CIND 123 Winter XXXXXXXXXXAssignment #1 CIND719 – Assignment 1 1 of 2 CIND 719 Assignment 1 Complete this assignment using you understanding of Hadoop, HDFS, Hive, HiveQL and Hive DDL to perform data...

1 answer below »



Complete this assignment using you understanding of Hadoop, HDFS, Hive, HiveQL and Hive DDL to perform data management, storage, retrieval, and analysis. Do not answer questions using any other software or method.




CIND 123 Winter 2020 - Assignment #1 CIND719 – Assignment 1 1 of 2 CIND 719 Assignment 1 Complete this assignment using you understanding of Hadoop, HDFS, Hive, HiveQL and Hive DDL to perform data management, storage, retrieval, and analysis. Do not answer questions using any other software or method. Dataset Download two csv files (stations_data.csv and trip_data.csv) from course shell (Lab Resources & Datasets). Both files contain data collected from the second year of Bay Area Bike Share's operation from 9/1/14 to 8/31/15. The schema of each file is given as below. • station_data.csv - station id: station ID number - name: name of station - lat: latitude - long: longitude - dockcount: number of total docks at station - landmark: city (San Francisco, Redwood City, Palo Alto, Mountain View, San Jose) - installation: original date that station was installed. • trip_data.csv -Trip ID: numeric ID of bike trip -Duration: time of trip in seconds -Start Date: start date of trip with date and time, in PST -Start Station: station name of start station (corresponds to 'name' in the station_data.csv dataset) -Start Terminal: numeric reference for start station (corresponds to 'station id' in the station_data.csv dataset) -End Date: end date of trip with date and time, in PST -End Station: station name for end station (corresponds to 'name' in the station_data.csv dataset) -End Terminal: numeric reference for end station (corresponds to 'station id' in the station_data.csv dataset) -Bike #: numeric ID of bike used -Subscription Type: 'Subscriber' = annual or 30-day member; 'Customer' = 24-hour or 3-day member -Zip Code: Home zip code of subscriber (customers can choose to manually enter zip at kiosk however data is unreliable) Study the datasets and understand how they are related before you attempt following questions. CIND719 – Assignment 1 2 of 2 Questions 1. Find the 'most popular' bike, i.e. the bike that has made the highest number of trips (1.5 pts) 2. Find the number of trips made by each subscription type. (1.5 pts) 3. Build a table that shows which stations are connected, and the minimum duration between them. You can use either station id or station name. Save this table as a comma separated text file in ‘/user/assignment1/stationlist.csv’ in HDFS. Include the directory listing of the output directory and first five lines of the output file in your submission. (3 pts) 4. Find the number of trips originating from each landmark. Your output should include the landmark name and the number of trip originating from it. (3 pts) 5. Find the number of trips crossing landmarks, i.e. trips that originate in one landmark and end in another. Your output should include the originating and ending landmark names and the number of trips between them. (6 pts) Assignment Submission Instructions Prepare a report of your findings. - For each question, provide the steps, commands, queries in both text and image form. o Take clear and readable screenshots of the shell commands along with the outputs. Submit the report to the Assignment 1 folder under Assessment/Assignments in your course shell.
Answered 5 days AfterFeb 27, 2021

Answer To: CIND 123 Winter XXXXXXXXXXAssignment #1 CIND719 – Assignment 1 1 of 2 CIND 719 Assignment 1 Complete...

Sandeep Kumar answered on Feb 28 2021
147 Votes
#!/bin/bash
# Collect & Check Linux server status
# Sys env / paths / etc
# -------------------------------------------------------------------------------------------\
PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin
SCRIPT_PATH=$(cd `dirname "${BASH_SOURCE[0]}"` && pwd)
# Initial variables
# ---------------------------------------------------\
HOS
TNAME=`hostname`
SERVER_IP=`hostname -I`
MOUNT=$(mount|egrep -iw "ext4|ext3|xfs|gfs|gfs2|btrfs"|grep -v "loop"|sort -u -t' ' -k1,2)
FS_USAGE=$(df -PTh|egrep -iw "ext4|ext3|xfs|gfs|gfs2|btrfs"|grep -v "loop"|sort -k6n|awk '!seen[$1]++')
SERVICES="$SCRIPT_PATH/services-list.txt"
TESTFILE="$SCRIPT_PATH/tempfile"
TOTALMEM=$(free -m | awk '$1=="Mem:" {print $2}')
DEBUG=false
# Colored styles
on_success="DONE"
on_fail="FAIL"
white="\e[1;37m"
green="\e[1;32m"
red="\e[1;31m"
purple="\033[1;35m"
nc="\e[0m"
SuccessMark="\e[47;32m ------ OK \e[0m"
WarningMark="\e[43;31m ------ WARNING \e[0m"
CriticalMark="\e[47;31m ------ CRITICAL \e[0m"
d="-------------------------------------"
Info() {
    echo -en "${1}${green}${2}${nc}\n"
}
Warn() {
echo -en "${1}${purple}${2}${nc}\n"
}
Success() {
    echo -en "${1}${green}${2}${nc}\n"
}
Error () {
    echo -en "${1}${red}${2}${nc}\n"
}
Splash() {
    echo -en "${white}${1}${nc}\n"
}
space() {
    echo -e ""
}
# Help information
usage() {
    Info "" "\nParameters:\n"
    echo -e "-sn - Skip speedtest\n
-sd - Skip disk test\n
-ss - Show all running services\n
-e - Extra info (Bash users, Who logged, All running services, Listen ports, UnOwned files, User list from processes)
    "
    Info "" "Usage:"
    echo -e "You can use this script with several parameters:"
    echo -e "./system-check.sh -sn -sd -e"
    echo -e "./system-check.sh -ss"
    exit 1
}
# Checks arguments
while [[ "$#" -gt 0 ]]; do
case $1 in
-sn|--skip-network) SKIPNET=1; ;;
        -ss|--skip-services) SKIPSERVICES=1; ;;
-sd|--skip-disk) SKIPDISK=1 ;;
        -e|--extra) EXTRA=1 ;;
        -h|--help) usage ;;    
*) echo "Unknown parameter passed: $1"; exit 1 ;;
esac
shift
done
# Functions
# ------------------------------------------------------------------------------------------------------\
## Service functions
# Yes / No confirmation
confirm() {
# call with a prompt string or use a default
read -r -p "${1:-Are you sure? [y/N]} " response
case "$response" in
[yY][eE][sS]|[yY])
true
;;
*)
false
;;
esac
}
# Check is current user is root
isRoot() {
    if [ $(id -u) -ne 0 ]; then
        Error "You must be root user to continue"
        exit 1
    fi
    RID=$(id -u root 2>/dev/null)
    if [ $? -ne 0 ]; then
        Error "User root no found. You should create it to continue"
        exit 1
    fi
    if [ $RID -ne 0 ]; then
        Error "User root UID not equals 0. User root must have UID 0"
        exit 1
    fi
}
# Checks supporting distros
checkDistro() {
    # Checking distro
    if [ -e /etc/centos-release ]; then
     DISTRO=`cat /etc/redhat-release | awk '{print $1,$4}'`
     RPM=1
    elif [ -e /etc/fedora-release ]; then
     DISTRO=`cat /etc/fedora-release | awk '{print ($1,$3~/^[0-9]/?$3:$4)}'`
     RPM=1
    elif [ -e /etc/os-release ]; then
        DISTRO=`lsb_release -d | awk -F"\t" '{print $2}'`
        RPM=0
    else
     Error "Your distribution is not supported (yet)"
     exit 1
    fi
}
# get Actual date
getDate() {
    date '+%d-%m-%Y_%H-%M-%S'
}
# SELinux status
isSELinux() {
    if [[ "$RPM" -eq "1" ]]; then
        selinuxenabled
        if [ $? -ne 0 ]
        then
         Error "SELinux:\t\t" "DISABLED"
        else
         Info "SELinux:\t\t" "ENABLED"
        fi
    fi
}
# If file exist true / false
chk_fileExist() {
    PASSED=$1
    if [[ -d $PASSED ]]; then
     # echo "$PASSED is a...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here