Microsoft Word - HW1.docx CS 5834, Fall 2019: Intro to Urban Computing Homework 2 NYC Taxi Data Analysis and Modeling Due Date: 10/03/2019 XXXXXXXXXXTotal: 100 Points...

FILE


Microsoft Word - HW1.docx CS 5834, Fall 2019: Intro to Urban Computing Homework 2 NYC Taxi Data Analysis and Modeling Due Date: 10/03/2019 Total: 100 Points Inthishomework,youwillprocessthetaxidatacollectedfromNewYorkcity,use regressionmodels to predict the trip fare amount, anduse different classification modelstopredictwhetherthetipfarewaslessthan20%ormorethanthat. Problem1.Downloadandprocessdata. (15points) 1. TheNYCtaxidatacanbefoundin http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml Inthisdata,theyellowandgreentaxitriprecordsincludefieldscapturingpick- up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemizedfares,ratetypes,paymenttypes,anddriver-reportedpassengercounts. Youarefreetochooseatripsheet(incsvformat)oftheyellowtaxiinanymonth of2017foryourhomework. 2. Randomlysample10,000triprecordstosolvetheProblems2and3. 3. Createadatasetwiththefollowingattributes: a. VendorID b. Day/night.Pleaseconvertthe‘tpep_pickup_datetime’today(for1)or night(for0). c. Passenger_count d. Trip_distance e. PULocationID f. DOLocationID g. Payment_type h. Payment_type_cat1,Payment_type_cat2,.... Note:Pleaseconvertthe‘payment_type’todummyvariables. i. Fare_amount j. Tip_amount k. Tip_rate_20. First, calculate ‘tip_rate’ with ‘tip_rate’=Tip_amount’/’Fare_amount’. Second,if‘tip_rate’<0.2,set‘tip_rate_20’=0,otherwise,setitto1. 4. save the dataset as a csv file. the first line of the csv file should be the attributenamesdescribedinthelastquestion. 5. plotthedistributionofthefare_amountsandtip_amounts problem2.tripfareamountprediction. (35points) 1. buildalinearregressionmodeltopredictthetripfareamount.youarefree tousepackageslikesklearnorwriteyourowncodes. a. hereisalinktothelinearregressionmoduleofsklearnpackage: http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.linearreg ression.html b. use attributes b, c, d, e, f, h as input features and attribute i as the output. c. yourmodelshouldbeevaluatedwiththe5-foldcross-validationand you have to report the averaged mean-squared-error (mse) and standarddeviation.youcanusethislinktocalculatemse http://scikit- learn.org/stable/modules/generated/sklearn.metrics.mean_squared_ error.html 2. similarly,buildaknnregressionmodeltopredictthetripfareamount.the modelshouldbeevaluatedwiththe5-foldcross-validation.ineachfold,80% ofthedatashouldbeusedfortrainingand20%fortesting.youmustchoose theoptimalvalueofkinbetween1and10basedonhalfofthetestingdata, thencalculatethemseonanotherhalfoftestingdatawiththebestk.atlast, reporttheaveragedmseandstandarddeviation. 3. comparetheresultsofthetwomodels. problem3.tiprateclassification. (35points) sample1000triprecordsfromyourdata,andsolvethefollowingproblems. 1. useknnmodeltopredictthetip_rate_20. a. setkinknnto5. b. useattributeb,c,d,hasinputfeatures. c. useattributekasclasslabels. d. useeuclideandistance. e. run5-foldcrossvalidationtoevaluateyourmodel. f. reportprecision,recallandf-scoreoftheclassification. g. pleasefollowthislinktoknninsklearn: http://scikit- learn.org/stable/modules/generated/sklearn.neighbors.kneighborsclass ifier.html 2.usedecisiontreetopredictthetip_rate_20. a. builddecisiontreewithattributeb,c,d,g. b. useattributekasclasslabels. c. use5-foldcross-validationtoevaluateyourmodel. d. reportprecision,recallandf-scoreoftheclassification. e. hereisthelinktothedecisiontreeinsklearnpackage: http://scikit- learn.org/stable/modules/generated/sklearn.tree.decisiontreeclassifier. html problem4.subwayservices (15points) supposeyouarethectoforwmataandare lookingto improveyourservices. if you are not familiar with wmata, they run the metro system in the greater washington dc area. every traveler buys a metro card and then uses it on automated fare collection systemswhile both entering and exiting stations.many hotels and online travel websites also sell the metro card (apart from sales at stations).amajorproblemyouneedtosolveistodifferentiatebetweentouristtrips andnormalcommutersinyoursystem. 1. givenyourknowledgeofml,canyouposethisisasoneofthetaskswehave seenbeforeinclass?makesureyouclearlydescribehowyouwillcreateyour datasetandjustifywhyyoursetupmakessense. 2. willyouranswerchangeinanywayifwmatacollectedthefaredirectlyat theentrypointonly(sonocardswipeatexit)? 3. finally, assuming you have built this ml model to differentiate these commuters, how can you use your knowledge for improving the user experience? thereisnoone‘right’answerforthequestionsabove;wearelookingtoseeifyou candesignwellandreasonaboutyourchoices/responses.pleasetry tokeepyour answerbrief(3-4lines)foreachquestion. submission makeasingle.zipfileincluding: 1. areportoftheresultsandyouranswertoproblem4asonepdf. 2. sourcecodefilesunderafoldernamed“src” 3. datausedinproblems2and3underafoldernamed“data” 4. screenshotsoftheoutputofyourcodeunderafoldernamed“img” 0.2,="" set="" ‘tip_rate_20’="0," otherwise,="" set="" it="" to="" 1.="" 4.="" save="" the="" dataset="" as="" a="" csv="" file.="" the="" first="" line="" of="" the="" csv="" file="" should="" be="" the="" attribute="" names="" described="" in="" the="" last="" question.="" 5.="" plot="" the="" distribution="" of="" the="" fare_amounts="" and="" tip_amounts="" problem="" 2.="" trip="" fare="" amount="" prediction.="" (35="" points)="" 1.="" build="" a="" linear="" regression="" model="" to="" predict="" the="" trip="" fare="" amount.="" you="" are="" free="" to="" use="" packages="" like="" sklearn="" or="" write="" your="" own="" codes.="" a.="" here="" is="" a="" link="" to="" the="" linear="" regression="" module="" of="" sklearn="" package:="" http://scikit-="" learn.org/stable/modules/generated/sklearn.linear_model.linearreg="" ression.html="" b.="" use="" attributes="" b,="" c,="" d,="" e,="" f,="" h="" as="" input="" features="" and="" attribute="" i="" as="" the="" output.="" c.="" your="" model="" should="" be="" evaluated="" with="" the="" 5-fold="" cross-validation="" and="" you="" have="" to="" report="" the="" averaged="" mean-squared-error="" (mse)="" and="" standard="" deviation.="" you="" can="" use="" this="" link="" to="" calculate="" mse="" http://scikit-="" learn.org/stable/modules/generated/sklearn.metrics.mean_squared_="" error.html="" 2.="" similarly,="" build="" a="" knn="" regression="" model="" to="" predict="" the="" trip="" fare="" amount.="" the="" model="" should="" be="" evaluated="" with="" the="" 5-fold="" cross-validation.="" in="" each="" fold,="" 80%="" of="" the="" data="" should="" be="" used="" for="" training="" and="" 20%="" for="" testing.="" you="" must="" choose="" the="" optimal="" value="" of="" k="" in="" between="" 1="" and="" 10="" based="" on="" half="" of="" the="" testing="" data,="" then="" calculate="" the="" mse="" on="" another="" half="" of="" testing="" data="" with="" the="" best="" k.="" at="" last,="" report="" the="" averaged="" mse="" and="" standard="" deviation.="" 3.="" compare="" the="" results="" of="" the="" two="" models.="" problem="" 3.="" tip="" rate="" classification.="" (35="" points)="" sample="" 1000="" trip="" records="" from="" your="" data,="" and="" solve="" the="" following="" problems.="" 1.="" use="" knn="" model="" to="" predict="" the="" tip_rate_20.="" a.="" set="" k="" in="" knn="" to="" 5.="" b.="" use="" attribute="" b,="" c,="" d,="" h="" as="" input="" features.="" c.="" use="" attribute="" k="" as="" class="" labels.="" d.="" use="" euclidean="" distance.="" e.="" run="" 5-fold="" cross="" validation="" to="" evaluate="" your="" model.="" f.="" report="" precision,="" recall="" and="" f-score="" of="" the="" classification.="" g.="" please="" follow="" this="" link="" to="" knn="" in="" sklearn:="" http://scikit-="" learn.org/stable/modules/generated/sklearn.neighbors.kneighborsclass="" ifier.html="" 2.="" use="" decision="" tree="" to="" predict="" the="" tip_rate_20.="" a.="" build="" decision="" tree="" with="" attribute="" b,="" c,="" d,="" g.="" b.="" use="" attribute="" k="" as="" class="" labels.="" c.="" use="" 5-fold="" cross-validation="" to="" evaluate="" your="" model.="" d.="" report="" precision,="" recall="" and="" f-score="" of="" the="" classification.="" e.="" here="" is="" the="" link="" to="" the="" decision="" tree="" in="" sklearn="" package:="" http://scikit-="" learn.org/stable/modules/generated/sklearn.tree.decisiontreeclassifier.="" html="" problem="" 4.="" subway="" services="" (15="" points)="" suppose="" you="" are="" the="" cto="" for="" wmata="" and="" are="" looking="" to="" improve="" your="" services.="" if="" you="" are="" not="" familiar="" with="" wmata,="" they="" run="" the="" metro="" system="" in="" the="" greater="" washington="" dc="" area.="" every="" traveler="" buys="" a="" metro="" card="" and="" then="" uses="" it="" on="" automated="" fare="" collection="" systems="" while="" both="" entering="" and="" exiting="" stations.="" many="" hotels="" and="" online="" travel="" websites="" also="" sell="" the="" metro="" card="" (apart="" from="" sales="" at="" stations).="" a="" major="" problem="" you="" need="" to="" solve="" is="" to="" differentiate="" between="" tourist="" trips="" and="" normal="" commuters="" in="" your="" system.="" 1.="" given="" your="" knowledge="" of="" ml,="" can="" you="" pose="" this="" is="" as="" one="" of="" the="" tasks="" we="" have="" seen="" before="" in="" class?="" make="" sure="" you="" clearly="" describe="" how="" you="" will="" create="" your="" dataset="" and="" justify="" why="" your="" setup="" makes="" sense.="" 2.="" will="" your="" answer="" change="" in="" anyway="" if="" wmata="" collected="" the="" fare="" directly="" at="" the="" entry="" point="" only="" (so="" no="" card="" swipe="" at="" exit)?="" 3.="" finally,="" assuming="" you="" have="" built="" this="" ml="" model="" to="" differentiate="" these="" commuters,="" how="" can="" you="" use="" your="" knowledge="" for="" improving="" the="" user="" experience?="" there="" is="" no="" one="" ‘right’="" answer="" for="" the="" questions="" above;="" we="" are="" looking="" to="" see="" if="" you="" can="" design="" well="" and="" reason="" about="" your="" choices/responses.="" please="" try="" to="" keep="" your="" answer="" brief="" (3-4="" lines)="" for="" each="" question.="" submission="" make="" a="" single="" .zip="" file="" including:="" 1.="" a="" report="" of="" the="" results="" and="" your="" answer="" to="" problem="" 4="" as="" one="" pdf.="" 2.="" source="" code="" files="" under="" a="" folder="" named="" “src”="" 3.="" data="" used="" in="" problems="" 2="" and="" 3="" under="" a="" folder="" named="" “data”="" 4.="" screenshots="" of="" the="" output="" of="" your="" code="" under="" a="" folder="" named="">
Sep 27, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here