Final Project – Data Analysis for Business Applications In the final project you are required to analyze a dataset by Uber. You will predict the demand for Uber rides within 1-kilometer distance from...

Attached


Final Project – Data Analysis for Business Applications In the final project you are required to analyze a dataset by Uber. You will predict the demand for Uber rides within 1-kilometer distance from the museum of natural history, for each quarter of hour, during the last 13 days of September 2014. Your predictions will be evaluated based on a list of time intervals, when you do not have access to the real number of Uber pickups that already have been occurred during those times. The Data For this work you have 3 different datasets: 1. uber_train - This is the fundamental data you are going to analyze and use for modeling. This file contains data on over 4.5 million Uber pickups in New York City from April to September 2014. Note: You have a separate file for each month and you can use as many data as you like in order to train your model. Think carefully how much data to use. 2. uber_test - For the list of time intervals in this data, you will make your predictions, based on the model you created. 3. External data source – You need to add at least one variable that is not supplied by Uber. You can use any data from the Internet /anyplace else that you think about. Note: You are not allowed to leak data from the “test period” to the “training period”, meaning that only information that was known before the start of the “test period” can be used. For example: precise temperature cannot be used during the last week of September, since the precise temperature was not known at the end of the “training period”. Instead, you can use the weather forecast for that week. Project Description This work will be divided into 4 parts: 1. Data Rearrangement - Your training data is not in the same format as the test data and contains data on pickups outside the target zone. Thus, your first mission is: - Use Museum of Natural History coordinates (latitude = 40.7813241, longitude = -73.9739882) in order to subset the data for pickups within 1 kilometer distance from the Museum of Natural History. The figure at the end of this document illustrates this target zone. Tip: use ‘distm’ function from 'geosphere' package in order to calculate distance between two geo-points. - Transform your training data in a way that it will be in the same format as the test data. i.e. aggregate your data by each time interval - Merge your external data source with your training data and extract as many features as you like from the data. 2. Exploratory Analysis and background - present and explain descriptive statistics (including plots) about the dataset. Emphasize in your analysis: i. What actions were done on the data in order to reach it’s final form. ii. The relationship between the dependent variable and other variables you will later use in your modeling as predictors. As we showed in class, data analysis is used in order to gain intuition for better predictions and models. 3. Model Estimation - build your best model. After evaluating few different models with different variables, choose your best model. Explain why you decided to choose this one, explain the model and how you built the different variables. Describe clearly how you will eventually estimate the demand for Uber rides at any time interval in the target zone. Notice: Your final model MUST consist at least one variable from external data source. You should scan the Internet for different data sources, that you find relevant for predicting the demand for Uber rides. Explain the data, how you joined it with the data and why do you think it is relevant. 4. Predictions - for each time interval (row) in the data called ‘uber_test.csv’ – use your best model to predict "number_of_pickups" . Your output should be in the form of a table with 2 columns: Time_Interval number_of_pickups 2014-09-18 00:00:00 11 2014-09-18 00:15:00 63 2014-09-18 00:30:00 0 Submitted documents: You should submit the following files: -The ‘uber_test.csv’ file that looks exactly as the table your are given (same columns names). -Make sure that the file name is ‘uber_test.csv’ -Note that those predictions must be based on the model you selected and showed in the two other documents. -The R code containing your script. - A PDF document (up to 6 pages) that shows: - Main points you have found in your exploratory analysis (including plots and descriptive statistics). - A summary and explanation of your selected model for prediction, including an evaluation of the model on a validation set. - A short brief regarding other models you examined and ruled out. - If you have used modeling methods that weren’t taught in this course, you must add an appendix of one page (which is on top of the other 6 pages) which describes these methods. - DON’T add code to the PDF file. Due dates By 7-March-2019 23:59 submit on moodle (the course’s website) all of the files listed above. Grading The grading will be done by the following key: - Quality of explanations and analysis (45%) - Accuracy of prediction (in comparison to other teams) (50%) - Document clarity (5%) Illustration of the target zone Good luck!
Feb 12, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here