FES 205 S&DS 230e Final Project Guidelines 1 S&DS 23eData Analysis Final Project Guidelines Due Friday, XXXXXXXXXX, 11:59pm, uploaded to CANVAS as PDF or DOC AND RMD Overview Analyze a dataset of your...

1 answer below »
Final project in the class due by Aug 5th midnight. Only thing between me and finishing this class. Uploaded the previous 3 assignments (HW 7-9). Need the Final Project completed. Also uploaded a potential starter file (Rideshare) but does not need to be used if it doesn't help.


FES 205 S&DS 230e Final Project Guidelines 1 S&DS 23eData Analysis Final Project Guidelines Due Friday, 8.13.21, 11:59pm, uploaded to CANVAS as PDF or DOC AND RMD Overview Analyze a dataset of your choice and write a 10-20 page report of your findings. This report must be created in RMarkdown and you’ll submit both a knitted PDF/doc file and the raw Rmarkdown code. Your goal is to demonstrate your ability to code in R, to clean data, to use appropriate graphical and statistical techniques in R, and to interpret your results. Groups You are encouraged but certainly not required to work in groups. Groups can be up to 4 students. Everyone in the group gets the same grade. Data You should choose a dataset that is interesting to you, OR you may use one of three datasets provided by myself. The dataset should have at least 10 variables and at least 50 observations. You must have at least two continuous variables and at least two categorical variables. Some datasets will have hundreds of variables and more than 100,000 observations. Getting the cleaning the data may be the most difficult part of your project. YOU ABSOLUTELY SHOULD DISCUSS YOUR DATA WITH MYSELF OR A TA BEFORE TURNING IN YOUR PROJECT. There are many online sources for data – you can just go to Google and search for a subject and then add ‘data’. You can also scrape data off a website. Here are some good sites:  ICPSR https://www.icpsr.umich.edu/icpsrweb/landing.jsp. More than 10,000 datasets here  Kaggle https://www.kaggle.com/datasets  The Census Bureau (http://www.census.gov/)  NOAA (http://www.nodc.noaa.gov/)  The US Environmental Protection Agency (http://www.epa.gov/epahome/Data.html). Other ideas:  Use your web scraping tools to get data on all roll call votes in the 116th Senate (2nd session, 2020) You should NOT choose a dataset that has already been extensively cleaned and analyzed (i.e. from a textbook or ‘nice example’ website). However, if there is minimal cleaning to do, then put more effort into something else. You do NOT need to use all the variables in your dataset; indeed, you may end up cleaning/analyzing only 6 to 10 variables. Your goal is not be comprehensive, but to demonstrate what you’ve learned. https://www.icpsr.umich.edu/icpsrweb/landing.jsp https://www.kaggle.com/datasets http://www.census.gov/ http://www.nodc.noaa.gov/ http://www.epa.gov/epahome/Data.html https://www.senate.gov/legislative/LIS/roll_call_lists/vote_menu_116_2.htm S&DS 230e Final Project Guidelines 2 If you decide not to find your own data, you can use one of the following three datasets, all available on CANVAS under Files  Final Project Information. Dataset information on variables and collection methods are also provided.  World Bank Data from 2016  Environmental Attitudes from the General Social Survey of 2000  Food Choices (we looked briefly at a few variables in class) : https://www.kaggle.com/borapajo/food-choices Format Your project should be presented as a report; it should have appropriate RMarkdown formatting and discussions should be in complete sentences. There is no minimum length (brevity and clarity are admired), and your knitted report should not be more than 15 pages long, including graphs and relevant output (just suppress irrelevant output). You should NOT have pages of output that you don’t discuss. You also don’t need to have RMarkdown show every last bit of output your code creates. It should feel more formal than a homework assignment, but you should be extremely concise in your discussion. Sections of the Report  Introduction (Background, motivation) – not more than a short paragraph.  DATA: Make a LIST of all variables you actually use – describe units, anything I should know. Ignore variables you don’t discuss.  Data cleaning process – describe the cleaning process you used on your data. Talk about what issues you encountered.  Descriptive Plots, summary information. Plots should be clearly labeled, well formatted, and display an aesthetic sense.  Analysis – see below  Conclusions and Summary – a short paragraph. Content Requirements Your report should include evidence of your ability in each of the following areas: 1) Data Cleaning – demonstrate use of find/replace, data cleaning, dealing with missing values, text character replacement, matching. It’s ok if your data didn’t require much of this. 2) Graphics – show appropriate use of at least ONE of each of the following – boxplot, scatterplot (can be matrix plot), normal quantile plot (can be related to regression), residual plots, histogram. 3) Basic tests - t-test, correlation, AND ability to create bootstrap confidence interval for either a t-test or a correlation. 4) Permutation Test – include at least one. 5) Multiple Regression – use either backwards stepwise regression or some form of best subsets regression. Should include residual plots. A GLM with a mix of continuous and categorical predictors is fine here. https://www.kaggle.com/borapajo/food-choices S&DS 230e Final Project Guidelines 3 6) AT LEAST ONE OF THE FOLLOWING TECHNIQUES – ANOVA, ANCOVA, Logistic Regression, Multinomial Regression, OR data scraping off a website. Additional Comments Please do NOT have appendices – unlike a journal article, include relevant plots and output in the section where you discuss the results (more of a narrative). This said, you should ONLY include output that is relevant to your discussion. I can always look at your RMarkdown code if I have questions. It is fine to suppress both long output and parts of your R code. As you work on this project, I expect you will regularly pester myself and TA’s. Submission - Please read this carefully 1) ONLY ONE person in a group should upload a copy of the final project (i.e. if there are three people in a group, only one person needs to upload the files. 2) BE SURE to put all members’ names on your project documents. --- title: "Final Project Rideshare" author: "Jack Kidney" date: '2022-07-28' output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ```{r, eval = F} rm(list=ls()) ``` ```{r} rideshare <- read.csv("/users/jackkidney/downloads/final="" project/rideshare.csv")="" attach(rideshare)="" ```="" data="" cleaning:="" ```{r}="" #="" there="" is="" some="" missing="" data.="" total_missing=""><- sum(is.na(rideshare))="" paste0("we="" find="" that="" ",="" total_missing,="" "="" rides="" are="" missing="" data="" from="" at="" least="" one="" column,="" which="" is="" ",="" (round((sum(is.na(rideshare))="" dim(rideshare)[1]),="" 3)="" *="" 100),="" "%="" of="" our="" total="" ride="" data.\="" ")="" #it="" seems="" that="" data="" is="" missing="" only="" from="" the="" price="" category."="" sapply(rideshare,="" function(x)="" sum(is.na(x)))="" #="" time_stamp="" is="" in="" unix="" format="" (seconds="" since="" january="" 1st="" 1970)="" #="" convert="" to="" friendlier="" format.="" #="" timestamp="" #install.packages("lubridate")="" #="" use="" lubridate="" to="" convert.="" #library("lubridate")="" #="" date="" is="" given="" in="" a="" nice="" format,="" but="" we're="" going="" to="" pretend="" that="" we="" only="" had="" the="" unix="" time="" listed="" under="" timestamp="" drop=""><- c("hour",="" "day",="" "month",="" "datetime")="" rideshare=""><- rideshare[,!(names(rideshare)="" %in%="" drop)]="" rideshare$timedate=""><- as.posixct(rideshare$timestamp,="" origin="1970-01-01 05:00:00" )="" #i="" double="" checked="" the="" dataset="" and="" for="" some="" reason="" the="" author="" doesn't="" explain,="" the="" origin="" in="" this="" case="" is="" 5am="" instead="" of="" 12am...="" head(rideshare$timedate)="" #="" convert="" timedate="" ```="" ```{r}="" boxplot(price="" ~="" name,="" data="rideshare)" hist(price)="" ```="" our="" histogram="" and="" boxplots="" look="" pretty="" right-skewed="" so="" maybe="" a="" transformation="" is="" in="" order,="" here.="" let's="" check="" out="" a="" box-cox="" transformation.="" ```{r}="" #="" first,="" we="" fit="" the="" simplest="" model="" possible.="" model1=""><- lm(price="" ~="" distance)="" #figure="" out="" what="" value="" of="" lambda="" (x)="" gives="" max="" value="" of="" log-liklihood="" (y)="" trans=""><- boxcox(price="" ~="" distance)="" trans$x[which.max(trans$y)]="" ```="" homework="" 09="" two="" way="" anova="" ancova="" glm="" homework="" 09="" two="" way="" anova="" ancova="" glm="" due="" by="" 11:59pm,="" monday,="" august="" 1,="" 2022="" s&ds="" 230e="" this="" assignment="" uses="" data="" from="" the="" international="" social="" survey="" program="" on="" environment="" from="" 2000.="" there="" are="" over="" 100="" questions="" from="" over="" 31000="" individuals="" across="" 38="" countries.="" the="" data="" you’ll="" need="" is="" here.="" be="" aware="" that="" it="" will="" take="" a="" few="" moments="" to="" load="" this="" data.="" you’ll="" also="" want="" the="" codebook="" that="" describes="" the="" variables.="" 1)="" data="" set="" creation="" (23="" pts="" -="" 3="" pts="" each="" section,="" except="" part="" f="" which="" is="" 5="" pts)="" 1.1)="" read="" the="" data="" into="" an="" object="" called="" envdat="" (do="" not="" use="" the="" option="" as.is="TRUE)." check="" the="" dimension="" to="" be="" sure="" the="" data="" loaded="" correctly.="" then="" create="" a="" new="" object="" called="" envdat2="" which="" only="" contains="" information="" for="" the="" following="" countries="" :="" usa,="" norway,="" russia,="" new="" zealand,="" canada,="" japan,="" and="" mexico.="" the="" variable="" that="" contains="" country="" is="" v3.="" you’ll="" need="" to="" use="" the="" codebook="" to="" figure="" out="" which="" number="" goes="" with="" which="" country.="" check="" the="" dimensions="" of="" your="" results="" -="" you="" should="" have="" 9102="" observations.="" envdat=""><- read.csv("http://reuningscherer.net/s&ds230/data/envdata.csv",="" as.is="F)" dim(envdat)="" ##="" [1]="" 31042="" 98="" #making="" envdat2="" envdat2=""><- envdat[envdat$v3="" %in%="" c(6,="" 12,="" 18,="" 19,="" 20,="" 24,="" 38),]="" 1.2)="" create="" a="" new="" variable="" called="" country="" on="" envdat2="" which="" has="" country="" names="" rather="" than="" country="" numbers.="" there="" are="" several="" ways="" to="" do="" this,="" but="" i="" suggests="" you="" use="" the="" recode()="" function="" in="" the="" car="" package.="" the="" syntax="" for="" this="" function="" is="" something="" like="" library(car)="" envdat2$country=""><- recode(envdat2$v3,="" "6='USA' ;="" 12='Norway' ;="" 18='Russia' ;="" 19='New Zealand' ;="" 20='Canada' ;="" 24='Japan' ;="" 38='Mexico' ")="" once="" you’re="" created="" the="" variable,="" make="" a="" table="" of="" the="" resulting="" variable="" to="" see="" how="" many="" observations="" there="" are="" from="" each="" country.="" table(envdat2$country)="" ##="">< table="" of="" extent="" 0=""> 1.3) Make a variable Gender on envdat2 that contains gender (which is variable V200). Recode so that 1 becomes ‘Male’ and 2 becomes ‘Female’. Again, make a table of resulting variable to see how many people identify as Male and how many as Female. http://reuningscherer.net/s&ds230/data/envdata.csv http://reuningscherer.net/s&ds230/data/Env_Survey_2000_Codebook.pdf library(car) envdat2$Gender <- recode(envdat2$v200, "1 = 'male'; 2 = 'female'") recode(envdat2$v200,="" "1='Male' ;="" 2='Female'>
Answered 1 days AfterAug 02, 2022

Answer To: FES 205 S&DS 230e Final Project Guidelines 1 S&DS 23eData Analysis Final Project Guidelines Due...

Mansi answered on Aug 03 2022
71 Votes
Final Project Report
Dataset used: Car Sales Data
1. Introduction
This data is taken from http://kaggle.com. It includes the information on different car specifications. It
would be interesting to see how these different variables are related to each other and what are the significant
factors affecting the Sales of cars. The data has 16 variables (columns) and 157 observations (rows).
2. Data
1. Manufacturer (Nomial Scale)
2. Model (Nominal Scale)
3. Sales (in thousands)
4. Resale value
5. Price (in thousands)
6. Engine size (litres)
7. Horsepower (in KW)
8. Wheelbase (in inches)
9. Width (in inches)
10. Length (in inches)
11. Curb_weight (in tonnes)
12.
Fuel_capacity (Gallons)
13. Fuel_efficiency (miles per gallon)
14. Vehicle_type (Nomial Scale)
Data Importing in R
Importing the data in R and naming it as mydata. Check the dimension of the data and structure of the data
mydata<-read.csv("/Users/mansikhurana/Documents/Grey Nodes/R markdown project/car_sales.csv")
dim(mydata)
## [1] 157 16
str(mydata)
## 'data.frame': 157 obs. of 16 variables:
## $ Manufacturer : Factor w/ 30 levels "Acura","Audi",..: 1 1 1 1 2 2 2 3 3 3 ...
## $ Model : Factor w/ 156 levels "3-Sep","3000GT",..: 80 146 39 121 9 10 11 4 5 8 ...
## $ Sales_in_thousands : num 16.92 39.38 14.11 8.59 20.4 ...
## $ X__year_resale_value: num 16.4 19.9 18.2 29.7 22.3 ...
## $ Vehicle_type : Factor w/ 2 levels "Car","Passenger": 2 2 2 2 2 2 2 2 2 2 ...
## $ Price_in_thousands : num 21.5 28.4 NA 42 24 ...
## $ Engine_size : num 1.8 3.2 3.2 3.5 1.8 2.8 4.2 2.5 2.8 2.8 ...
## $ Horsepower : int 140 225 225 210 150 200 310 170 193 193 ...
## $ Wheelbase : num 101 108 107 115 103 ...
## $ Width : num 67.3 70.3 70.6 71.4 68.2 76.1 74 68.4 68.5 70.9 ...
## $ Length : num 172 193 192 197 178 ...
## $ Curb_weight : num 2.64 3.52 3.47 3.85 3 ...
## $ Fuel_capacity : num 13.2 17.2 17.2 18 16.4 18.5 23.7 16.6 16.6 18.5 ...
## $ Fuel_efficiency : int 28 25 26 22 27 22 21 26 24 25 ...
1
http://kaggle.com
## $ Latest_Launch : Factor w/ 130 levels "1/14/12","1/15/11",..: 48 94 10 53 21 120 51 92 8 74 ...
## $ Power_perf_factor : num 58.3 91.4 NA 91.4 62.8 ...
Structure of the data (str) gives the information on different data structures like Vehicle tyoe is a factor type
with two levels ‘car’ and ‘passenger’, Sales, Price, Engine size etc. are vectors with numeric type.
3. Data Cleaning
First, let us see descriptive statistics of the data. Although summary is function available in R to calculate
summary statistics but we will write our own used defined fucntion to calculate different summary statistics.
Also, we have created an outlier flag variable here using the logic if maximum is greater than mean+3* SD or
minimum value is less than mean-3*SD. P99 and P1 (99th percentile and 1st percentile) may also be used to
understand the presence of outliers in the data. In outlier flag, 0 means absence of an outlier and 1 means
presence of an outlier. Nmiss function gives the sum of all missing observations in the data.
# user written function for creating descriptive statistics
mystats <- function(x) {
nmiss<-sum(is.na(x))
a <- x[!is.na(x)]
m <- mean(a)
n <- length(a)
s <- sd(a)
min <- min(a)
p1<-quantile(a,0.01)
p99<-quantile(a,0.99)
max <- max(a)
UC <- m+3*s
LC <- m-3*s
outlier_flag<- max>UC | minreturn(c(n=n, nmiss=nmiss, outlier_flag=outlier_flag, mean=m,
stdev=s,min = min, p1=p1,p99=p99,max=max, UC=UC, LC=LC ))
}
vars <- c( "Sales_in_thousands" , "X__year_resale_value" , "Price_in_thousands",
"Engine_size" , "Horsepower", "Wheelbase" , "Width" ,"Power_perf_factor" , "Length" , "Curb_weight" ,
"Fuel_capacity", "Fuel_efficiency" )
diag_stats<-t(data.frame(apply(mydata[vars], 2, mystats)))
diag_stats
## n nmiss outlier_flag mean stdev min
## Sales_in_thousands 157 0 1 52.998076 68.0294220 0.11000
## X__year_resale_value 121 36 1 18.072975 11.4533841 5.16000
## Price_in_thousands 155 2 1 27.390755 14.3516532 9.23500
## Engine_size 156 1 1 3.060897 1.0446530 1.00000
## Horsepower 156 1 1 185.948718 56.7003209 55.00000
## Wheelbase 156 1 1 107.487179 7.6413030 92.60000
## Width 156 1 0 71.150000 3.4518719 62.60000
## Power_perf_factor 155 2 1 77.043591 25.1426641 23.27627
## Length 156 1 0 187.343590 13.4317543 149.40000
## Curb_weight 155 2 1 3.378026 0.6305016 1.89500
## Fuel_capacity 156 1 1 17.951923 3.8879213 10.30000
## Fuel_efficiency 154 3 1 23.844156 4.2827056 15.00000
## p1.1% p99.99% max UC LC
2
## Sales_in_thousands 0.93728 260.64532 540.5610 257.086342 -151.09018956
## X__year_resale_value 6.17300 60.22000 67.5500 52.433128 -16.28717709
## Price_in_thousands 10.23144 78.47980 85.5000 70.445714 -15.66420473
## Engine_size 1.55500 5.70000 8.0000 6.194856 -0.07306148
## Horsepower 96.40000 325.75000 450.0000 356.049681 15.84775537
## Wheelbase 92.87500 134.37500 138.7000 130.411089 84.56327040
## Width 66.08500 79.19000 79.9000 81.505616 60.79438441
## Power_perf_factor 38.46192 141.11946 188.1443 152.471583 1.61559890
## Length 154.91500 219.30500 224.5000 227.638853 147.04832689
## Curb_weight 2.24540 5.39668 5.5720 5.269531 1.48652090
## Fuel_capacity 11.90000 30.90000 32.0000 29.615687 6.28815928
## Fuel_efficiency 15.00000 33.00000 45.0000 36.692273 10.99603916
Treating Outliers
## OUTLIERS
mydata$Sales_in_thousands[mydata$Sales_in_thousands>257.086342425636] <-257.086342425636
mydata$X__year_resale_value[mydata$X__year_resale_value>52.4331275042866] <-52.4331275042866
mydata$Price_in_thousands[mydata$Price_in_thousands>70.4457144064253] <-70.4457144064253
In the above three variables, outliers were detected, so we have replaced the places where values are greater
than UC, by their respective UC.
Treating Missing values
## Missing value treatment
mydata<- mydata[!is.na(mydata$Sales_in_thousands),] # dropping obs where values are missing
require(Hmisc)
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
mydata1<-data.frame(apply(mydata[vars],2, function(x) impute(x, mean))) #Imputing missings with mean
mydat2<-cbind(mydata1,Vehicle_type=mydata$Vehicle_type )
Package Hmisc is required for treating missing values in R. We make use of the impute function in R to
impute the missing values in our data with their mean values. Now, mydata1 is the data without the missing
observations and outliers have also been treated in the above step. mydat2 is now the final data that we will
make use of for the model building purpose.
3
4. Plots/Graphics
# Histogram of...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here