*** FOR THE ASSIGNMENT, I JUST NEED ASSISTANCE WITH QUESTIONS 3, 4, 5. I ALREADY DID 1 AND 2 AND I CAN FINISH THE REST ***Option #1:Linear Regression ModelIn this Critical Thinking Assignment, you...

*** FOR THE ASSIGNMENT, I JUST NEED ASSISTANCE WITH QUESTIONS 3, 4, 5. I ALREADY DID 1 AND 2 AND I CAN FINISH THE REST ***













Option #1:Linear Regression Model





In this Critical Thinking Assignment, you will install R Markdown, explore and summarize a dataset as well as create a linear regression model. Your assignment submission will be an R Markdown generated Word document.


Install R Markdown








Download Install R Markdown





. Create a new R Markdown file by performing the following steps.







  1. Open R Studio



  2. Select File | New | R Markdown



  3. UseModule 3 CT Option 1as the Title



  4. Use your name as the Author



  5. Select the Word output format



  6. Delete all default content after the R Setup block of code, which is all content from line 12 through the end of the file.






Explore Boston housing in the





BostonHousing.csv











Download BostonHousing.csv





file by performing the following steps.







  1. Apply what you learned in Modules 1 and 2 about data exploration by selecting and running appropriate data exploration functions. Run at least five functions.



  2. For your assignment submission, copy your commands into your R Markdown file.



    1. Include R comments on all your code.



    2. Separate sections of R code by using appropriate R Markdown headings.









  3. Fit a multiple linear regression model to the median house price (MEDV) as a function of CRIM, CHAS, and RM by following the process underExample: Predicting the Price of Used Toyota Corolla Carsin section 6.3.



  4. Use the R code example shown in Figure 6.3.




    Hint:You will need to remove the categorial variable CAT…MEDV prior to fitting the multiple linear regression model.







  5. Create a scatter plot with the plot() function with the following attributes.



    1. Use MEDV as the y-axis



    2. Use the most significant attribute as the x-axis.



    3. Use the abline() function to add a linear regression line to the scatter plot. Use the y-intercept as the y-value and the factor value of the most significant attribute as the slope value.









  6. For your assignment submission, copy your commands into your R Markdown file.



    1. Include R comments on all your code.



    2. Separate sections of R code by using appropriate R Markdown headings.









  7. Use the R MarkdownKnitdrop-down menu to selectKnit to Wordto create the Word document for your assignment submission.









Software dropdown menu with the option for ‘knit to Word’ selected.





Your assignment submission must be one Word document that meets the following requirements:







  • Is an R Markdown generated Word document containing all R code used in this assignment, appropriate R comments on code, and appropriate R Markdown headings?



  • Does not include a cover page.



  • Does not include an abstract.



  • Includes a one-page description of what you did and what you learned. Add this description to the end of the R Markdown document as a new page. This page must conform to APA guidelines in the


    CSU Global Writing CenterLinks to an external site.








    .






**************************************************










"6.3 Estimating the Regression Equation and Prediction



Once we determine the predictors to include and their form, we estimate the coefficients of the regression formula from the data using a method called ordinary least squares (OLS). This method finds values that minimize the sum of squared deviations between the actual outcome values (Y) and their predicted values based on that model ().



To predict the value of the outcome variable for a record with predictor values x1, x2, …, xp, we use the equation























(6.2)"



















"Predictions based on this equation are the best predictions possible in the sense that they will be unbiased (equal to the true values on average) and will have the smallest mean squared error compared to any unbiased estimates if we make the following assumptions:













The noise ε (or equivalently, Y) follows a normal distribution.



The choice of predictors and their form is correct (linearity).



The records are independent of each other.



The variability in the outcome values for a given set of predictors is the same regardless of the values of the predictors (homoskedasticity).













An important and interesting fact for the predictive goal is that even if we drop the first assumption and allow the noise to follow an arbitrary distribution, these estimates are very good for prediction, in the sense that among all linear models, as defined by equation (6.1), the model using the least squares estimates, , will have the smallest mean squared errors. The assumption of a normal distribution is required in explanatory modeling, where it is used for constructing confidence intervals and statistical tests for the model parameters.













Even if the other assumptions are violated, it is still possible that the resulting predictions are sufficiently accurate and precise for the purpose they are intended for. The key is to evaluate predictive performance of the model, which is the main priority. Satisfying assumptions is of secondary interest and residual analysis can give clues to potential improved models to examine.













Example: Predicting the Price of Used Toyota Corolla Cars



A large Toyota car dealership offers purchasers of new Toyota cars the option to buy their used car as part of a trade-in. In particular, a new promotion promises to pay high prices for used Toyota Corolla cars for purchasers of a new car. The dealer then sells the used cars for a small profit. To ensure a reasonable profit, the dealer needs to be able to predict the price that the dealership will get for the used cars. For that reason, data were collected on all previous sales of used Toyota Corollas at the dealership. The data include the sales price and other information on the car, such as its age, mileage, fuel type, and engine size. A description of each of these variables is given in Table 6.1. A sample of this dataset is shown in Table 6.2. The total number of records in the dataset is 1000 cars (we use the first 1000 cars from the dataset ToyotoCorolla.csv). After partitioning the data into training (60%) and validation (40%) sets, we fit a multiple linear regression model between price (the outcome variable) and the other variables (as predictors) using only the training set. Table 6.3 shows the estimated coefficients. Notice that the Fuel Type predictor has three categories (Petrol, Diesel, and CNG). We therefore have two dummy variables in the model: Fuel_TypePetrol (0/1) and Fuel_TypeDiesel (0/1); the third, for CNG (0/1), is redundant given the information on the first two dummies. Including the redundant dummy would cause the regression to fail, since the redundant dummy will be a perfect linear combination of the other two; R’s “lm” routine handles this issue automatically.























Table 6.1 Variables in the Toyota Corolla Example











































Variable



Description











































Price



Offer price in Euros























Age



Age in months as of August 2004























Kilometers



Accumulated kilometers on odometer























Fuel Type



Fuel type (Petrol, Diesel, CNG)























HP



Horsepower























Metallic



Metallic color? (Yes = 1, No = 0)























Automatic



Automatic (Yes = 1, No = 0)























CC



Cylinder volume in cubic centimeters























Doors



Number of doors























QuartTax



Quarterly road tax in Euros























Weight



Weight in kilograms"



















"Table 6.2 Prices and Attributes for Used Toyota Corolla Cars (selected rows and columns only)





















































Price



Age



Kilometers



Fuel Type



HP



Metallic



Automatic



CC



Doors



Quart Tax



Weight























13500



23



46986



Diesel



90



1



0



2000



3



210



1165























13750



23



72937



Diesel



90



1



0



2000



3



210



1165























13950



24



41711



Diesel



90



1



0



2000



3



210



1165























14950



26



48000



Diesel



90



0



0



2000



3



210



1165























13750



30



38500



Diesel



90



0



0



2000



3



210



1170























12950



32



61000



Diesel



90



0



0



2000



3



210



1170























16900



27



94612



Diesel



90



1



0



2000



3



210



1245























18600



30



75889



Diesel



90



1



0



2000



3



210



1245























21500



27



19700



Petrol



192



0



0



1800



3



100



1185























12950



23



71138



Diesel



69



0



0



1900



3



185



1105























20950



25



31461



Petrol



192



0



0



1800



3



100



1185























19950



22



43610



Petrol



192



0



0



1800



3



100



1185























19600



25



32189



Petrol



192



0



0



1800



3



100



1185























21500



31



23000



Petrol



192



1



0



1800



3



100



1185























22500



32



34131



Petrol



192



1



0



1800



3



100



1185























22000



28



18739



Petrol



192



0



0



1800



3



100



1185























22750



30



34000



Petrol



192



1



0



1800



3



100



1185























17950



24



21716



Petrol



110



1



0



1600



3



85



1105























16750



24



25563



Petrol



110



0



0



1600



3



19



1065























16950



30



64359



Petrol



110



1



0



1600



3



85



1105























15950



30



67660



Petrol



110



1



0



1600



3



85



1105























16950



29



43905



Petrol



110



0



1



1600



3



100



1170























15950



28



56349



Petrol



110



1



0



1600



3



85



1120























16950



28



32220



Petrol



110



1



0



1600



3



85



1120























16250



29



25813



Petrol



110



1



0



1600



3



85



1120























15950



25



28450



Petrol



110



1



0



1600



3



85



1120























17495



27



34545



Petrol



110



1



0



1600



3



85



1120























15750



29



41415



Petrol



110



1



0



1600



3



85



1120























11950



39



98823



CNG



110



1



0



1600



5



197



1119









































































Table 6.3 Linear regression model of price vs. car attributes































































code for fitting a regression model

































car.df <->



# use first 1000 rows of data



car.df <- car.df[1:1000,="">



# select variables for regression



selected.var <- c(3,="" 4,="" 7,="" 8,="" 9,="" 10,="" 12,="" 13,="" 14,="" 17,="">













# partition data



set.seed(1) # set seed for reproducing the partition



train.index <- sample(c(1:1000),="">



train.df <- car.df[train.index,="">



valid.df <- car.df[-train.index,="">













# use lm() to run a linear regression of Price on all the predictors in the



# training set (it will automatically turn Fuel_Type into dummies).



# use . after ~ to include all the remaining columns in train.df as predictors.



car.lm <- lm(price="" ~="" .,="" data="">



# use options() to ensure numbers are not displayed in scientific notation.



options(scipen = 999)



summary(car.lm)













Partial Output













> summary(car.lm)













Call:



lm(formula = Price ~ ., data = train.df)













Residuals:






Min 1Q Median 3Q Max



-8212.5 -839.2 -14.3 831.5 7270.7













Coefficients:






Estimate Std. Error t value Pr(>|t|)




(Intercept) -1774.877829 1643.744823 -1.080 0.2807




Age_08_04 -135.430875 4.875906 -27.776 < 0.0000000000000002="">



KM -0.019003 0.002341 -8.116 0.00000000000000283 ***



Fuel_TypeDiesel 1208.339159 534.431400 2.261 0.0241 *



Fuel_TypePetrol 2425.876714 520.587979 4.660 0.00000391697679667 ***



HP 38.985537 5.587183 6.978 0.00000000000811621 ***



Met_Color 84.792715 126.883452 0.668 0.5042




Automatic 306.684154 289.433138 1.060 0.2898




CC 0.031966 0.099075 0.323 0.7471




Doors -44.157742 64.056530 -0.689 0.4909




Quarterly_Tax 16.677343 2.602668 6.408 0.00000000030287017 ***



Weight 12.667487 1.536587 8.244 0.00000000000000109 ***



---



Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1













Residual standard error: 1406 on 588 degrees of freedom



Multiple R-squared: 0.8567,


Adjusted R-squared: 0.854



F-statistic: 319.6 on 11 and 588 DF, p-value: <>
















"The regression coefficients are then used to predict prices of individual used Toyota Corolla cars based on their age, mileage, and so on. Table 6.4 shows a sample of predicted prices for 20 cars in the validation set, using the estimated model. It gives the predictions and their errors (relative to the actual prices) for these 20 cars. Below the predictions, we have overall measures of predictive accuracy. Note that the mean error (ME) is < ent=""> $ −40 and RMSE = $1321. A histogram of the residuals (Figure 6.1) shows that most of the errors are between ± $ 2000. This error magnitude might be small relative to the car price, but should be taken into account when considering the profit. Another observation of interest is the large positive residuals (under-predictions), which may or may not be a concern, depending on the application. Measures such as the mean error, and error percentiles are used to assess the predictive performance of a model and to compare models."













Jan 13, 2023
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here