The dataset for this assignment contains house prices as well as 19 other features for each property. Those features are detailed below and include information about the house (number of bedrooms,...


The dataset for this assignment contains house prices as well as 19 other features for each property. Those features are detailed below and include information about the house (number of bedrooms, bathrooms…), the lot (square footage…) and the sale conditions (period of the year…) The overall goal of the assignment is to predict the sale price of a house by using a linear regression. For this assignment, the training set is in the file "house_prices_train.csv" and the test set is in the file "house_prices_test.csv"



Here is a brief description of each feature inthe dataset:




  • SalePrice: the property's sale price in dollars. This is the target variable that you're trying to predict.


  • LotFrontage: Linear feet of street connected to property


  • LotArea: Lot size in square feet


  • YearBuilt: Original construction date


  • BsmtUnfSF: Unfinished square feet of basement area


  • TotalBsmtSF: Total square feet of basement area


  • 1stFlrSF: First Floor square feet


  • 2ndFlrSF: Second floor square feet


  • LowQualFinSF: Low quality finished square feet (all floors)


  • GrLivArea: Above grade (ground) living area square feet


  • FullBath: Full bathrooms above grade


  • HalfBath: Half baths above grade


  • BedroomAbvGr: Number of bedrooms above basement level


  • KitchenAbvGr: Number of kitchens


  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)


  • GarageCars: Size of garage in car capacity


  • GarageArea: Size of garage in square feet


  • PoolArea: Pool area in square feet


  • MoSold: Month Sold


  • YrSold: Year Sold



Question 1: Data Cleaning



  • Open the training dataset and remove all rows that contain at least one missing value (NA)

  • Return the new clean dataset and the number of rows in that dataset


Question 2:


For the training dataset, print a summary of the variables “LotArea”, “YearBuilt”, “GarageArea”, and “BedroomAbvGr” and “SalePrice”. Return the whole summary and a list containing (in that order):



  • The maximum sale price

  • The minimum garage area



  • The first quartile of lot area




  • The second most common year built

  • The mean of BedroomAbvGr



Hint: Use the built-in method describe() for a pandas.DataFrame


Question 3:


Run a linear regression on "SalePrice" using the variables “LotArea”, “YearBuilt”, “GarageArea”, and “BedroomAbvGr”. For each variable, return the coefficient associated to the regression in a dictionary similar to this: {“LotArea”: 1.888, “YearBuilt”: -0.06, ...} (This is only an example not the right answer)




Compute the Root Mean Squared Error (RMSE) using the file "house_prices_test.csv" to measure the out-of-sample performance of the model.



Question 4:


Refit the model on the training set using all the variables and return the RMSE on the test set.


(The first column "unnamed: 0" isnota variable)

Apr 20, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here