1 University at Buffalo, Industrial and Systems Engineering IE322 Analytics and Computing for Industrial Engineers Lab#2 Fall 2022 Data Preprocessing and Linear Regression (This is an...

1 answer below »
Analytics and Computing for Industrial Engineers
Data Preprocessing and Linear Regression




1 University at Buffalo, Industrial and Systems Engineering IE322 Analytics and Computing for Industrial Engineers Lab#2 Fall 2022 Data Preprocessing and Linear Regression (This is an individual lab) Name: ________________ Due 23:59 October 23th, 2022 Description: The dataset for this lab is SalesData.csv (the same one as used in Lab 1), and it is available on UBlearns. The dataset is used to record the sales information of a company. Requirements: Draft a report to document your R code and results (or partial results if there are too many) in each step. Note that your report will be graded on both technical content (70%) and report quality (30%). Submit two files to UBLearns: 1) your report, and 2) your R script. 1. Data preprocessing (30 pts) Read the SalesData.csv data into R console as D0. (a) Identify the variable(s) that have missing data, and determine the number of missing values for each variable identified in your report. (b) Install R package “imputeTS” into your R environment, and import it using library(imputeTS). Note: this package is used for data imputation. (c) Perform data imputation by following methods: i. Using D0, delete all rows with missing data, and name the new dataset as S1. ii. Using D0, replace the missing values with the mean value of each numerical variable, and name the new dataset as S2. (Hint: you may use na_mean() in “imputeTS” package, or any other method for this data imputation). (d) Calculate the correlation between “Profit” and “Sales” for both S1 and S2. To compare how those two correlations are affected by different data imputation methods. Document your findings in the report. (e) Use any method to identify any outliers in “Shipping.Cost” column. Count how many outliers identified. Document your findings in the report. (f) Delete those outliers you identified in 2(e), and name the new dataset as S3. Abdullah Fahad Abdullah Fahad Abdullah Fahad Abdullah Fahad 2 2. EDA (20 pts) Using D0 for the following tasks. (a) Draw the histogram for the “Shipping.Cost” column, and change the y-axis as frequency. Then, add a density curve on it. Present the plot into your report. (b) Draw a single scatter plot to show the relationship between “Profit” (as in y-axis) and “Sales” (as in x-axis). Based on this plot: a. add a color of your choice for each “Ship.Mode” (i.e., Delivery Truck, Express Air, Regular Air). Make sure you choose different colors; b. add a legend. Present this plot into your report. (c) Draw a scatter plot in a matrix form (see slides Lecture 5, page 16) for "Order.Quantity", "Profit", and "Shipping.Cost". Present this plot into your plot. 3. Linear regression (50 pts) For the following tasks, use the “Profit” as dependent variable, and four independent variables: “Order.Quantity”, “Sales”, “Unit.Price”, “Shipping.Cost”. (a) Build four linear regression models (name as m0, m1, m2, m3) on dataset D0, S1, S2, and S3 respectively, using lm() function. (b) Fill out the following table in your report using summary(). Models Dataset & description Model performance RMSE R2 m0 D0: raw data m1 S1: excluded NAs m2 S2: imputed Nas with mean m3 S3: excluded outliers Your proposed model in (e) (c) Compare the model performance among m0, m1 and m2, and discuss how the data imputation methods affects the model performance. (d) Compare the model performance between m0 and m3, and discuss how the exclusion of outliers affects the model performance. (e) Propose another way to further improve the model performance in your report. Justify your method by building such a linear regression model. You may consider add additional variables (e.g., Store, company), or interaction/polynomial terms, based on the dataset S3. Fill out the table in (b) with your model results.
Answered 2 days AfterOct 18, 2022

Answer To: 1 University at Buffalo, Industrial and Systems Engineering IE322 Analytics and Computing for...

Mohd answered on Oct 21 2022
46 Votes
-
-
-
2022-10-21
Importing Data
library(readr)
salesdata <- read_csv("salesdata.csv", col_types = cols(`Order Date` = col_date(format = "%m/%d/%Y")))
View(salesdata)
First look of the Data
skimr::skim(salesdata)
Data summary
    Name
    salesdata
    Number of rows
    670
    Number of columns
    13
    _______________________
    
    Column type frequency:
    
    character
    7
    Date
    1
    numeric
    5
    ________________________
    
    Group variables
    None
Variable t
ype: character
    skim_variable
    n_missing
    complete_rate
    min
    max
    empty
    n_unique
    whitespace
    Order Priority
    2
    1
    3
    13
    0
    6
    0
    Ship Mode
    2
    1
    11
    14
    0
    3
    0
    Customer Name
    1
    1
    8
    18
    0
    85
    0
    Customer Segment
    0
    1
    8
    14
    0
    4
    0
    Product Category
    0
    1
    9
    15
    0
    3
    0
    company
    0
    1
    1
    1
    0
    1
    0
    Store
    0
    1
    1
    1
    0
    4
    0
Variable type: Date
    skim_variable
    n_missing
    complete_rate
    min
    max
    median
    n_unique
    Order Date
    0
    1
    2009-01-05
    2012-12-21
    2010-12-30
    395
Variable type: numeric
    skim_variable
    n_missing
    complete_rate
    mean
    sd
    p0
    p25
    p50
    p75
    p100
    hist
    Order Quantity
    0
    1
    26.08
    17.21
    -10.00
    12.00
    25.00
    39.00
    101.00
    ▅▇▆▁▁
    Sales
    1
    1
    1915.47
    3568.45
    4.99
    166.76
    497.14
    1773.86
    26133.39
    ▇▁▁▁▁
    Profit
    2
    1
    193.54
    1205.74
    -11053.60
    -82.33
    3.89
    161.82
    10951.31
    ▁▁▇▁▁
    Unit Price
    2
    1
    95.42
    242.49
    1.26
    7.25
    22.78
    99.99
    3499.99
    ▇▁▁▁▁
    Shipping Cost
    0
    1
    13.76
    18.04
    0.49
    3.14
    6.50
    15.10
    147.12
    ▇▁▁▁▁
There are missing values in sales, profit, Order priority, ship mode, customer name and unit price.
## removing all missing values records
library(imputeTS)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
S1<-na.omit(salesdata)
#replacing categorical variable missing values with mode
#mode function
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
calc_mode <- function(x) {
u <- unique(x)
tab <- tabulate(match(x, u))
u[tab == max(tab)]
}
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ stringr 1.4.1
## ✔ tidyr 1.2.1 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
S2<-salesdata %>%
mutate(`Order Priority` = if_else(is.na(`Order Priority`),
calc_mode(`Order Priority`),
`Order Priority`))%>%
mutate(`Ship Mode` = if_else(is.na(`Ship Mode`),
calc_mode(`Ship Mode`),
`Ship Mode`))%>%
mutate(`Customer Name` = if_else(is.na(`Customer Name`),
calc_mode(`Customer Name`),
`Customer Name`))
#skimr::skim(S2)
S2<-na_mean(S2)
#skimr::skim(S2)
#Correletion between Sales and profit in S1
S1_c<-S1[,c("Sales","Profit")]
cor(S1_c)
## Sales Profit
## Sales 1.000000 0.480134
## Profit 0.480134 1.000000
#Correletion between Sales and profit in S2
S2_c<-S2[,c("Sales","Profit")]
cor(S2_c)
## Sales Profit
## Sales 1.0000000 0.4790883
## Profit 0.4790883 1.0000000
#Removing outliers in shipping cost
IQR<-mean(salesdata$`Shipping Cost`)+ 1.5*(15.1-3.14)
boxplot(salesdata$`Shipping Cost`, plot=FALSE)$out
## [1] 35.00 68.02 35.00 69.00 49.00 74.35 49.00 39.61 57.38 64.73
## [11] 35.00 35.00 58.95 53.48 62.74 43.75 53.03 50.00 60.49 36.61
## [21] 43.32 54.11 58.92 69.64 34.20 50.00 44.55 60.00 33.60 64.20
## [31] 66.27 99.00 36.09 41.91 35.02 35.00 87.01 48.80 66.27 69.30
## [41] 69.64 35.00 35.00 52.20 54.12 39.61 70.20 51.94 70.20 49.00
## [51] 49.00 60.00 56.14 35.00 69.00 60.00 36.61 64.73 66.67 64.73
## [61] 81.98 56.14 35.00 55.24 66.27 54.92 56.20 64.73 45.00 36.09
## [71] 70.20 84.84 49.00 69.00 49.00 60.00 45.51 55.24 147.12 49.00
## [81] 64.20 51.94
S3<-S2%>%
filter(`Shipping Cost`boxplot(S3$`Shipping Cost`)
hist(salesdata$`Shipping Cost`, prob = TRUE)

lines(density(salesdata$`Shipping Cost`), col = "green")
ggplot(salesdata, aes(x=Sales,y=Profit,color=`Ship Mode`)) +
...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here