--- title: "Assignment 6: ML" date: output: bookdown::html_document2: toc: true toc_float: true theme: flatly highlight: monochrome code_folding: hide markdown_extensions: XXXXXXXXXXadmonition ---...

1 answer below »
Make only HTML Document in R studio ( Econometrics course )



--- title: "Assignment 6: ML" date: output: bookdown::html_document2: toc: true toc_float: true theme: flatly highlight: monochrome code_folding: hide markdown_extensions: - admonition --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, messages = FALSE ) # kable options(knitr.kable.NA = '') # Set the graphical theme ggplot2::theme_set(ggplot2::theme_light()) # digits options(pillar.sigfig = 7 ,digits=3) ## New packages library(tidymodels) library(ISLR) library(glmnet) library(broom) # load library(tidyverse) library(AER) library(kableExtra) library(gridExtra) library(haven) library(modelsummary) ``` # Instructions Go through text and code, and replace "..." with the correct code. Make sure you turn `eval = T` after completing a code chunk. # Baseball data The data set `Hitters` data from the package `ISLR`. It contains data on salary (the outcome variable) and career performance statistics (the predictors or features). There are a few players without salary, and we first remove them from the data. ```{r} df <- hitters="" %="">% drop_na() ``` **A** In order to assess the performance of different estimation procedure for prediction, we first split the data into a **testing** and **training** data set. The test data set is sometimes called a "holdout", and it should only be used for model assessment. Using this data, we calculate a performance metric, such as the test MSPE. The first step is to randomly split the data into training and test data. Before we do this, it is important to set the random seed so that we can replicate our results later on. There are many ways to do this, but I like to use the package `tidymodels`'s function called `initial_split`. Thus, you have to install and load the package `tidymodels` to run this code chunk. ```{r, eval = F} # set random seed for replication set.seed(42101) # split data into training/test df_split<- initial_split(...,="" prop="3/4)" #="" create="" data="" test=""><- testing(df_split)="" train=""><- training(df_split)="" ```="" so="" the="" steps="" are:="" 1.="" set="" seed="" for="" replication,="" 2.="" split="" the="" data="" randomly="" using="" `intial_split`.="" there="" are="" different="" options="" for="" how="" the="" split="" is="" done,="" for="" us="" the="" only="" important="" one="" is="" `prop`="" which="" allocates="" our="" data="" to="" training="" and="" testing="" sets.="" 3.="" use="" the="" commands="" `training`="" and="" `testing`="" on="" the="" split="" data="" to="" create="" the="" actual="" data="" frames.="" once="" this="" is="" done,="" we="" work="" with="" the="" training="" data="" to="" construct="" models.="" then="" we="" assess="" the="" performance="" of="" the="" model="" on="" the="" testing="" set.="" **b**="" our="" training="" set="" has="" 3-quarters="" of="" the="" observations="" and="" `r="" dim(df)[2]-1`="" predictor="" variables.="" but="" that="" is="" not="" the="" number="" of="" potential="" features;="" we="" could="" also="" consider="" interactions="" and="" polynomials="" of="" each="" variable.="" this="" can="" significantly="" increase="" the="" number="" of="" parameters="" to="" estimate.="" below,="" we="" estimate="" three="" linear="" regressions,="" each="" with="" a="" different="" predictor="" set.="" the="" first="" model,="" called="" `basic`,="" just="" uses="" a="" players="" experience="" (`years`)="" and="" last="" season="" and="" career="" home="" run="" and="" rbi="" statics.="" the="" second="" model="" is="" a="" "full"="" model="" in="" the="" sense="" that="" it="" uses="" all="" of="" the="" variables.="" to="" estimate="" this,="" i="" use="" the="" `r`="" short="" cut="" `.`="" to="" denote="" that="" i="" want="" `r`="" to="" create="" a="" formula="" for="" all="" the="" predictors.="" the="" last="" model,="" `extended`,="" uses="" the="" full="" predictor="" set,="" plus="" all="" interactions="" and="" squares="" of="" the="" variables.="" to="" do="" this,="" i="" use="" the="" short="" cut="" `.^2`.="" 1.="" what="" is="" the="" number="" of="" parameters="" estimated="" in="" each="" model?="" [there="" are="" two="" ways="" to="" do="" this.="" one="" obvious="" way="" is="" to="" use="" `coef(model)="" %="">% length()`, which calculates the length of the estimated coefficient vector. The second is the built-in `lm()` created rank, accessed by model$rank] 2. Calculate the training error 3. Calculate the test error ```{r, eval = F} Basic<- lm(salary="" ~="" years="" +="" rbi="" +="" hmrun="" +="" crbi="" +="" chmrun,="" data="...)" full=""><- lm(salary="" ~="" .,="" data="...)" extended=""><- lm(salary="" ~="" .^2,="" data="...)" ```="" **c**="" as="" we="" have="" discussed,="" we="" want="" to="" distinguish="" between="" in-sample="" and="" out-of-sample="" error.="" to="" estimate="" the="" three="" models="" above,="" we="" used="" the="" training="" data.="" lets="" calculate="" the="" training="" error="" for="" each="" model.="" below,="" i="" show="" 3="" ways="" to="" do="" this="" for="" the="" `basic`="" model.="" the="" first="" is="" straight="" forward;="" we="" simply="" implement="" the="" mse="" formula="" by="" taking="" the="" mean="" of="" the="" squared="" errors.="" i="" give="" a="" 2nd="" alternative="" that="" uses="" piping="" because="" i="" think="" its="" more="" readable.="" in="" the="" last="" example,="" i="" create="" a="" function="" to="" do="" the="" work="" for="" us.="" it="" takes="" three="" arguments,="" the="" data="" frame,="" the="" outcome="" variable="" (as="" a="" string),="" and="" the="" model.="" i="" show="" how="" to="" use="" this="" to="" get="" the="" mse="" for="" the="" `basic`="" model.="" 1.="" find="" the="" in-sample="" mse="" for="" all="" three="" models.="" ```{r,="" eval="F}" #="" training="" error="" basic.error=""><- mean((train$salary="" -="" predict(...))^2)="" #="" more="" readable="" basic.error=""><- (train$salary="" -="" predict(...))="" %="">% .^2 %>% mean() # Create your own function mse <- function(.data,="" true,="" model)="" {="" error=""><- (.data[[true]]="" -="" predict(model,="" newdata=".data))^2" mean(error)="" }="" mse(train,="" "salary",="" ...)="" ```="" **d**="" in="" order="" to="" visualize="" these="" outcomes,="" its="" useful="" to="" construct="" a="" data="" frame="" of="" our="" mode="" performance="" vs="" model="" complexity="" (the="" number="" of="" estimated="" parameters).="" i="" do="" this="" below,="" manually.="" the="" end="" result="" is="" that="" we="" want="" a="" data="" frame="" with="" a="" column="" for="" model,="" mse,="" and="" number="" of="" model="" parameters.="" 1.="" using="" the="" data="" set="" `training.error`,="" plot="" the="" mse="" vs="" k="" (the="" number="" of="" parameters)="" using="" `geom_line()`.="" interpret="" your="" result.="" which="" model="" has="" the="" best="" in-sample="" performance?="" ```{r,="" eval="F}" training.error=""><- tibble(="" model="c('Basic'," 'full',="" 'extended'),="" mse="map_dbl(list(Basic," full,="" extended),="" ~mse(train,="" "salary",="" .x)),="" k="map_dbl(list(Basic," full,="" extended),="" ~.x$rank)="" )="" training.error="" ggplot(training.error,="" aes(y="...," x="...))" +="" geom_line()="" ```="" **e**="" machine="" learning="" methods="" are="" about="" finding="" models="" that="" are="" "generalizable".="" this="" means="" that="" we="" want="" our="" model="" to="" have="" good="" out="" of="" sample="" performance.="" to="" gange="" each="" of="" our="" models'="" out-of-sample="" performance,="" we="" can="" ouse="" our="" hold="" out="" test="" data.="" 1.="" repeat="" **d**="" for="" the="" data="" set="" `test`.="" which="" model="" has="" the="" best="" out-of-sample="" performance?="" ```{r,="" eval="F}" test.error=""><- tibble(="" model="c('Basic'," 'full',="" 'extended'),="" mse="map_dbl(list(Basic," full,="" extended),="" ~mse(...,="" "salary",="" .x)),="" k="map_dbl(list(Basic," full,="" extended),="" ~.x$rank)="" )="" test.error="" ggplot(test.error,="" aes(y="...," x="...))" +="" geom_line()="" ```="" ##="" glmnet="" since="" the="" relationship="" between="" salary="" and="" the="" predictor="" variables="" is="" unknown,="" any="" regression="" model="" we="" run="" is="" ad-hoc="" in="" the="" sense="" that="" we="" have="" no="" guidance="" as="" to="" what="" predictor="" variables="" to="" use,="" and="" how="" to="" use="" them="" (squares,="" interactions,="" ect).="" there="" are="" methods="" that="" systematically="" run="" a="" number="" of="" models="" (step="" wise="" or="" subset="" regression,="" for="" example)="" in="" the="" aim="" of="" finding="" the="" best="" model.="" however,="" these="" methods="" can="" be="" vary="" time="" consuming="" because="" even="" for="" a="" relatively="" small="" data="" set="" such="" as="" this,="" there="" are="" potentially="" a="" large="" number="" of="" potential="" predictors.="" we="" can="" use="" penalized="" regression="" (lasso,="" elastic="" net,="" or="" ridge)="" to="" constrain="" model="" complexity="" in="" the="" hopes="" of="" improving="" out-of-sample="" performance.="" lets="" start="" with="" a="" basic="" introduction="" to="" to="" the="" command="" `glmnet`="" from="" the="" package="" of="" the="" same="" name.="" to="" demonstrate,="" we="" will="" use="" ridge="" regression.="" `glmnet`="" doesn't="" have="" a="" formula="" interface="" like="" `lm()`.="" instead,="" we="" have="" to="" manually="" supply="" an="" outcome="" vector="" $y$="" and="" a="" matrix="" of="" features,="" $\mathbf(x)$.="" below,="" i="" demonstrate="" for="" the="" `train`="" data="" set.="" ```{r,="" eval="F}" ###="" train="" data="" #="" extract="" salary="" as="" a="" vector="" using="" pull()="" y_train=""><- train="" %="">% pull(...) # Extract X as matrix X_train <- ...="" %="">% select(-Salary) %>% model.matrix(~ -1 + ., data = .) ### Test data # extract Salary as a vector using pull() Y_test <- ...="" %="">% pull(...) # Extract X as matrix X_test <- ...="" %="">% select(-...) %>% model.matrix(~ -1 + ., data = .) ``` To fit a `glmnet` model, we simply pass on these objects to the `glmnet` command. There are, of course, a bunch of options. For us, one option we want to specify is `alpha` which determines the penalty. An `alpha = 0` is a ridge penalty, and `alpha = 1` is a lasso penalty. 0 < `alpha`="">< 1="" are="" elastic="" net="" penalties,="" which="" combine="" lasso="" and="" ridge.="" ```{r,="" eval="F}" #="" ridge="" ridge=""><- glmnet(x="...," y="Y_train," alpha="...)" ```="" the="" object="" we="" just="" created="" called="" `ridge`="" contains="" the="" results="" from="" our="" ridge="" regression.="" if="" you="" print="" it="" to="" the="" screen="" or="" try="" `summary`="" you="" will="" get="" unfamiliar="" output.="" the="" best="" way="" to="" inspect="" the="" object="" is="" to="" use="" the="" package="" `broom`="" function="" `tidy`="" to="" create="" a="" data="" frame="" of="" the="" output.="" lets="" do="" that:="" ```{r,="" eval="F}" ridge.data=""><- tidy(...) ``` the data frame `ridge.data` can now be printed to the screen or observed in the viewer. look at it you can see that it contains not just one set of regression results, but many - indexed by `step`. this is part of the estimation procedure. we don't know what penalty to use beforehand, so `glmnet` computes results over a grid of penalties. the higher the penalty, the more "regularized" we expect our regression to be. to see this, note that the penalty is on the sum of the square of the coefficients. that is, the ridge penalty is $\lambda \cdot \sum_i^k\beta_k^2$, so the higher $\lambda$ means that the regression is penalized for larger $\sum_i^k\beta_k^2$. below tidy(...)="" ```="" the="" data="" frame="" `ridge.data`="" can="" now="" be="" printed="" to="" the="" screen="" or="" observed="" in="" the="" viewer.="" look="" at="" it="" you="" can="" see="" that="" it="" contains="" not="" just="" one="" set="" of="" regression="" results,="" but="" many="" -="" indexed="" by="" `step`.="" this="" is="" part="" of="" the="" estimation="" procedure.="" we="" don't="" know="" what="" penalty="" to="" use="" beforehand,="" so="" `glmnet`="" computes="" results="" over="" a="" grid="" of="" penalties.="" the="" higher="" the="" penalty,="" the="" more="" "regularized"="" we="" expect="" our="" regression="" to="" be.="" to="" see="" this,="" note="" that="" the="" penalty="" is="" on="" the="" sum="" of="" the="" square="" of="" the="" coefficients.="" that="" is,="" the="" ridge="" penalty="" is="" $\lambda="" \cdot="" \sum_i^k\beta_k^2$,="" so="" the="" higher="" $\lambda$="" means="" that="" the="" regression="" is="" penalized="" for="" larger="" $\sum_i^k\beta_k^2$.="">
Answered 1 days AfterApr 09, 2021

Answer To: --- title: "Assignment 6: ML" date: output: bookdown::html_document2: toc: true toc_float: true...

Abr Writing answered on Apr 11 2021
136 Votes
a6.html
Code
        Show All Code
        Hide All Code
Assignment 6: ML
1 Instructions
Go through text and code, and replace “…” with the correct code. Make sure you turn eval = T after completing a code chunk.
2 Baseball data
The data set Hitters data from the package ISLR. It contains data on salary (the outcome variable) and career performance statistics (the predictors or features). There are a few players without salary, and we first remove them from the data.
df <- Hitters %>%
drop_na()
A In order to assess the performance of different estimation procedure for prediction, we first split the data into a testing and training data set. The test data set is sometimes called a “holdout”, and it should only be used for model assessment. Using this data, we calculate a performance metric, such as the test MSPE. The first step is to randomly split the data into training and test data. Before we do this, it is important to set the random seed so that we can replicate our results later on.
There are many ways to do this, but I like to use the package tidymodels’s function called initial_split. Thus, you have to install and load the package tidymodels to run this code chunk.
# set rando
m seed for replication
set.seed(42101)
# split data into training/test
df_split<- initial_split(df, prop = 3/4)
# Create data
test<- testing(df_split)
train <- training(df_split)
So the steps are:
        set seed for replication,
        split the data randomly using intial_split. There are different options for how the split is done, for us the only important one is prop which allocates our data to training and testing sets.
        use the commands training and testing on the split data to create the actual data frames.
Once this is done, we work with the training data to construct models. Then we assess the performance of the model on the testing set.
B
cat(paste(
"Our training set has 3-quarters of the observations and",
dim(df)[2]-1,
"predictor variables."
))
Our training set has 3-quarters of the observations and 19 predictor variables.
But that is not the number of potential features; we could also consider interactions and polynomials of each variable. This can significantly increase the number of parameters to estimate.
Below, we estimate three linear regressions, each with a different predictor set. The first model, called Basic, just uses a players experience (Years) and last season and career home run and RBI statics. The second model is a “full” model in the sense that it uses all of the variables. To estimate this, I use the R short cut . to denote that I want R to create a formula for all the predictors. The last model, Extended, uses the full predictor set, plus all interactions and squares of the variables. To do this, I use the short cut .^2.
        What is the number of parameters estimated in each model? [There are two ways to do this. One obvious way is to use coef(model) %>% length(), which calculates the length of the estimated coefficient vector. The second is the built-in lm() created rank, accessed by model$rank]
        Calculate the training error
        Calculate the test error
Basic<- lm(Salary ~ Years +
RBI +
HmRun +
CRBI +
CHmRun, data = train)
Full<- lm(Salary ~ ., data = train)
Extended<- lm(Salary ~ .^2, data = train)
C
As we have discussed, we want to distinguish between in-sample and out-of-sample error. To estimate the three models above, we used the training data. Lets calculate the training error for each model.
Below, I show 3 ways to do this for the Basic model. The first is straight forward; we simply implement the mse formula by taking the mean of the squared errors. I give a 2nd alternative that uses piping because I think its more readable.
In the last example, I create a function to do the work for us. It takes three arguments, the data frame, the outcome variable (as a string), and the model. I show how to use this to get the mse for the Basic model.
        Find the in-sample mse for all three models.
# training error
Basic.error <- mean((train$Salary - predict(Basic))^2)
# More readable
Basic.error <- (train$Salary - predict(Basic)) %>% .^2 %>% mean()
# Create your own function
mse <- function(.data, true, model) {
error <- (.data[[true]] - predict(model, newdata = .data))^2
mean(error)
}
cat(paste("The training error for Basic Model: ",
mse(train, "Salary", Basic),
"\n"))
The training error for Basic Model: 128132.666311328
cat(paste("The training error for Full Model: ",
(train$Salary - predict(Full)) %>% .^2 %>% mean(),
"\n"))
The training error for Full Model: 97154.1301798506
cat(paste("The training error for Extended Model: ",
(train$Salary - predict(Extended)) %>% .^2 %>% mean(),
"\n"))
The training error for Extended Model: 2587.25216012656
D
In order to visualize these outcomes, its useful to construct a data frame of our mode performance vs model complexity (the number of estimated parameters). I do this below, manually. The end result is that we want a data frame with a column for model, mse, and number of model parameters.
        Using the data set training.error, plot the mse vs k (the number of parameters) using geom_line(). Interpret your result. Which model has the best in-sample performance?
training.error <- tibble(
model = c('Basic', 'Full', 'Extended'),
mse = map_dbl(list(Basic, Full, Extended), ~mse(train, "Salary", .x)),
k = map_dbl(list(Basic, Full, Extended), ~.x$rank)
)
training.error
# A tibble: 3 x 3
model mse k

1 Basic 128132.7 6
2 Full 97154.13 20
3 Extended 2587.252 183
ggplot(training.error, aes(y = mse, x = k)) +
geom_line()
From the above plot of mse vs k (the number of parameters), we can observe that as the value of k increases, the mean square error decreases. Therefore, the Extended model performed the best when considering in-sample performance.
E Machine learning methods are about finding models that are “generalizable”. This means that we want our model to have good out of sample performance. To gange each of our models’ out-of-sample performance, we can ouse our hold out test data.
        Repeat D for the data set test. Which model has the best out-of-sample performance?
test.error <- tibble(
model = c('Basic', 'Full', 'Extended'),
mse = map_dbl(list(Basic, Full, Extended), ~mse(test, "Salary", .x)),
k = map_dbl(list(Basic, Full, Extended), ~.x$rank)
)
test.error
# A tibble: 3 x 3
model mse k

1 Basic 107710.1 6
2 Full 89624.41 20
3 Extended 16617898. 183
ggplot(test.error, aes(y = mse, x = k)) +
geom_line()
From the above plot of mse vs k (the number of parameters), we can observe that as the value of k increases, the mean square error first decreases and then increases as evident from the table as well. Therefore, the Full model performed the best when considering out-of-sample performance.
2.1 Glmnet
Since the relationship between salary and the predictor variables is unknown, any regression model we run is ad-hoc in the sense that we have no guidance as to what predictor variables to use, and how to use them (squares, interactions, ect). There are methods that systematically run a number of models (step wise or subset regression, for example) in the aim of finding the best model. However, these methods can be vary time consuming because even for a relatively small data set such as this, there are potentially a large number of potential predictors.
We can use penalized regression (lasso, elastic net, or ridge) to constrain model complexity in the hopes of improving out-of-sample performance. Lets start with a basic introduction to to the command glmnet from the package of the same name. To demonstrate, we will use ridge regression.
glmnet doesn’t have a formula interface like lm(). Instead, we have to manually supply an outcome vector \(Y\) and a matrix of features, \(\mathbf(X)\). Below, I demonstrate for the train data set.
### Train data
# extract Salary as a vector using pull()
Y_train <- train %>% pull(Salary)
# Extract X as matrix
X_train <- train %>%
select(-Salary) %>%
model.matrix(~ -1 + ., data = .)
### Test data
# extract Salary as a vector using pull()
Y_test <- test %>% pull(Salary)
# Extract X as matrix
X_test <- test %>%
select(-Salary) %>%
model.matrix(~ -1 + ., data = .)
To fit a glmnet model, we simply pass on these objects to the glmnet command. There are, of course, a bunch of options. For us, one option we want to specify is alpha which determines the penalty. An alpha = 0 is a ridge penalty, and alpha = 1 is a lasso penalty. 0 < alpha < 1 are elastic net penalties, which combine lasso and ridge.
# ridge
ridge <- glmnet(x = X_train,
y = Y_train,
alpha=0)
The object we just created called ridge contains the results from our ridge regression. If you print it to the screen or try summary you will get unfamiliar output. The best way to inspect the object is to use the package broom function tidy to create a data frame of the output. Lets do that:
ridge.data <- tidy(ridge)
The data frame ridge.data can now be printed to the screen or observed in the viewer. Look at it you can see that it contains not just one set of regression results, but many - indexed by step. This is part of the estimation procedure. We don’t know what penalty to use beforehand, so glmnet computes results over a grid of penalties. The higher the penalty, the more “regularized” we expect our regression to be.
To see this, note that the penalty is on the sum of the square of the coefficients. That is, the ridge penalty is \(\lambda \cdot \sum_i^k\beta_k^2\), so the higher \(\lambda\) means that the regression is penalized for larger \(\sum_i^k\beta_k^2\). Below, I use the ridge.data, remove the intercept, and plot the \(\sqrt{\sum_i^k\beta_k^2}\) against \(\log(\lambda)\). \(\sqrt{\sum_i^k\beta_k^2}\) is called the l-2 norm, and is interpreted as “distance from 0.” As \(\lambda \rightarrow \infty\), the coefficients shrink to zero.
# this code chunk is complete, set eval = T
ridge.data %>%
filter(term != "(Intercept)") %>%
group_by(step) %>%
summarize(sum.coef = sqrt(sum(estimate^2)) ,
lambda = mean(lambda)) %>%
ggplot(aes(y = sum.coef, x = log(lambda))) +
geom_line() +
labs(
y = "l-2 norm",
x = "log penalty",
title = "Ridge regression: sum of squared coefficients vs penalty"
)
As discussed, by default, glmnet estimates a ridge regression over a grid of 100 potential penalty parameters. To get a visualization of the “shrinkage” the penalty term does, we can plot the regression coefficients against lambda.
# this code chunk is complete, set eval = T
ridge.data %>%
filter(term != "(Intercept)") %>%
ggplot(aes(lambda, estimate, color = term)) +
geom_line() +
scale_x_log10() +
labs(
title = "Ridge coefficient path",
y = "betas",
x = "log lambda"
) +
theme(legend.position = 'bottom')
Lambda is an example of a tuning parameter. That is, its a parameter that is determined separately from the model and has to be specified by the researcher. Obviously, we don’t know what a good value is for \(\lamba\) a priori, but we can select a tuning parameter through a data driven process known as cross-validation.
glmnet has a built in function for this. By default, it does 10 fold cross validation, but we can select how many folds we wish to use. Below, I estimate a series of ridge regressions for different values of lambda, using the cross-fold procedure we discussed in class (this is done automatically for us). Next, I plot the cross-fold out of sample mse for different values of lambda. The idea is that we want to choose the “best” lambda, the one that...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here