Make only HTML Document in R studio ( Econometrics course )--- title: "Assignment 6: ML" date: ...

Question

Make only HTML Document in R studio ( Econometrics course )--- title: "Assignment 6: ML" date:  output:   bookdown::html_document2:     toc: true     toc_float: true     theme: flatly     highlight: monochrome     code_folding: hide     markdown_extensions:       - admonition --- ```{r setup, include=FALSE} knitr::opts_chunk$set( 	echo = TRUE, 	message = FALSE, 	warning = FALSE, 	messages = FALSE ) # kable options(knitr.kable.NA = '') # Set the graphical theme ggplot2::theme_set(ggplot2::theme_light()) # digits options(pillar.sigfig = 7 ,digits=3) ## New packages library(tidymodels) library(ISLR) library(glmnet) library(broom) # load library(tidyverse) library(AER) library(kableExtra) library(gridExtra) library(haven) library(modelsummary) ``` # Instructions Go through text and code, and replace "..." with the correct code. Make sure you turn `eval = T` after completing a code chunk. # Baseball data The data set `Hitters` data from the package `ISLR`. It contains data on salary (the outcome variable) and career performance statistics (the predictors or features).  There are a few players without salary, and we first remove them from the data. ```{r} df %   drop_na() ``` **A** In order to assess the performance of different estimation procedure for prediction, we first split the data into a **testing** and **training** data set.  The test data set is sometimes called a "holdout", and it should only be used for model assessment.  Using this data, we calculate a performance metric, such as the test MSPE.  The first step is to randomly split the data into training and test data. Before we do this, it is important to set the random seed so that we can replicate our results later on.   There are many ways to do this, but I like to use the package `tidymodels`'s function called `initial_split`.  Thus, you have to install and load the package `tidymodels` to run this code chunk. ```{r, eval = F} # set random seed for replication set.seed(42101) # split data into training/test df_split% length()`, which calculates the length of the estimated coefficient vector.  The second is the built-in `lm()` created rank, accessed by model$rank] 2. Calculate the training error  3. Calculate the test error ```{r, eval = F} Basic     % .^2 %>% mean() # Create your own function mse % pull(...) # Extract X as matrix X_train %    select(-Salary) %>%    model.matrix(~ -1 + ., data = .) ### Test data # extract Salary as a vector using pull() Y_test % pull(...) # Extract X as matrix X_test %    select(-...) %>%    model.matrix(~ -1 + ., data = .) ``` To fit a `glmnet` model, we simply pass on these objects to the `glmnet` command. There are, of course, a bunch of options.  For us, one option we want to specify is `alpha` which determines the penalty.  An `alpha = 0` is a ridge penalty, and `alpha = 1` is a lasso penalty.  0

Abr Writing · Accepted Answer

a6.html
Code 
		Show All Code
		Hide All Code
Assignment 6: ML
1 Instructions
Go through text and code, and replace “…” with the correct code. Make sure you turn eval = T after completing a code chunk.
2 Baseball data
The data set Hitters data from the package ISLR. It contains data on salary (the outcome variable) and career performance statistics (the predictors or features). There are a few players without salary, and we first remove them from the data.
df %
  drop_na()
A In order to assess the performance of different estimation procedure for prediction, we first split the data into a testing and training data set. The test data set is sometimes called a “holdout”, and it should only be used for model assessment. Using this data, we calculate a performance metric, such as the test MSPE. The first step is to randomly split the data into training and test data. Before we do this, it is important to set the random seed so that we can replicate our results later on.
There are many ways to do this, but I like to use the package tidymodels’s function called initial_split. Thus, you have to install and load the package tidymodels to run this code chunk.
# set random seed for replication
set.seed(42101)
# split data into training/test
df_split% length(), which calculates the length of the estimated coefficient vector. The second is the built-in lm() created rank, accessed by model$rank]
		Calculate the training error
		Calculate the test error
Basic     % .^2 %>% mean()
# Create your own function
mse % .^2 %>% mean(),
          "\n"))
The training error for Full Model:  97154.1301798506 
cat(paste("The training error for Extended Model: ",
          (train$Salary - predict(Extended)) %>% .^2 %>% mean(),
          "\n"))
The training error for Extended Model:  2587.25216012656 
D
In order to visualize these outcomes, its useful to construct a data frame of our mode performance vs model complexity (the number of estimated parameters). I do this below, manually. The end result is that we want a data frame with a column for model, mse, and number of model parameters.
		Using the data set training.error, plot the mse vs k (the number of parameters) using geom_line(). Interpret your result. Which model has the best in-sample performance?
training.error           
1 Basic    128132.7       6
2 Full      97154.13     20
3 Extended   2587.252   183
ggplot(training.error, aes(y = mse, x = k)) + 
  geom_line()
From the above plot of mse vs k (the number of parameters), we can observe that as the value of k increases, the mean square error decreases. Therefore, the Extended model performed the best when considering in-sample performance.
E Machine learning methods are about finding models that are “generalizable”. This means that we want our model to have good out of sample performance. To gange each of our models’ out-of-sample performance, we can ouse our hold out test data.
		Repeat D for the data set test. Which model has the best out-of-sample performance?
test.error            
1 Basic      107710.1      6
2 Full        89624.41    20
3 Extended 16617898.     183
ggplot(test.error, aes(y = mse, x = k)) + 
  geom_line()
From the above plot of mse vs k (the number of parameters), we can observe that as the value of k increases, the mean square error first decreases and then increases as evident from the table as well. Therefore, the Full model performed the best when considering out-of-sample performance.
2.1 Glmnet
Since the relationship between salary and the predictor variables is unknown, any regression model we run is ad-hoc in the sense that we have no guidance as to what predictor variables to use, and how to use them (squares, interactions, ect). There are methods that systematically run a number of models (step wise or subset regression, for example) in the aim of finding the best model. However, these methods can be vary time consuming because even for a relatively small data set such as this, there are potentially a large number of potential predictors.
We can use penalized regression (lasso, elastic net, or ridge) to constrain model complexity in the hopes of improving out-of-sample performance. Lets start with a basic introduction to to the command glmnet from the package of the same name. To demonstrate, we will use ridge regression.
glmnet doesn’t have a formula interface like lm(). Instead, we have to manually supply an outcome vector $Y$ and a matrix of features, $\mathbf(X)$. Below, I demonstrate for the train data set.
### Train data
# extract Salary as a vector using pull()
Y_train % pull(Salary)
# Extract X as matrix
X_train % 
  select(-Salary) %>% 
  model.matrix(~ -1 + ., data = .)
### Test data
# extract Salary as a vector using pull()
Y_test % pull(Salary)
# Extract X as matrix
X_test % 
  select(-Salary) %>% 
  model.matrix(~ -1 + ., data = .)
To fit a glmnet model, we simply pass on these objects to the glmnet command. There are, of course, a bunch of options. For us, one option we want to specify is alpha which determines the penalty. An alpha = 0 is a ridge penalty, and alpha = 1 is a lasso penalty. 0 % 
  filter(term != "(Intercept)") %>% 
  group_by(step) %>% 
  summarize(sum.coef = sqrt(sum(estimate^2)) ,
            lambda = mean(lambda)) %>% 
  ggplot(aes(y = sum.coef, x = log(lambda))) + 
  geom_line() + 
  labs(
    y = "l-2 norm",
    x = "log penalty",
    title = "Ridge regression: sum of squared coefficients vs penalty"
  )
As discussed, by default, glmnet estimates a ridge regression over a grid of 100 potential penalty parameters. To get a visualization of the “shrinkage” the penalty term does, we can plot the regression coefficients against lambda.
# this code chunk is complete, set eval = T
ridge.data %>% 
  filter(term != "(Intercept)") %>% 
ggplot(aes(lambda, estimate, color = term)) +
  geom_line() +
  scale_x_log10() + 
  labs(
    title = "Ridge coefficient path",
    y = "betas",
    x = "log lambda"
  ) + 
  theme(legend.position = 'bottom')
Lambda is an example of a tuning parameter. That is, its a parameter that is determined separately from the model and has to be specified by the researcher. Obviously, we don’t know what a good value is for $\lamba$ a priori, but we can select a tuning parameter through a data driven process known as cross-validation.
glmnet has a built in function for this. By default, it does 10 fold cross validation, but we can select how many folds we wish to use. Below, I estimate a series of ridge regressions for different values of lambda, using the cross-fold procedure we discussed in class (this is done automatically for us). Next, I plot the cross-fold out of sample mse for different values of lambda. The idea is that we want to choose the “best” lambda, the one that minimizes the cross-fold error.
cv.ridge %
  mean()
ridge.mse 
[1] 85514
Compare this to the mse from our linear models above. What do you find?
Based on the mean square error on testing (out-of-sample) data, the Ridge Regression performed the best in building a model for predicting the Salary.
2.2 Your turn
In my example above, I only used the .

--- title: "Assignment 6: ML" date: output: bookdown::html_document2: toc: true toc_float: true theme: flatly highlight: monochrome code_folding: hide markdown_extensions: XXXXXXXXXXadmonition ---...

Answer To: --- title: "Assignment 6: ML" date: output: bookdown::html_document2: toc: true toc_float: true...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment