--- title: 'Lab 8a: CART' author: (you) date: April 8, 2019 output: html_document --- # Introduction Decision Trees use a simple system for making clean hierarchical rules that can help predict the...

I have attached one assignment divided up into 2 parts


--- title: 'Lab 8a: CART' author: (you) date: April 8, 2019 output: html_document --- # Introduction Decision Trees use a simple system for making clean hierarchical rules that can help predict the value of a numeric or categorical variable.  The idea is very old (Linneaus made one in 1735 and Darwin used one in 1859), but modern computers make it easy to generate them with ease. One common algorithm for making decision trees is called _CART (Classification and Regression Trees)_.  In R, we get CART from the `rpart` package. It uses a simple algorithm to make trees very quickly, so very large problems can be solved.  Once we learn CART models, we can see more complicated variations, such as BART, Bagging and Boosting, and Random Forests. # The Data Set For this lab, we'll keep looking at the red wine data set we looked at in the last homework. As a reminder, it comes from the UC-Irvine Machine Learning Repository. It's a data set the relates several types of measurements of red wine to a (human-created) rating of the wine's quality (a whole number between 3 and 9).  You can copy it from the last homework, or download it from here: [http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv](http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv) This first bit of code loads our libraries, loads the data set and reminds you of the variable names. Note that the `read.csv` command can actually download data sets from the internet, as well as load them locally. ```{r} # Note the set.seed command to ensure repeatability on our randomness. set.seed(4239857) library(tidyverse) library(cvTools) # The install.packages command is only needed once. Copy it to the console, # and run it if you need to install the rpart package. # install.packages("rpart") library(rpart) # Load the file straight from the web site: # wine <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",="" sep=";" )="" #="" code="" to="" load="" the="" file="" locally="" instead.="" you="" don't="" need="" both.="" wine=""><- read.csv("winequality-red.csv",="" sep=";" )="" names(wine)="" ```="" #="" the="" `rpart`="" library="" and="" command="" the="" `rpart`="" library="" gives="" us="" the="" `rpart`="" command,="" which="" is="" used="" to="" create="" cart="" trees.="" (i="" don't="" know="" what="" rpart="" stands="" for,="" but="" it="" seems="" that="" it="" should="" stand="" for="" something.)="" the="" basic="" format="" for="" `rpart`="" is="" similar="" to="" that="" of="" the="" `lm`="" command.="" note="" that="" interactions="" aren't="" allowed="" in="" the="" formula="" for="" `rpart`.="" `rpart(response="" ~="" exp1="" +="" exp2="" +="" ...,="" data="data_file)`" here's="" the="" command="" used="" to="" predict="" the="" quality="" of="" wine="" using="" all="" the="" variables:="" ```{r}="" fit.rpart=""><- rpart(quality="" ~="" .,="" data="wine)" ```="" after="" you="" make="" the="" tree,="" you="" will="" likely="" want="" to="" get="" information="" from="" it.="" the="" next="" few="" sections="" show="" some="" things="" you="" can="" do.="" ##="" plotting="" the="" tree="" it's="" realatively="" easy="" to="" plot="" the="" tree="" created="" by="" `rpart`.="" the="" base-r="" `plot`="" command="" makes="" the="" tree="" itself,="" but="" if="" you="" want="" to="" decorate="" it,="" you="" also="" need="" to="" use="" the="" `text`="" command.="" these="" are="" general-purpose="" commands="" that="" can="" do="" many="" things,="" but="" they="" have="" specific="" meaning="" for="" your="" fitted="" trees.="" compare="" the="" two="" graphs:="" ```{r}="" plot(fit.rpart)="" ```="" ```{r}="" plot(fit.rpart)="" text(fit.rpart)="" ```="" if="" your="" plot="" or="" text="" doesn't="" look="" so="" good,="" try="" these="" things:="" -="" replace="" your="" variable="" names="" with="" shorter="" names="" so="" that="" they="" can="" fit="" on="" the=""  plot="" more="" easily.="" i="" wouldn't="" do="" this="" permanently,="" but="" perhaps="" just="" to="" make=""  the="" graph.="" -="" check="" out="" the="" options="" for="" plot="" and="" text="" by="" using="" the="" `?plot.rpart`="" and=""  `?text.rpart`="" commands.="" options="" can="" adjust="" the="" shape="" of="" the="" tree="" or=""  placement="" of="" the="" text.="" ###="" question="" 1=""> Take a look at the tree diagram above. Trees are often human-understandable. > What can you say about the apparent relationship between `alcohol` and > quality rating? How about `volatile.acidity` and overall quality rating? > **Answer Below**: ## Text Output If you just type the name of your rpart fit, you'll get a printout in text form that tells you the following: - Numbering and indentation represents the nesting of branches in the tree. - Branches marked with asterisks are final branches at the bottom of the tree. - For each branch, you'll get four pieces of information    - The logical expression that defines the branch.    - The number of data points that go into that branch.    - The _mean squared error_ for data points in that branch (without the      square root).    - The predicted value of y for that branch. You probably won't have to use this output very much, but here it is: ```{r} fit.rpart ``` ### Question 2 > Two branches at the second level of the plot should be `sulphates < 0.575`=""> and `sulphates < 0.645`.="" we="" know="" that="" the="" cart="" method="" only="" makes="" one=""> division at a time. Based on the text output, which of the two splits was > made first by the algorithm? **Answer Below:** ### Question 3 > If the CART algorithm had stopped sooner, you might have had a > _single_ end node for all data points that had `alcohol >= 10.53`, > `sulphates < 0.645`="" and="" `volatile.acidity="">< 1.015`.=""> Based on the text output, what would have been the predicted quality for > such wines? **Answer Below:** ## Cross-Validated RMSE The `cvFit` command from the `cvTools` package works with our new CART trees. The commands below calculate CV-RMSE for our CART tree and for a linear model that uses all the (other) variables to predict quality. We've seen the definitions for these options in a previous lab. To remember, you can execute `?cvFit` in the console if you'd like. ```{r} cvFit(fit.rpart, y=wine$quality, data=wine, K=10, R=10) fit.lm <- lm(quality="" ~="" .,="" data="wine)" cvfit(fit.lm,="" y="wine$quality," data="wine," k="10," r="10)" ```="" ###="" question="" 4=""> Which of the two models (CART or linear model) will do better at predicting > quality for new bottles of wine, according to the CV-RMSE? Does it appear > there is a there a small, medium or a large difference between the accuracy > of the two models?  **Answer Below:** # Complexity in CART As we talked about in class, predictive models that try to account for the problem of overfitting often include some sort of "penalty" term when finding the best-fitting model. More complex models are penalized in the hopes that simpler models are less likely to overfit the data. There's always a trade-off: - Larger penalties mean you'll get a simler model, sometimes at the cost of  accuracy. - Smaller penalties mean you'll get a more complex model, sometimes at the  cost of overfitting. You can think of the model parameters that adjust penalties as being knobs you can turn to adjust how the model works. The goal is to find the optimal setting to get the lowest cross-validated RMSE (and thus the best chance of an accurate prediction on new data). Our `rpart` command has an optional complexity parameter (creatively called `cp`) that you can adjust. Here's what it looks like in action: ```{r} fit.rpart <- rpart(quality="" ~="" .,="" data="wine," cp="0.01)" ```="" ###="" question="" 5=""> Try various values of `cp` to find a value that seems to minimize > cross-valided RMSE. Give all your models a different name, and keep all your > output for various `cp` values so that I can see what you've tried. Below all > your code, write a sentence stating the best `cp` value you found and > what the cross-validated RMSE was for that value of `cp`. [Hints: Try `cp` > values between 0.001 and 0.02. If the commands are taking too long, use > `R=1` initially, then increase the repetitions only for your final runs.] > **Answer Below:** ```{r} ``` ### Question 6 > Looking back at the CV-RMSE for the linear model we tried and your best > CV-RMSE from Question 5, was CART able to out-perform the linear model? --- title: 'Lab 8: CART, Part 2' author: (you) date: April 8, 2019 output: html_document --- # Introduction This is a continuation of the first part of the last lab. To do this lab, you'll need to know the `cp` value you found at the end of the last lab to minimize CV-RMSE. Another packages is used, rpart.plot, which specifically is meant to make much prettier trees. First we'll reload libraries and data. Add or subtract hashtags as you need to. ```{r} # Note the set.seed command to ensure repeatability on our randomness. set.seed(4239857) library(tidyverse) library(cvTools) # The install.packages command is only needed once. Copy it to the console, # and run it if you need to install the rpart package. # install.packages("rpart"); # install.packages("rpart.plot") library(rpart) library(rpart.plot) # Load the file straight from the web site: wine <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",="" sep=";" )="" #="" code="" to="" load="" the="" file="" locally="" instead.="" you="" don't="" need="" both.="" #="" wine=""><- read.csv("winequality-red.csv",="" sep=";" )="" names(wine)="" ```="" next,="" let's="" duplicate="" where="" you="" ended="" up="" at="" the="" end="" of="" the="" last="" lab.="" we've="" got="" a="" linear="" model="" `fit.lm`="" and="" a="" cart="" model="" `fit.rpart`="" using="" the="" default="" value="" of="" `cp`.="" then,="" let's="" create="" a="" second="" cart="" model="" using="" your="" best="" cp="" from="" last="" time.="" **replace="" `cp="0.1`" in="" the="" command="" below="" with="" your="" `cp`="" value.**="" ```{r}="" fit.lm=""><- lm(quality="" ~="" .,="" data="wine)" fit.rpart=""><- rpart(quality="" ~.,="" data="wine)" fit.rpart.best=""><- rpart(quality="" ~.,="" data="wine," cp="0.003)" #="" look="" at="" the="" default="" cart="" fit="" vs.="" your="" best="" fit.="" cvfit(fit.rpart,="" y="wine$quality," data="wine," k="10," r="10)" cvfit(fit.rpart.best,="" y="wine$quality," data="wine," k="10," r="10)" ```="" ###="" question="" 1=""> In the code block below, write commands to plot both the default CART > model and your best CART model. Then briefly describe in words how the > two trees differ. ```{r} rpart.plot(fit.rpart) rpart.plot(fit.rpart.best) ``` # Categorical Prediction (Classification Models) In this wine data, we've made a questionable assumption. We've treated quality as a numeric variable, when in reality it only takes on a few discrete values (3, 4, 5, 6, 7, 8). It _might_ be better to treat it as a categorical variable in terms of prediction. **Note: The fact that I'm using the same data set to demonstrate both numeric and categorical predictors is possible because of the specific nature of this data set. Normally, only one or the other would be appropriate.** The code below creates a new data set, where quality is set to be a categorical (factor) variable. Take a look and make sure it succeeded. ```{r} wine.cat <- wine="" %="">% mutate(quality = as.factor(quality)) head(wine.cat) ``` Now let's make a _classification tree_ for quality treated as a categorical variable. The syntax isn't any different. ```{r} fit.rpart.cat <- rpart(quality="" ~="" .,="" data="wine.cat)" plot(fit.rpart.cat)="" text(fit.rpart.cat)="" ```="" ###="" question="" 2=""> Compare the tree you
Apr 05, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here