Assignment: Decision Trees Learning outcomes · Understand how to use decision trees on a Dataset to make a prediction · Learning hyper-parameters tuning for decision trees by using RandomGrid ·...

1 answer below »
Hello, I have another ML Python assignment that will not have time to fully complete. I have attached the word doc with all of the questions that need to be answered, along with the assignment file that it needs to be completed in (assignment.ipynb) and the data can be downloaded -https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#Regards,Ed


Assignment: Decision Trees Learning outcomes · Understand how to use decision trees on a Dataset to make a prediction · Learning hyper-parameters tuning for decision trees by using RandomGrid · Learning the effectiveness of ensemble algorithms (Random Forest, Adaboost, Extra trees classifier, Gradient Boosted Tree) · · In the first part of this assignment, you will use Classification Trees for predicting if a user has a default payment option active or not. You can find the necessary data for performing this assignment here · This dataset is aimed at the case of customer default payments in Taiwan. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default. · Required imports for this project are given below. Make sure you have all libraries required for this project installed. You may use conda or pip based on your set up. · NOTE: Since data is in Excel format you need to install xlrd in order to read the excel file inside your pandas dataframe. You can run pip install xlrd to install Questions (15 points total) Question 1 (2 pts) Build a classifier by using decision tree and calculate the confusion matrix. Try different hyper-parameters (at least two) and discuss the result. Question 2 (4 pts) Try to build the decision tree which you built for the previous question, but this time by RandomGrid search over hyper-parameters. Compare the results. Question 3 (6 pts) Try to build the same classifier by using following ensemble models. For each of these models calculate accuracy and at least for two in the list below, plot the learning curves. · Random Forest · AdaBoost · Extra Trees Classifier · Gradient Boosted Trees Question 4 (3 pts) Discuss and compare the results for the all past three questions. · How does changing hyperparms effect model performance? · Why do you think certain models performed better/worse? · How does this performance line up with known strengths/weakness of these models?
Answered Same DayJul 17, 2021

Answer To: Assignment: Decision Trees Learning outcomes · Understand how to use decision trees on a Dataset to...

Suraj answered on Jul 19 2021
132 Votes
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "9OBvBOCkPrga"
},
"source": [
"## Assignment 4"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "bEmSTWZSPrgb"
},
"source": [
"This assignment is based on content discussed in module 8 and using Decision Trees and Ensemble Models in classification and regression problems."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "1cUoTzQLPrgc"
},
"source": [
"## Learning outcomes "
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Q1ygYVo_Prgc"
},
"source": [
"- Understand how to use decision trees on a Dataset to make a prediction\n",
"- Learning hyper-parameters tuning for decision trees by using RandomGrid \n",
"- Learning the effectiveness of ensemble algorithms (Random Forest, Adaboost, Extra trees classifier, Gradient Boosted Tree)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "9hjVbQlVPrgd"
},
"source": [
"In the first part of this assignment, you will use Classification Trees for predicting if a user has a default payment option active or not. You can find the necessary data for performing this assignment [here](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) \n",
"\n",
"This dataset is aimed at the case of customer default payments in Taiwan. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default.\n",
"\n",
"Required imports for this project are given below. Make su
re you have all libraries required for this project installed. You may use conda or pip based on your set up.\n",
"\n",
"__NOTE:__ Since data is in Excel format you need to install `xlrd` in order to read the excel file inside your pandas dataframe. You can run `pip install xlrd` to install "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "R376ZBnBPrge"
},
"outputs": [],
"source": [
"#required imports\n",
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ddF9R5pdPrgi"
},
"source": [
"After installing the necessary libraries, proceed to download the data. Since reading the excel file won't create headers by default, we added two more operations to substitute the columns."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "CtNCjjr7Prgj"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"None\n"
]
}
],
"source": [
"#loading the data\n",
"dataset = pd.read_excel(\"https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls\")\n",
"#dataset.columns = dataset.iloc[0]\n",
"#dataset.drop(['ID'], inplace=True)\n",
"dataset.drop(dataset.columns[dataset.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)\n",
"print(dataset.drop(0,inplace=True))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "cMh-sEIdPrgl"
},
"source": [
"In the following, you can take a look into the dataset."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "E0lAPOXQPrgl",
"outputId": "ea66ba57-f32c-4b39-c60a-e52402acbca1"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
X1X2X3X4X5X6X7X8X9X10...X15X16X17X18X19X20X21X22X23Y
1200002212422-1-1-2...000068900001
212000022226-12000...3272345532610100010001000020001
3900002223400000...1433114948155491518150010001000100050000
4500002213700000...2831428959295472000201912001100106910000
55000012157-10-100...2094019146191312000366811000090006896790
6500001123700000...19394196192002425001815657100010008000
75000001122900000...5426534830034739445500040000380002023913750137700
8100000222230-1-100...221-1595673806010581168715420
91400002312800200...12211117933719332904321000100010000
102000013235-2-2-2-2-1...0130071391200013007112200
\n",
"

10 rows × 24 columns

\n",
"
"
],
"text/plain": [
" X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 ... X15 X16 X17 \\\n",
"1 20000 2 2 1 24 2 2 -1 -1 -2 ... 0 0 0 \n",
"2 120000 2 2 2 26 -1 2 0 0 0 ... 3272 3455 3261 \n",
"3 90000 2 2 2 34 0 0 0 0 0 ... 14331 14948 15549 \n",
"4 50000 2 2 1 37 0 0 0 0 0 ... 28314 28959 29547 \n",
"5 50000 1 2 1 57 -1 0 -1 0 0 ... 20940 19146 19131 \n",
"6 50000 1 1 2 37 0 0 0 0 0 ... 19394 19619 20024 \n",
"7 500000 1 1 2 29 0 0 0 0 0 ... 542653 483003 473944 \n",
"8 100000 2 2 2 23 0 -1 -1 0 0 ... 221 -159 567 \n",
"9 140000 2 3 1 28 0 0 2 0 0 ... 12211 11793 3719 \n",
"10 20000 1 3 2 35 -2 -2 -2 -2 -1 ... 0 13007 13912 \n",
"\n",
" X18 X19 X20 X21 X22 X23 Y \n",
"1 0 689 0 0 0 0 1 \n",
"2 0 1000 1000 1000 0 2000 1 \n",
"3 1518 1500 1000 1000 1000 5000 0 \n",
"4 2000 2019 1200 1100 1069 1000 0 \n",
"5 2000 36681 10000 9000 689 679 0 \n",
"6 2500 1815 657 1000 1000 800 0 \n",
"7 55000 40000 38000 20239 13750 13770 0 \n",
"8 380 601 0 581 1687 1542 0 \n",
"9 3329 0 432 1000 1000 1000 0 \n",
"10 0 0 0 13007 1122 0 0 \n",
"\n",
"[10 rows x 24 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "r4jchSRoPrgr"
},
"source": [
"## Questions (15 points total)\n",
"\n",
"#### Question 1 (2 pts)\n",
"Build a classifier by using decision tree and calculate the confusion matrix. Try different hyper-parameters (at least two) and discuss the result."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "1Qr1SPGlPrgr"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Int64Index: 30000 entries, 1 to 30000\n",
"Data columns (total 24 columns):\n",
"X1 30000 non-null object\n",
"X2 30000 non-null object\n",
"X3 30000 non-null object\n",
"X4 30000 non-null object\n",
"X5 30000 non-null object\n",
"X6 30000 non-null object\n",
"X7 30000 non-null object\n",
"X8 30000 non-null object\n",
"X9 30000 non-null object\n",
"X10 30000 non-null object\n",
"X11 30000 non-null object\n",
"X12 30000 non-null object\n",
"X13 30000 non-null object\n",
"X14 30000 non-null object\n",
"X15 30000 non-null object\n",
"X16 30000 non-null object\n",
"X17 30000 non-null object\n",
"X18 30000 non-null object\n",
"X19 30000 non-null object\n",
"X20 30000 non-null object\n",
"X21 30000 non-null object\n",
"X22 30000 non-null object\n",
"X23 30000 non-null object\n",
"Y 30000 non-null object\n",
"dtypes: object(24)\n",
"memory usage: 5.7+ MB\n",
"[[14306 3261]\n",
" [ 2883 2050]]\n",
"\n",
"\n",
"[[16868 699]\n",
" [ 3316 1617]]\n",
"[[16669 898]\n",
" [ 3122 1811]]\n"
]
}
],
"source": [
"# YOUR CODE HERE\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import accuracy_score,confusion_matrix\n",
"dataset.info()\n",
"dataset.describe()\n",
"# dividing data into dependent and independent variables\n",
"ind=dataset.iloc[:,0:23].values\n",
"dep=dataset.iloc[:,23:24].values\n",
"dep=dep.astype('int')\n",
"# spliting data into train and test phase\n",
"x_train,x_test,y_train,y_test=train_test_split(ind,dep,test_size=0.75,random_state=0)\n",
"# building model\n",
"tree=DecisionTreeClassifier()\n",
"tree.fit(x_train,y_train)\n",
"pred=tree.predict(x_test)\n",
"print(confusion_matrix(y_test,pred))\n",
"#changing first hyperparameter\n",
"tree=DecisionTreeClassifier(criterion=\"entropy\",max_depth=2,min_samples_leaf=1,min_samples_split=2)\n",
"tree.fit(x_train,y_train)\n",
"pred=tree.predict(x_test)\n",
"print(type(pred))\n",
"print(type(y_test))\n",
"print(confusion_matrix(y_test,pred))\n",
"#changing second hyperparameter\n",
"tree=DecisionTreeClassifier(criterion=\"gini\",max_depth=4,min_samples_leaf=2,min_samples_split=3)\n",
"tree.fit(x_train,y_train)\n",
"pred=tree.predict(x_test)\n",
"print(confusion_matrix(y_test,pred))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "QwcecRukPrgw"
},
"source": [
"#### Question 2 (4 pts)\n",
"\n",
"Try to build the decision tree which you built for the previous question, but this time by RandomGrid search over hyper-parameters. Compare the results."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "4XHRmsWOPrgx"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"[[16654 913]\n",
" [ 3112 1821]]\n"
]
}
],
"source": [
"# YOUR CODE HERE\n",
"from sklearn.model_selection import GridSearchCV\n",
"parameters = {'criterion':('gini','entropy'),'max_depth':(2,3,4,5,6,7,8),'min_samples_leaf':(2,3,4,5,6,7,8)}\n",
"grid=GridSearchCV(DecisionTreeClassifier(),param_grid=parameters,cv=3)\n",
"grid_model=grid.fit(x_train,y_train)\n",
"grid_model.best_estimator_\n",
"tree=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,\n",
" max_features=None, max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=4, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n",
" splitter='best')\n",
"tree.fit(x_train,y_train)\n",
"pred=tree.predict(x_test)\n",
"print(type(pred))\n",
"print(type(y_test))\n",
"print(confusion_matrix(y_test,pred))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "dEvsYwiXPrg3"
},
"source": [
"#### Question 3 (6 pts)\n",
"\n",
"Try to build the same classifier...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here