INSTRUCTIONS · Use of the following CLASSIFICATION Machine Learning Models · Decision Trees · Random Forest · Gradient Boosting Trees · First, given dataset “spx_tail”, the “tail events” as defined...

1 answer below »
See attached Word docNeed to have confirmation that Problem is well understood, as well as the Methods to be used (Decision trees, Random Forest, Gradient Boosting Trees) and validation methods (one- split and k-fold cross validation, time series split)



INSTRUCTIONS · Use of the following CLASSIFICATION Machine Learning Models · Decision Trees · Random Forest · Gradient Boosting Trees · First, given dataset “spx_tail”, the “tail events” as defined below have to be identified in order to assign them such label · All sections needed, i.e. · “Please Report” section · “Remember” section: · one-split cross validation · k-fold cross validation (use k = 5) · top K Feature Importance for Random Forest and Gradient Boosting Trees · Bonus 1 · Bonus 2 · Bonus 3 ASSIGNMENT
Answered Same DayNov 07, 2020

Answer To: INSTRUCTIONS · Use of the following CLASSIFICATION Machine Learning Models · Decision Trees · Random...

Ximi answered on Nov 10 2020
132 Votes
{
"cells": [
{
"cell_type": "code",
"execution_count": 110,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split, cross_val_score, TimeSeriesSplit\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.metrics import accuracy_score, classification_report\n",
"from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.naive_bayes import GaussianNB\n",
"from sklearn.neighbors import KNeighborsClassifier"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#Reading CSV file\n",
"df = pd.read_csv('spx_tail.csv')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(4319, 254)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Shape of the data\n",
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
" \n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
tickerlag_1lag_2lag_3lag_4lag_5lag_6lag_7lag_8lag_9...lag_244lag_245lag_246lag_247lag_248lag_249lag_250lag_251lag_252T1
02001-01-02-0.0105030.0039800.0103850.0070350.0241100.007970-0.031796-0.0130430.008038...0.012096-0.004396-0.0131490.0111280.0267300.0009550.001920-0.0390990.000000False
12001-01-03-0.028432-0.0105030.0039800.0103850.0070350.0241100.007970-0.031796-0.013043...0.0106150.012096-0.004396-0.0131490.0111280.0267300.0009550.001920-0.039099False
22001-01-040.048884-0.028432-0.0105030.0039800.0103850.0070350.0241100.007970-0.031796...-0.0068560.0106150.012096-0.004396-0.0131490.0111280.0267300.0009550.001920False
32001-01-05-0.0106080.048884-0.028432-0.0105030.0039800.0103850.0070350.0241100.007970...0.000522-0.0068560.0106150.012096-0.004396-0.0131490.0111280.0267300.000955False
42001-01-08-0.026593-0.0106080.048884-0.028432-0.0105030.0039800.0103850.0070350.024110...-0.0071210.000522-0.0068560.0106150.012096-0.004396-0.0131490.0111280.026730False
\n",
"

5 rows × 254 columns

\n",
"
"
],
"text/plain": [
" ticker lag_1 lag_2 lag_3 lag_4 lag_5 lag_6 \\\n",
"0 2001-01-02 -0.010503 0.003980 0.010385 0.007035 0.024110 0.007970 \n",
"1 2001-01-03 -0.028432 -0.010503 0.003980 0.010385 0.007035 0.024110 \n",
"2 2001-01-04 0.048884 -0.028432 -0.010503 0.003980 0.010385 0.007035 \n",
"3 2001-01-05 -0.010608 0.048884 -0.028432 -0.010503 0.003980 0.010385 \n",
"4 2001-01-08 -0.026593 -0.010608 0.048884 -0.028432 -0.010503 0.003980 \n",
"\n",
" lag_7 lag_8 lag_9 ... lag_244 lag_245 lag_246 \\\n",
"0 -0.031796 -0.013043 0.008038 ... 0.012096 -0.004396 -0.013149 \n",
"1 0.007970 -0.031796 -0.013043 ... 0.010615 0.012096 -0.004396 \n",
"2 0.024110 0.007970 -0.031796 ... -0.006856 0.010615 0.012096 \n",
"3 0.007035 0.024110 0.007970 ... 0.000522 -0.006856 0.010615 \n",
"4 0.010385 0.007035 0.024110 ... -0.007121 0.000522 -0.006856 \n",
"\n",
" lag_247 lag_248 lag_249 lag_250 lag_251 lag_252 T1 \n",
"0 0.011128 0.026730 0.000955 0.001920 -0.039099 0.000000 False \n",
"1 -0.013149 0.011128 0.026730 0.000955 0.001920 -0.039099 False \n",
"2 -0.004396 -0.013149 0.011128 0.026730 0.000955 0.001920 False \n",
"3 0.012096 -0.004396 -0.013149 0.011128 0.026730 0.000955 False \n",
"4 0.010615 0.012096 -0.004396 -0.013149 0.011128 0.026730 False \n",
"\n",
"[5 rows x 254 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#A glance at the data\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#Separating input data and output variable\n",
"X = df.iloc[:, 1:-1]\n",
"y = df.iloc[:, -1]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
lag_1lag_2lag_3lag_4lag_5lag_6lag_7lag_8lag_9lag_10...lag_243lag_244lag_245lag_246lag_247lag_248lag_249lag_250lag_251lag_252
0-0.0105030.0039800.0103850.0070350.0241100.007970-0.031796-0.0130430.008038-0.021696...0.0106150.012096-0.004396-0.0131490.0111280.0267300.0009550.001920-0.0390990.000000
1-0.028432-0.0105030.0039800.0103850.0070350.0241100.007970-0.031796-0.0130430.008038...-0.0068560.0106150.012096-0.004396-0.0131490.0111280.0267300.0009550.001920-0.039099
20.048884-0.028432-0.0105030.0039800.0103850.0070350.0241100.007970-0.031796-0.013043...0.000522-0.0068560.0106150.012096-0.004396-0.0131490.0111280.0267300.0009550.001920
3-0.0106080.048884-0.028432-0.0105030.0039800.0103850.0070350.0241100.007970-0.031796...-0.0071210.000522-0.0068560.0106150.012096-0.004396-0.0131490.0111280.0267300.000955
4-0.026593-0.0106080.048884-0.028432-0.0105030.0039800.0103850.0070350.0241100.007970...-0.002917-0.0071210.000522-0.0068560.0106150.012096-0.004396-0.0131490.0111280.026730
\n",
"

5 rows × 252 columns

\n",
"
"
],
"text/plain": [
" lag_1 lag_2 lag_3 lag_4 lag_5 lag_6 lag_7 \\\n",
"0 -0.010503 0.003980 0.010385 0.007035 0.024110 0.007970 -0.031796 \n",
"1 -0.028432 -0.010503 0.003980 0.010385 0.007035 0.024110 0.007970 \n",
"2 0.048884 -0.028432 -0.010503 0.003980 0.010385 0.007035 0.024110 \n",
"3 -0.010608 0.048884 -0.028432 -0.010503 0.003980 0.010385 0.007035 \n",
"4 -0.026593 -0.010608 0.048884 -0.028432 -0.010503 0.003980 0.010385 \n",
"\n",
" lag_8 lag_9 lag_10 ... lag_243 lag_244 lag_245 \\\n",
"0 -0.013043 0.008038 -0.021696 ... 0.010615 0.012096 -0.004396 \n",
"1 -0.031796 -0.013043 0.008038 ... -0.006856 0.010615 0.012096 \n",
"2 0.007970 -0.031796 -0.013043 ... 0.000522 -0.006856 0.010615 \n",
"3 0.024110 0.007970 -0.031796 ... -0.007121 0.000522 -0.006856 \n",
"4 0.007035 0.024110 0.007970 ... -0.002917 -0.007121 0.000522 \n",
"\n",
" lag_246 lag_247 lag_248 lag_249 lag_250 lag_251 lag_252 \n",
"0 -0.013149 0.011128 0.026730 0.000955 0.001920 -0.039099 0.000000 \n",
"1 -0.004396 -0.013149 0.011128 0.026730 0.000955 0.001920 -0.039099 \n",
"2 0.012096 -0.004396 -0.013149 0.011128 0.026730 0.000955 0.001920 \n",
"3 0.010615 0.012096 -0.004396 -0.013149 0.011128 0.026730 0.000955 \n",
"4 -0.006856 0.010615 0.012096 -0.004396 -0.013149 0.011128 0.026730 \n",
"\n",
"[5 rows x 252 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Input data\n",
"X.head()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 False\n",
"3 False\n",
"4 False\n",
"Name: T1, dtype: bool"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Output data\n",
"y.head()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"#Train test split or single fold split\n",
"train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.2)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Decision Trees\n",
"I. Implementing decision trees with different depths and impurity functions."
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Depth: 2 \n",
"Entropy function: gini\n",
"Accuracy Score: 89.814815\n",
"Classification Report: \n",
" precision recall f1-score support\n",
"\n",
" False 0.90 1.00 0.95 776\n",
" True 0.50 0.01 0.02 88\n",
"\n",
"avg / total 0.86 0.90 0.85 864\n",
"\n",
"Depth: 2 \n",
"Entropy function: entropy\n",
"Accuracy Score: 89.814815\n",
"Classification Report: \n",
" precision recall f1-score support\n",
"\n",
" False 0.90 1.00 0.95 776\n",
" True 0.00 0.00 0.00 88\n",
"\n",
"avg / total 0.81 0.90 0.85 864\n",
"\n",
"Depth: 5 \n",
"Entropy function: gini\n",
"Accuracy Score: 89.004630\n",
"Classification Report: \n",
" precision recall f1-score support\n",
"\n",
" False 0.90 0.99 0.94 776\n",
" True 0.11 0.01 0.02 88\n",
"\n",
"avg / total 0.82 0.89 0.85 864\n",
"\n",
"Depth: 5 \n",
"Entropy function: entropy\n",
"Accuracy Score: 89.583333\n",
"Classification Report: \n",
" precision recall f1-score support\n",
"\n",
" False 0.90 1.00 0.94 776\n",
" True 0.25 0.01 0.02 88\n",
"\n",
"avg / total 0.83 0.90 0.85 864\n",
"\n",
"Depth: 10 \n",
"Entropy function: gini\n",
"Accuracy Score: 87.037037\n",
"Classification Report: \n",
" precision recall f1-score support\n",
"\n",
" False 0.90 0.97 0.93 776\n",
" True 0.07 0.02 0.03 88\n",
"\n",
"avg / total 0.81 0.87 0.84 864\n",
"\n",
"Depth: 10 \n",
"Entropy function: entropy\n",
"Accuracy Score: 85.879630\n",
"Classification Report: \n",
" precision recall f1-score support\n",
"\n",
" False 0.90 0.95 0.92 776\n",
" True 0.13 0.07 0.09 88\n",
"\n",
"avg / total 0.82 0.86 0.84 864\n",
"\n",
"Depth: 15 \n",
"Entropy function: gini\n",
"Accuracy Score: 84.722222\n",
"Classification Report: \n",
" precision recall f1-score support\n",
"\n",
" False 0.90 0.94 0.92 776\n",
" True 0.08 0.05 0.06 88\n",
"\n",
"avg / total 0.81 0.85 0.83 864\n",
"\n",
"Depth: 15 \n",
"Entropy function: entropy\n",
"Accuracy Score: 83.912037\n",
"Classification Report: \n",
" precision recall f1-score support\n",
"\n",
" False 0.90 0.92 0.91 776\n",
" True 0.13 0.10 0.11 88\n",
"\n",
"avg / total 0.82 0.84 0.83 864\n",
"\n",
"Depth: 20 \n",
"Entropy function: gini\n",
"Accuracy Score: 83.912037\n",
"Classification Report: \n",
" precision recall f1-score support\n",
"\n",
" False 0.89 0.93 0.91 776\n",
" True 0.05 0.03 0.04 88\n",
"\n",
"avg / total 0.81 0.84 0.82 864\n",
"\n",
"Depth: 20 \n",
"Entropy function: entropy\n",
"Accuracy Score: 81.944444\n",
"Classification Report: \n",
" precision recall f1-score support\n",
"\n",
" False 0.90 0.90 0.90 776\n",
" True 0.14 0.15 0.14 88\n",
"\n",
"avg / total 0.82 0.82 0.82 864\n",
"\n",
"Depth: 50 \n",
"Entropy function: gini\n",
"Accuracy Score: 78.472222\n",
"Classification Report: \n",
" precision recall f1-score support\n",
"\n",
" False 0.89 0.86 0.88 776\n",
" True 0.07 0.09 0.08 88\n",
"\n",
"avg / total 0.81 0.78 0.80 864\n",
"\n",
"Depth: 50 \n",
"Entropy function: entropy\n",
"Accuracy Score: 81.481481\n",
"Classification Report: \n",
" precision recall f1-score support\n",
"\n",
" False 0.90 0.89 0.90 776\n",
" True 0.12 0.12 0.12 88\n",
"\n",
"avg / total 0.82 0.81 0.82 864\n",
"\n"
]
}
],
"source": [
"depths = [2, 5, 10, 15, 20, 50]\n",
"criterion = ['gini', 'entropy']\n",
"for depth in depths:\n",
" for en_func in criterion:\n",
" print (\"Depth: %d \\nEntropy function: %s\"%(depth, en_func))\n",
" clf = DecisionTreeClassifier(criterion=en_func, max_depth=depth)\n",
" clf.fit(train_x, train_y)\n",
"\n",
" #Accuracy Score\n",
" print (\"Accuracy Score: %f\"%(accuracy_score(test_y, clf.predict(test_x))*100))\n",
"\n",
" #Classification report\n",
" print (\"Classification Report: \")\n",
" print(classification_report(test_y, clf.predict(test_x)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Decision tree is working well in max_depth = 2 with gini impurity function and its able to predict the True class at 50% precision which is better than the rest."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 0.90 (+/- 0.01)\n"
]
}
],
"source": [
"#Cross validating with k = 5 with optimal parameters\n",
"clf = DecisionTreeClassifier(criterion='gini', max_depth=2)\n",
"scores = cross_val_score(clf, X, y, cv=5)\n",
"print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Bonus 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using balanced class_weight increased the precision for the True class indicating that classifier is able to predict more precisely the True class."
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 0.90 (+/- 0.00)\n",
"Classification Report: \n",
" precision recall f1-score support\n",
"\n",
" False 0.90 1.00 0.95 776\n",
" True 0.50 0.01 0.02 88\n",
"\n",
"avg / total 0.86 0.90 0.85 864\n",
"\n"
]
}
],
"source": [
"#Tuning class_weights on optimal parameters to check performance\n",
"clf = DecisionTreeClassifier(criterion='gini', max_depth=2, class_weight={True:0.5, False:0.5})\n",
"\n",
"scores = cross_val_score(clf, X, y, cv=5)\n",
"print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))\n",
"\n",
"clf.fit(train_x, train_y)\n",
"\n",
"#Classification report\n",
"print (\"Classification Report: \")\n",
"print(classification_report(test_y, clf.predict(test_x)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Random Forest\n",
"II. Implementing random...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here