{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 1\n", "\n", "\n", "This question is inspired from Exercise 3 in Chapter 5 in the textbook. \n", "\n", "On Canvas, you...

1 answer below »
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Question 1\n",
"\n",
"\n",
"This question is inspired from Exercise 3 in Chapter 5 in the textbook. \n",
"\n",
"On Canvas, you will see a CSV file named \"THA_diamonds.csv\". This file is a small subset of a real dataset on diamond prices in a [Kaggle competition](https://www.kaggle.com/shivam2503/diamonds). You will use this dataset for this question and the next question. \n",
"\n",
"**Some Background Information:** In our version of the dataset, the `price` feature has been discretized as `low`, `medium`, and `high`, and `premium`. If you are interested, these levels correspond to the following price ranges in the actual diamonds dataset:\n",
"- `low` price: price between \\\\$1000 and \\\\$2000\n",
"- `medium` price: price between \\\\$2000 and \\\\$3000\n",
"- `high` price: price between \\\\$3000 and \\\\$3500\n",
"- `premium` price: price between \\\\$3500 and \\\\$4000\n",
"\n",
"**Question Overview:** For this question, you will use the (unweighted) KNN algorithm for predicting the `carat` (numerical) target feature for the following single observation using the **Euclidean distance** metric with different number of neighbors:\n",
"- `cut` = good\n",
"- `color` = D\n",
"- `depth` = 60\n",
"- `price` = premium\n",
"- (`carat` = 0.71 but you will pretend that you do not have this information)\n",
"\n",
"In practice, you would use cross-validation or train-test split for determining optimal values of KNN hyperparameters. **However, as far as this assessment is concerned, you are to use entire data for training.**\n",
"\n",
"\n",
"### Part A (15 points)\n",
"Prepare your dataset for KNN modeling. Specifically, \n",
"1. Perform one-hot encoding of the categorical descriptive features in the input dataset.\n",
"2. Scale your descriptive features to be between 0 and 1.\n",
"3. Display the **last** 10 rows after one-hot encoding and scaling.\n",
"\n",
"**IMPORTANT NOTE: If your data preparation steps are incorrect, you will not get full credit for a correct follow-through.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**NOTE:** For Parts (B), (C), and (D) below, you are **not** allowed to use the `KNeighborsRegressor()` in Scikit-Learn module, but rather use manual calculations (via either Python or Excel). That is, you will need to show and explain all your solution steps **without** using Scikit-Learn. The reason for this restriction is so that you get to learn how some things work behind the scenes. \n",
"\n",
"### Part B (5 points)\n",
"What is the prediction of the 1-KNN algorithm (i.e., k=1 in KNN) for the `carat` target feature using your manual calculations (using the Euclidean distance metric) for the single observation given above?\n",
"\n",
"### Part C (5 points)\n",
"What is the prediction of the 5-KNN algorithm?\n",
"\n",
"### Part D (5 points)\n",
"What is the prediction of the 10-KNN algorithm?\n",
"\n",
"\n",
"### Part E (15 points)\n",
"\n",
"This part (E) is an exception to the solution mode instructions for this question. In particular, you will need to use the `KNeighborsRegressor()` in Scikit-Learn to perform the same predictions in each Part (B) to (D). That is, \n",
"- What is the prediction of the 1-KNN algorithm using `KNeighborsRegressor()`?\n",
"- What is the prediction of the 5-KNN algorithm using `KNeighborsRegressor()`?\n",
"- What is the prediction of the 10-KNN algorithm using `KNeighborsRegressor()`?\n",
"\n",
"Are you able to get the same results as in your manual calculations? Please explain.\n",
"\n",
"\n",
"### Part F: Wrap-up (5 points)\n",
"\n",
"**IMPORTANT NOTE: This Wrap-up section is mandatory. That is, for Parts (B) to (E) (inclusive), you will not get any points for solutions not presented in the table format explained below.** \n",
"\n",
"Add and display two tables called **\"df_summary_manual\"** and **\"df_summary_sklearn\"** respectively:\n",
"- For the table **\"df_summary_manual\"**, you will report your results for Parts (B) to (D) using your manual calculations.\n",
"- For the table **\"df_summary_sklearn\"**, you will report your results for the 3 predictions in Part (E) using `KNeighborsRegressor()`.\n",
"\n",
"\n",
"Each of these tables need to have the following 3 columns:\n",
"- method\n",
"- prediction for the observation given (to be rounded to 3 decimal places)\n",
"- is_best (True or False - only the best prediction's is_best flag needs to be True and all the others need to be False)\n",
"\n",
"Your table needs to have 3 rows (one for each method) in each table that summarizes your results. These tables should look like below:\n",
"\n",
"|method | prediction | is_best |\n",
"|---|---|---\n",
"|1-KNN | ? | ? | ? |\n",
"|5-KNN | ? | ? | ? |\n",
"|10-KNN | ? | ? | ? |\n",
"\n",
"In case of a Pandas data frame, you can populate this data frame line by line by referring to Cell #6 in our [Pandas tutorial](https://www.featureranking.com/tutorials/python-tutorials/pandas/).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Question 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This question is inspired from Exercise 3 in Chapter 4 in the textbook. \n",
"\n",
"You will use the same CSV file as in Question 1 named \"THA_diamonds.csv\". You will build a simple decision tree with **depth 1** using this dataset for predicting the `price` (categorical) target feature using the **Entropy** split criterion. \n",
"\n",
"To clarify, for Question 1, your target feature will be `carat` whereas for this Question 2, your target feature will be `price`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part A (10 points)\n",
"\n",
"The dataset for this question has 2 numerical descriptive features, `carat` and `depth`. \n",
"1. Discretize these 2 features separately as \"category_1\", \"category_2\", and \"category_3\" respectively using the *equal-frequency binning* technique. \n",
"2. Display the first 10 rows after discretization of these two features.\n",
"\n",
"After this discretization, all features in your dataset will be categorical (which we will assume to be **\"nominal categorical\"**). \n",
"\n",
"For this question, please do **NOT** perform any one-hot-encoding of the categorical descriptive features nor any scaling. Also, please do **NOT** perform any train-test splits.\n",
"\n",
"**IMPORTANT NOTE: If your discretizations are incorrect, you will not get full credit for a correct follow-through.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part B (5 points)\n",
"\n",
"Compute the impurity of the `price` target feature."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part C (20 points)\n",
"\n",
"**IMPORTANT NOTE: For Parts C and D below, you will not get any points for solutions not presented in the required table format.** \n",
"\n",
"In this part, you will determine the root node for your decision tree.\n",
"\n",
"Your answer to this part needs to be a table and it needs to be called **\"df_splits\"**. Also, it needs to have the following 4 columns:\n",
"- split\n",
"- remainder\n",
"- info_gain\n",
"- is_optimal (True or False - only the optimal split's is_optimal flag needs to be True and the others need to be False)\n",
"\n",
"In your **\"df_splits\"** table, you should have **one row for each descriptive feature in the dataset**. As an example for your **\"df_splits\"** table, consider the `spam prediction` example in Table 4.2 in the textbook (**FIRST** Edition) on page 121, which was also covered in lectorials. The `df_splits` table would look something like the table below.\n",
"\n",
"|split| remainder | info_gain| is_optimal |\n",
"|---|---|---|---|\n",
"|suspicious words | ? | ? | True |\n",
"|unknown sender | ? | ? | False |\n",
"|contains images | ? | ? | False |\n",
"\n",
"**HINT:** Your `df_splits` table should have 4 rows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part D (15 points)\n",
"\n",
"In this part, you will **assume** the `carat` descriptive feature is at the root node (**NOTE:** This feature may or may not be the optimal root node, but you will just assume it is). Under this assumption, you will make predictions for the `price` target variable. \n",
"\n",
"Your answer to this part needs to be a table and it needs to be called **\"df_pred\"**. Also, it needs to have the following 6 columns:\n",
"- leaf_condition\n",
"- low_price_prob (probability)\n",
"- medium_price_prob\n",
"- high_price_prob\n",
"- premium_price_prob\n",
"- leaf_prediction\n",
"\n",
"As an example, continuing the spam prediction problem, assume the `suspicious words` descriptive feature is at the root node. The `df_pred` table would look something like the table below.\n",
"\n",
"|leaf_condition| spam_prob | ham_prob | leaf_prediction |\n",
"|---|---|---|---|\n",
"|suspicious words == true | ? | ? | ? |\n",
"|suspicious words == false | ? | ? | ? |\n",
"\n",
"**HINT:** Your `df_pred` table should have 3 rows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Answered 4 days AfterMay 06, 2021

Solution

Sandeep answered on May 09 2021
21 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Submit New Assignment

Copy and Paste Your Assignment Here