work needs to be clean and tidy and understandable.{ "cells": [ { "cell_type": "markdown", ...

Question

work needs to be clean and tidy and understandable.{  "cells": [   {    "cell_type": "markdown",    "metadata": {},    "source": [     "## Question 1
",     "
",     "
",     "This question is inspired from Exercise 3 in Chapter 5 in the textbook. 
",     "
",     "On Canvas, you will see a CSV file named "THA_diamonds.csv". This file is a small subset of a real dataset on diamond prices in a [Kaggle competition](https://www.kaggle.com/shivam2503/diamonds). You will use this dataset for this question and the next question. 
",     "
",     "**Some Background Information:** In our version of the dataset, the `price` feature has been discretized as `low`, `medium`, and `high`, and `premium`. If you are interested, these levels correspond to the following price ranges in the actual diamonds dataset:
",     "- `low` price: price between \\$1000 and \\$2000
",     "- `medium` price: price between \\$2000 and \\$3000
",     "- `high` price: price between \\$3000 and \\$3500
",     "- `premium` price: price between \\$3500 and \\$4000
",     "
",     "**Question Overview:** For this question, you will use the (unweighted) KNN algorithm for predicting the `carat` (numerical) target feature for the following single observation using the **Euclidean distance** metric with different number of neighbors:
",     "- `cut` = good
",     "- `color` = D
",     "- `depth` = 60
",     "- `price` = premium
",     "- (`carat` = 0.71 but you will pretend that you do not have this information)
",     "
",     "In practice, you would use cross-validation or train-test split for determining optimal values of KNN hyperparameters. **However, as far as this assessment is concerned, you are to use entire data for training.**
",     "
",     "
",     "### Part A (15 points)
",     "Prepare your dataset for KNN modeling. Specifically, 
",     "1. Perform one-hot encoding of the categorical descriptive features in the input dataset.
",     "2. Scale your descriptive features to be between 0 and 1.
",     "3. Display the **last** 10 rows after one-hot encoding and scaling.
",     "
",     "**IMPORTANT NOTE: If your data preparation steps are incorrect, you will not get full credit for a correct follow-through.**"    ]   },   {    "cell_type": "markdown",    "metadata": {},    "source": [     "**NOTE:** For Parts (B), (C), and (D) below, you are **not** allowed to use the `KNeighborsRegressor()` in Scikit-Learn module, but rather use manual calculations (via either Python or Excel). That is, you will need to show and explain all your solution steps **without** using Scikit-Learn. The reason for this restriction is so that you get to learn how some things work behind the scenes. 
",     "
",     "### Part B (5 points)
",     "What is the prediction of the 1-KNN algorithm (i.e., k=1 in KNN) for the `carat` target feature using your manual calculations (using the Euclidean distance metric) for the single observation given above?
",     "
",     "### Part C (5 points)
",     "What is the prediction of the 5-KNN algorithm?
",     "
",     "### Part D (5 points)
",     "What is the prediction of the 10-KNN algorithm?
",     "
",     "
",     "### Part E (15 points)
",     "
",     "This part (E) is an exception to the solution mode instructions for this question. In particular, you will need to use the `KNeighborsRegressor()` in Scikit-Learn to perform the same predictions in each Part (B) to (D). That is, 
",     "- What is the prediction of the 1-KNN algorithm using `KNeighborsRegressor()`?
",     "- What is the prediction of the 5-KNN algorithm using `KNeighborsRegressor()`?
",     "- What is the prediction of the 10-KNN algorithm using `KNeighborsRegressor()`?
",     "
",     "Are you able to get the same results as in your manual calculations? Please explain.
",     "
",     "
",     "### Part F: Wrap-up (5 points)
",     "
",     "**IMPORTANT NOTE: This Wrap-up section is mandatory. That is, for Parts (B) to (E) (inclusive), you will not get any points for solutions not presented in the table format explained below.**
",     "
",     "Add and display two tables called **"df_summary_manual"** and **"df_summary_sklearn"** respectively:
",     "- For the table **"df_summary_manual"**, you will report your results for Parts (B) to (D) using your manual calculations.
",     "- For the table **"df_summary_sklearn"**, you will report your results for the 3 predictions in Part (E) using `KNeighborsRegressor()`.
",     "
",     "
",     "Each of these tables need to have the following 3 columns:
",     "- method
",     "- prediction for the observation given (to be rounded to 3 decimal places)
",     "- is_best (True or False - only the best prediction's is_best flag needs to be True and all the others need to be False)
",     "
",     "Your table needs to have 3 rows (one for each method) in each table that summarizes your results. These tables should look like below:
",     "
",     "|method | prediction | is_best |
",     "|---|---|---
",     "|1-KNN  | ? | ? | ? |
",     "|5-KNN  | ? | ? | ? |
",     "|10-KNN | ? | ? | ? |
",     "
",     "In case of a Pandas data frame, you can populate this data frame line by line by referring to Cell #6 in our [Pandas tutorial](https://www.featureranking.com/tutorials/python-tutorials/pandas/).
"    ]   },   {    "cell_type": "markdown",    "metadata": {},    "source": [     "## Question 2"    ]   },   {    "cell_type": "markdown",    "metadata": {},    "source": [     "This question is inspired from Exercise 3 in Chapter 4 in the textbook. 
",     "
",     "You will use the same CSV file as in Question 1 named "THA_diamonds.csv". You will build a simple decision tree with **depth 1** using this dataset for predicting the `price` (categorical) target feature using the **Entropy** split criterion. 
",     "
",     "To clarify, for Question 1, your target feature will be `carat` whereas for this Question 2, your target feature will be `price`."    ]   },   {    "cell_type": "markdown",    "metadata": {},    "source": [     "### Part A (10 points)
",     "
",     "The dataset for this question has 2 numerical descriptive features, `carat` and `depth`. 
",     "1. Discretize these 2 features separately as "category_1", "category_2", and "category_3" respectively using the *equal-frequency binning* technique. 
",     "2. Display the first 10 rows after discretization of these two features.
",     "
",     "After this discretization, all features in your dataset will be categorical (which we will assume to be **"nominal categorical"**). 
",     "
",     "For this question, please do **NOT** perform any one-hot-encoding of the categorical descriptive features nor any scaling. Also, please do **NOT** perform any train-test splits.
",     "
",     "**IMPORTANT NOTE: If your discretizations are incorrect, you will not get full credit for a correct follow-through.**"    ]   },   {    "cell_type": "markdown",    "metadata": {},    "source": [     "### Part B (5 points)
",     "
",     "Compute the impurity of the `price` target feature."    ]   },   {    "cell_type": "markdown",    "metadata": {},    "source": [     "### Part C (20 points)
",     "
",     "**IMPORTANT NOTE: For Parts C and D below, you will not get any points for solutions not presented in the required table format.**
",     "
",     "In this part, you will determine the root node for your decision tree.
",     "
",     "Your answer to this part needs to be a table and it needs to be called **"df_splits"**. Also, it needs to have the following 4 columns:
",     "- split
",     "- remainder
",     "- info_gain
",     "- is_optimal (True or False - only the optimal split's is_optimal flag needs to be True and the others need to be False)
",     "
",     "In your **"df_splits"** table, you should have **one row for each descriptive feature in the dataset**. As an example for your **"df_splits"** table, consider the `spam prediction` example in Table 4.2 in the textbook (**FIRST** Edition) on page 121, which was also covered in lectorials. The `df_splits` table would look something like the table below.
",     "
",     "|split| remainder | info_gain| is_optimal |
",     "|---|---|---|---|
",     "|suspicious words | ? | ? | True |
",     "|unknown sender | ? | ? | False |
",

Sandeep Kumar · Accepted Answer

Answer Attached Below:

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 1\n", "\n", "\n", "This question is inspired from Exercise 3 in Chapter 5 in the textbook. \n", "\n", "On Canvas, you...

Answer To: { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 1\n", "\n", "\n",...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment