CS 301 Fall 2022 Sample Exam Solution Time 2 hour 30 minutes Total points -45 Name -------------------------------------------------------- 1. Multiple choice or single answer question....

1 answer below »
Please do this paper in a detailed way with explanations. The answers are provided for most questions except question 2. Please solve all with detailed explanation and steps as to how to solve them. PLEASE PROVIDE NOTES AND EXPLANATION FOR MY UNDERSTANDING. I want to learn this . Perhaps include a pdf or scanned file with the explanations.
The file is attached.



Thank you .



CS 301 Fall 2022 Sample Exam Solution Time 2 hour 30 minutes Total points -45 Name -------------------------------------------------------- 1. Multiple choice or single answer question. You do not need to show your work. (2+1+2+2+2+1=10) a. Which one has the highest entropy (select all that applies)? i. A fair coin ii. An unfair coin. iii. A 6-sided fair dice iv. A 4 sided fair dice b. The accuracy on an unpruned decision tree is 40% on the test data, while that on the training data is close to 60%. Write the accuracy expression of this using 0.632 bootstrap? 1 ? ∑ 0.632 × 0.4 + 0.368 × 0.6 ?=1..? c. The following data demonstrates the relationship between Math Score and Age of the test writers. Write down the model equation. score = -0.8157 * Age + 79.56886 d. The dataset contains the following 10 training points (each point contains their coordinates and a class label). X1(1,1) - Male, X2 (2,2) - Male, X3(2, 2.5) – Female, X4 (3,7) -Female, X5(9,9) – Male, X6 (8,9) -Male, X7(3,3) – Male, X8(9,9) -Male, X9(9,10) - Female, X10(9,5) – Female Classify a test point X(8,7) using k (=3) nearest neighbor based classification. Use Manhattan distance. Answer: Male e. A transaction Database contains three transactions as follows = {, {, < a1,="" …,="" a50="">} .Min support = 2. Write down the closed itemsets and the corresponding support count. Write down the maximum itemsets. Answer: Closed itemsets:: 2 < a1,="" …,="" a50="">: 3 Max itemsets: : 2 f. Select all those are true. i. Supervised discretization could be obtained by applying information gain based criteria ii. Supervised discretization could be obtained by simply finding the pure intervals where the class labels are the same iii. discretization could be obtained by applying Elbow method 2. Given the dataset below and a support threshold of 3 and confidence of 100%, generate all the association rules that describe weather conditions with play outcome. Find out the closed and the max patterns. (6+2+2=10) We have solved exactly this in the class. 3. Given to us are the following 6 objects. Run AGNES over it and compute the dendrogram (use single link as inter cluster distance). Show each step and the computed dissimilarity matrix (10) Age Test-1 Test-2 Standing Gender AP courses taken A 20 P P Junior M Chem, Math B 19 P P Sophomore F CS, Math C 19 F P Freshman F English D 18 F F Freshman F E 25 F F Senior M Math, Physics F 24 F F Senior M Math G 21 F P Junior M CS, Chem H 21 P F Sophomore M Physics I 20 F P Junior F English We have done similar problem before and AGNS on top 4. Given below is historical data that determines the play decision based on weather parameters. a. Classify X (Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) using Naïve Bayes classifier. (5) i. Solution in the next page b. Compute the Naïve Bayes scores (up to two decimal points rounded) produced by D1, D2, D3, D4, D5 considering their respective actual class to be their predicted class. Draw the ROC curve of D1- D5. (5) i. Produce score of D1, D2, D3, D4, D5 ii. ROC curve as we have covered in the class. If the cost of false positive (positive class is Play=Yes) is 9, and false negative is 1, find the best threshold computed for the records considered in b. (5) You can ignore this one. I wont ask this question as this was not covered.
Answered 1 days AfterDec 14, 2022

Answer To: CS 301 Fall 2022 Sample Exam Solution Time 2 hour 30 minutes Total points -45 Name...

Karthi answered on Dec 16 2022
29 Votes
a. i. A fair coin iii. A 6-sided fair dice
In information theory, entropy is a measure of the amount of uncertainty in a random variable.
For a fair coin, each possible outcome (heads or tails) is equally likely, so there is maximum
uncertainty or entropy. Similarly, a 6-sided fair dice has six possible outcomes, each with equal
probability, so it also has maximum entropy. An unfair coin or a 4-sided fair dice, on the other
hand, would have less entropy because some outcomes are more likely than others.
b.
The
accuracy expression using 0.632 bootstrap for the given scenario is:
0.632 × 0.4 + 0.368 × 0.6
The 0.632 bootstrap is a method used to estimate the accuracy of a classifier. It involves
repeatedly sampling the training data with replacement, building a classifier on each sample,
and then averaging the accuracy of the classifiers. The 0.632 factor comes from the fact that
each bootstrapped sample is typically about 63.2% of the size of the original training set. In the
given scenario, the accuracy on the test data is 40%, while that on the training data is close to
60%, so the 0.632 bootstrap method can be used to estimate the overall accuracy of the
classifier.
c.
The given data demonstrates a linear relationship between Math Score and Age of the test
writers. The model equation for this relationship can be written as:
score = -0.8157 * Age + 79.56886
This equation represents a line with a slope of -0.8157 and a y-intercept of 79.56886. Given an
age, the equation can be used to predict the corresponding math score. For example, if the age
of a test writer is 20, their predicted math score would be -0.8157 * 20 + 79.56886 = 74.09206.
d.
To classify a test point X(8,7) using k-nearest neighbor based classification with k=3 and
Manhattan distance, we first need to calculate the distance between the test point and each of
the 10 training points. The Manhattan distance between two points (x1, y1) and (x2, y2) is given
by |x1-x2| + |y1-y2|. Using this formula, the distances between X(8,7) and each of the training
points can be calculated as follows:
X1(1,1): |1-8| + |1-7| = 14
X2(2,2): |2-8| + |2-7| = 13
X3(2,2.5): |2-8| + |2.5-7| = 13.5
X4(3,7): |3-8| + |7-7| = 5
X5(9,9): |9-8| + |9-7| = 4
X6(8,9): |8-8| + |9-7| = 2
X7(3,3): |3-8| + |3-7| = 9
X8(9,9): |9-8| + |9-7| = 4
X9(9,10): |9-8| + |10-7| = 6
X10(9,5): |9-8| + |5-7| = 5
Next, we need to sort the distances in ascending order and select the k=3 closest training
points. The 3 closest training points are X4(3,7), X5(9,9), and X6(8,9), with distances of 5, 4, and
2, respectively. Finally, we need to classify the test point based on the majority class label
among these 3 points. In this case, all 3 points have the class label "Male", so the test point
X(8,7) would be classified as "Male" using 3-nearest neighbor based classification with
Manhattan distance.
e.
In a transaction database, an itemset is a set of items that occur together in a transaction. The
support of an itemset is the number of transactions in which the itemset appears. A closed
itemset is an itemset such that no proper superset of the itemset has the same support. A
maximum itemset is an itemset that is not a subset of any other itemset with the same support.
Given the transaction database {, {, < a 1 , …, a 50 >} and a
minimum support of 2, the closed itemsets and the corresponding support counts are:
: 2 < a 1 , …, a 50 >: 3
Both of these itemsets are closed because they have a support of 2 and there are no proper
supersets of these itemsets with the same support. The maximum itemsets are:
: 2
This is the only maximum itemset because it is not a subset of any other itemset with the same
support.
f.
The following statements are true:
i. Supervised discretization could be obtained by applying information gain based criteria.
ii. Supervised discretization could be obtained by simply finding the pure intervals where the
class labels are the same.
iii. Discretization could be obtained by applying Elbow method.
Supervised discretization is the process of dividing a continuous variable into a set of discrete
intervals or bins. This can be useful for simplifying complex data and making it easier to analyze.
There are several methods for performing supervised discretization, including information gain
based criteria, pure interval finding, and the Elbow method. The first method uses information
gain to find the intervals that are most useful for predicting the class label. The second method
finds intervals where all the observations have the same class label. The third method uses the
Elbow method to determine the optimal number of intervals. All of these methods can be used
to obtain supervised discretization.
2.
To generate all association rules that describe weather conditions with play outcome, we need
to first find all frequent itemsets in the dataset using the support threshold of 3. For example,
the itemset {Outlook=Sunny, Play=No} has a support of 3, since it appears in transactions 1, 2,
and 8. Similarly, the itemset {Outlook=Overcast, Play=Yes}...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here