1 Problem 1: Decision Trees [5 Points] 1.1 Conditional Entropy [2 Points] Recall the definition of entropy, H(Y ) = − k∑ i=1 P (Y = yi) log2 P (Y = yi) Using the definition of conditional and joint...

1 answer below »
Please see attached.


1 Problem 1: Decision Trees [5 Points] 1.1 Conditional Entropy [2 Points] Recall the definition of entropy, H(Y ) = − k∑ i=1 P (Y = yi) log2 P (Y = yi) Using the definition of conditional and joint entropy, prove that H(X, Y ) = H(X) +H(Y |X) 1.2 Information Gains [3 Points] Suppose that a company is developing a new game. They conduct a survey to find out whether your game would be well received and find that of 400 responses, 300 enjoyed the game, while 100 did not. (a) Consider the random variable L that indicates whether a user liked the game(1) or not(0). What is the entropy of L? (use log2) [1 Point] (b) The company is trying to build a decision tree to classify whether a new user would like the new game. They have a binary attribute: age range(A) ∈ {15 − 20, 20 − 25}. The split of 15-20 to 20-25 for the positive class (L = 1) is {240, 60} and for the negative class (L = 0) is (60, 40). What is the information gain for splitting on A? Would A be a valid feature to split on? [1 Point] 1 (c) They also found that the binary attribute game console (G) ∈ {Xbox, Playstation} determined a user’s liking of a game. The split of Xbox to Playstations users for the positive class (L = 1) was {100, 200} and for the negative class(L = 0) was {80, 20}. Would this be a better splitting attribute that the age-range? [1 Point] 2 Problem 2: Conditional Independence & MLE [4 Points] (a) Here we define pairwise independence as p(X2|X1) = p(X2). And we de- fine mutual independence as p(X1:n) = ∏n 1 p(Xi). Please specify if pairwise independence between all pairs of variables imply mutual independence, and why? [2 Points] (b) Suppose we have n i.i.d data sample Xi, and Xi ∈ {0, 1} follows the binomial distribution: P (Xi = xi) = p xi(1 − p)1−xi . What is the maximum likelihood estimator of p? Please elaborate on the calculation process. [2 Points] 3 Problem 3: Linear Regression [2 Points] Consider a linear regression model with a 2 dimensional response vector yi ∈ R2. Suppose we have some binary input data, xi ∈ {−1, 1}. The training data is as follows: x y -1 (−1,−1)T -1 (−1,−2)T 1 (1, 1)T 1 (1,−1)T 1 (2, 1)T Let us embed each xi into 2d using the following basis function: φ(−1) = (0, 1)T , φ(1) = (1, 0)T The model becomes ŷ = W Tφ(x) where W is a 2 × 2 matrix. What should be the appropriate value of w? 2 Hint: A−1 = 1 A11A22−A12A21 [ A22 −A12 −A21 A11 ] 4 Problem 4: Logistic Regression [2 Points] The multinomial logistic classifier uses a generalization of the sigmoid, called the softmax function, to compute the probability p(y = c|x). The softmax function takes a vector z = [z1, z2, ..., zk] of k arbitrary values and maps them to a probability distribution, with each value in the range (0,1), and all the values summing to 1. softmax(zi) = exp(zi)∑k j=1 exp(zj) 1 ≤ i ≤ k However, in practice, if we implement the softmax function following the equation above via programming language like Python, it might not give the correct output (sometimes even cannot return a valid number). What is the problem of implementing the softmax function following the equation above in practice? How should we fix this problem (explaining why the proposed solution works)? 5 Problem 5: Logistic Regression vs Decision Tree [2 Points] Consider the data shown in the below figure 3 Figure 1: Justify whether (a) logistic regression and (b) Decision tree, will be able to perfectly classify the data. 6 Problem 6: Perceptron [5 Points] Suppose we have N points xi (xi ∈ Rn) with labels yi ∈ {−1,+1}. We are training a perceptron whose weight is β = (β1, β0) with the data x ∗ i where x∗i = (xi, 1). (a) What is the expression for updating the weights of the perceptron, as- suming we are using hinge loss and SGD? Justify your answer. [2 Points] (b) Suppose we have N points xi (xi ∈ Rn) with labels yi ∈ {−1,+1}. As- sume the perceptron learning algorithm converges to a separating hyperplane f(x) = βTx∗ = 0 in a finite number of steps. Let zi = xi ∗ ‖xi∗‖ . Show that there exists a βsep such that yiβ T sepzi ≥ 1 for ∀i. [3 Points] 7 Problem 7: Multilayer Perceptron [3 Points] A multilayer perceptron is referred to as a network composed of multiple layers of perceptrons. 4 Figure 2: Multilayer Perceptron. (a) Given figure 1, assuming that you are given two features x0 and x1 and the third one is the bias term. Express this multilayer perceptron as a single layer perceptron. Note : W represents the weights for the perceptron. A, B, and C nodes are only plain summation operation. [2 Points] (b) What inferences can be drawn from the previous part for a general mul- tilayer perceptron? [1 Point] 8 Problem 8: Linear SVM [4 Points] Given 2-dimensional input data points S1 = {(1, 4), (1, 5), (2, 4), (2, 5), (3, 3)}, S2 = {(3, 2), (3, 1), (4, 1), (5, 1), (6, 1), (6, 2)}, where S1 has the data points from the positive class and S2 has data points from the negative class: 5 Figure 3: Plot of points S1 = {(1, 4), (1, 5), (2, 4), (2, 5), (3, 3)}, S2 = {(3, 2), (3, 1), (4, 1), (5, 1), (6, 1), (6, 2)} (a) Suppose you are using a hard-margin linear SVM (i.e. a Linear SVM that is trying to maximize its margin while ensuring all data points are on their correct sides of the margin). Draw three lines on the above diagram, showing the classification boundary and the two sides of the margin. Circle the support vector(s). [2 Points] 6 (b) Using the linear SVM classifier notation sign(wTx + b), calculate the values of w and b learned for part (a). [2 Points] 9 Problem 9: Binary Logistic Regression vs Softmax [4 Points] (a) Show that the softmax function with 2 classes is equivalent to binary logistic regression. [2 Points] (b) Show that when Y = {−1, 1}, P (yi|xi) = 1 1 + exp(−yif(xi)) is equivalent to P (yi|xi) = 1 1 + exp(−f(xi)) when Y = {0, 1}, where f(x) = w0 + ∑ iwixi. [2 Points] 10 Problem 10: Kernelization [6 Points] (a) Suppose K1 and K2 are valid kernels. Prove that: 1. K(x, y) = K1(x, y)−K2(x, y) is a valid kernel. [2 Points] 2. K(x, y) = K1(x, y) · K2(x, y), i.e the product of 2 valid kernels, is a valid kernel. [2 Points] (b) Let u and v be arbitrary 2-dimensional vectors from the input-space. Prove that any n-degree (n > 2) polynomial in (u · v), i.e., Pn(u, v) = an ∗ (u · v)n + an−1 ∗ (u · v)n−1... + ak ∗ (u · v)k... + a1 ∗ (u · v) + a0 is a valid kernel function for this input-space. Specifically, show that there exists some mapping Φ such that Pn(u, v) = Φ(u) ∗ Φ(v). [2 Points] 7 11 Problem 11: Linear Regression - Mean Square Error [3 Points] Charlie just finished the lecture on linear regression in the CS 4641 class. To familiarise himself with the concepts better, he thought of training a re- gression model on some dataset. Specifically, he defined the linear regression model F and corresponding loss function L as follows: F (X,W ) = w0 + ∑D i=1(wi ∗ xi) L(W ) = 1 2 ∗ ∑N n=1(F (Xn,W )− Yn))2 where wi is the scalar value of i th dimension in D+1 dimensional weight vec- tor W ; xi is the scalar value of i th dimension in D dimensional input vector X; N is the total number of examples in training set; Xn is the D dimen- sional input vector and Yn is the scalar regression output corresponding to nth training example. Charlie’s friend, Chaplin, wanted to have some fun so he corrupted the above-mentioned training dataset by adding Gaussian noise ηi sampled from Ni(0, σ 2) to xi for all N training examples and for each dimension i. More specifically, expectation of ith noise variable E(ηi) = 0, variance of i th noise variable V ar(ηi) = σ 2 and different η′is are independent i.e. E(ηi ∗ ηj) = 0 for i 6= j. Show that the expected difference between L for the original train- ing data and corrupted training data is a weight-decay regularisation term; precisely a constant multiple of sum of squares of all w′is except w0. 12 Problem 12: SVM with Slack Variables [6 Points] Consider a training dataset collected from error-prone sensors for two differ- ent class labels, the data is plotted in Figure 4. Assuming that we train an SVM with a quadratic kernel, answer the following questions qualitatively, where C refers to the slack penalty. 8 Figure 4: Training dataset (a) Plot a rough sketch of the decision boundary for C → ∞ (or very large value of C) and justify your answer. [2 Points] (b) Plot a rough sketch of the decision boundary for C → 0 (or very small value of C) and justify your answer. [2 Points] (c) In a task of classification which case is expected to perform better? C → 0 or C → ∞. Justify. (Remember that the data points are collected from error-prone sensors, which means that the the two outliers are likely to be wrong results produced by the sensor). [2 Points] 9
Answered 4 days AfterOct 27, 2021

Answer To: 1 Problem 1: Decision Trees [5 Points] 1.1 Conditional Entropy [2 Points] Recall the definition of...

Risavsignh answered on Nov 01 2021
118 Votes
Solution
1.1 - Conditional Entropy
Given,
Starting from LHS,
from joint probability definition, . So the equation becomes:
Hence QED.
1.2 - Information Gains
a. -(300 / 4
00) * log2(300 / 400) - (100 / 400) * log2(100 / 400) = 0.8113
b. Entropy for split - {15 - 20} = -(240 / 300) * log2(240 / 300) - (60 / 300) * log2(60 / 300) = 0.7219
Entropy for split - {20 - 25} = -(60 / 100) * log2(60 / 100) - (40 / 100) * log2(40 / 100) = 0.9709
Information gain = 0.8113 - (240+60) / (400) * 0.7219 - (60 + 40) / (400) * 0.9709 = 0.02715
Since information gain is positive, it is a valid feature to split on.
c. Entropy for split - {Xbox} = -(100 / 180) * log2(100 / 180) - (80 / 180) * log2(80 / 180) = 0.9911
Entropy for split - {Playstation} = -(200 / 220) * log2(200 / 220) - (20 / 220) * log2(20 / 220) = 0.4395
Information gain = 0.8113 - (100+80) / (400) * 0.9911 - (200 + 20) / (400) * 0.4395 = 0.12358
Since information gain for game console split is better than age-range split, game console split would be a better splitting.
2 - Problem 2: Conditional Independence & MLE
a. Given, pairwise independence i.e. .
If we have pairwise independent variables,
Hence, both are mutually independent too.
if we have variables with pairwise independence between all pairs, and so on.
But we cannot imply that
Hence, pairwise independence between all pairs of variables does not imply mutual independence.
b. Given i.i.d n follow binomial distribution:
Likelihood function:
Finding the value of such that it maximises the above equation will give the maximum likelihood estimator . To simplify this we take log on both sides, since log is a monotonically increasing function with the condition that value of is . The new equation becomes:
Differentiating to...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here