10-701 Machine Learning: Assignment 1Due on Februrary 20, 2014 at 12 noonBarnabas Poczos, Aarti SinghInstructions: Failure to follow these directions may result in loss of points.• Your...

I need help for Machine learning Assignment


10-701 Machine Learning: Assignment 1 Due on Februrary 20, 2014 at 12 noon Barnabas Poczos, Aarti Singh Instructions: Failure to follow these directions may result in loss of points. • Your solutions for this assignment need to be in a pdf format and should be submitted to the blackboard and a webpage (to be specified later) for peer-reviewing. • For the programming question, your code should be well-documented, so a TA can understand what is happening. • DO NOT include any identification (your name, andrew id, or email) in both the content and filename of your submission. MLE, MAP, Concentration (Pengtao) 1. MLE of Uniform Distributions [5 pts] Given a set of i.i.d samples X1, ..., Xn Uniform(0, θ), find the maximum likelihood estimator of θ. (a) Write down the likelihood function (3 pts) (b) Find the maximum likelihood estimator (2 pts) 2. Concentration [5 pts] The instructors would like to know what percentage of the students like the Introduction to Machine Learn- ing course. Let this unknown—but hopefully very close to 1—quantity be denoted by µ. To estimate µ, the instructors created an anonymous survey which contains this question: ”Do you like the Intro to ML course? Yes or No” Each student can only answer this question once, and we assume that the distribution of the answers is i.i.d. (a) What is the MLE estimation of µ? (1 pts) (b) Let the above estimator be denoted by µ̂. How many students should the instructors ask if they want the estimated value µ̂ to be so close to the unknown µ such that P(|µ̂− µ| > 0.1) < 0.05,="" (4pts)="" 1="" highlight="" 3.="" map="" of="" multinational="" distribution="" [10="" pts]="" you="" have="" just="" got="" a="" loaded="" 6-sided="" dice="" from="" your="" statistician="" friend.="" unfortunately,="" he="" does="" not="" remem-="" ber="" its="" exact="" probability="" distribution="" p1,="" p2,="" ...,="" p6.="" he="" remembers,="" however,="" that="" he="" generated="" the="" vector="" (p1,="" p2,="" .="" .="" .="" ,="" p6)="" from="" the="" following="" dirichlet="" distribution.="" p(p1,="" p2,="" .="" .="" .="" ,="" p6)="Γ(" ∑6="" i="1" ui)∏6="" i="1" γ(ui)="" 6∏="" i="1" pui−1i="" δ(="" 6∑="" i="1" pi="" −="" 1),="" where="" he="" chose="" ui="i" for="" all="" i="1," .="" .="" .="" ,="" 6.="" here="" γ="" denotes="" the="" gamma="" function,="" and="" δ="" is="" the="" dirac="" delta.="" to="" estimate="" the="" probabilities="" p1,="" p2,="" .="" .="" .="" ,="" p6,="" you="" roll="" the="" dice="" 1000="" times="" and="" then="" observe="" that="" side="" i="" occurred="" ni="" times="" (="" ∑6="" i="1" ni="1000)." (a)="" prove="" that="" the="" dirichlet="" distribution="" is="" conjugate="" prior="" for="" the="" multinomial="" distribution.="" (b)="" what="" is="" the="" posterior="" distribution="" of="" the="" side="" probabilities,="" p(p1,="" p2,="" .="" .="" .="" ,="" p6|n1,="" n2,="" .="" .="" .="" ,="" n6)?="" linear="" regression="" (dani)="" 1.="" optimal="" mse="" rule="" [10="" pts]="" suppose="" we="" knew="" the="" joint="" distribution="" pxy="" .="" the="" optimal="" rule="" f="" ∗="" :="" x="" →="" y="" which="" minimizes="" the="" mse="" (mean="" square="" error)="" is="" given="" as:="" f∗="arg" min="" f="" e[(f(x)−="" y="" )2]="" show="" that="" f∗(x)="E[Y" |x].="" 2.="" ridge="" regression="" [10="" pts]="" in="" class,="" we="" discussed="" `2="" penalized="" linear="" regression:="" β̂="arg" min="" β="" n∑="" i="1" (yi="" −xiβ)2="" +="" λ‖β‖22="" where="" xi="[X" (1)="" i="" .="" .="" .="" x="" (p)="" i="" ].="" a)="" show="" that="" a="" closed="" form="" expression="" for="" the="" ridge="" estimator="" is="" β̂="(AAA">AAA + λIII)−1AAA>YYY where AAA = [X1; . . . ;Xn] and YYY = [Y1; ...;Yn]. b) An advantage of ridge regression is that a unique solution always exists since (AAA>AAA+λIII) is invertible. To be invertible, a matrix needs to be full rank. Argue that (AAA>AAA+ λIII) is full rank by characterizing its p eigenvalues in terms of the singular values of AAA and λ. Logistic Regression (Prashant) 1. Overfitting and Regularized Logistic Regression [10 pts] a) Plot the sigmoid function 1/(1 + e−wX) vs. X ∈ R for increasing weight w ∈ {1, 5, 100}. A qualitative sketch is enough. Use these plots to argue why a solution with large weights can cause logistic regression to overfit. Page 2 of 6 b) To prevent overfitting, we want the weights to be small. To achieve this, instead of maximum conditional likelihood estimation M(C)LE for logistic regression: max w0,...,wd n∏ i=1 P (Yi|Xi, w0, . . . , wd), we can consider maximum conditional a posterior M(C)AP estimation: max w0,...,wd n∏ i=1 P (Yi|Xi, w0, . . . , wd)P (w0, . . . , wd) where P (w0, . . . , wd) is a prior on the weights. Assuming a standard Gaussian prior N (0, III) for the weight vector, derive the gradient ascent update rules for the weights. 2. Multi-class Logistic Regression [10 pts] One way to extend logistic regression to multi-class (say K class labels) setting is to consider (K-1) sets of weight vectors and define P (Y = yk|X) ∝ exp(wk0 + d∑ i=1 wkiXi) for k = 1, . . . ,K − 1 a) What model does this imply for P (Y = yK |X)? b) What would be the classification rule in this case? c) Draw a set of training data with three labels and the decision boundary resulting from a multi-class logistic regression. (The boundary does not need to be quantitatively correct but should qualitatively depict how a typical boundary from multi-class logistic regression would look like.) Naive Bayes Classifier (Pulkit) 1. Naive Bayes Classification Implementation [25 pts] In this question, you will write a Naive Bayes classifier and verify its performance on a news-group data-set. As you learned in class, Naive Bayes is a simple classification algorithm that makes an assumption about conditional independence of features, but it works quite well in practice. You will implement the Naive Bayes algorithm (Multinomial Model) to classify a news corpus into 20 different categories. Handout - http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/assignments/homework1.tar You have been provided with the following data files in the handout: • train.data - Contains bag-of-words data for each training document. Each row of the file represents the number of occurrences of a particular term in some document. The format of each row is (docId, termId, Count). • train.label - Contains a label for each document in the training data. Page 3 of 6 http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/assignments/homework1.tar • test.data - Contains bag-of-words data for each testing document. The format of this file is the same as that of the train.data file. • test.label - Contains a label for each document in the testing data. For this assignment, you need to write code to complete the following functions: • logPrior(trainLabels) - Computes the log prior of the training data-set. (5 pts) • logLikelihood(trainData, trainLabels) - Computes the log likelihood of the training data-set. (7 pts) • naiveBayesClassify(trainData, trainLabels, testData) - Classifies the data using the Naive Bayes algo- rithm. (13 pts) Implementation Notes 1. We compute the log probabilities to prevent numerical underflow when calculating multiplicative prob- abilities. You may refer to this article on how to perform addition and multiplication in log space. 2. You may encounter words during classification that you haven’t during training. This may be for a particular class or over all. Your code should deal with that. Hint: Laplace Smoothing 3. Be memory efficient and please do not create a document-term-matrix in your code. That would require upwards of 600MB of memory. Due to popular demand, we are allowing the solution to be coded in 3 languages: Octave, Julia, and Python. Octave is an industry standard in numerical computing. Unlike MATLAB, it is an open-source language and has similar capabilities and syntax. Julia is a popular new open-source language developed for numerical and scientific computing was well as beginning effective for general programming purposes. This is the first time this language is being sup- ported in a CMU course. Python is an extremely flexible language and is popular in industry and the data science community. Pow- erful python libraries would not be available to you. For Octave and Julia, a blank function interface has been provided for you (in the handout). However, for Python, you will need to perform the I/O for the data files and ensure the results are written to the correct output files. Challenge Question This question is not graded, but it is highly recommended that you try it. In the above question, we are using all the terms from the vocabulary to make a prediction. This would lead to a lot of noisy features. Although it seems counter-intuitive, classifiers built from a smaller vocabulary perform better because they generalize better over unseen data. Noisy features that are not well-represented often skew the perceived distribution of words, leading to classification errors. Therefore, the classification can be improved by selecting a subset of extremely effective words. Write a program to select a subset of the words from the vocabulary provided to you and then use this subset to run your naive bayes classification again. Verify changes in accuracy. TF-IDF and Information Theory are good places to start looking. Page 4 of 6 http://blog.smola.org/post/987977550/log-probabilities-semirings-and-floating-point-numbers Support Vector Machines (Jit) 1. SVM Matching [15 points] Figure 1 (at the end of this problem) plots SVM decision boundries resulting from using different kernels and/or different slack penalties. In Figure 1, there are two classes of training data, with labels yi ∈ {−1, 1}, represented by circles and squares respectively. The SOLID circles and squares represent the Support Vectors. Determine which plot in Figure 1 was generated by each of the following optimization problems. Explain your reasoning for each choice. 1. min 1 2 w ·w + C n∑ i=1 ξi s.t. ∀i = 1, · · · , n: ξi ≥ 0 (w · xi + b)yi − (1− ξi) ≥ 0 and C = 0.1. 2. min 1 2 w ·w + C n∑ i=1 ξi (1) s.t. ∀i = 1, · · · , n: ξi ≥ 0 (w · xi + b)yi − (1− ξi) ≥ 0 and C = 1. 3. max n∑ i=1 αi − 1 2 ∑ i,j αiαjyiyjK(xi,xj) (2) s.t. ∑n i=1 αiyi = 0; αi ≥ 0,∀i = 1, · · · , n; where K(u,v) = u · v + (u · v)2. 4. max n∑ i=1 αi − 1 2 ∑ i,j αiαjyiyjK(xi,xj) (3) s.t. ∑n i=1 αiyi = 0; αi ≥ 0,∀i = 1, · · · , n; where K(u,v) = exp(−‖u−v‖ 2 2 ). 5. max n∑ i=1 αi − 1 2 ∑ i,j αiαjyiyjK(xi,xj) (4) s.t. ∑n i=1 αiyi = 0; αi ≥ 0,∀i = 1, · · · , n; where K(u,v) = exp(− ‖ u− v ‖2). Page 5 of 6 Figure 1: Induced Decision Boundaries Page 6 of 6 MLE, MAP, Concentration (Pengtao) 1. MLE of Uniform Distributions [5 pts] 2. Concentration [5 pts] 3. MAP of Multinational Distribution [10 pts] Linear Regression (Dani) 1. Optimal MSE rule [10 pts] 2. Ridge Regression [10 pts] Logistic Regression (Prashant) 1. Overfitting and Regularized Logistic Regression [10 pts] 2. Multi-class Logistic Regression [10 pts] Naive Bayes Classifier (Pulkit) 1. Naive Bayes Classification Implementation [25 pts] Challenge Question Support Vector Machines (Jit) 1. SVM Matching [15 points]
Nov 25, 2022
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here