IDS575: Machine Learning & Statistical Methods Updated: 2/11/2020 Homework #2 Instructor: Moontae Lee Total Points: 150+10 Policy 1. HW2 is due by 03/01/2020 11:59PM in Central Time. One submission...

assignment attached


IDS575: Machine Learning & Statistical Methods Updated: 2/11/2020 Homework #2 Instructor: Moontae Lee Total Points: 150+10 Policy 1. HW2 is due by 03/01/2020 11:59PM in Central Time. One submission per each group. 2. You are allowed to work individually or as a group of up to three students. 3. Having wider discussions is not prohibited. Put all the names of students beyond your group members. However individual students/groups must write their own solutions. 4. Put your write-up and results from the coding questions in a single pdf file. Compress R source codes into one zip file. Each student/group must submit only two files. (You will lose the points if the answers for coding questions are not included the pdf report) 5. If you would include some graphs, be sure to include the source codes together that were used to generate those figures. Every result must be reproducible! 6. Maximally leverage Piazza to benefit other students by your questions and answers. Try to be updated by checking notifications in both Piazza and the class webpage. 7. Late submissions will be penalized 20% per each late day. No assignment will be accepted more than 3 days after its due date. For HW2, it will be 03/04/2020 11:59pm Problem 1: Linear Regression and Gradient Learning [30 points] In class we derived linear regression and various learning algorithms based on gradient descent. In addition to the least square objective, we also learned its probabilistic perspective where each observation is assumed to have a Gaussian noise. (i.e., Noise of each example is an independent and identically distributed sample from a normal distribution) In this problem, you are supposed to deal with the following regression model that includes two linear features and one quadratic feature. y = θ0 + θ1x1 + θ2x2 + θ3x 2 1 + � where � ∼ N (0, σ2) Your goal is to develop a gradient descent learning algorithm that will estimate the best pa- rameters θ = {θ0, θ1, θ2, θ3}. (a) Given the definition of noise, derive the corresponding mean and variance parameters of the normal distribution for y|x1, x2;θ. Write also down its probability density function. Updated: 2/11/2020-1 Maliha Maliha Maliha Maliha Maliha Maliha Maliha (b) You are provided with a training observations D = {(x(i)1 , x (i) 2 , y (i))|1 ≤ i ≤ m}. Derive the conditional log-likelihood that will be later maximized to make D most likley. (c) If you omit all the constant that does not relate to our parameters θ, what will be the objective function J(θ0, θ1, θ2, θ3) that you are going to perform Maximum Likelihood Estimation? Does J look similar to the Least Square objective for this problem? (d) Compute the gradient of J(θ) with respect to each parameter. (Hint: You should evaluate the partial derivatives of J(θ) with respect to each θj for 0 ≤ j ≤ 3) (e) [Coding] Develop two learning algorithms: batch and stochastic gradient descent for this problem on the Auto dataset given in the Problem 4 in Homework 1. Try to find the two best input features for predicting the output mpg. Compare and report the difference between your full coding and R’s built-in function call: lm. (Hint: At least your stochastic gradient algorithm must learn parameters θ comparable to the result from calling the R’s built-in function. Otherwise try to tune the learning rate α < 1.0)="" problem="" 2:="" logistic="" regression="" and="" evaluations="" [30="" points]="" recall="" the="" grading="" problem="" to="" predict="" pass/fail="" in="" the="" class.="" suppose="" you="" collect="" data="" for="" a="" group="" of="" students="" in="" the="" class="" that="" consist="" of="" two="" input="" features="" x1="hours" studied="" and="" x2="undergrad" gpa.="" your="" goal="" is="" to="" predict="" the="" output="" y="" ∈="" {pass,="" fail}.="" suppose="" that="" you="" fit="" a="" logistic="" regression,="" learning="" its="" parameter="" (θ0,="" θ1,="" θ2)="(−6," 0.05,="" 1).="" (a)="" what="" will="" be="" the="" probability="" for="" a="" student="" who="" studies="" for="" 40="" hours="" and="" has="" a="" gpa="" of="" 3.5="" to="" pass="" the="" class?="" (b)="" how="" many="" hours="" would="" the="" student="" in="" part="" (a)="" needs="" to="" study="" in="" order="" to="" have="" at="" least="" 50%="" chance="" of="" passing="" the="" class?="" the="" following="" questions="" must="" be="" answered="" using="" weekly="" dataset="" in="" islr="" package.="" it="" contains="" 1,089="" weekly="" returns="" for="" 21="" years="" from="" the="" beginning="" of="" 1990="" to="" the="" end="" of="" 2010.="" you="" will="" use="" its="" 1990-2008="" as="" a="" training="" data="" and="" 2009-2010="" as="" a="" test="" data.="" (c)="" [coding]="" given="" the="" training="" data,="" you="" are="" to="" perform="" a="" logistic="" regression="" where="" the="" input="" features="" are="" five="" of="" lag="" variables="" and="" volume,="" and="" the="" binary="" output="" is="" direction.="" use="" the="" summary="" function="" to="" print="" our="" the="" results.="" report="" the="" confusion="" matrix="" and="" the="" accuracy="" on="" both="" training="" and="" test="" data="" given="" the="" learned="" model.="" (hint:="" as="" nothing="" is="" specified,="" your="" default="" threshold="" to="" decide="" up/down="" must="" be="" 0.5)="" (d)="" [coding]="" now="" you="" will="" run="" logistic="" regression="" five="" times="" with="" only="" one="" input="" features="" lagj(1="" ≤="" j="" ≤="" 5)="" for="" each="" time.="" compute="" the="" confusion="" matrix="" and="" the="" accuracy="" on="" both="" training="" and="" test="" data="" given="" each="" of="" the="" learned="" models.="" which="" are="" the="" best="" models="" among="" the="" five="" models="" here="" and="" the="" earlier="" model="" in="" part="" (c)="" in="" terms="" of="" the="" accuracy="" and="" f-score,="" respectively?="" does="" the="" best="" model="" also="" achieve="" the="" best="" accuracy="" or="" f-score="" updated:="" 2/11/2020-2="" on="" the="" training="" data?="" (hint:="" the="" best="" model="" must="" be="" chosen="" based="" on="" the="" test="" data,="" not="" the="" training="" data!)="" (e)="" [coding]="" try="" to="" draw="" six="" roc="" curves="" for="" 6="" models="" from="" the="" part="" (c)="" and="" (d)="" with="" varying="" thresholds.="" determine="" the="" best="" model="" in="" terms="" of="" the="" area="" under="" curve="" (auc).="" (note:="" 6="" curves="" must="" be="" plotted="" at="" the="" same="" time="" as="" a="" single="" graph.="" you="" can="" use="" r’s="" built-in="" function="" to="" compute="" the="" auc)="" (f)="" [coding]="" try="" to="" draw="" six="" precision-recall="" curves="" for="" 6="" models="" from="" the="" part="" (c)="" and="" (d)="" with="" varying="" thresholds.="" determine="" the="" best="" model="" in="" terms="" of="" the="" area="" under="" curve="" (auc).="" (note:="" 6="" curves="" must="" be="" plotted="" at="" the="" same="" time="" as="" a="" single="" graph.="" you="" can="" use="" r’s="" builtin="" function="" to="" compute="" the="" auc)="" (g)="" report="" your="" observation="" about="" different="" evaluation="" measures,="" and="" pick="" the="" one="" that="" you="" think="" the="" most="" appropriate="" for="" this="" problem.="" explain="" why.="" problem="" 3:="" generalized="" linear="" model="" [20="" points]="" generalized="" linear="" models="" provide="" a="" standardized="" recipe="" for="" developing="" a="" new="" learning="" model="" and="" algorithm.="" in="" our="" lecture,="" you="" have="" seen="" three="" examples:="" linear="" regression,="" logistic="" re-="" gression,="" and="" multi-class="" classification="" where="" outputs="" given="" an="" input="" are="" modeled="" by="" normal,="" bernoulli,="" and="" categorical="" distributions,="" respectively.="" in="" this="" problem,="" your="" output="" will="" be="" a="" different="" type="" of="" random="" variable,="" which="" models="" the="" number="" of="" times="" that="" an="" event="" occurs="" in="" an="" interval="" of="" time="" or="" space.="" the="" distribution="" you="" are="" to="" use="" has="" the="" following="" form:="" p(y;λ)="e−λλy" y!="" ,="" where="" the="" only="" parameter="" λ="" indicates="" the="" rate="" of="" the="" event="" which="" occurs="" independently.="" (if="" you="" receive="" in="" average="" 4="" postal="" mails="" every="" day,="" then="" receiving="" any="" particular="" mail="" will="" not="" affect="" the="" arrival="" times="" of="" future="" mails.="" for="" this="" example,="" λ="4)" (a)="" prove="" or="" disprove="" whether="" this="" distribution="" is="" in="" the="" exponential="" family.="" if="" it="" is="" an="" expo-="" nential="" family,="" write="" down="" clearly="" what="" are="" the="" natural="" parameter="" η,="" the="" natural="" sufficient="" statistics="" t="" (η),="" the="" log="" partition="" function="" a(η),="" and="" the="" base="" measure="" b(y).="" (b)="" recall="" that="" the="" natural="" parameter="" of="" logistic="" regression="" was="" the="" log="" odd="" ratio="" of="" bernoulli="" parameter.="" (e.g.,="" η="log" φ1−φ)="" when="" we="" express="" φ="" in="" terms="" of="" η,="" it="" corresponded="" to="" the="" logistic="" function,="" explaining="" why="" the="" logistic="" function="" is="" the="" natural="" choice="" for="" the="" binary="" classification.="" what="" will="" be="" such="" a="" natural="" function="" for="" this="" distribution?="" try="" to="" express="" η="" in="" terms="" of="" λ.="" (hints:="" mean="" of="" this="" distribution="" is="" λ)="" (c)="" suppose="" you="" are="" given="" with="" a="" training="" data="" d="{(x(i)," y(i))|1="" ≤="" i="" ≤="" m}.="" evaluate="" the="" partial="" derivative="" of="" one="" single="" example="" p(y(i)|x(i);="" θ)="" with="" respect="" to="" θj="" .="" derive="" a="" stochastic="" gradient="" ascent="" algorithm="" for="" learning="" the="" model="" parameter="" θ.="" updated:="" 2/11/2020-3="" (d)="" (optional="" +5pts)="" in="" class="" we="" have="" seen="" both="" linear="" regression="" and="" logistic="" regression="" have="" exactly="" same="" learning="" algorithms,="" and="" it="" was="" explained="" because="" they="" are="" all="" glms.="" for="" majority="" members="" of="" the="" exponential="" family="" that="" satisfy="" t="" (y)="y," prove="" the="" stochastic="" gradient="" algorithm="" on="" its="" log-likelihood="" shares="" the="" same="" update="" θj="" :="θj−α(hθ(x)−y)xj" .="" problem="" 4:="" support="" vector="" machine="" [20="" points]="" given="" the="" following="" toy="" dataset="" consisting="" of="" seven="" exmaples="" just="" two="" features,="" you="" are="" supposed="" to="" learn="" maximal="" margin="" classifier.="" i="" x="" (i)="" 1="" x="" (i)="" 2="" y="" 1="" 3="" 4="" yes="" 2="" 2="" 2="" yes="" 3="" 4="" 4="" yes="" 4="" 1="" 4="" yes="" 5="" 2="" 1="" no="" 6="" 4="" 1="" no="" 7="" 4="" 3="" no="" (a)="" visualize="" all="" seven="" training="" examples="" and="" sketch="" the="" optimal="" separating="" hyperplane.="" write="" down="" the="" equation="" for="" this="" hyperplane.="" (b)="" state="" the="" rule="" of="" classification="" between="" yes="" and="" no.="" it="" should="" be="" of="" the="" form="" “yes”="" if="" b+="" w1x1="" +="" w2x2=""> 0, and “No” otherwise. (c) Is the data linearly separable? If so, on top of your sketch, indicate the geometric margin and the support vectors. (d) Would a slight perturbation of the 6-th example affect the optimal margin hyperplane? Justify your answer. Problem 5: Multiclass Support Vector Machine [50 points] Text classification is an important problem in information retrieval and natural language pro- cessing. The task is to classify each text article into one of the predefined classes/categories. Here you have four sets of articles about: 1) operating systems, 2) vehicles, 3) sports, and 4) politics. Each set consists of various articles collected from a corresponding newsgroup where each article is represented by Bag-of-Words (BoW) features. That is, after extracting all the relevant words from the entire dataset, feature vectors are defined as the frequency of words occurring in each article. Updated: 2/11/2020-4 Since you have more than two classes, using a single binary SVM classifier is not sufficient. Instead, you are going to combine several binary SVM classifiers to predict multiple classes. Thus, the goal is to train multiple linear SVM classifiers that predict binary classes based on BoW features, then combining them to properly predict one of the four classes. You can download the following data files in the class webpage. • articles.train Training data consisting of 4,000 articles (1,000 articles per class) • articles.test Test data consisting of 2,400 articles (600 articles per class) • words.map A word table mapping every relevant word into its line number. A single line in the training and test data has the following format: • Format: (Class #) (Features) • Class #: One of 1, 2, 3, 4 (1=operating systems, 2=vehicles, 3=sports, 4=politics) • Features: A space-separated sequence of word #:frequency If a line is 1 11:2 13:1 23:12 25:1 27:2 28:1, for example, this is an article about operating system which contains word 11 (addresses) twice, word 13 (organizations) once, and so
Feb 18, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here