IFT 6390 Fundamentals of Machine Learning Ioannis Mitliagkas Homework 1 - Theoretical part • This homework must be done and submitted to Gradescope individu- ally. You are welcome to discuss with...

Hi,
I need help with questions 4 , 5 and 6 please.
Thank you,
George Abou Kassm


IFT 6390 Fundamentals of Machine Learning Ioannis Mitliagkas Homework 1 - Theoretical part • This homework must be done and submitted to Gradescope individu- ally. You are welcome to discuss with other students but the solution you submit must be your own. Note that we will use Gradescope’s plagiarism detection feature. All suspected cases of plagiarism will be recorded and shared with university officials for further handling. • You need to submit your solution as a pdf file on Gradescope using the homework titled Homework 1 - Theory. 1. Probability warm-up: conditional probabilities and Bayes rule [5 points] (a) Give the definition of the conditional probability of a discrete random variable X given a discrete random variable Y . (b) Consider a biased coin with probability 3/4 of landing on heads and 1/4 on tails. This coin is tossed three times. What is the probability that exactly two heads occur (out of the three tosses) given that the first outcome was a head? (c) Give two equivalent expressions of P (X,Y ): (i) as a function of P(X) and P(Y |X) (ii) as a function of P(Y ) and P(X|Y ) (d) Prove Bayes theorem: P(X|Y ) = P(Y |X)P(X) P(Y ) . (e) A survey of certain Montreal students is done, where 55% of the surveyed students are affiliated with UdeM while the others are affiliated with McGill. A student is drawn randomly from this surveyed group. i. What is the probability that the student is affiliated with McGill? 1 ii. Now let’s say that this student is bilingual, and you know that 80% of UdeM students are bilingual while 50% of McGill students are. Given this information, what is the probability that this student is affiliated with McGill ? 2. Bag of words and single topic model [12 points] We consider a classification problem where we want to predict the topic of a document from a given corpus (collection of documents). The topic of each document can either be sports or politics. 2/3 of the documents in the corpus are about sports and 1/3 are about politics. We will use a very simple model where we ignore the order of the words appearing in a document and we assume that words in a document are independent from one another given the topic of the document. In addition, we will use very simple statistics of each document as fea- tures: the probabilities that a word chosen randomly in the document is either “goal”, “kick”, “congress”, “vote”, or any another word (de- noted by other). We will call these five categories the vocabulary or dictionary for the documents: V = {“goal”, “kick”, “congress”, “vote”, other}. Consider the following distributions over words in the vocabulary given a particular topic: P(word | topic = sports) P(word | topic = politics) word = “goal” 2/100 0 word = “kick” 1/200 5/1000 word = “congress” 1/1000 1/50 word = “vote” 2/500 4/100 word = other 970/1000 935/1000 Table 1 This table tells us for example that the probability that a word chosen at random in a document is “vote” is only 2/500 if the topic of the document is sport, but it is 4/100 if the topic is politics. 2 (a) What is the probability that a random word in a document is “goal” given that the topic is politics? (b) In expectation, how many times will the word “goal” appear in a document containing 200 words whose topic is sports? (c) We draw randomly a document from the corpus. What is the probability that a random word of this document is “goal”? (d) Suppose that we draw a random word from a document and this word is “kick”. What is the probability that the topic of the document is sports? (e) Suppose that we randomly draw two words from a document and the first one is “kick”. What is the probability that the second word is “goal”? (f) Going back to learning, suppose that you do not know the condi- tional probabilities given a topic or the probability of each topic (i.e. you don’t have access to the information in table 1 or the topic distribution), but you have a dataset of N documents where each document is labeled with one of the topics sports and poli- tics. How would you estimate the conditional probabilities (e.g., P(word = “goal” | topic = politics)) and topic probabilities (e.g., P(topic = politics)) from this dataset? 3. Maximum likelihood estimation [5 points] Let x ∈ R be uniformly distributed in the interval [0, θ] where θ is a parameter. That is, the pdf of x is given by fθ(x) = { 1/θ if 0 ≤ x ≤ θ 0 otherwise Suppose that n samples D = {x1, . . . , xn} are drawn independently according to fθ(x). (a) Let fθ(x1, x2, . . . , xn) denote the joint pdf of n independent and identically distributed (i.i.d.) samples drawn according to fθ(x). Express fθ(x1, x2, . . . , xn) as a function of fθ(x1), fθ(x2), . . . , fθ(xn) 3 (b) We define the maximum likelihood estimate by the value of θ which maximizes the likelihood of having generated the dataset D from the distribution fθ(x). Formally, θMLE = arg max θ∈R fθ(x1, x2, . . . , xn), Show that the maximum likelihood estimate of θ ismax(x1, . . . , xn) 4. Maximum likelihood meets histograms [10 points] Let X1, X2, · · · , Xn be n i.i.d data points drawn from a piece-wise constant probability density function over N equal size bins between 0 and 1 (B1, B2, · · · , BN ), where the constants are θ1, θ2, · · · , θN . p(x; θ1, · · · , θN ) = { θj j−1 N ≤ x < j n for j ∈ {1, 2, · · · , n} 0 otherwise we define µj for j ∈ {1, 2, · · · , n} as µj := ∑n i=1 1(xi ∈ bj). (a) using the fact that the total area underneath a probability den- sity function is 1, express θn in terms of the other constants. (b) write down the log-likelihood of the data in terms of θ1, θ2, · · · , θn−1 and µ1, µ2, · · · , µn−1. (c) find the maximum likelihood estimate of θj for j ∈ {1, 2, · · · , n}. 5. histogram methods [10 points] consider a dataset {xj}nj=1 where each point x ∈ [0, 1]d. let f(x) be the true unknown data distribution. you decide to use a histogram method to estimate the density f(x) and divide each dimension into m bins. (a) show that for a measurable set s, e[1{x∈s}] = p(x ∈ s). (b) combining the result of the previous question with the law of large numbers, show that the estimated probability of falling in bin i, as given by the histogram method, tends to ∫ vi f(x)dx, the true probability of falling in bin i, as n → ∞. vi denotes the volume occupied by bin i. 4 (c) consider the mnist dataset with 784 dimensions. we divide each dimension into 2 bins. how many digits (base 10) does the total number of bins have? (d) assume the existence of an idealized mnist classifier based on a histogram of m = 2 bins per dimension. the accuracy of the classifier increases by � = 5% (starting from 10% and up to a maximum of 100%) each time k = 4 new data points are added to every bin. what is the smallest number of samples the classifier requires to reach an accuracy of 90%? (e) assuming a uniform distribution over all bins, what is the prob- ability that a particular bin is empty, as a function of d, m and n? note the contrast between (b) and (e): even if for infinitely many datapoints, the histogram will be arbitrarily accurate at estimating the true distribution (b), in practice the number of samples required to even get a single datapoint in every region grows exponentially with d. 6. gaussian mixture [10 points] let µ0, µ1 ∈ rd, and let σ0,σ1 be two d× d positive definite matrices (i.e. symmetric with positive eigenvalues). we now introduce the two following pdf over rd : fµ0,σ0(x) = 1 (2π) d 2 √ det(σ0) e− 1 2 (x−µ0) t σ−10 (x−µ0) fµ1,σ1(x) = 1 (2π) d 2 √ det(σ1) e− 1 2 (x−µ1) t σ−11 (x−µ1) these pdf correspond to the multivariate gaussian distribution of mean µ0 and covariance σ0, denoted nd(µ0,σ0), and the multivari- ate gaussian distribution of mean µ1 and covariance σ1, denoted nd(µ1,σ1). we now toss a balanced coin y , and draw a random variable x in rd, following this process : if the coin lands on tails (y = 0) we draw x 5 from nd(µ0,σ0), and if the coin lands on heads (y = 1) we draw x from nd(µ1,σ1). (a) calculate p(y = 0|x = x), the probability that the coin landed on tails given x = x ∈ rd, as a function of µ0, µ1, σ0, σ1, and x. show all the steps of the derivation. (b) recall that the bayes optimal classifier is hbayes(x) = argmax y∈{0,1} p(y = y|x = x). show that in this setting if σ0 = σ1 the bayes optimal classifier is linear in x. 6 j="" n="" for="" j="" ∈="" {1,="" 2,="" ·="" ·="" ·="" ,="" n}="" 0="" otherwise="" we="" define="" µj="" for="" j="" ∈="" {1,="" 2,="" ·="" ·="" ·="" ,="" n}="" as="" µj="" :="∑n" i="1" 1(xi="" ∈="" bj).="" (a)="" using="" the="" fact="" that="" the="" total="" area="" underneath="" a="" probability="" den-="" sity="" function="" is="" 1,="" express="" θn="" in="" terms="" of="" the="" other="" constants.="" (b)="" write="" down="" the="" log-likelihood="" of="" the="" data="" in="" terms="" of="" θ1,="" θ2,="" ·="" ·="" ·="" ,="" θn−1="" and="" µ1,="" µ2,="" ·="" ·="" ·="" ,="" µn−1.="" (c)="" find="" the="" maximum="" likelihood="" estimate="" of="" θj="" for="" j="" ∈="" {1,="" 2,="" ·="" ·="" ·="" ,="" n}.="" 5.="" histogram="" methods="" [10="" points]="" consider="" a="" dataset="" {xj}nj="1" where="" each="" point="" x="" ∈="" [0,="" 1]d.="" let="" f(x)="" be="" the="" true="" unknown="" data="" distribution.="" you="" decide="" to="" use="" a="" histogram="" method="" to="" estimate="" the="" density="" f(x)="" and="" divide="" each="" dimension="" into="" m="" bins.="" (a)="" show="" that="" for="" a="" measurable="" set="" s,="" e[1{x∈s}]="P(x" ∈="" s).="" (b)="" combining="" the="" result="" of="" the="" previous="" question="" with="" the="" law="" of="" large="" numbers,="" show="" that="" the="" estimated="" probability="" of="" falling="" in="" bin="" i,="" as="" given="" by="" the="" histogram="" method,="" tends="" to="" ∫="" vi="" f(x)dx,="" the="" true="" probability="" of="" falling="" in="" bin="" i,="" as="" n="" →="" ∞.="" vi="" denotes="" the="" volume="" occupied="" by="" bin="" i.="" 4="" (c)="" consider="" the="" mnist="" dataset="" with="" 784="" dimensions.="" we="" divide="" each="" dimension="" into="" 2="" bins.="" how="" many="" digits="" (base="" 10)="" does="" the="" total="" number="" of="" bins="" have?="" (d)="" assume="" the="" existence="" of="" an="" idealized="" mnist="" classifier="" based="" on="" a="" histogram="" of="" m="2" bins="" per="" dimension.="" the="" accuracy="" of="" the="" classifier="" increases="" by="" �="5%" (starting="" from="" 10%="" and="" up="" to="" a="" maximum="" of="" 100%)="" each="" time="" k="4" new="" data="" points="" are="" added="" to="" every="" bin.="" what="" is="" the="" smallest="" number="" of="" samples="" the="" classifier="" requires="" to="" reach="" an="" accuracy="" of="" 90%?="" (e)="" assuming="" a="" uniform="" distribution="" over="" all="" bins,="" what="" is="" the="" prob-="" ability="" that="" a="" particular="" bin="" is="" empty,="" as="" a="" function="" of="" d,="" m="" and="" n?="" note="" the="" contrast="" between="" (b)="" and="" (e):="" even="" if="" for="" infinitely="" many="" datapoints,="" the="" histogram="" will="" be="" arbitrarily="" accurate="" at="" estimating="" the="" true="" distribution="" (b),="" in="" practice="" the="" number="" of="" samples="" required="" to="" even="" get="" a="" single="" datapoint="" in="" every="" region="" grows="" exponentially="" with="" d.="" 6.="" gaussian="" mixture="" [10="" points]="" let="" µ0,="" µ1="" ∈="" rd,="" and="" let="" σ0,σ1="" be="" two="" d×="" d="" positive="" definite="" matrices="" (i.e.="" symmetric="" with="" positive="" eigenvalues).="" we="" now="" introduce="" the="" two="" following="" pdf="" over="" rd="" :="" fµ0,σ0(x)="1" (2π)="" d="" 2="" √="" det(σ0)="" e−="" 1="" 2="" (x−µ0)="" t="" σ−10="" (x−µ0)="" fµ1,σ1(x)="1" (2π)="" d="" 2="" √="" det(σ1)="" e−="" 1="" 2="" (x−µ1)="" t="" σ−11="" (x−µ1)="" these="" pdf="" correspond="" to="" the="" multivariate="" gaussian="" distribution="" of="" mean="" µ0="" and="" covariance="" σ0,="" denoted="" nd(µ0,σ0),="" and="" the="" multivari-="" ate="" gaussian="" distribution="" of="" mean="" µ1="" and="" covariance="" σ1,="" denoted="" nd(µ1,σ1).="" we="" now="" toss="" a="" balanced="" coin="" y="" ,="" and="" draw="" a="" random="" variable="" x="" in="" rd,="" following="" this="" process="" :="" if="" the="" coin="" lands="" on="" tails="" (y="0)" we="" draw="" x="" 5="" from="" nd(µ0,σ0),="" and="" if="" the="" coin="" lands="" on="" heads="" (y="1)" we="" draw="" x="" from="" nd(µ1,σ1).="" (a)="" calculate="" p(y="0|X" =="" x),="" the="" probability="" that="" the="" coin="" landed="" on="" tails="" given="" x="x" ∈="" rd,="" as="" a="" function="" of="" µ0,="" µ1,="" σ0,="" σ1,="" and="" x.="" show="" all="" the="" steps="" of="" the="" derivation.="" (b)="" recall="" that="" the="" bayes="" optimal="" classifier="" is="" hbayes(x)="argmax" y∈{0,1}="" p(y="y|X" =="" x).="" show="" that="" in="" this="" setting="" if="" σ0="Σ1" the="" bayes="" optimal="" classifier="" is="" linear="" in="" x.="">
Sep 29, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here