Inference Inference Assignment in COMS30007 Machine Learning Carl Henrik Ek November 14, 2018 Welcome to the second assignment in the machine learning course. You will present the assignment by a...

I would most probably only need 7 out of 10, no more than that.




Inference Inference Assignment in COMS30007 Machine Learning Carl Henrik Ek November 14, 2018 Welcome to the second assignment in the machine learning course. You will present the assignment by a written report that you should submit on SAFE. From the report it should be clear what you have done and you need to support your claims with results. You are supposed to write down the answers to the specific questions detailed for each task. This report should clearly show how you have drawn the conclusions and come up with the derivations. Your assumptions should be stated clearly. For the practical part of the task you should not show any of your code but rather only show the results of your experiments using images and graphs together with your analysis. You should still submit your code though, but you are free to use whatever language that you want. Being able to communicate your results and conclusions is a key aspect of any scientific practitioner. It is up to you as a author to make sure that the report clearly shows what you have done and what your understanding is. Based on this, and only this, we will decide if you pass the task. No detective work should be needed on our side. Therefore, neat and tidy reports please! Each report can be up to 5 pages long. I very much recommend you to get used to LATEX. It is an amazing tool that you will find very useful in your further endeavours as a scientist. You can download the style-file from the repository. You are not allowed to alter the margins nor the font-size but you are free to use \vspace{} as much as you like. If you decided to use another formatting engine try to mimic the style file provided, but I really recommend that you get used to LATEX as soon as possible as you will have to write a dissertation at some point and using something else should be consider an exercise in self harm. If anyone is an Emacs connoisseur I recommend using the excellent org-mode and the ox-latex export engine to prepare your report. The grading of the assignments will be as follows, 50% Question 1-4 70% Question 1-7 80% Question 1-8 90% Question 1-10 This means that this is your potential top mark if all the questions are answered correctly and the discussion is as it should be. For the 90% mark, you have to do the amortised learning bit but as you have been provided with the code it is your intuitions that are important. The last 10% I am using as golden sprinkle that I hand out to things that I think are especially well done so there is absolutely the possibility to get a 100% mark. 1 Abstract In the first assignment we looked at how we can create models that allows us to parametrise and factorise a distribution over the observed data domain. We saw that we can make decisions from the posterior distribution of the model which we reached by combining the model with observed data. For many different models though it is not feasible to compute the posterior in closed form most commonly because the marginal likelihood or the evidence is not analytically or computationally tractable1. So how do we proceed? Well one possibility is to look for a point estimate rather than the full distribution and proceed with a Maximum Likelihood or a Maximum-a-Posteriori estimate. However, this should really be our last resort as these methods will at best tell us what it believes the "best" approach is but will not at all quantify what "best" means. Further our assumptions in this case does not reach the data which means they are at best regularisers. This means that for such an inference scheme we cannot make a choice if we should trust the model or not, nor can we make a choice on how well the model actually describes the data. The more sensible approach is to try and approximate the intractable integrals and here there are two main approaches, either stochastic or deterministic methods. They both have their benefits and negatives. A stochastic approach is simple to formulate but importantly it is hard to assess how well we are doing and in many ways it is considered one of the "black arts" of machine learning. Deterministic approaches are usually very efficient but they will never be exact. In this assignment we will look at both of these approaches for a simple, yet very intractable posterior. 1 Approximate Inference The task of inference in a machine learning model is the task of combining our assumptions with the observed data. In specific we have a set of observed data Y which have been parametrised by a variable θ the task requires us to use Bayes rule to reach the posterior p(θ|Y), p(θ|Y) = p(Y|θ)p(θ) p(Y) . The challenging part of the relationship above is the marginal likelihood or the evidence, which is the probability of the observed data when all assumptions have been propagated through and integrated out, p(Y) = ∫ p(Y,θ)dθ. In the first assignment we looked at situations where we can avoid calculating the marginal likelihood by exploiting conjugacy, however, for certain cases it is simply not possible as this integral is intractable, either computationally but quite often it is analytically intractable. In order to proceed we have to make sacrifices and approximate this integral. But in order for the assignment to get underway we need to have a model to play around with that will exemplify the different approaches. In this part we are going to look at the rather useful task of image restoration, in specific we are going to work with binary images which have been corrupted by noise and we are supposed to clean them up. The task is exactly the same if you want to perform image segmentation rather than denoising and I will explain the latter as a task for extra marks. 1.1 The Model Images are one of the most interesting and easily available sources of data, images contain a lot of information and they can be acquired in an unitrusive manner with very cheap sensors. Our task here is to build a model of images, in specific of binary or black-and-white images. Images are normally represented as a grid of pixels yi however the images we observe are noisy and rather will be a realisation of an underlying latent pixel representation xi. Now to make our computations a bit easier, lets say that white is encoded by xi = 1 and 1For those of you who did the evidence part of the Model coursework think about how many elements you had in the summation for such a simple problem. Page 2 black with xi = −1 and that the grey-scale values that we observed yi ∈ (0, 1). We will write our likelihood on this form, p(y|x) = 1 Z1 N∏ i=1 eLi(xi), (1) where Li(xi) is a function which generates a large value if xi is likely to have generated yi and Z1 is a factor that ensures that p(y|x) is a distribution. We have further assumed that the pixels in the image are conditionally independent given the latent variables x. The next part is to think what a sensible prior would be, what do we actually know about images? One important aspect of images that makes them, well images is that there is a significant correlation between neighbouring pixels. What do we know about this relationship? Well, lets say that we see one white pixel, what do we believe the most likely colour of the pixel to the right to be? If I had to guess I would probably say white as I think that images have more contious segments of one colour compared to switches between colours. So this is now prior information, an assumption that we want to quantify in terms of a probability. We can write down this as follows, p(x) = 1 Z0 eE0(x), (2) where again E0(x) is a function that is large the configuration of x is something that we believe is likely and small otherwise and Z0 a normalising term to ensure that p(x) is a distribution. If we follow our previous reasoning and say that a pixel depends on its neighbouring pixels only we can write E0(x) as function of the following form, E0(x) = N∑ i=1 ∑ j∈N (i) wijxixj , (3) where N (i) specifies the set of nodes that are neighbours to node i. Remember that xi ∈ [−1, 1] this means that xixj will be 1 if the nodes have the same label and −1 if the nodes have different labels. The scalars wij are our parameters that we can control the strength of our prior with, where a large value wij implies that node xixj are nodes that we really believe should have the same label. Now we have our final model and can describe the joint distribution, p(x,y) = p(y|x)p(x) = 1 Z1 N∏ i=1 eLi(xi) 1 Z0 e ∑ j∈N(i) wijxixj . (4) We can also write up the graphical model for the model which is shown in Figure 1. The model that we just have described is referred to as a Markov Random Field with a Ising prior. This model was initially described in physics2 to study nearby magnets where the latent variable was their "direction". However, it turns out that they are very good models for images in many tasks. 1.2 Inference The task we will study in this paper is to given a noisy observation y recover the latent variables x that have generated the observations. This means that we want to reach the posterior distribution p(x|y) to do so we have to compute Baye’s rule, p(x|y) = p(y|x)p(x) p(y) . 2An interesting note of machine learning researchers is that very few comes from a computer science background, much more common are physicists, engineers and of course statisticians. Maybe it is therefore not surprising to see a lot of physics motivated models in use for rather different tasks commpared to what they where initially designed. Page 3 x0x1 x2 x3x4 x5 x6x7 x8 y0y1 y2 y3y4 y5 y6y7 y8 Figure 1: Above is the graphical model of the MRF we will use for the images. Note that we have con- nected the latent variables with lines and not arrows, that is because we do not specify these as conditional probabilities but rather joint probabilities. The denominator could be computed as follows, p(y) = ∑ x p(y|x)p(x). For any type of sensible size of image this summation is not computationally tractable. What we want to sum over is all possible values that x can actually take, i.e. we want to test all possible binary images. If we have an image of size 10 it consists of 100 different pixels. The number of combinations that they can take is therefore 2100. This is the number of terms in the marginalisation above and its a big number. This means that it is simply computationally intractable to compute Baye’s rule for any sensibly sized image, say something with 3-4 megapixels, and we need to perform some form of
Nov 29, 2020COMS30007
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here