https://courses.grainger.illinois.edu/cs361/sp2021/secure/project/CS361Project CS 361: Probability and Statistics for Computer Science (Spring 2021) Stochastic Optimization Project 1 Stochastic...

This is a college statistics and probability class assignment project.


https://courses.grainger.illinois.edu/cs361/sp2021/secure/project/CS361Project CS 361: Probability and Statistics for Computer Science (Spring 2021) Stochastic Optimization Project 1 Stochastic Optimization Theory Part 1.1 A common stochastic optimization task In many machine learning problems, we are trying to minimize function f(✓) in the following format. f(✓) = 1 k kX j=1 Q(✓, j) (1.1) where Q(✓, j) is the loss function for jth data point where we have k training data points in the data set D. Here, ✓ is the set of parameters of the model that we try to optimize for, that is we want to find ✓⇤ = argmin✓ f(✓). However, f is either unknown or very expensive to collect, but we have access to the noisy version g(✓). Using g(✓) to find the optimized ✓⇤ for f(✓) would be a stochastic optimization task. Exercise 1. (4 points) Define the random variable g(✓) specifically here to be Q(✓, i), where i is a random variable that is of uniform distribution in the integer set [1 : k]. So, rg(✓) is to be rQ(✓, i). Further, we define the noise z = Q(✓, i)� f(✓), hence g(✓) = Q(✓, i) = f(✓) + z. That’s why we call g is the noisy version of f . Given this definition of f and g, prove that equations E[g(✓)] = f(✓), E[rg(✓)] = rf(✓), and E[z] = 0 are satisfied. 1.2 Review of GD and SGD The goal of optimization is to find the ✓⇤ that minimizes f(✓), which could be more general than that in (1.1). Again sometimes f is either unknown or very expensive to collect, but we have access to the noisy version g(✓). The following are two approximation methods to find ✓⇤ iteratively. 1.2.1 Gradient Descent Algorithm (GD) • Initialize ✓1 • For t = 1, 2, · · · Then do the following update ✓t+1 = ✓t � ⌘trf(✓t) (1.2) 1.2.2 Stochastic Gradient Descent Algorithm (SGD) • Initialize ✓1 • For t = 1, 2, · · · . Obtain a noisy version of the gradient (i.e. rg(✓t)). Then do the following update ✓t+1 = ✓t � ⌘trg(✓t) (1.3) 1.3 Convergence rates for SGD and GD Suppose we want to minimize function f with learning rate sequence ⌘t, We have the following theorems: Theorem 1.1 Assume rf has a unique root ✓⇤. If f is twice continuously di↵erentiable and strongly convex, and ⌘t = O(t�1), then to reach an approximation error ✏ = |E[f(✓t)] � f(✓⇤)|, stochastic gradient descent requires t = O( 1✏ ) updates. 1 – Stochastic Optimization Project 2 Theorem 1.2 Assume rf has a unique root ✓⇤. If either the smoothness assumptions about f in theorem 1.1 or ⌘t = O(t�1) is not met, then to reach an approximation error ✏ = |E[f(✓t)] � f(✓⇤)|, stochastic gradient descent requires t = O( 1✏2 ) updates. Theorem 1.3 Assume rf has a unique root ✓⇤. To reach an approximation error ✏ = |E[f(✓t)] � f(✓⇤)|, gradient descent requires t = O(ln( 1✏ )) updates. 2 Comparing SGD vs GD for their running time Let us consider the f minimization task discussed earlier in (1.1). • f(✓) and g(✓) are as described in exercise 1, rf(✓⇤) = 0. • Let’s assume that f is strongly convex, and is twice continuously di↵erentiable. • We use the ⌘t = 1t learning rate sequence. • We have k data points. • We want to achieve a training precision of ✏ = |E[f(✓t)]� f(✓⇤)|. • Assume finding rQ(✓, j) takes c time, and finding rf(✓) takes kc time. Exercise 2. (15 points) Fill in table 1, with complexity orders in terms of k and ✏ Table 1 Comparing SGD vs GD in terms of training precision SGD GD Computational Cost per Update O(1) Number of updates to reach ✏ Total Cost Explain your answer for each element of the second row by providing a reference to the theorem mentioned in this document. For every other element, explain your answer. 3 Comparing SGD vs GD with respect to test loss The last exercise was about finding the computational complexity required to reach a specific precision with respect to the optimal training loss. However, a specific precision of training loss does not translate to the same precision of test loss. Here are some helpful facts: • To achieve a specific test loss precision ⇢, you need a large enough number of training samples k. The relation ⇢ = O(k��) must hold for most loss functions where 0 < �  1 is an unknown constant. • let’s assume ✏ = ⇥(⇢); for simplicity, you can assume that it is as if ✏ is c⇥⇢ where c is a positive constant. exercise 3. (16 points) 1) fill in table 2, with complexity orders in terms of ⇢ and �. you can refer to the previous table, and replace the elements with appropriate values. make sure you state the reason for each element even if it looks obvious. – stochastic optimization project 3 table 2 comparing sgd vs gd in terms of testing precision sgd gd computational cost per update o(1) number of updates to reach ⇢ total cost 2) for a typical � = 0.2, explain why choosing gd over sgd can be very unwise. cs 361: probability and statistics for computer science (spring 2021) stochastic optimization implementation objective: implement state-of-the-art stochastic optimization algorithms for machine learning problems, linear regression and classification (using neural networks). 3.1 adaptive momentum estimation (adam) gradient descent algorithm sgd can be ine↵ective when the function is highly non-convex. unfortunately, most modern applications of machine learning involve non-convex optimization problems. one stochastic optimization algorithm that works well even for non-convexity is adam [kb14]. adam uses past gradient information to “speed” up optimization by smoothing, and the algoirthm has been further improved [sss18]. you will implement the adam stochastic optimization algorithm for a linear regression problem. the pseudo-code for adam has been reproduced here from this paper. credits to [kb14]. disclaimer: the textbook, in chapter 13, uses � for parameters but we will be using ✓. algorithm 1: g 2 t indicates the elementwise square gt�gt. good default settings for the tested machine learning problems are ↵ = 0.001, �1 = 0.9,�2 = 0.999 and ✏ = 10�8. all operations on vectors are element-wise. with � t 1 and � t 2, we denote �1 and �2 to the power t. require: ↵: stepsize require: ✏: division-by-zero control parameter require: �1,�2 2 [0, 1): exponential decay rates for the moment estimates require: f(✓): stochastic objective function with parameters ✓. require: ✓0: initial parameter vector m0 0 (initialize 1st moment vector) v0 0 (initialize 2nd moment vector) t 0 (initialize timestep) while ✓t not converged do t t+ 1 gt r✓ft(✓t�1) (get gradients w.r.t. stochastic objective at timestep t) mt �1 ·mt�1 + (1� �1) · gt (update biased first moment estimate) vt �2 · vt�1 + (1� �2) · g2t (update biased second raw moment estimate) m̂t mt/(1� �t1) (compute bias-corrected first moment estimate). v̂t vt/(1� �t2) (compute bias-correct second raw moment estimate). ✓̂t ✓t�1 � ↵ · m̂t/( p v̂t + ✏) (update parameters) end while return ✓t (resulting parameters) exercise 4. consider the following problem setting: • the number of training data points is k = 1000. the number of test data points is 50. • generation of the data set to fit for the model: the jth data point is represented by xj and is a 10- dimensional vector. each element of xj is being generated from a uniform distribution over the interval [�1, 1]. • ✓true (i.e. the true parameter set) and ✓ (i.e. the variable indicating a candidate parameter set) are also 10-dimensional vectors. assume that ✓true is all ones. however, we will pretend that we do not know it and we want to estimate it using the training data. • data is being generated according to yj = xtj ✓true+0.1 ·⇠ (where ⇠ ⇠ n (0, 1) is a sample from the normal distribution). the label yj is a scalar. 1 – stochastic optimization implementation 2 • let us assume that all the algorithm variants start from the same initial ✓, whose elements are picked uniformly in the interval [0, 0.1]. • use 1000 updates. since this is an optimization problem, we need to define the loss function. let’s use the following format f (✓) = 1 k kx j=1 q(✓, j) (3.1) q(✓, j) = ����x t j ✓ � yj ���� � (3.2) where � is a hyper-parameter that we use to control and define the loss function. for the following problems, state findings, provide analysis, and substantiate them in your write-up with screen- shots of the plots from the gradient descent python notebook code. please save the notebook to your own google drive so that your work could be preserved after the runtime expires. 1. (5 points) for � = 2, the problem becomes the least squares regression that you learned from chapter 13. state the closed form solution (i.e. ✓ = ...) in your report, and then implement the solution to solve for ✓. provide the results of the experiment and state whether it is close to the true value. 2. (5 points) for q(✓, j), find an expression that gives you the gradient rq(✓, j). report this expression, and implement it in the appropriate function. hint: for r(✓) = h(e(✓)) = h(xtj ✓ � yj), the gradient can be written as rr(✓) = @h@e · re(✓) = @h @e · xj according to the chain rule. hint: the sign function, sgn(x), may be useful. 3. (5 points) code the adam optimization algorithm (with default hyper-parameters such as the learning rate as mentioned in the pseudocode above) and sgd to find the best parameter ✓. use a batch size of 1 for this problem, and perform 1000 parameter updates. report the final set of parameters. 4. (5 points) update your code to compute the average test loss at every 50th update, and plot these values. you might notice that the error bars of adam and sgd overlap. this is due to the inherent randomness from sampling values. to avoid this probabilistic overlap, increase the number of replicates (num rep in the starter code) until the error bars between adam and sgd do not overlap. report this curve. 5. (9 points) run adam method for each of � = 0.4, 0.7, 1, 2, 3, and 5. report your observations clearly, and analyze the trends you are seeing. state whether adam consistently out performs sgd. your analysis should include the reason why one method outperforms the other under each setting. example �="" ="" 1="" is="" an="" unknown="" constant.="" •="" let’s="" assume="" ✏="⇥(⇢);" for="" simplicity,="" you="" can="" assume="" that="" it="" is="" as="" if="" ✏="" is="" c⇥⇢="" where="" c="" is="" a="" positive="" constant.="" exercise="" 3.="" (16="" points)="" 1)="" fill="" in="" table="" 2,="" with="" complexity="" orders="" in="" terms="" of="" ⇢="" and="" �.="" you="" can="" refer="" to="" the="" previous="" table,="" and="" replace="" the="" elements="" with="" appropriate="" values.="" make="" sure="" you="" state="" the="" reason="" for="" each="" element="" even="" if="" it="" looks="" obvious.="" –="" stochastic="" optimization="" project="" 3="" table="" 2="" comparing="" sgd="" vs="" gd="" in="" terms="" of="" testing="" precision="" sgd="" gd="" computational="" cost="" per="" update="" o(1)="" number="" of="" updates="" to="" reach="" ⇢="" total="" cost="" 2)="" for="" a="" typical="" �="0.2," explain="" why="" choosing="" gd="" over="" sgd="" can="" be="" very="" unwise.="" cs="" 361:="" probability="" and="" statistics="" for="" computer="" science="" (spring="" 2021)="" stochastic="" optimization="" implementation="" objective:="" implement="" state-of-the-art="" stochastic="" optimization="" algorithms="" for="" machine="" learning="" problems,="" linear="" regression="" and="" classification="" (using="" neural="" networks).="" 3.1="" adaptive="" momentum="" estimation="" (adam)="" gradient="" descent="" algorithm="" sgd="" can="" be="" ine↵ective="" when="" the="" function="" is="" highly="" non-convex.="" unfortunately,="" most="" modern="" applications="" of="" machine="" learning="" involve="" non-convex="" optimization="" problems.="" one="" stochastic="" optimization="" algorithm="" that="" works="" well="" even="" for="" non-convexity="" is="" adam="" [kb14].="" adam="" uses="" past="" gradient="" information="" to="" “speed”="" up="" optimization="" by="" smoothing,="" and="" the="" algoirthm="" has="" been="" further="" improved="" [sss18].="" you="" will="" implement="" the="" adam="" stochastic="" optimization="" algorithm="" for="" a="" linear="" regression="" problem.="" the="" pseudo-code="" for="" adam="" has="" been="" reproduced="" here="" from="" this="" paper.="" credits="" to="" [kb14].="" disclaimer:="" the="" textbook,="" in="" chapter="" 13,="" uses="" �="" for="" parameters="" but="" we="" will="" be="" using="" ✓.="" algorithm="" 1:="" g="" 2="" t="" indicates="" the="" elementwise="" square="" gt�gt.="" good="" default="" settings="" for="" the="" tested="" machine="" learning="" problems="" are="" ↵="0.001," �1="0.9,�2" =="" 0.999="" and="" ✏="10�8." all="" operations="" on="" vectors="" are="" element-wise.="" with="" �="" t="" 1="" and="" �="" t="" 2,="" we="" denote="" �1="" and="" �2="" to="" the="" power="" t.="" require:="" ↵:="" stepsize="" require:="" ✏:="" division-by-zero="" control="" parameter="" require:="" �1,�2="" 2="" [0,="" 1):="" exponential="" decay="" rates="" for="" the="" moment="" estimates="" require:="" f(✓):="" stochastic="" objective="" function="" with="" parameters="" ✓.="" require:="" ✓0:="" initial="" parameter="" vector="" m0="" 0="" (initialize="" 1st="" moment="" vector)="" v0="" 0="" (initialize="" 2nd="" moment="" vector)="" t="" 0="" (initialize="" timestep)="" while="" ✓t="" not="" converged="" do="" t="" t+="" 1="" gt="" r✓ft(✓t�1)="" (get="" gradients="" w.r.t.="" stochastic="" objective="" at="" timestep="" t)="" mt="" �1="" ·mt�1="" +="" (1�="" �1)="" ·="" gt="" (update="" biased="" first="" moment="" estimate)="" vt="" �2="" ·="" vt�1="" +="" (1�="" �2)="" ·="" g2t="" (update="" biased="" second="" raw="" moment="" estimate)="" m̂t="" mt/(1�="" �t1)="" (compute="" bias-corrected="" first="" moment="" estimate).="" v̂t="" vt/(1�="" �t2)="" (compute="" bias-correct="" second="" raw="" moment="" estimate).="" ✓̂t="" ✓t�1="" �="" ↵="" ·="" m̂t/(="" p="" v̂t="" +="" ✏)="" (update="" parameters)="" end="" while="" return="" ✓t="" (resulting="" parameters)="" exercise="" 4.="" consider="" the="" following="" problem="" setting:="" •="" the="" number="" of="" training="" data="" points="" is="" k="1000." the="" number="" of="" test="" data="" points="" is="" 50.="" •="" generation="" of="" the="" data="" set="" to="" fit="" for="" the="" model:="" the="" jth="" data="" point="" is="" represented="" by="" xj="" and="" is="" a="" 10-="" dimensional="" vector.="" each="" element="" of="" xj="" is="" being="" generated="" from="" a="" uniform="" distribution="" over="" the="" interval="" [�1,="" 1].="" •="" ✓true="" (i.e.="" the="" true="" parameter="" set)="" and="" ✓="" (i.e.="" the="" variable="" indicating="" a="" candidate="" parameter="" set)="" are="" also="" 10-dimensional="" vectors.="" assume="" that="" ✓true="" is="" all="" ones.="" however,="" we="" will="" pretend="" that="" we="" do="" not="" know="" it="" and="" we="" want="" to="" estimate="" it="" using="" the="" training="" data.="" •="" data="" is="" being="" generated="" according="" to="" yj="xTj" ✓true+0.1="" ·⇠="" (where="" ⇠="" ⇠="" n="" (0,="" 1)="" is="" a="" sample="" from="" the="" normal="" distribution).="" the="" label="" yj="" is="" a="" scalar.="" 1="" –="" stochastic="" optimization="" implementation="" 2="" •="" let="" us="" assume="" that="" all="" the="" algorithm="" variants="" start="" from="" the="" same="" initial="" ✓,="" whose="" elements="" are="" picked="" uniformly="" in="" the="" interval="" [0,="" 0.1].="" •="" use="" 1000="" updates.="" since="" this="" is="" an="" optimization="" problem,="" we="" need="" to="" define="" the="" loss="" function.="" let’s="" use="" the="" following="" format="" f="" (✓)="1" k="" kx="" j="1" q(✓,="" j)="" (3.1)="" q(✓,="" j)="����x" t="" j="" ✓="" �="" yj="" ����="" �="" (3.2)="" where="" �="" is="" a="" hyper-parameter="" that="" we="" use="" to="" control="" and="" define="" the="" loss="" function.="" for="" the="" following="" problems,="" state="" findings,="" provide="" analysis,="" and="" substantiate="" them="" in="" your="" write-up="" with="" screen-="" shots="" of="" the="" plots="" from="" the="" gradient="" descent="" python="" notebook="" code.="" please="" save="" the="" notebook="" to="" your="" own="" google="" drive="" so="" that="" your="" work="" could="" be="" preserved="" after="" the="" runtime="" expires.="" 1.="" (5="" points)="" for="" �="2," the="" problem="" becomes="" the="" least="" squares="" regression="" that="" you="" learned="" from="" chapter="" 13.="" state="" the="" closed="" form="" solution="" (i.e.="" ✓="...)" in="" your="" report,="" and="" then="" implement="" the="" solution="" to="" solve="" for="" ✓.="" provide="" the="" results="" of="" the="" experiment="" and="" state="" whether="" it="" is="" close="" to="" the="" true="" value.="" 2.="" (5="" points)="" for="" q(✓,="" j),="" find="" an="" expression="" that="" gives="" you="" the="" gradient="" rq(✓,="" j).="" report="" this="" expression,="" and="" implement="" it="" in="" the="" appropriate="" function.="" hint:="" for="" r(✓)="h(e(✓))" =="" h(xtj="" ✓="" �="" yj),="" the="" gradient="" can="" be="" written="" as="" rr(✓)="@h@e" ·="" re(✓)="@h" @e="" ·="" xj="" according="" to="" the="" chain="" rule.="" hint:="" the="" sign="" function,="" sgn(x),="" may="" be="" useful.="" 3.="" (5="" points)="" code="" the="" adam="" optimization="" algorithm="" (with="" default="" hyper-parameters="" such="" as="" the="" learning="" rate="" as="" mentioned="" in="" the="" pseudocode="" above)="" and="" sgd="" to="" find="" the="" best="" parameter="" ✓.="" use="" a="" batch="" size="" of="" 1="" for="" this="" problem,="" and="" perform="" 1000="" parameter="" updates.="" report="" the="" final="" set="" of="" parameters.="" 4.="" (5="" points)="" update="" your="" code="" to="" compute="" the="" average="" test="" loss="" at="" every="" 50th="" update,="" and="" plot="" these="" values.="" you="" might="" notice="" that="" the="" error="" bars="" of="" adam="" and="" sgd="" overlap.="" this="" is="" due="" to="" the="" inherent="" randomness="" from="" sampling="" values.="" to="" avoid="" this="" probabilistic="" overlap,="" increase="" the="" number="" of="" replicates="" (num="" rep="" in="" the="" starter="" code)="" until="" the="" error="" bars="" between="" adam="" and="" sgd="" do="" not="" overlap.="" report="" this="" curve.="" 5.="" (9="" points)="" run="" adam="" method="" for="" each="" of="" �="0.4," 0.7,="" 1,="" 2,="" 3,="" and="" 5.="" report="" your="" observations="" clearly,="" and="" analyze="" the="" trends="" you="" are="" seeing.="" state="" whether="" adam="" consistently="" out="" performs="" sgd.="" your="" analysis="" should="" include="" the="" reason="" why="" one="" method="" outperforms="" the="" other="" under="" each="" setting.="">
May 04, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here