CS XXXXXXXXXX: Deep Learning Fall 2021 Problem Set 2 Instructor: Dhruv Batra TAs: Joanne Truong, Andrew Szot, Amol Agrawal, Bhavika Devnani Jordan Rodrigues, Sivagami Nambi, Swornim Baral Discussions:...

Need to answer questions 1-6


CS4803-7643: Deep Learning Fall 2021 Problem Set 2 Instructor: Dhruv Batra TAs: Joanne Truong, Andrew Szot, Amol Agrawal, Bhavika Devnani Jordan Rodrigues, Sivagami Nambi, Swornim Baral Discussions: https://piazza.com/gatech/fall2022/cs48037643 Due: 11:59pm, September 29, 2021 Instructions 1. We will be using Gradescope to collect your assignments. Please read the following instructions for submitting to Gradescope carefully! • For the HW2 component on Gradescope, you could upload one single PDF containing the answers to all the theory questions and the report for the coding problems. How- ever, the solution to each problem or subproblem must be on a separate page. When submitting to Gradescope, please make sure to mark the page(s) cor- responding to each problem/sub-problem. Likewise, the pages of the report must also be marked to their corresponding subproblems. • For the HW2 Code component on Gradescope, please use the collect_submission.sh script provided and upload the resulting zip file to Gradescope. Please make sure you have saved the most recent version of your codes. • Note: This is a large class and Gradescope’s assignment segmentation features are es- sential. Failure to follow these instructions may result in parts of your assignment not being graded. We will not entertain regrading requests for failure to follow instructions. 2. LATEX’d solutions are strongly encouraged (f21_hw2_starter/theory/sol2.tex), but scanned handwritten copies are acceptable. Hard copies are not accepted. 3. We generally encourage you to collaborate with other students. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group ac- tivity. Please list the students you collaborated with. 1 https://piazza.com/gatech/fall2022/cs48037643 1 Gradient Descent 1. [3 points] We often use iterative optimization algorithms such as Gradient Descent to find w that min- imizes a loss function f(w). Recall that in gradient descent, we start with an initial value of w (say w(1)) and iteratively take a step in the direction of the negative of the gradient of the objective function i.e. w(t+1) = w(t) − η∇f(w(t)) (1) for learning rate η > 0. In this question, we will develop a slightly deeper understanding of this update rule, in par- ticular for minimizing a convex function f(w). Note: this analysis will not directly carry over to training neural networks since loss functions for training neural networks are typically not convex, but this will (a) develop intuition and (b) provide a starting point for research in non-convex optimization (which is beyond the scope of this class). Recall the first-order Taylor approximation of f at w(t): f(w) ≈ f(w(t)) + 〈w −w(t),∇f(w(t))〉 (2) When f is convex, this approximation forms a lower bound of f , i.e. f(w) ≥ f(w(t)) + 〈w −w(t),∇f(w(t))〉︸ ︷︷ ︸ affine lower bound to f(·) ∀w (3) Since this approximation is a ‘simpler’ function than f(·), we could consider minimizing the approximation instead of f(·). Two immediate problems: (1) the approximation is affine (thus unbounded from below) and (2) the approximation is faithful for w close to w(t). To solve both problems, we add a squared `2 proximity term to the approximation minimization: argmin w f(w(t)) + 〈w −w(t),∇f(w(t))〉︸ ︷︷ ︸ affine lower bound to f(·) + λ 2︸︷︷︸ trade-off ∣∣∣∣∣∣w −w(t)∣∣∣∣∣∣2︸ ︷︷ ︸ proximity term (4) Notice that the optimization problem above is an unconstrained quadratic programming prob- lem, meaning that it can be solved in closed form (hint: gradients). What is the solution w∗ of the above optimization? What does that tell you about the gradient descent update rule? What is the relationship between λ and η? 2. [4 points] Let’s prove a lemma that will initially seem devoid of the rest of the analysis but will come in handy in the next sub-question when we start combining things. Specifically, the analysis in this sub-question holds for any w?, but in the next sub-question we will use it for w? that minimizes f(w). Consider a sequence of vectors v1,v2, ...,vT , and an update equation of the form w(t+1) = w(t) − ηvt with w(1) = 0. Show that: T∑ t=1 〈w(t) −w?,vt〉 ≤ ||w?||2 2η + η 2 T∑ t=1 ||vt||2 (5) 2 3. [4 points] Now let’s start putting things together and analyze the convergence rate of gradient descent i.e. how fast it converges to w?. For this question, assume that f is convex and ρ-Lipschitz. A function that is ρ-Lipschitz is one where the norm of its gradient is bounded by ρ, i.e. ||∇f(w)|| ≤ ρ. First, show that for w̄ = 1T ∑T t=1 w (t) f(w̄)− f(w?) ≤ 1 T T∑ t=1 〈w(t) −w?,∇f(w(t))〉 (6) Next, use the result from part 2, with upper bounds B and ρ for ||w?|| and ∣∣∣∣∇f(w(t))∣∣∣∣ respectively and show that for fixed η = √ B2 ρ2T , the convergence rate of gradient descent is O(1/ √ T ) i.e. the upper bound for f(w̄)− f(w?) ∝ 1√ T . 2 Estimating Hessians 4. [6 points] Optimization is an extremely important part of deep learning. In the previous question, we explored gradient descent, which uses the direction of maximum change to minimize a loss function. However, gradient descent leaves a few questions unresolved – how do we choose the learning rate η? If η is small, we will take a long time to reach the optimal point; if η is large, it will oscillate between one side of the curve and another. So what should we do? One solution is to use Hessians, which is a measure of curvature, or the rate of change of the gradients. Intuitively, if we knew how steep a curve were, we would know how fast we should move in a given direction. This is the intuition behind second-order optimization methods such as Newton’s method. Let us formally define a Hessian matrix H of a function f as a square n×n matrix containing all second partial derivatives of f , i.e.: H =  ∂2f ∂x21 ∂2f ∂x1∂x2 . . . ∂ 2f ∂x1∂xn ∂2f ∂x2∂x1 ∂2f ∂x22 . . . ∂ 2f ∂x2∂xn ... ... . . . ... ∂2f ∂xn∂x1 ∂2f ∂xn∂x2 . . . ∂ 2f ∂x2n  Recall the second-order Taylor approximation of f at w(t): f(w) ≈ f(w(t)) + 〈w −w(t),∇f(w(t))〉+ 1 2 (w −w(t))>H(w −w(t)) (7) (a) What is the solution to the following optimization problem? argmin w [ f(w(t)) + 〈w −w(t),∇f(w(t))〉+ 1 2 (w −w(t))>H(w −w(t)) ] (8) What does that tell you about how to set the learning rate η in gradient descent? 3 Now that we’ve derived Netwon’s update algorithm, we should also mention that there is a catch to using Newton’s method. Newton’s method requires us to 1) calculate H, and 2) invert H. Having to compute a Hessian is expensive; H is massive and we would also have to figure out how to store it. (b) Consider an MLP with 3 fully-connected layers, each with 50 hidden neurons, except for the output layer, which represents 10 classes. We can represent the transformations as x ∈ R50 −→ h(1) ∈ R50 −→ h(2) ∈ R50 −→ s ∈ R10. Assume that x does not include any bias feature appended to it. How many parameters are in this MLP? What is the size of the corresponding Hessian? Rather than store and manipulate the Hessian H directly, we will instead focus on being able to compute the result of a Hessian-vector product Hv, where v is an arbitrary vector. Why? Because in many cases one does not need the full Hessian but only Hv. Computing Hv is a core building block for computing a number of quantities including H−1∇f (hint, hint). You will next show a surprising result that it is possible to ‘extract information from the Hessian’, specifically to compute the Hessian-vector product without ever explicitly calculating or storing the Hessian itself! Consider the Taylor series expansion of the gradient operator about a point in weight space: ∇w(w + ∆w) = ∇w(w) + H∆w +O(||∆w||2) (9) where w is a point in weight space, ∆w is a perturbation of w, ∇w is the gradient, and ∇w(w + ∆w) is the gradient of f evaluated at w + ∆w. If you have difficulty understanding this expression above, consider starting with Eqn 2, re- placing w −w(t) with ∆w and f(·) with ∇w(·). (c) Use Eqn 9 to derive a numerical approximation of Hv (in terms of ∇w). Hint: Consider choosing ∆w = rv, where v is a vector and r is a small number. This approximation you derived above is susceptible to numerical instability. We would like a method that is free of these numerical issues and is exact instead of an approximation. To that end, let’s now define a useful operator, known as the R-operator. The R-operator with respect to v is defined as: Rv{f(w)} = ∂ ∂r f(w + rv) ∣∣∣ r=0 (10) (d) The R-operator has many useful properties. Let’s first prove some of them. Show that: Rv{cf(w)} = cRv{f(w)} [Linearity under scalar multiplication] Rv{f(w)g(w)} = Rv{f(w)}g(w) + f(w)Rv{g(w)} [Product Rule of R-operators] (e) Now, instead of numerically approximating Hv, use theR-operator to derive an equation to exactly calculate Hv. (f) Explain how might you implement Hv in MLPs if you already have access to an auto- differentiation library. 4 3 Automatic Differentiation 5. [4 points] In practice, writing the closed-form expression of the derivative of a loss function f w.r.t. the parameters of a deep neural network is hard (and mostly unnecessary) as f becomes complex. Instead, we define computation graphs and use the automatic differentiation algorithms (typ- ically backpropagation) to compute gradients using the chain rule. For example, consider the expression f(x, y) = (x+ y)(y + 1) (11) Let’s define intermediate variables a and b such that a = x+ y (12) b = y + 1 (13) f = a× b (14) ] A computation graph for the “forward pass” through f is shown in Fig. 1. Figure 1 We can then work backwards and compute the derivative of f w.r.t. each intermediate vari- able (∂f∂a , ∂f ∂b ) and chain them together to get ∂f ∂x and ∂f ∂y . Let σ(·) denote the standard sigmoid function. Now, for the following vector function: f1(w1, w2) = cos(w1) cos(w2) + σ(w2) (15) f2(w1, w2) = ln(w1 + w2) + w 2 1w2 (16) (a) Draw the computation graph. Compute the value of f at ~w = (1, 2). (b) At this ~w, compute the Jacobian ∂ ~f ∂ ~w using numerical differentiation (using ∆w = 0.01). (c) At this ~w, compute the Jacobian using forward mode auto-differentiation. (d) At this ~w, compute the Jacobian using backward mode auto-differentiation. (e) Don’t you love that software exists to do this for us? 5 4 Convolutions 6
Oct 06, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here