Final CS5250 Final The Final for CS 5250 will be a short research paper. Details: ● The paper should be approximately 3-5 pages, single spaced. ● You may send me a digital copy of your paper, in PDF...

Question 3-6


Final CS5250 Final The Final for CS 5250 will be a short research paper. Details: ● The paper should be approximately 3-5 pages, single spaced. ● You may send me a digital copy of your paper, in PDF format. You may submit it to Blackboard. ● This is an open-book, open Internet final, and you have as much time as you want, but it must be your own work. Do not collaborate with others, and do not plagiarize from the internet! ● The primary source for this research paper is chapter 7 of the book. If you do not have a copy of the book, you will find the relevant information here (which is where the book got the information): https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-proc essing-unit-tpu In Chapter 7, the book discusses Domain Specific Architectures. So far in the book we have been talking about general purpose computers, which are designed to run basically any program. But we are close to running out of options to continue increasing performance for a general purpose computer. The next option is to design special hardware that is really good at solving a specific problem. For your research paper, you need to answer 3 questions. The first two are rather simpler, and your answer should be about a paragraph. The last question is more complex, and will require most of the time. Question 1: What is Dennard Scaling? What is its basic principle, and what does it tell us? Does Dennard Scaling still apply? If so, for how much longer do we expect it to apply, and if not, when did it end? Do not just copy this from the internet. Study this, then answer these questions in your own words. Question 2: What is Moore's Law? What did it tell us? Does Moore's law still apply, or does it no longer work? Questions 3-6: Read the article listed above regarding Google's Tensor Processing Unit. There are many interesting things there, so read through the whole article. The questions for you to answer deals with the multiplier array, and are given below. The key for this processor is their systolic multiplier array. The purpose of this array is to multiply an array of values by an array of weights. In our example here (and in the paper) the array size is 3x3, but in the real chip it is 256x256. When this is being used, all of the weight values are loaded into the array. The first input matrix then enters the array through the top, and the resulting array appears out the right side of the array. The first input array is followed by the second input array, which is then followed by the third, and so on. Be sure to watch the animated GIF in the paper, this shows how the data flows through the array. Because this is a custom chip, we don't have to store all of the values and all the partial products into memory, they can be stored in registers on the chip, and these registers directly connect to their data sources and sinks, so all of these operations can happen in parallel. The system is pipelined, so that during each clock cycle, each cell is multiplying one input value by a fixed weight value, then adding the result to a sum. During the next clock cycle the input value moves down one location in the array, and the partial sum moves right one location in the array. Because this is pipelined, all 9 multipliers are working each clock cycle. The calculation performed by this array is: Y = W × X, where W is a matrix of weights, X is a matrix of input values, and Y is the matrix of output values. For our 3 × 3 case, the circuit computes: Y11 = W11 × X11 + W12 × X12 + W13 × X13 Y12 = W21 × X11 + W22 × X12 + W23 × X13 Y13 = W31 × X11 + W32 × X12 + W33 × X13 Y21 = W11 × X21 + W12 × X22 + W13 × X23 Y22 = W21 × X21 + W22 × X22 + W23 × X23 Y23 = W31 × X21 + W32 × X22 + W33 × X23 Y31 = W11 × X31 + W12 × X32 + W13 × X33 Y32 = W21 × X31 + W22 × X32 + W23 × X33 Y33 = W31 × X31 + W32 × X32 + W33 × X33 For the real chip, the matrices are 256 × 256, so the problem is a LOT bigger. Question 3: Write MIPS code to implement the above 3 × 3 multiplier. All of the W and X values need to be loaded from memory, and all of the resulting Y values are stored into memory. You can use registers for holding partial products (and of course the registers are used when loading values to and from memory). This is doing floating point adds and multiplies, so try to design your code to minimize stalls. How many cycles does it take to perform this operation? Question 4: Estimate how long it would take to do the complete 256 × 256 processor using MIPS. Consider that there are not nearly enough registers on the chip to load values only once then use them for all the calculations. Question 5: For the TPU chip, the weight values are already loaded into the whole array. Therefore, we only have to read the X values in from memory once (whereas with MIPS we had to load the values multiple times, once for each 'cell' in the multiplier), and we only have to store the Y values once. Also note that we don't have to read/write intermediate values to memory or registers, as they are stored in latches in the array. Consequently, calculate the estimated time to load the X matrix and to store the Y matrix. Question 6: How much faster is the TPU version of this array multiplier than the MIPS version?
May 03, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here