The Final for CS 5250 will be a short research paper. Details:
● The paper should be approximately 3-5 pages, single spaced.
● You may send me a digital copy of your paper, in PDF format. You may submit it to
● This is an open-book, open Internet final, and you have as much time as you want, but it
must be your own work. Do not collaborate with others, and do not plagiarize from the
● The primary source for this research paper is chapter 7 of the book. If you do not have
a copy of the book, you will find the relevant information here (which is where the book
got the information):
In Chapter 7, the book discusses Domain Specific Architectures. So far in the book we have
been talking about general purpose computers, which are designed to run basically any
program. But we are close to running out of options to continue increasing performance for a
general purpose computer. The next option is to design special hardware that is really good at
solving a specific problem.
For your research paper, you need to answer 3 questions. The first two are rather simpler, and
your answer should be about a paragraph. The last question is more complex, and will require
most of the time.
Question 1: What is Dennard Scaling? What is its basic principle, and what does it tell us?
Does Dennard Scaling still apply? If so, for how much longer do we expect it to apply, and if
not, when did it end? Do not just copy this from the internet. Study this, then answer these
questions in your own words.
Question 2: What is Moore's Law? What did it tell us? Does Moore's law still apply, or does it
no longer work?
Questions 3-6: Read the article listed above regarding Google's Tensor Processing Unit.
There are many interesting things there, so read through the whole article. The questions for
you to answer deals with the multiplier array, and are given below.
The key for this processor is their systolic multiplier array. The purpose of this array is to
multiply an array of values by an array of weights. In our example here (and in the paper) the
array size is 3x3, but in the real chip it is 256x256. When this is being used, all of the weight
values are loaded into the array. The first input matrix then enters the array through the top,
and the resulting array appears out the right side of the array. The first input array is followed
by the second input array, which is then followed by the third, and so on. Be sure to watch the
animated GIF in the paper, this shows how the data flows through the array. Because this is a
custom chip, we don't have to store all of the values and all the partial products into memory,
they can be stored in registers on the chip, and these registers directly connect to their data
sources and sinks, so all of these operations can happen in parallel.
The system is pipelined, so that during each clock cycle, each cell is multiplying one input
value by a fixed weight value, then adding the result to a sum. During the next clock cycle the
input value moves down one location in the array, and the partial sum moves right one location
in the array. Because this is pipelined, all 9 multipliers are working each clock cycle.
The calculation performed by this array is: Y = W × X,
where W is a matrix of weights,
X is a matrix of input values, and
Y is the matrix of output values.
For our 3 × 3 case, the circuit computes:
Y11 = W11 × X11 + W12 × X12 + W13 × X13
Y12 = W21 × X11 + W22 × X12 + W23 × X13
Y13 = W31 × X11 + W32 × X12 + W33 × X13
Y21 = W11 × X21 + W12 × X22 + W13 × X23
Y22 = W21 × X21 + W22 × X22 + W23 × X23
Y23 = W31 × X21 + W32 × X22 + W33 × X23
Y31 = W11 × X31 + W12 × X32 + W13 × X33
Y32 = W21 × X31 + W22 × X32 + W23 × X33
Y33 = W31 × X31 + W32 × X32 + W33 × X33
For the real chip, the matrices are 256 × 256, so the problem is a LOT bigger.
Question 3: Write MIPS code to implement the above 3 × 3 multiplier. All of the W and X
values need to be loaded from memory, and all of the resulting Y values are stored into
memory. You can use registers for holding partial products (and of course the registers are
used when loading values to and from memory). This is doing floating point adds and
multiplies, so try to design your code to minimize stalls. How many cycles does it take to
perform this operation?
Question 4: Estimate how long it would take to do the complete 256 × 256 processor using
MIPS. Consider that there are not nearly enough registers on the chip to load values only
once then use them for all the calculations.
Question 5: For the TPU chip, the weight values are already loaded into the whole array.
Therefore, we only have to read the X values in from memory once (whereas with MIPS we
had to load the values multiple times, once for each 'cell' in the multiplier), and we only have to
store the Y values once. Also note that we don't have to read/write intermediate values to
memory or registers, as they are stored in latches in the array. Consequently, calculate the
estimated time to load the X matrix and to store the Y matrix.
Question 6: How much faster is the TPU version of this array multiplier than the MIPS