Question 1: (15 points)
Suppose we are considering a
change to an instruction set. The
base machine initially has only loads and stores to memory, and all operations
work on the registers. Such
machines are called load-store machines (see Appendix A of Hennessey-Patterson 4th
edition). Measurements of the
load-store machine show that the instruction mix and the clock cycles per
instruction are as follows:
|
Instruction Type |
Frequency |
Clock Cycle Count |
|
ALU operations |
43% |
1 |
|
Loads |
21% |
2 |
|
Stores |
12% |
2 |
|
Branches |
24% |
2 |
Based on our analysis of the
above instruction traces, we have also discovered that 25% of the arithmetic
logic unit (ALU) operations indirectly use a loaded operand – but that is
never used again.
This seems a bit wasteful, and
we are considering the addition of ALU instructions that have one source
operand in memory. Of course
implementing this instruction type will have other repercussions. These are detailed below:
á
The
new register-memory instructions will have a clock cycle count of 2.
á
Because
we will now have three types of instructions needing to access memory (new ALU
type, loads, and stores), we have decided to increase the number of ports
available to memory. After careful
analysis, we discovered that increasing the number of ports would have the
added bonus of decreasing the clock cycle count for all store instructions from
2 to 1.
á
That
said, the extra chip area for our additional memory ports comes at the expense
of our branch prediction hardware.
As a result, the clock cycle count for branches will increase from 2 to
3.,
á
The
overall clock cycle time (i.e. the clock rate) will not change.
At the end of the day, will the
above enhancements improve CPU performance? Justify your answer.
Question 2: (15 points)
After analyzing instruction traces for a set of
benchmarks of interest, we have discovered that we often find an instruction
sequence similar to that shown below:
LOAD $1, 0($2) #
Load the contents of the memory address stored in $2 into $1
ADDi $2, $2, 4 #
Add 4 to the contents of $2 to specify the next memory address to load
Because this instruction mix appears so frequently, we
are contemplating adding a new instruction that will accomplish both of these
tasks (as part of the same instruction).
An example of our new instruction is shown below:
LOADi
$1, 0($2) #
$1 § Memory($2) || $2 = $2 + 4
As you can never get Òsomething for nothing,Ó obviously
there will be a performance penalty for making this switch. In this case, our clock cycle time will
increase. (The clock rate of the
original machine is 1.1 GHz, and the clock rate of the new machine would be
1.05 GHz).
By studying our benchmark instruction traces, we have
discovered that 1 out of every 4 instructions is a load. (Note that I am not saying that 1 out
of every 4 instructions is a load followed by an add immediate
instruction. IÕm just saying that
1 out of every 4 instructions is a load.)
What percentage of these loads must be eliminated in
order for this new instruction to be a Ògood ideaÓ? (You may assume that CPI does not change, only the clock
rateÉ)
Question 3: (10 points)
Your company has a benchmark that is considered
representative of your typical applications. An embedded processor under consideration to support your task
does not have a floating-point unit and must emulate each floating-point
instruction by a sequence of integer instructions. This processor is rated at 120 MIPS on the benchmark. A third-party vendor offers a
compatible co-processor to boost performance. The co-processor executes each floating-point instruction in
hardware (i.e. no emulation is necessary). The processor/co-processor combination rates 80 MIPS on the
same benchmark. Use the following
symbols to answer parts (a)-(e) of this question:
á
I:
Number of integer
instructions executed on the benchmark
á
F:
Number of floating-point
instructions executed on the benchmark
á
Y: Number
of integer instructions to emulate one floating-point instruction
á
W: Time to execute the benchmark on
the processor alone
á
B: Time to execute the benchmark
on the processor/co-processor combination
(a)
Write
an equation for the MIPS rating of each configuration using the symbols above.
(b)
For
the configuration without the co-processor, we measure that F = 8x106,
Y=50, and W=4 seconds. Find I.
(c)
What
is the value of B?
(d)
What
is the MFLOPS rating of the system with the co-processor?
(e)
Your
colleague wants to purchase the co-processor even though the MIPS rating for
the configuration using the coprocessor is less than that of the processor
alone. Is your colleagueÕs
evaluation correct? Defend your
answer.
Question 4: (10 points)
LittleÕs Law: A computer can process database requests
at a steady state rate of 2000 requests per second. Each transaction takes, on
average, 0.01 seconds to process. On average, how many requests are being
processed at any given instant? If the processor only has room to hold half
that many requests at any given time, what will happen to the system
throughput? (i.e., how many requests/second will the computer process?)
Question 5: (10 points – a, b = 2 points, c, d = 3 points)
Answer question 1.14 on page 62 of Hennessey and
Patterson, A Quantitative Approach, 4th Edition. For convenience, this question appears
below:
Your company has just bought a new dual Pentium
processor, and you have been tasked with optimizing your software for this
processor. You will run two
applications on this dual Pentium, but the resource requirements are not equal. The first application needs 80% of the
resources, and the other only 20% of the resources.
Question 6: (20 points)
As we discussed in Lecture 01, power has become an
important consideration in the field of computer architecture. In order to learn a little bit more,
for this question, you will read Trevor MudgeÕs paper ÒPower: A First-Class Architectural Design
ConstraintÓ and answer the questions below:
(a)
Briefly
describe the three components of power of which computer architects should be
aware? (Note, a good answer will
have ~2-4 sentences on each and convince me that you really understand Òwhere
the power is coming fromÉÓ)
(b)
Mudge
indicates that parallel processing may be a way to limit power while not
sacrificing performance. However,
this will only work to a point.
Why? (Again, write about
3-4 sentences.)
(c)
Mudge
notes that there may be better metrics than MIPS/W when considering
architectural performance in the context of power. One is an energy delay product. Why might this be a good metric? (3-4 sentences)
(d)
Discuss
how we might alleviate leakage power in memory. What are the pros and cons of the approaches proposed in
MudgeÕs paper? (5-6 sentences)
(e)
Mudge
talks about alleviating power associated with interchip buses. One method that he discusses is to
encode address line data into a Gray code. Why might this be a good idea? (Hint: see
slide 38 from Lecture 01)
Question 7: (20 points)
SimpleScalar is a set of simulator tools that we will
use throughout the semester to study different aspects of different computer
architectures. The SimpleScalar
toolset (see www.simplescalar.com for more information), which is written in C,
is used widely in research for evaluating microarchitectural ideas, and it
includes several types of simulators. These simulators trade off speed of
simulation versus modeling detail. They can all simulate several ISAs, but in
this class we will only use them to simulate the Alpha ISA (used by DEC and
then Compaq).
On an x86/Linux machine, create a working directory.
Copy the Þles instruct-progs.tar.gz and simplesim-3v0d.tar.gz from /afs/nd.edu/coursesp.08/cse/cse60321.01/simplescalar
to your working directory. For both of these Þles, gunzip (gunzip file.tar.gz)
and then untar (tar -xvf file.tar) them. Move to the newly created simplesim-3.0
directory and then build the purely functional simulator, sim-safe, by typing make
sim-safe. Now you are ready to run the benchmarks that are in the newly created
benchmarks directory. Follow the instructions for running 3 out of the 4
benchmarks (all but compress) that are in benchmarks/README. The target is
ÒalphaÓ, since we are simulating the Alpha ISA. To make sure everything is
running correctly, here are the instruction counts for go (545811989), gcc
(337360339), and anagram (25595137). It is possible your instruction counts
will be slightly different (less than 1% different) but this is OK. Also, do
not worry if the output is not the same as the reference outputs—this is
also OK.
Experiment: For each of these three benchmarks, evaluate
the following design idea. Assume
that you have a 1GHz processor whose clock rate is bottlenecked by the time to
access the L1 data cache for loads (but not stores), and that all instructions
currently take 1 cycle.
We could increase the clock rate to 1.5 GHz if we let
loads take 2 cycles each (but all other instructions are still 1 cycle). Is
this a good idea? Show your math!
To evaluate this idea, you must modify sim-safe to count
loads and stores of all types.
Note: sim-safe reports Òsim_elapsed_timeÓ when it is
done. This is a measure of how long the simulation took to run, NOT how long it
would take the simulated machine to run the benchmark. This is a very important
distinction.