Question 1:  (15 points)

Suppose we are considering a change to an instruction set.  The base machine initially has only loads and stores to memory, and all operations work on the registers.  Such machines are called load-store machines (see Appendix A of Hennessey-Patterson 4th edition).  Measurements of the load-store machine show that the instruction mix and the clock cycles per instruction are as follows:

 

Instruction Type

Frequency

Clock Cycle Count

ALU operations

43%

1

Loads

21%

2

Stores

12%

2

Branches

24%

2

 

Based on our analysis of the above instruction traces, we have also discovered that 25% of the arithmetic logic unit (ALU) operations indirectly use a loaded operand – but that is never used again.

 

This seems a bit wasteful, and we are considering the addition of ALU instructions that have one source operand in memory.  Of course implementing this instruction type will have other repercussions.  These are detailed below:

á          The new register-memory instructions will have a clock cycle count of 2.

á          Because we will now have three types of instructions needing to access memory (new ALU type, loads, and stores), we have decided to increase the number of ports available to memory.  After careful analysis, we discovered that increasing the number of ports would have the added bonus of decreasing the clock cycle count for all store instructions from 2 to 1.

á          That said, the extra chip area for our additional memory ports comes at the expense of our branch prediction hardware.  As a result, the clock cycle count for branches will increase from 2 to 3.,

á          The overall clock cycle time (i.e. the clock rate) will not change.

 

At the end of the day, will the above enhancements improve CPU performance?  Justify your answer.

 

 

 

Question 2:  (15 points)

After analyzing instruction traces for a set of benchmarks of interest, we have discovered that we often find an instruction sequence similar to that shown below:

 

                  LOAD   $1, 0($2)               # Load the contents of the memory address stored in $2 into $1

                  ADDi     $2, $2, 4                # Add 4 to the contents of $2 to specify the next memory address to load

 

Because this instruction mix appears so frequently, we are contemplating adding a new instruction that will accomplish both of these tasks (as part of the same instruction).  An example of our new instruction is shown below:

 

                  LOADi $1, 0($2)                # $1 § Memory($2) || $2 = $2 + 4

 

As you can never get Òsomething for nothing,Ó obviously there will be a performance penalty for making this switch.  In this case, our clock cycle time will increase.  (The clock rate of the original machine is 1.1 GHz, and the clock rate of the new machine would be 1.05 GHz).

 

By studying our benchmark instruction traces, we have discovered that 1 out of every 4 instructions is a load.  (Note that I am not saying that 1 out of every 4 instructions is a load followed by an add immediate instruction.  IÕm just saying that 1 out of every 4 instructions is a load.) 

 

What percentage of these loads must be eliminated in order for this new instruction to be a Ògood ideaÓ?  (You may assume that CPI does not change, only the clock rateÉ)

 

 

 

Question 3:  (10 points)

Your company has a benchmark that is considered representative of your typical applications.  An embedded processor under consideration to support your task does not have a floating-point unit and must emulate each floating-point instruction by a sequence of integer instructions.  This processor is rated at 120 MIPS on the benchmark.  A third-party vendor offers a compatible co-processor to boost performance.  The co-processor executes each floating-point instruction in hardware (i.e. no emulation is necessary).  The processor/co-processor combination rates 80 MIPS on the same benchmark.  Use the following symbols to answer parts (a)-(e) of this question:

 

á          I:      Number of integer instructions executed on the benchmark

á          F:    Number of floating-point instructions executed on the benchmark

á          Y:  Number of integer instructions to emulate one floating-point instruction

á          W:   Time to execute the benchmark on the processor alone

á          B:    Time to execute the benchmark on the processor/co-processor combination

 

(a)                  Write an equation for the MIPS rating of each configuration using the symbols above.

(b)                  For the configuration without the co-processor, we measure that F = 8x106, Y=50, and W=4 seconds.  Find I.

(c)                  What is the value of B?

(d)                  What is the MFLOPS rating of the system with the co-processor?

(e)                  Your colleague wants to purchase the co-processor even though the MIPS rating for the configuration using the coprocessor is less than that of the processor alone.  Is your colleagueÕs evaluation correct?  Defend your answer.

 

 

 

Question 4:  (10 points)

LittleÕs Law: A computer can process database requests at a steady state rate of 2000 requests per second. Each transaction takes, on average, 0.01 seconds to process. On average, how many requests are being processed at any given instant? If the processor only has room to hold half that many requests at any given time, what will happen to the system throughput? (i.e., how many requests/second will the computer process?)

 

 

 

Question 5:  (10 points – a, b = 2 points, c, d = 3 points)

Answer question 1.14 on page 62 of Hennessey and Patterson, A Quantitative Approach, 4th Edition.  For convenience, this question appears below:

 

Your company has just bought a new dual Pentium processor, and you have been tasked with optimizing your software for this processor.  You will run two applications on this dual Pentium, but the resource requirements are not equal.  The first application needs 80% of the resources, and the other only 20% of the resources.

 

  1. Given that 40% of the first application is parallelizable, how much speedup would you achieve with that application if run in isolation?
  2. Given that 99% of the second application is parallelizable, how much speedup would this application observe if run in isolation?
  3. Given that 40% of the first application is parallelizable, how much overall system speedup would you observe if you parallelized it?
  4. Given that 99% of the second application is parallelizable, how much overall system speedup would you get?

 

 

 

Question 6:  (20 points)

As we discussed in Lecture 01, power has become an important consideration in the field of computer architecture.  In order to learn a little bit more, for this question, you will read Trevor MudgeÕs paper ÒPower:  A First-Class Architectural Design ConstraintÓ and answer the questions below:

(a)                  Briefly describe the three components of power of which computer architects should be aware?  (Note, a good answer will have ~2-4 sentences on each and convince me that you really understand Òwhere the power is coming fromÉÓ)

(b)                  Mudge indicates that parallel processing may be a way to limit power while not sacrificing performance.  However, this will only work to a point.  Why?  (Again, write about 3-4 sentences.)

(c)                  Mudge notes that there may be better metrics than MIPS/W when considering architectural performance in the context of power.  One is an energy delay product.  Why might this be a good metric?  (3-4 sentences)

(d)                  Discuss how we might alleviate leakage power in memory.  What are the pros and cons of the approaches proposed in MudgeÕs paper?  (5-6 sentences)

(e)                  Mudge talks about alleviating power associated with interchip buses.  One method that he discusses is to encode address line data into a Gray code.  Why might this be a good idea?  (Hint:  see slide 38 from Lecture 01)

 

 

 

Question 7:  (20 points)

SimpleScalar is a set of simulator tools that we will use throughout the semester to study different aspects of different computer architectures.  The SimpleScalar toolset (see www.simplescalar.com for more information), which is written in C, is used widely in research for evaluating microarchitectural ideas, and it includes several types of simulators. These simulators trade off speed of simulation versus modeling detail. They can all simulate several ISAs, but in this class we will only use them to simulate the Alpha ISA (used by DEC and then Compaq). 

 

On an x86/Linux machine, create a working directory. Copy the Þles instruct-progs.tar.gz and simplesim-3v0d.tar.gz from /afs/nd.edu/coursesp.08/cse/cse60321.01/simplescalar to your working directory. For both of these Þles, gunzip (gunzip file.tar.gz) and then untar (tar -xvf file.tar) them. Move to the newly created simplesim-3.0 directory and then build the purely functional simulator, sim-safe, by typing make sim-safe. Now you are ready to run the benchmarks that are in the newly created benchmarks directory. Follow the instructions for running 3 out of the 4 benchmarks (all but compress) that are in benchmarks/README. The target is ÒalphaÓ, since we are simulating the Alpha ISA. To make sure everything is running correctly, here are the instruction counts for go (545811989), gcc (337360339), and anagram (25595137). It is possible your instruction counts will be slightly different (less than 1% different) but this is OK. Also, do not worry if the output is not the same as the reference outputs—this is also OK. 

 

Experiment: For each of these three benchmarks, evaluate the following design idea.  Assume that you have a 1GHz processor whose clock rate is bottlenecked by the time to access the L1 data cache for loads (but not stores), and that all instructions currently take 1 cycle.

 

We could increase the clock rate to 1.5 GHz if we let loads take 2 cycles each (but all other instructions are still 1 cycle). Is this a good idea? Show your math!

 

To evaluate this idea, you must modify sim-safe to count loads and stores of all types.

 

Note: sim-safe reports Òsim_elapsed_timeÓ when it is done. This is a measure of how long the simulation took to run, NOT how long it would take the simulated machine to run the benchmark. This is a very important distinction.