Syllabus: CPE 626 - Advanced VLSI Systems, Spring 2002



Department of Electrical and Computer Engineering

University of Alabama in Huntsville

CPE 631 – Advanced Computer Systems Architecture

Spring 2003

Midterm Exam

Instructor: Dr. Aleksandar Milenković

Date: February 26, 2003

Place: TH 115 / SED

Time: 2:20 PM – 3:40 PM

Notes:

(1) Work should be performed systematically and neatly. This exam is open book, open notes, closed neighbor(s).

(2) The test has 4 questions on 10 pages. Please check to be sure that you have a complete test.

(3) Write your answers on the test, right below the questions. If you need more space you can use Page 9.

(4) Write your name on the front page.

(5) SED students might want to seal their exam in an envelope.

Good luck!

|Question |Points |Score |

|Q1 |25 | |

|Q2 |15 | |

|Q3 |30 | |

|Q4 |30 | |

| | | |

|Sum | | |

Please print in capitals:

Last name:____________________________

First name: ___________________________

Question #1: (Fundamentals of Computer Design) (25 points)

1.1 (10 points)

Consider two different implementations, M1 and M2, of the same instruction set. M1 has a clock rate of 400MHz, and M2 has a clock rate of 200MHz. The average number of cycles for each instruction class on M1 and M2 is given in the following table:

|Instruction class |CPI (M1) |CPI (M2) |C1 usage |C2 usage |C3 usage |

|A |4 |2 |30% |30% |50% |

|B |6 |4 |50% |20% |30% |

|C |8 |3 |20% |50% |20% |

The table also contains a summary of three different compilers using the instruction set.

C1 is a compiler produced by the makers of M1, C2 is a compiler produced by the makers of M2, and the other compiler is a third-party product. Assume that each compiler uses the same number of instructions for a given program but that the instruction mix is as described in the table.

Answer the following questions:

(a) Using C1 on both M1 and M2, how much faster can the makers of M1 claim that M1 is compared to M2?

(b) Using C2 on both M2 and M1, how much faster can the makers of M2 claim that M2 is compared to M1?

(c) If you purchase M1, which compiler would you use?

(d) If you purchase M2, which compiler would you use?

(e) Which machine would you purchase given the choice of compiler, if we assume that all other criteria are identical, including costs?

1.2 (15 points) A certain benchmark contains 195,578 floating-point operations. The benchmark was run on an embedded processor after compilation with optimization turned on. The embedded processor is based on a current RISC processor that includes floating-point function units, but the embedded processor does not include floating point for reasons of cost, power consumption, and lack of need for floating point by the target applications. The compiler allows floating-point instructions to be calculated with the hardware units or using software routines, depending on compiler flags. The benchmark took 1.08 seconds on the RISC processor and 13.6 seconds using software on its embedded version. Assume that the clock cycle rate is 16.67MHz for both the RISC processor and its embedded version.

Assume that the CPI using the RISC processor was measured to be 10, while the CPI of the embedded version of the processor was measured to be 6.

(a) (5 points) What is the MIPS rating for both runs?

(b) (5 points) What is the total number of instructions executed for both runs?

(c) (5 points) On the average, how many integer instructions does it take to perform a floating-point operation in software?

Question #2: (Instruction Set Principles) (15 points)

2.1. (5 points) Several researchers have suggested that adding a register-memory addressing mode to a load-store machine might be useful. The idea is to replace sequences of

LOAD R1, 0(Rb)

ADD R2, R2, R1

by

ADD R2, 0(Rb)

Assume new instruction will cause the clock cycle increase by 5%. The original load instruction frequency is 25.1% (before modification). The new instruction affects only clock cycle time and not the CPI. What percentage of loads must be eliminated for the machine with the new instruction to have at least the same performance as the original one?

2.2 (5 points) The ARM embedded processor supports the conditional execution of instructions. Every instruction starts with a 4-bit field (N-negate, V-overflow, Z-zero, C-carry) that determines whether it will act as a nop or as a real instruction, depending on the condition codes. E.g., ADDNE r1, r1, r0 ; add if zero flag is not set.

Explain the motivation for introducing this feature.

2.3. (5 points) The ARM embedded processor supports Block loads and stores. Under control of a 16-bit mask within the instructions, any of the 16 registers can be loaded or stored into memory in a single instruction. E.g.,

LDMIA r1, {r0, r4, r7} ; r0:=mem[r1], r4:=mem[r1+4], r7:=mem[r1+8]

Explain the motivation for introducing this feature.

Question #3 (Basic Pipelining) (30 points)

3.1. (5 points)

What are WAW hazards? Give an example. What can be done in order to avoid them?

3.2 (15 points)

Assuming an extended MIPS pipeline with a pipelined FP adder (3 clock cycle execution) and a pipelined FP multiplier (5 clock cycle execution).

Loop:L.D F0, 0(R1)

MULT.D F4, F0, F2

ADD.D F6, F4, F8

S.D 0(R1), F6

SUBI R1, R1, #8

BNEZ R1, Loop

Show the timing of this instruction sequence execution (one iteration) assuming that data forwarding paths are supported, and that the branch is handled by predicting not taken in the Decode stage of the pipeline. If structural hazards are due to write-back contention, assume that earliest instruction gets priority and other instructions are stalled.

Use attached sheet (Page 10) to enter your answer.

Could reordering of instructions + individual operand change reduce the number of clock cycles needed per one iteration? What changes would you suggest?

3.3 (10 points)

Consider the following code sequence executing on a CPU with the scoreboarding.

MULT.D F0, F6, F4

SUB.D F8, F0, F2

ADD.D F2, F10, F2

Assume that execution of the MULT.D instruction requires 5 clock cycles, while the SUB.D and ADD.D instructions require 3 clock cycle.

Fill the following table entering the clock cycle when the instructions issue, read operands, finish execution, and write the result back into the register file. Assume that there are 2 FP multiply/divide functional units and 2 FP add/subtract functional units.

|Instruction |Issue |Read op. |Execute |WriteBack |Comment |

|MULT.D |1 |2 | | | |

| | | | | | |

|SUB.D | | | | | |

| | | | | | |

|ADD.D | | | | | |

| | | | | | |

Bonus problem (5 points):

Answer the question above assuming Tomasulo’s algorithm.

|Instruction |Issue |Execute |WriteBack |Comment |

|MULT.D |1 | | | |

| | | | | |

|SUB.D | | | | |

| | | | | |

|ADD.D | | | | |

| | | | | |

Question #4 (Memory Hierarchy) (30 points)

4.1 (10 points)

(a) (2 points) Does cache prefetching can help in reducing the number of cold (compulsory) misses?

(b) (2 points) Explain how giving read priority over writes reduces the miss penalty.

(c) (6 points) Assume cache miss penalty is 100 clock cycles, all instructions normally take 1.0 clock cycle (ignoring memory stalls). There are 1.4 memory references per instruction, and the average number of cache misses per 1000 instructions is 25. What is the impact on performance of the memory subsystem? Compare the ideal system (all load and stores are satisfied immediately – no latency associated), the real system without a cache, and the system when cache memory is included.

4.2. (20 points) Draw an overall picture of a hypothetical memory hierarchy starting from virtual address to L2 cache access. The page size is 16KB, the TLB is 2-way set-associative with 512 entries, the L1 data cache is direct mapped 16KB, the L1 instruction cache is direct-mapped 8KB, and L2 unified cache is 4-way set-associative 2MB. Cache line (block) size is 64B for both L1 and L2 caches. The virtual address is 64-bits and physical address is 44 bits. L1 cache is physically tagged.

Clearly identify all relevant fields of the virtual and physical addresses and their sizes as well as size of the tags and data fields for all caches (L1 data, L1 instruction, L2 unified). Also, identify the bit widths of all relevant parts (TLB tag, TLB data, L1 data cache tag, …).

Question 3.2.

| |1 |2 |3 |4 |5 |6 |7 |8 |9 |10 |11 |12 |13 |14 |15 |16 |17 |18 |19 |20 |21 |22 |23 |24 |25 |26 |27 |28 |29 |30 |31 |32 |33 |34 |35 |36 |37 |38 |39 |40 | |1st |L.D F0, 0(R1) | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |MULT.D F4,F0,F2 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |ADD.D F6, F4, F8 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |S.D 0(R1), F6 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |SUBI R1, R1, #8 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |BNEZ loop | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |2nd |L.D F0, 0(R1) | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |MULT.D F4,F0,F2 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |ADD.D F6, F4, F8 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |S.D 0(R1), F6 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |SUBI R1, R1, #8 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |BNEZ loop | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

Execution time per one iteration [#clock cycles]:

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download