CS 152 Lab#4 Report



CS 152 Computer Architecture

Final Project Report

Spring 2001

Prof. Kubiatowicz

Section: 2-4 pm Wednesday

TA: Ed Liao

Team members:

Lin Zhou 14255590

Sonkai Lao 14429797

Wen Hui Guan 12270914

Wilma Yeung 14499996

Table of Content:

Abstract …………………………………………………………… 2

Division of Labors ………………………………………………… 2

Detailed Strategies ………………………………………………… 2

Results …………………………………………………………….. 12

Conclusion ………………………………………………………… 12

Appendix I (notebooks) …………………………………………… 13

Appendix II (schematics) ………………………………………….. 13

Appendix III (VHDL files) ………………………………………… 14

Appendix IV (testing) ……………………………………………… 14

Appendix V (delay time table) …………………………………….. 15

Abstract:

We implemented a deep-pipelined processor with seven stages, along with a branch predictor, separate instruction and data cache, and stream buffer. Based on the original 5-staged pipeline, we divided the instruction fetch stage into two, the same as the execution stage. At the first fetch stage, the PC is read from the instruction cache while the branch decision is predicted at the second fetch stage. Since the ALU is the function unit that accounts for the worst delay time (15ns), to achieve optimal performance, we broke it down into two 16-bit adders. Each adder is paired with a logical unit and is placed on different execution stages. We also added a branch predictor on the second stage. With this implementation, our pipelined processor can sustained a cycle time of 22.5ns, with the ideal memory system.

To optimize our memory system, we implemented two cache subsystems, one for the data and the other for the instruction. Furthermore, due to the frequency of sequential memory accesses to the instruction cache, we attached a stream buffer to it. And finally, the performance metrics that we used for the cache are the hit time and miss penalty. The hit time for both data and instruction caches is 7ns. The miss penalty for the data cache is estimated as 124.5ns while that for the instruction cache is 422ns, based on the result from running the mystery program with the ideal memory.

Division of Labor:

| Project Design Areas |Team Member(s) Involved |

|Processor datapath, and controllers |Sonkai Lao, |

| |Lin Zhou |

|Cache system, controllers, and arbiter |Wilma Yeung, |

| |Wen Hui Guan |

|Branch Target predictor |Sonkai Lao |

|Tracer monitor |Wen Hui Guan |

|Forwarding Unit |Wilma Yeung |

|Report write-up |Lin Zhou, |

| |Wen Hui Guan |

Detailed Strategy:

a. General Design Strategy

Memory System

1) Instruction Cache and Stream Buffer

Stream buffer is added to reduce the compulsory and capacity misses. Since instructions are likely executed in sequence as a group of four to five instructions due to branches and jumps. We design the instruction cache and stream buffer with fully associative lookup method. Thus, we associate each cache and buffer block with a comparator.

For the instruction cache, we implement the FIFO replacement policy, which is enforced by using a counter to select a block for replacement in a fashion of a clock. Therefore, when read-miss occurs on both the cache and the buffer, the buffer would be flushed. The counter would advance after the instruction cache read in 2 sequential words for a block. The buffer controller would then fill the buffer with next 4 sequential words. This sequence of reads corresponds to two memory accesses under our implementation of burst request of 3 words. Based on this emplementation, the miss penalty is about 124.5ns. However, when a hit is found on the cache, the hit time is 7ns. If the requested word misses on cache but hits on the buffer, it only costs one extra ns to bring the word to the NPC.

2) Data Cache Re-design (with fully associative, burst request, and write-back)

We implement fully associative access method for the data cache along with write-back write policy. Since we employ two 32-bit DRAM banks in parallel, we connect the data cache to the memory via a 64-bit bus. We also implement the FIFO replacement policy. This policy is enforced by using a counter to select a block for replacement in a fashion of a clock as for the instruction cache. Thus, the counter only advances when there is a read-miss or write-miss.

To increase the performance of the cache, we design the cache in such a way that when a write to the DRAM is required, the cache only writes one word, instead of two, to the memory. However, when there is a read-miss and a block is selected for replacement, two words would be read into the block to take advantage of the spatial locality. This implementation also reduces the overhead for write since only one word is written back to the DRAM if the dirty bit corresponding to that word is set. Therefore, we implement each block with two dirty bits.

Write-miss and write-hit involves a little complication since we need to update only one word (32-bit) via a 64-bit bus and make the implementation simple. For write-miss, the block selected for replacement is handled in the same manner as for read-miss. In this case, we just request a read of two words from the DRAMs and set the valid bit after writing the dirty word(s) to the memory if any. However, to put these two scenarios together and fit them into a 64-bit input bus, we need to consider several write cases. Our approach is to keep two valid bits and two dirty bit for each block. For example, on write-hit, we simply update the appropriate word and set the dirty bit. For a compulsory write-miss, we would need to update the tag, the cache word, and the corresponding valid bit. Therefore, by keeping two valid bit, we could do non-allocate-on-write to reduce the overhead. Evidently, due to the data source for write can come from either CPU or DRAM and the fact the target block may be empty (all valid bits are not set) and the input data is from a 64-bit bus, we need to appropriately choose the 64-bit input data. In this regard, we implement the data cache controller to distinguish the input source and generate a 3-bit source selection signal to choose the input data.

3) Data Cache controller design

From the data cache design section, write-hit and read-hit are handled similarly. Read-hit involves selecting the right word to be output to the CPU. On write-hit, the appropriate word needs to be updated appropriately. These two cases can be handled on the same state by recognizing the type of hit and generate the corresponding selection signals.

For both misses, the cache would request permission to access the DRAM through the arbiter (describe below). It has to wait for the grant signal from the arbiter before performing the write or read. The diagram below shows that the hit is handled at the START stage. When a miss is triggered, the controller would go to either GRANT_W or GRANT_R states until the write or read permission is granted. Once permission is granted, the controller would make a transition to either write or read states and wait until the DRAM is idle. When the operation is completed, the controller would go back to the initial state.

It is possible that a read-miss occurs after servicing a write-miss. As shown in the diagram, if this is the case, the state machine can request read right after finishing the write from the DRAM. This implementation actually saves one cycle on the fly. The state diagram is here.

4) Instruction and Stream Buffer controllers design

Instruction and stream buffer controller are two communicating state machines. When it hits on either the cache or the buffer, both machine stay at the START state. If it is a hit on the buffer, while on START state, the buffer controller would instruct to output the requested word instead.

On misses, the buffer controller takes an active role to communicate with the arbiter and memory controllers to request a sequence of 6 words (burst read 3 times, two words each). Once the memory access request is granted to the buffer controller by the arbiter, the first two words would be written to the selected cache block while the next 4 sequential words would be put into the buffer. After each burst read from the memory, the buffer controller would wait for one cycle to let the DRAM to recover. In addition, it is also quite possible that a write-miss is followed by a read-miss. We treat this case by requesting memory access again immediately after the write-miss is handled because the availability of instruction data is critical to the processor performance. The instruction cache state diagram is here. And the stream buffer controller state diagram is here.

5) Memory controller design

Memory controller (memctrl) is a synchronous process, which contains two asychronous components, memory read component (mem_read) and memory write component (mem_write). The synchronous input signals of memctrl will trigger the mem_read and mem_write to do the operation. The separation of mem_read and mem_write is easy for memctrl to distinguish the read and write operation. In order to increase the throughput of the DRAM’s, we implemented the burst read request of 3 words. Within the RAC and CAS timing constraint, for each RAC signal, we toggle the CAS signal three times to accomplish the burst read request.

6) Memory arbiter

We design the arbiter as a state machine consisting of 3 states, Start, Instruction, and Data. From the start state When a request comes either from data or instruction cache, the current state would make a transition to the respective state along with the appropriate memory request signal. However, when requests come from both instruction and data at the same time, the request from data takes priority and the instruction fetch register is stalled until the memory access is completed. At this point of time, the request from instruction is serviced.

Datapath

7) Main structure of 7 stages deep pipeline (overview).

8) Branch Target Predictor and controller

To implement a branch target predictor, we put a branch prediction table on the second instruction fetch stage. The prediction table maintains 16 entries of PCs, predicted PCs, and the next state bits. The prediction is based on the input PC from the next PC logic. The prediction table is associated with a 2-bit state machine, which always predicts “not taken” at the first time. The predicted NPC would be latched to the next PC logic. But the actual decision is made in the first EXE stage. If the prediction is wrong, the instruction register would be flushed. This costs extra two cycles in delay.

9) ALU Modification

To reduce the cycle time, we divide the original 32-bit ALU into 2 16-bit adders. Addition and subtraction can be handled by the adders from the 2 stages. The first adder in EXE1 would operate on the lower 16 bits of operands while the upper 16 bits would be calculated by the other adder in EXE2. Also each execution stage has a 32-bit logic unit, which handles logic operations and the shifts. The reason of having two logic units is to avoid stall between arithmetic and shift/logic operations. Both logic units would do the operations at the different stages, and the output of the first logic unit will be forwarded to the EXE2 stage. If there is a RAW hazard, the controller will detect the hazard and select the right output of logic unit through the control of the MUX. Forwarding can also be completed as usual.

For example,

choose the first output of Logic Unit :

sll $1 $2 $3

addiu $4 $1 3

choose the second output of Logic Unit:

addiu $1$2 $3

sll $4 $1 2

In addition, the gate level logic block, such as SLT logic unit, is built

in order to reduce the cycle time of the critical path.

10) Forwarding Unit

The forwarding unit (hazard controller) is written in VHDL. It compares the registers among the decode, 1st execution, 2nd execution, and memory stage, then compares them. When data dependency is detected, it would forward the depended value to the previous stage. As a result, it avoids the RAW hazard. To implement the forwarding unit, we code the VHDL to generate the necessary selection signals for various MUXes to accomplish the correct data flow based on the result of the comparison. There are 2 muxes at the decode stage, 4 muxes at the 1st execution stage, and 5 muxes at the 2nd execution stage. It also controls which logical unit to to use for each shift/logic operations.

b. Testing Strategy

Delay modeling of major components:

1) Forwarding Unit

The forwarding unit is used to generate the selection signals for muxes. It compares registers from different stages. In hardware implementation, it can be designed using muxes and simple logic gates. As a result, we model the time delay as 5ns.

2) Predict entry

The predict entry is used to keep the PC, NPC, and the state bits of the prediction state machine. It is can be implemented as a register. Thus, the delay time of 3ns is reasonable.

3) Branch Table Controller

The state transitions of the controller are following the edge-triggered flip flops. Since all the output signals result from simple combinational logic, we consider the delay of the controller having the same delay as for a series of two register, which is 6 ns.

4) Comparator

The 9-bit comparator is used to compare the tag can be implemented with 9 2-input NAND gates, 9 inverters. However, ANDing these 9 single bit values may need a 9-input AND gate. This high fanout does increase the delay. Another way to do it is to break it down into more levels. This also increases the time delay for 3 levels of gates, excluding inverters. Therefore, we give it a delay of 6 ns.

5) 3-bit counter

The 3-bit counter can be implemented with 3 toggle flip flops in parallel and a fair amount of combinational logic to select the output bits. Thus, the delay of the counter is tied to the time delay of the flip flops. It is reasonable to model the delay for toggle flip flop as 2ns, that of register. Therefore, that the 3-bit counter has delay time of 2ns seems justifiable.

6) Cache block select decoder

The decoder takes in a 3-bit value as input and output 8 single bits value in parallel with only one bit got asserted at most, based on the value of the 3-bit. This can be realized as a one level of 8 NAND gates in parallel, excluding inverters. As a result, we assign to it a time delay of 2 ns.

7) Instruction, Data Cache controllers

The cache controllers can be implemented with pure combinational logic. Thus state transitions are merely triggered by flip flops. The reasonable delay is 3ns.

8) Memory controller

The state transitions of the controller are following the edge-triggered flip flops. Since all the output signals result from simple combinational logic, we consider the delay of the controller having the same delay as a series of two registers, which is 6 ns.

9) Memory arbiter

The memory arbiter is designed as a state machine using state bit assignment. The state bits are done by direct bit assignment. Therefore, its structure is a two level combinational logic. Same as the memory controller, we use 6ns delay time to model the arbiter.

10) Other common components are modeled with the same values as for lab 4 to lab 6.

Memory System Testing Methodology

Cache, Stream buffer, and controllers:

1) We first test individual cache and controller components. The individual testing scripts (.cmd file) can be found in Appendix IV. Then we do a comprehensive test on the memory cache system as a whole. The read-hit is tested on a straightforward manner by addressing cache content and verify it value.

2) In particular, we check for the correct handling of the replacement policy by doing write-misses on all blocks first. An additional write-miss would choose the first block for replacement because of the FIFO policy. As a result, the dirty word would be written back to the dram before writing the new word into the cache block. We check the memory content for the correct update. The read-miss is tested similarly.

3) For the instruction cache, we particularly check for the read-miss on the cache but have a first hit on the buffer. Then reading the next 3 sequential words shouldn’t cause any unnecessary delay and stall because the buffer has the 4 sequential words already. After that, miss both on the cache and buffer would require buffer flushing and bringing in another 6 sequential words.

Memory controller:

1) We first test the basic write and read operations and verify the corresponding memory content and the output.

2) To test the burst read of 3 words from memory, we input a row address and then lower the RAS signal. During this period, we supply the controller with three consecutive column addresses, each followed by lowering the CAS signal. After one access of the DRAM’s, we expect that 6 sequential words would be the output since we have two DRAM’s in parallel and we access both at the same time.

Comprehensive testing:

3) Our testing strategy is in hierarchical order. First of all, we break down the pipeline into 7 stages and test it with the mystery program used in the previous labs to make sure the pipeline works. After testing all the basic components such as cache block, caches, controllers, forwarding unit, branch predictor, and arbiter, etc. We proceed to add one functional unit at a time to the pipeline and run it with an ideal memory. After ensuring those functional units work properly, we attach the memory and cache system to the pipeline. And we do a series of intensive tests to further confirm the proper functionality of each unit and the whole processor. Finally, a trace monitor is added to datapath to monitor the committed result on the write-back stage.

Results:

a. Critical path

4) The processor critical path:

Cmparing with the cycle time of 37ns obtained from the 5-stage pipeline, the cycle time for the deep-pipeline processor is a 39.2% improvement.

5) Cache performance metrics:

Data and Instruction Hit Time (on cache or buffer):

9-bit comparator(6ns) + 32-bit Tristate Buffer(1ns) ( 7ns

Instruction Cache Miss penalty = 422ns.

Data Cache Miss penalty => 124.5ns:

Miss penalty for the data and instruction cache differs significantly due to the fact that instruction cache burst read 3 times (page mode) on a miss while the data cache only read once (normal mode).

b. Cycle time for the pipelined processor

The calculated delay time for the critical path is 22.5 ns. And the clock rate for the new process is 1/22.5ns = 44.44MHZ.

 

Conclusion:

We have implemented a deep-pipeline processor running at 44.44 MHZ with branch prediction capability. We also implemented a stream buffer to complement the instruction cache. Both the data and instruction caches’ access method is fully associative access with write-back write policy and FIFO replacement policy. Furthermore, we implemented the burst memory read request of 3 words for each dram. It works well with the memory controller, DRAM,, cache, and cache controller as a whole via a 64-bit bus. An arbiter is constructed to prioritize the simultaneous data and instruction. The data cache request always takes priority. If time permits, in order to further enhance the cache system, we would implement a victim cache for the data cache.

Appendix I (notebooks):

a. Lin Zhou

b. Sonkai Lao

c. Wilma Leung

d. Wen Hui Guan

Appendix II (schematics):

All schematics shown on this report are not listed.

|Component |Schematic file |

|Data cache controller |Data_cache_fsm.ppt |

|Instruction cache controller |Instr_cache_fsm.ppt |

|Cache and memory system |Mem_cache_system.ppt |

|Arbiter controller |Fsm_for_arbiter.ppt |

|Stream buffer controller |Stream_buffer.fsm.ppt |

|Presentation schematics |Sum.ppt |

| | |

| | |

Appendix III (VHDL files):

VHDL files for all components and blocks used:

|Component |VHDL file |

|Data cache controller |Data_cache_ctrl.vhd |

|Predict entry |Predict_entry.vhd |

|Instruction cache controller |Instr_cache_ctrl.vhd |

|Stream buffer controller |Stream_buf_ctrl.vhd |

|Memory controller |Memctrl.vhd |

|Branch table controller |B_table_ctrl.vhd |

|Memory arbiter |Final_mem_arbiter.vhd |

|Hazard controller (forwarding) |Hazardcontroller.vhd |

|16-bit ALU |Half_alu.vhd |

|Debugging Tracer (monitor) |Display.vhd |

|3-bit counter |Counter_3bit.vhd |

|m1x8 mux |Mux1x8.vhd |

|m10x2 mux |M10x2.vhd |

|m32x2 mux |M32x2.vhd |

|m8x2 mux |M8x2.vhd |

|10-bit tristate |Tristate_10bit.vhd |

|32-bit tristate |Tristate.vhd |

|9-bit comparator |Comparator9.vhd |

|9-bit register |Register9int.vhd |

|1-bit register |Reg1.vhd |

|Block select decoder |Block_sel_decoder.vhd |

|Memory read component |Mem_read.vhd |

|Memory write component |Mem_write.vhd |

| | |

Note: All other components such as extender, shifter, muxes are the same as those used in lab 4 to lab 6. So they are not listed here again.

Appendix IV (testing):

Component testing:

Tests are performed in hierachical order:

|Component |Command file |Other files |

|Overall system test |Final.cmd | |

| |Quicksort.cmd |All testbenches are saved in |

| | |folder: |

| | |finalproj/behv/testbnch/ |

|Cache system |Cachesystem.cmd | |

| |Test_cache_sys.cmd | |

| |A_partial.cmd | |

|Pipeline datapath |Test_datapathfinal.cmd | |

|Forwarding unit |Forward.cmd | |

| |Forward_no_sort.cmd | |

|Forwarding on quickSort |Forward_sort.cmd | |

|Instruction cache |Instr_cache | |

|Memory system |Test_mem_sys.cmd | |

|Dram bank |Test_memoryBank.cmd | |

|Data cache |Test_datacache.cmd | |

|Replacement policy |Replace.cmd | |

|arbiter |Test_arbiter.cmd | |

| | | |

| | | |

| | | |

| | | |

Log files:

1) for testing system ( finalproj/fsimtemp.log

Final Datapath testing:

Datapath schematic file: ../ finalproj/sch/datapathhazard.1

Datapath command file: ../ finalproj /cmd/final.cmd

Monitor output file: ../ finalproj /disassembly.out

Appendix V : delay time for each component

| Component |Dalay time |

|Stream buffer controller | 6ns |

|Predict entry | 3ns |

|Hazard controller (forwarding) | 5ns |

|Branch table controller | 6ns |

|Data cache controller | 6ns |

|Instruction cache controller | 6ns |

|Block select decoder | 2 ns |

|3-bit counter | 2 ns |

|m1x8 mux | 3.5 ns |

|m32x8 mux | 3.5 ns |

|m10x2 mux | 1.5 ns |

|m32x2 mux | 1.5 ns |

|m8x2 mux | 1.5 ns |

|10-bit tristate | 1 ns |

|32-bit tristate | 2 ns |

|9-bit comparator | 6 ns |

|9-bit register | 3 ns |

|32-bit register | 3 ns |

|Logic Unit (shifter/logic unit) | 10 ns |

|16-bit Adder | 8 ns |

| SLT logic | 3 ns |

|1-bit register | 3 ns |

|Data, instr. Cache controller | 6 ns |

|Memory controller | 6 ns |

|Memory arbiter | 6 ns |

|Memory read component |Both are part of the memory controller |

|Memory write component | |

| | |

-----------------------

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download