Spring 2003 CS 152 - Lab 7 (Final Project):



Spring 2003 CS 152 - Lab 7 (Final Project):

10 stages Deep Pipelined MIPS Processor with Branch Prediction and Jump Target Predictor

(Sunshine Slam)

Group:

Tofu Soup

Members:

Khian Hao Lim

Soe Myint

Leo Ting

Haywood Ho

Ka Hou Chan

Abstract

In this final project, we implemented 10 stages deep pipelining (18 points), branch predictor (8 points) and jr target predictor (8 points). To implement the deep pipelined processor, we modified our design and datapath from lab 6. In summary, we added branch and jr predictors, branch and jr verifiers; we re-designed and broke the instruction cache and data cache into 2 stages; we modified the stalling logic and the forwarding logic modules to accommodate the changes in the new 10-stage pipeline; a pipeline monitor was added for debugging purposes; a statistics collector module was added to measure our processor’s performance.

In this project, we were concerned with the following important issues:

▪ Minimizing critical path delay after place and route to achieve 20ns cycle time

▪ Correcting wrong predictions (invalidating instructions in the pipeline)

▪ Dividing the cache into 2 stages

▪ Handling hazards and forwarding

The final result is that our final design works very well in simulation. The timing analyzer results indicate that most of our critical path delay is mainly from the routing delay (we could not get close to the 50-50 split for logic and routing delay).

 

Division of Labor

Khian Hao Lim: Design datapath, branch, jr predictor, and synthesize, and debug

Soe Myint: Change forwarding and stalling modules

Leo Ting: Design the cache datapath, cache control and debug

Haywood Ho: Modify SDRAM control, learn about synthesis tools, synthesize, use timing constraints to minimize critical path, debug

Ka Hou Chan: Design datapath, measure performance and debug

Detailed Description

Datapath Description

In order to meet the 20 ns timing requirement, we had to progressively break up the datapath into many stages. We started from the 5 stage pipeline and eventually ended up with a 10 stage pipeline that includes the following stages:

[pic]

Legend:

IF - instruction fetch

ID – instruction decode

FO – forwarding

EX – execute

WB – write back

Summary of stages

IF0-1-2

The IF0 stage was one of the later stages to be added. Its sole purpose is for the Branch Prediction logic’s delay and PC + 4’s logic to go through the muxes and reach the PC register. The instruction cache is broken up into 2 stages to allow time for tag comparison delay and muxing. The Branch Prediction should be done as soon as possible to ensure that fewer instructions are lost due to branching. However, the earliest possible stage to do this is the IF2 stage when the results have come out of the instruction cache. Thus, there is still one instruction that may need to be crushed at IF0 stage if branch is taken or we encounter a jump. This is due to the fact that the actual PC value will take one more cycle to change at the PC register.

ID-FO-EX1-EX2

The ID stage is similar to that of the 5 stage pipeline. However, during the process of decreasing the critical path, we noticed the delay of the large forwarding muxes. Each one might be a 11 entry 32 bit mux. We decided to break the forwarding into 2 stages. The second stage FO stage is added purely for the routing delay of the forwarding muxes. Branch and JumpReg verification occurs in the EX1 stage after all operands have been forwarded. Due to the delay of this verification unit, we initially made plans to split the verification over both EX1 and EX2 stages. EX2 stage was left due to the decision to possibly split the ALU or the Verification units. This flexibility will make the splitting much easier to do.

ME1-ME2-WB

Like the instruction cache, the data cache is split into 2. The write buffer can be seen as residing in the ME1 stage of the data cache. The WB stage contains the disassembly monitor that $displays the content of the instruction in the WB stage. It also has a statistics collection unit that computes CPI, miss rates, stall percentages and other statistics and dumps them out to file.

Other modules

Also in the design are 3 other modules that cannot be classified under a stage. They are the memory control, stalling logic and the pipeline monitor modules.

The memory control can be seen as sitting outside the pipeline and receiving requests from the write buffer,

The stalling logic module determines whether the pipeline needs to be stalled by checking the signals generated by the write buffer when its full, the instruction and data cache when they miss and the memory control module before its ready to start receiving requests. It also inserts bubbles into the pipeline when there is a lw/use dependency or there is a break instruction in the ID stage.

The pipeline monitor module displays the instructions at each pipeline stage and whether the instruction is valid. This proves to be very useful in debugging branch/jumpreg prediction, branch verification, stalling and forwarding.

Branch Predictor and Jump Target Predictor

Introduction

The branch and jumpreg predictor (henceforth called BJPredictor) sits in the IF2 stage. It will make its prediction and possibly change the nextPC and whatever the instruction cache will fetch for the next cycle. Because it sits in the IF2 stage, there is a delay slot after each branch/jumpreg but no lost instructions if prediction is correct. It performs its own decoding after receiving the instruction from the cache and thus might possibly affect the critical path.

BJPredictor’s companion module, the branch and jumpreg verifier (henceforth called BJVerifier) sits in the EX1 stage. By the EX1 stage, all data has been retrieved and forwarded. BJVerifier verifies whenever the branch or jump has been made, corrects the PC, causes the instructions in the early stages to be squashed and lets the BJPredictor update its table.

Algorithm

The BJPredictor can be effectively partitioned into 2 portions – the prediction and the updating.

For the prediction, every cycle, BJPredictor checks to see if the instruction in IF2 stage is a branch/jump instruction. If it is a branch instruction, it looks up the branch_table to guess whether the branch will be made and alter the nextPC accordingly. If it is a jal or j instruction, the nextPC could be changed accurately. If it is a jr instruction, it looks up the branch_table to see the destination of the previous jump.

For the updating, every cycle, BJVerifier sends out information about the instruction in the EX1 stage. If the instruction is a branch instruction and prediction by BJPredictor was wrong, BJPredictor will update the corresponding entry following the state transition diagram shown below. If the instruction is a jr instruction, BJPredictor will change the corresponding entry in the jumpreg_table to the destination. Thus, the BJPredictor will always keep the value of the last destination of a jr instruction as long as this entry is retained in the jumpreg_table.

[pic]

State Transition Diagram for each entry of Branch Predictor Table

(with reference to COD pg 502)

Implementation

The following is a snippet of code from BJPredictor.v

=========================================

`define table_width 10

reg [1:0] branch_table [(`table_width*8)-1:0];

reg [31:0] jumpreg_table[(`table_width*8)-1:0];

=========================================

The branch and jump tables are implemented as register arrays with a `defined table width. Changing this parameter will increase the table width instantaneously, facilitating experimentation. Note that we only need 2 bits per branch_table entry to implement the four states shown in the state transition diagram.

The tables are initialized to 0’s after reset. For the branch_table, each entry is a “not branch certain”. For the jumpreg_table, this means a jump to PC of 0.

Lookup of both tables are done using the last table_width bits except the last 2 bits. This saves on resources needed to store the tags as well. Thus, an address will the same few last bits will kick a previous entry out. However, since this is only a predictor, we expect that the performance will not be very much affected.

Testing

Test fixtures are built for BJPredictor and BJVerifier. The test fixtures offer inputs and verifies that outputs are as expected.

Statistics Collection

To facilitate tuning of parameters after benchmarking, statistic collection and dumping features have been incorporated. The number of branches instructions encountered, the number of wrong branch predictions, the number of jr instructions encounterd and the number of wrong jr predictions are updated continuously. When a dummy wire called “display_statistics” is forced high in simulation, the BJPredictor module dumps out to a log file of the statistics above and the current tables.

Example for a small experimentation table:

// *****************************************************

test_BJPredictor.a_BJPredictor@ 1050500000: Printing Branch Table

00000000: no_br_certain

00000001: no_br_certain

00000002: no_br_certain

00000003: no_br_certain

00000004: no_br_certain

00000005: no_br_certain

00000006: no_br_certain

00000007: no_br_certain

00000008: no_br_uncertain

00000009: no_br_certain

0000000a: no_br_certain

0000000b: no_br_certain

0000000c: no_br_certain

0000000d: no_br_certain

0000000e: no_br_certain

0000000f: no_br_certain

test_BJPredictor.a_BJPredictor@ 1050500000: Printing Jump Table

00000000: dest: 00000000

00000001: dest: 00000000

00000002: dest: 00000000

00000003: dest: 00000000

00000004: dest: 00000000

00000005: dest: 00000000

00000006: dest: 00000000

00000007: dest: 00000000

00000008: dest: 00000030

00000009: dest: 00000000

0000000a: dest: 00000000

0000000b: dest: 00000000

0000000c: dest: 00000050

0000000d: dest: 00000000

0000000e: dest: 00000000

0000000f: dest: 00000000

branch_count: 5

branch_wrong_count: 5

jumpreg_count: 35

jumpreg_wrong_count: 0

***************************************************** //

Branch Prediction Performance

To prove that dynamic branch prediction actually earned us some benefits, we compare several branch prediction schemes:

1) Always predict branch taken

2) Always predict branch not taken

3) Our implementation of 2 bit dynamic branch prediction (with a 16 entry table)

4) The version of 2 bit dynamic branch prediction that Kubi introduced in lecture (with a 16 entry table). The difference between this and our implementation is the next state logic of the 2 bit state machine.

The suite of tests we use to generate the following results include:

1) A recursive factorial program we wrote. In the tests we computed a factorial of 12.

2) The “base.s” program provided by the TAs.

3) The “extra.s” program provided by the TAs.

4) The “quicksort.s” program provided by the TAs.

The total number of branch encountered in the program flow:

Factorial (12): 167

Base: 283

Extra: 85

Quicksort: 1099

We calculate the branch hit percentage as follows:

[pic]

The following table and graphs summarize the branch hit percentage (number of correct predictions / number of branches) of each of the implementations for each of the programs:

|Branch Prediction |

|Test program |Predicting always |Predicting always not |Our original version |Kubi’s version |

| |branch |branch | |mentioned in lecture |

|Quick sort |51.41 % |48.59 % |65.97 % |67.06 % |

|Extra |92.94 % |7.06 % |87.06 % |87.06 % |

|Base |98.94 % |1.06 % |91.17 % |91.17 % |

|Compute 12! |30.54 % |69.46 % |89.22 % |88.62 % |

|average |68.46 % |31.54 % |83.36 % |83.48 % |

According to the results above, we plotted the following graphs:

[pic]

[pic]

[pic]

[pic]

Analysis

It can be observed that the performance of static branch prediction is inconsistent when compared to dynamic branch prediction, which consistently keep branch prediction accuracy between 70% and 90%. The dynamic branch prediction schemes perform very similarly, with the version that Kubi talked about doing marginally better than the implementation we have. Because the program sizes are small, the dynamic branch prediction algorithms performed well even with only a 16 entry table. We also observe that larger tables for these program sizes do not provide much more benefit.

Jump Target Predictor Performance

Similar to the branch prediction, we calculate the jr predictor accuracy as follows:

[pic]

The accuracy of our jump target predictor is summarized in the following table:

|Test program |Jr prediction accuracy % |

|Quick sort |49.23 % |

|Extra |No jr instruction |

|Base |No jr instruction |

|Compute 12! |50 % |

|average |49.62 % |

The average accuracy of our jr prediction is about 50 %.

Forwarding Logic

Since we have 10 stages pipelined, we have to forward data from 5 stages towards FO_EX1 stage. For load word follows by store words instruction, we optioned to stall in the pipeline rather than forwarding in memory stage to reduce complexity of the pipeline stages.

[pic]

Testing forwarding logic

We have written several test cases during previous lab assignment. So we used those test cases to test the forwarding logic. First we tested forwarding logic alone using forwarding_logic_testbench.v to test the behavior of forwarding logic when forwarding from several different pipeline stages to several different dependent instructions. Then we use block ram as our instruction cache and data cache to test forwarding logic combined with processor to make sure our pipeline stages and control signal are correctly working. Once we finish testing, we combine our processor with cache to test forwarding logic again. The tests we used for forwarding logic can be found in lab7/tests/forwarding_*.s

Stalling Logic

For stalling logic, we will have three kinds of stalls

• Stalling for break instruction

• Stalling for lw instruction followed by instructions depending on lw

• Stalling because of cache miss.

Stalling for break instruction

When we have break instruction in the pipeline, we have to stall all the instructions following the break instruction, while continuing the instruction before break instructions to finish. Processor stall until the break have been released.

Stalling for lw instruction followed by instruction depending on lw

Instructions following lw instruction have to stall several cycles depending on the position of lw instruction and dependent instructions.

lw instru in ID_FO stage and dependent instru in IF2_ID stage => stall for 4 cycles

lw instru in FO_EX1 stage and dependent instru in IF2_ID stage => stall for 3 cycles

lw instru in EX1_EX2 stage and dependent instru in IF2_ID stage => stall for 2 cycles

lw instru in EX2_ME1 stage and dependent instru in IF2_ID stage => stall for 1 cycle.

Stalling because of cache miss

When cache miss occur, we stall all pipeline stages in processor until the data is return from DRAM to cache and cache return to processor request.

Testing Stalling logic

We have written several test cases during previous lab assignment. So we used those test cases to test the stalling logic. First we tested stalling logic alone using stalling_logic_testbench.v in lab7/tests directory to test the behavior of stalling logic under different situations. Then we use block ram as our instruction cache and data cache to test stalling logic combined with processor to make sure our pipeline stages and control signal are correctly working. Once we finish testing, we combine our processor with SDRAM , cache and write buffer to test stalling logic again. The tests we used for stalling logic can be found in lab7/tests directory.

Cache Architecture

Overview

I was assigned the task of developing a cache module for Lab 7. The design would be based on our previous design for Lab 6, except that it must be pipelined into two stages this time. This was because the Lab 6 cache module contained a critical path that exceeded the 20 ns limit imposed in Lab 7. Nonetheless, the Lab 7 cache architecture retains the basic traits of the cache that was designed in Lab 6.

Architecture

I planned to keep the same overall physical architecture of the cache as in Lab 6. Thus, each cache (instruction or data) has:

• 8192 (8 × 1024) bytes of overall data storage (not including tags or valid bits)

• 8-word (32-byte) cache lines

• 2-way set-associativity

• write-through policy

• random cache-line replacement policy

Each cache module is organised as follows:

• two data BlockRAMs (each 8 words wide and 128 lines deep)

• two tag BlockRAMs (each 14 bits wide and 128 lines deep)

• one write buffer (with storage for 4 independent words and their addresses)

• one cache controller

• pipeline registers to hold two separate instructions

The following diagram describes the 2-way set-associative cache design that we have chosen to implement:

Total data size: 8192 bytes

2-way set-associativity: therefore 4096 bytes per set

Word size: 4 bytes

Block size: 8 words = 32 bytes

Thus, we have 128 blocks per set.

32-bit byte address

|Tag |Entry Index |Block offset |Byte offset |

| | | | |

Note, however, that the SDRAM that we have on the Xilinx board can access, effectively, only 23-bit word addresses. Thus, the actual effective tag portion of an address will comprise only 13 bits.

BlockRAM organisation:

|Valid |Tag |data |Valid |Tag |

|(1 bit) |(13 bits) |(256 bits = 8 words) |(1 bit) |(13 bits) |

|Write Buffer |12 |6.192 (28.7%) |15.353 (71.3%) |21.545 |

|Stalling logic |7 |4.019 (19.8%) |16.256 (80.2%) |20.275 |

|ALU |13 |7.177 (36.5%) |12.476 (63.5%) |19.653 |

|Forwarding logic |12 |6.715 (35.1%) |12.406 (64.9%) |19.121 |

|Branch Verifier |15 |7.130 (37.7%) |11.790 (62.3%) |18.920 |

|Branch Predictor |6 |4.765 (24.7%) |14.545 (75.3%) |19.310 |

|Forwarding Muxes |4 |3.158 (22.8%) |10.666 (77.2%) |13.824 |

The table above is a summary of the extracts of the timing analyzer that follows. Names for components have been randomly by the schematic editor. Here is a translation table.

|Schematic Names |Actual component |

|XLXI_32 |Write Buffer |

|XLXI_101 |Stalling logic |

|XLXI_107 |ALU |

|XLXI_156 |Branch Verifier |

|XLXI_155 |Branch Predictor |

|XLXI_189 |One of forwarding muxes |

************************************************************************************

Slack: -1.601ns (requirement - (data path - negative clock skew))

Source: proc/UUT/XLXI_123/data_cache/XLXI_32/currently_waiting

Destination: proc/UUT/XLXI_123/data_cache/XLXI_32/v0/DOUT

Requirement: 20.000ns

Data Path Delay: 21.545ns (Levels of Logic = 12)

Negative Clock Skew: -0.056ns

Source Clock: clk7 rising at 0.000ns

Destination Clock: clk7 rising at 20.000ns

Timing Improvement Wizard

Data Path: proc/UUT/XLXI_123/data_cache/XLXI_32/currently_waiting to proc/UUT/XLXI_123/data_cache/XLXI_32/v0/DOUT

Delay type Delay(ns) Logical Resource(s)

---------------------------- -------------------

Tcko 0.992 proc/UUT/XLXI_123/data_cache/XLXI_32/currently_waiting

net (fanout=2) 1.743 proc.UUT.XLXI_123.data_cache.wr_buf_wr_waiting

Tilo 0.468 G_1355

net (fanout=27) 1.717 proc.UUT.XLXI_123.data_cache.wr_buf_addr_sel

Tilo 0.468 proc/UUT/XLXI_123/data_cache/XLXI_67/DOUT[13]

net (fanout=7) 2.075 proc/UUT/XLXI_123/data_cache/wr_buf_addr[13]

Topcyf 0.939 proc/UUT/XLXI_123/data_cache/XLXI_32/un1_c1_0.I_16

proc/UUT/XLXI_123/data_cache/XLXI_32/un1_c1_0.I_8

proc/UUT/XLXI_123/data_cache/XLXI_32/un1_c1_0.I_44

net (fanout=1) 0.000 proc/UUT/XLXI_123/data_cache/XLXI_32/un1_c1_0.I_44/O

Tbyp 0.149 proc/UUT/XLXI_123/data_cache/XLXI_32/un1_c1_0.I_26

proc/UUT/XLXI_123/data_cache/XLXI_32/un1_c1_0.I_80

net (fanout=1) 0.000 proc/UUT/XLXI_123/data_cache/XLXI_32/un1_c1_0.I_80/O

Tbyp 0.149 proc/UUT/XLXI_123/data_cache/XLXI_32/un1_c1_0.I_62

proc/UUT/XLXI_123/data_cache/XLXI_32/un1_c1_0.I_2

net (fanout=3) 3.031 proc/UUT/XLXI_123/data_cache/XLXI_32/I_2_1

Tilo 0.468 proc/UUT/XLXI_123/data_cache/XLXI_32/Valid29_1

net (fanout=36) 1.535 proc/UUT/XLXI_123/data_cache/XLXI_32/Valid29_1

Tilo 0.468 proc/UUT/XLXI_123/data_cache/XLXI_32/un19_a3_enable_1

net (fanout=4) 0.866 proc/UUT/XLXI_123/data_cache/XLXI_32/un19_a0_enable_1

Tilo 0.468 proc/UUT/XLXI_123/data_cache/XLXI_32/un19_a0_enable

net (fanout=2) 1.874 proc/UUT/XLXI_123/data_cache/XLXI_32/un19_a0_enable

Tilo 0.468 proc/UUT/XLXI_123/data_cache/XLXI_32/a0_enable_0_and2_1

net (fanout=17) 0.725 proc/UUT/XLXI_123/data_cache/XLXI_32/a0_enable_1

Tilo 0.468 proc/UUT/XLXI_123/data_cache/XLXI_32/v0_enable_0

net (fanout=1) 1.787 proc/UUT/XLXI_123/data_cache/XLXI_32/v0_enable

Tceck 0.687 proc/UUT/XLXI_123/data_cache/XLXI_32/v0/DOUT

---------------------------- ------------------------------

Total 21.545ns (6.192ns logic, 15.353ns route)

(28.7% logic, 71.3% route)

************************************************************************************

Timing constraint: TS_clk7_c = PERIOD TIMEGRP "clk7_c" 20 nS HIGH 50.000000 % ;

8988 items analyzed, 22 timing errors detected.

Minimum period is 20.275ns.

--------------------------------------------------------------------------------

Slack: -0.275ns (requirement - (data path - negative clock skew))

Source: proc/UUT/XLXI_136/instr/DOUT[28]

Destination: proc/UUT/XLXI_132/PC/DOUT[12]

Requirement: 20.000ns

Data Path Delay: 20.275ns (Levels of Logic = 7)

Negative Clock Skew: 0.000ns

Source Clock: clk7 rising at 0.000ns

Destination Clock: clk7 rising at 20.000ns

Timing Improvement Wizard

Data Path: proc/UUT/XLXI_136/instr/DOUT[28] to proc/UUT/XLXI_132/PC/DOUT[12]

Delay type Delay(ns) Logical Resource(s)

---------------------------- -------------------

Tcko 0.992 proc/UUT/XLXI_136/instr/DOUT[28]

net (fanout=7) 3.195 proc/UUT/EX2_ME1_opcode[2]

Tilo 0.468 proc/UUT/XLXI_101/buffer_full10_399

net (fanout=1) 2.457 proc/UUT/XLXI_101/buffer_full10_399

Tilo 0.468 proc/UUT/XLXI_101/complete_stall

net (fanout=28) 3.041 proc/UUT/not_complete_stall

Tilo 0.468 proc/UUT/not_complete_stall_1

net (fanout=31) 2.662 proc/UUT/not_complete_stall_1

Tilo 0.468 proc/UUT/XLXI_172

net (fanout=6) 2.745 proc/UUT/enable

Tilo 0.468 proc/UUT/enable_2

net (fanout=21) 2.156 proc/UUT/enable_2

Tceck 0.687 proc/UUT/XLXI_132/PC/DOUT[12]

---------------------------- ------------------------------

Total 20.275ns (4.019ns logic, 16.256ns route)

(19.8% logic, 80.2% route)

Minimum period is 19.653ns.

--------------------------------------------------------------------------------

Slack: 0.347ns (requirement - (data path - negative clock skew))

Source: proc/UUT/XLXI_178/A/DOUT[20]

Destination: proc/UUT/XLXI_178/B/DOUT[0]

Requirement: 20.000ns

Data Path Delay: 19.653ns (Levels of Logic = 13)

Negative Clock Skew: 0.000ns

Source Clock: clk7 rising at 0.000ns

Destination Clock: clk7 rising at 20.000ns

-------------------------------------------------------------------------

Data Path: proc/UUT/XLXI_178/A/DOUT[20] to proc/UUT/XLXI_178/B/DOUT[0]

Delay type Delay(ns) Logical Resource(s)

---------------------------- -------------------

Tcko 0.992 proc/UUT/XLXI_178/A/DOUT[20]

net (fanout=9) 4.929 proc/UUT/FO_EX1_A[20]

Topcyf 0.939 proc/UUT/XLXI_107/un24_DOUT_axb_20

proc/UUT/XLXI_107/un24_DOUT_cry_20

proc/UUT/XLXI_107/un24_DOUT_cry_21

net (fanout=1) 0.000 proc/UUT/XLXI_107/un24_DOUT_cry_21/O

Tbyp 0.149 proc/UUT/XLXI_107/un24_DOUT_cry_22

proc/UUT/XLXI_107/un24_DOUT_cry_23

net (fanout=1) 0.000 proc/UUT/XLXI_107/un24_DOUT_cry_23/O

Tbyp 0.149 proc/UUT/XLXI_107/un24_DOUT_cry_24

proc/UUT/XLXI_107/un24_DOUT_cry_25

net (fanout=1) 0.000 proc/UUT/XLXI_107/un24_DOUT_cry_25/O

Tbyp 0.149 proc/UUT/XLXI_107/un24_DOUT_cry_26

proc/UUT/XLXI_107/un24_DOUT_cry_27

net (fanout=1) 0.000 proc/UUT/XLXI_107/un24_DOUT_cry_27/O

Tbyp 0.149 proc/UUT/XLXI_107/un24_DOUT_cry_28

proc/UUT/XLXI_107/un24_DOUT_cry_29

net (fanout=1) 0.000 proc/UUT/XLXI_107/un24_DOUT_cry_29/O

Tciny 0.677 proc/UUT/XLXI_107/un24_DOUT_cry_30

proc/UUT/XLXI_107/un24_DOUT_s_31

net (fanout=1) 0.960 proc/UUT/XLXI_107/un24_DOUT_s_31

Tilo 0.468 proc/UUT/XLXI_107/DOUT_4_and2_0[31]

net (fanout=1) 0.679 proc/UUT/XLXI_107/N_832

Tilo 0.468 proc/UUT/XLXI_107/DOUT_4[31]

net (fanout=2) 1.268 proc/UUT/XLXN_844

Tif5x 0.871 proc/UUT/XLXI_108/DOUT_3_bm[0]

proc/UUT/XLXI_108/DOUT_3[0]

net (fanout=2) 2.672 proc/UUT/ALUOut1[0]

Tif5 0.903 proc/UUT/XLXI_189/DOUT_5_am[0]

proc/UUT/XLXI_189/DOUT_5[0]

net (fanout=1) 0.000 proc/UUT/XLXI_189/N_2992

Tf5iny 0.220 proc/UUT/XLXI_189/DOUT_7[0]

net (fanout=1) 1.968 proc/UUT/XLXI_189/N_3058

Tick 1.043 proc/UUT/XLXI_189/DOUT[0]

proc/UUT/XLXI_178/B/DOUT[0]

---------------------------- ------------------------------

Total 19.653ns (7.177ns logic, 12.476ns route)

(36.5% logic, 63.5% route)

************************************************************************************

Timing constraint: TS_clk7_c = PERIOD TIMEGRP "clk7_c" 20 nS HIGH 50.000000 % ;

24200 items analyzed, 0 timing errors detected.

Minimum period is 19.121ns.

--------------------------------------------------------------------------------

Slack: 0.879ns (requirement - (data path - negative clock skew))

Source: proc/UUT/XLXI_133/instr/DOUT[0]

Destination: proc/UUT/XLXI_184/reg_ALU_B_forward_sel/DOUT[0]

Requirement: 20.000ns

Data Path Delay: 19.121ns (Levels of Logic = 12)

Negative Clock Skew: 0.000ns

Source Clock: clk7 rising at 0.000ns

Destination Clock: clk7 rising at 20.000ns

Data Path: proc/UUT/XLXI_133/instr/DOUT[0] to proc/UUT/XLXI_184/reg_ALU_B_forward_sel/DOUT[0]

Delay type Delay(ns) Logical Resource(s)

---------------------------- -------------------

Tcko 0.992 proc/UUT/XLXI_133/instr/DOUT[0]

net (fanout=28) 3.065 proc/UUT/XLXN_1301[0]

Tilo 0.468 proc/UUT/XLXI_184/G_748

net (fanout=1) 1.262 proc/UUT/XLXI_184/N_812

Tilo 0.468 proc/UUT/XLXI_184/G_752_799

net (fanout=1) 0.900 proc/UUT/XLXI_184/G_752_799

Tilo 0.468 proc/UUT/XLXI_184/G_691

net (fanout=1) 0.297 proc/UUT/XLXI_184/N_890

Tilo 0.468 proc/UUT/XLXI_184/EX1_EX2_B_forward_sel_iv_0_3_307_1

net (fanout=1) 1.040 proc/UUT/XLXI_184/EX1_EX2_B_forward_sel_iv_0_3_307_1

Tilo 0.468 proc/UUT/XLXI_184/EX1_EX2_B_forward_sel_iv_0_3_307

net (fanout=1) 0.297 proc/UUT/XLXI_184/EX1_EX2_B_forward_sel_iv_0_3_307

Tilo 0.468 proc/UUT/XLXI_184/EX1_EX2_B_forward_sel_iv_0_3_308

net (fanout=1) 0.448 proc/UUT/XLXI_184/EX1_EX2_B_forward_sel_iv_0_3_308

Tilo 0.468 proc/UUT/XLXI_184/EX1_EX2_B_forward_sel_iv_0[3]

net (fanout=2) 1.425 proc.UUT.XLXI_184.EX1_EX2_B_forward_sel[3]

Tilo 0.468 proc/UUT/XLXI_184/un8_ALU_B_forward_sel_unreg

net (fanout=5) 1.636 proc/UUT/XLXI_184/un8_ALU_B_forward_sel_unreg

Tilo 0.468 proc/UUT/XLXI_184/un16_ALU_B_forward_sel_unreg

net (fanout=2) 1.527 proc.UUT.XLXI_184.un16_ALU_B_forward_sel_unreg

Tilo 0.468 G_1310

net (fanout=2) 0.509 G_1310

Tick 1.043 G_1310_rt

proc/UUT/XLXI_184/reg_ALU_B_forward_sel/DOUT[0]

---------------------------- ------------------------------

Total 19.121ns (6.715ns logic, 12.406ns route)

(35.1% logic, 64.9% route)

Destination: proc/UUT/XLXI_85/DOUT[4]

Requirement: 20.000ns

Data Path Delay: 18.920ns (Levels of Logic = 15)

Negative Clock Skew: 0.000ns

Source Clock: clk7 rising at 0.000ns

Destination Clock: clk7 rising at 20.000ns

Data Path: proc/UUT/XLXI_134/PC/DOUT[13] to proc/UUT/XLXI_85/DOUT[4]

Delay type Delay(ns) Logical Resource(s)

---------------------------- -------------------

Tcko 0.992 proc/UUT/XLXI_134/PC/DOUT[13]

net (fanout=2) 2.266 proc/UUT/ID_FO_PC[13]

Topcyg 1.000 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_axb_11

proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_11

net (fanout=1) 0.000 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_11/O

Tbyp 0.149 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_12

proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_13

net (fanout=1) 0.000 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_13/O

Tbyp 0.149 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_14

proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_15

net (fanout=1) 0.000 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_15/O

Tbyp 0.149 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_16

proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_17

net (fanout=1) 0.000 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_17/O

Tbyp 0.149 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_18

proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_19

net (fanout=1) 0.000 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_19/O

Tbyp 0.149 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_20

proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_21

net (fanout=1) 0.000 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_21/O

Tbyp 0.149 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_22

proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_23

net (fanout=1) 0.000 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_23/O

Tbyp 0.149 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_24

proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_25

net (fanout=1) 0.000 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_25/O

Tbyp 0.149 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_26

proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_27

net (fanout=1) 0.000 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_27/O

Tciny 0.677 proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_cry_28

proc/UUT/XLXI_156/my_branch_calcuation/branchAddress_1_s_29

net (fanout=3) 1.401 proc/UUT/branchAddress[31]

Topcyg 1.000 proc/UUT/XLXI_156/un2_branch_wrong_0.I_19

proc/UUT/XLXI_156/un2_branch_wrong_0.I_11

net (fanout=2) 1.705 proc/UUT/XLXI_156/I_11_0

Tilo 0.468 proc/UUT/XLXI_156/branch_wrong

net (fanout=2) 1.170 proc/UUT/BJVerifier_branch_wrong

Tilo 0.468 proc/UUT/XLXI_156/predict_wrong

net (fanout=35) 5.248 proc/UUT/XLXN_968

Tdick 1.333 proc/UUT/XLXI_91/DOUT[4]

proc/UUT/XLXI_85/DOUT[4]

---------------------------- ------------------------------

Total 18.920ns (7.130ns logic, 11.790ns route)

(37.7% logic, 62.3% route)

************************************************************************************

Timing constraint: TS_clk7_c = PERIOD TIMEGRP "clk7_c" 20 nS HIGH 50.000000 % ;

723 items analyzed, 0 timing errors detected.

Minimum period is 13.824ns.

--------------------------------------------------------------------------------

Slack: 6.176ns (requirement - (data path - negative clock skew))

Source: proc/UUT/XLXI_184/reg_ALU_B_forward_sel/DOUT[0]

Destination: proc/UUT/XLXI_178/B/DOUT[14]

Requirement: 20.000ns

Data Path Delay: 13.824ns (Levels of Logic = 4)

Negative Clock Skew: 0.000ns

Source Clock: clk7 rising at 0.000ns

Destination Clock: clk7 rising at 20.000ns

Data Path: proc/UUT/XLXI_184/reg_ALU_B_forward_sel/DOUT[0] to proc/UUT/XLXI_178/B/DOUT[14]

Delay type Delay(ns) Logical Resource(s)

---------------------------- -------------------

Tcko 0.992 proc/UUT/XLXI_184/reg_ALU_B_forward_sel/DOUT[0]

net (fanout=32) 7.677 proc/UUT/ALU_B_forward_sel[0]

Tif5 0.903 proc/UUT/XLXI_189/DOUT_5_am[14]

proc/UUT/XLXI_189/DOUT_5[14]

net (fanout=1) 0.000 proc/UUT/XLXI_189/N_3006

Tf5iny 0.220 proc/UUT/XLXI_189/DOUT_7[14]

net (fanout=1) 2.989 proc/UUT/XLXI_189/N_4549

Tick 1.043 proc/UUT/XLXI_189/DOUT[14]

proc/UUT/XLXI_178/B/DOUT[14]

---------------------------- ------------------------------

Total 13.824ns (3.158ns logic, 10.666ns route)

(22.8% logic, 77.2% route)

--------------------------------------------------------------------------------

Slack: 0.690ns (requirement - (data path - negative clock skew))

Source: proc/UUT/XLXI_93/instr_cache/XLXI_36/DOUT[26]

Destination: proc/UUT/XLXI_85/DOUT[20]

Requirement: 20.000ns

Data Path Delay: 19.310ns (Levels of Logic = 6)

Negative Clock Skew: 0.000ns

Source Clock: clk7 rising at 0.000ns

Destination Clock: clk7 rising at 20.000ns

Data Path: proc/UUT/XLXI_93/instr_cache/XLXI_36/DOUT[26] to proc/UUT/XLXI_85/DOUT[20]

Delay type Delay(ns) Logical Resource(s)

---------------------------- -------------------

Tcko 0.992 proc/UUT/XLXI_93/instr_cache/XLXI_36/DOUT[26]

net (fanout=1) 5.243 proc/UUT/XLXI_93/instr_cache/ME2_cache_data_rd_dout[26]

Tilo 0.468 proc/UUT/XLXI_93/dout[26]

net (fanout=2) 1.878 proc/UUT/instr_cache_instr[26]

Tilo 0.468 proc/UUT/XLXI_155/un11_is_branch_781

net (fanout=2) 2.289 proc/UUT/XLXI_155/un11_is_branch_781

Tilo 0.468 proc/UUT/XLXI_155/un11_is_branch_1

net (fanout=30) 3.964 proc/UUT/XLXI_155/un11_is_branch_1

Tif5x 0.871 proc/UUT/XLXI_155/target_addr_mux/DOUT_3_bm[20]

proc/UUT/XLXI_155/target_addr_mux/DOUT_3[20]

net (fanout=1) 1.171 proc/UUT/BJPredictor_target_addr[20]

Tif5ck 1.498 proc/UUT/XLXI_91/DOUT_am[20]

proc/UUT/XLXI_91/DOUT[20]

proc/UUT/XLXI_85/DOUT[20]

---------------------------- ------------------------------

Total 19.310ns (4.765ns logic, 14.545ns route)

(24.7% logic, 75.3% route)

************************************************************************************ //

Physical Design

The physical design and resource usage varies greatly with the flags we use with some of tools. Some of the options we change include:

Synplify (synthesis):

- fanout guide

- resource sharing

Xilinx Project Navigator (mapping, place and route)

- timing constraints

- place and route effort

- number of passes

The reports we provide below represent an average run that is fairly optimized for speed:

Design Summary

//*******************************************************

Number of errors: 0

Number of warnings: 8

Number of Slices: 6,418 out of 19,200 33%

Number of Slices containing

unrelated logic: 0 out of 6,418 0%

Number of Slice Flip Flops: 5,218 out of 38,400 13%

Total Number 4 input LUTs: 7,749 out of 38,400 20%

Number used as LUTs: 7,637

Number used as a route-thru: 45

Number used for Dual Port RAMs: 66

(Two LUTs used per Dual Port RAM)

Number used as 16x1 RAMs: 1

Number of bonded IOBs: 168 out of 512 32%

IOB Flip Flops: 71

Number of Tbufs: 544 out of 19,520 2%

Number of Block RAMs: 68 out of 160 42%

Number of GCLKs: 2 out of 4 50%

Number of GCLKIOBs: 1 out of 4 25%

Total equivalent gate count for design: 1,216,135

Additional JTAG gate count for IOBs: 8,112

******************************************************* //

Clock speed on board

As explained in the Critical Path Analysis section, we have critical paths less than 25ns and should be to run the processor on the board comfortably at 27Mhz (~37ns). We tried quick_sort, verify and several of our randomly generated lw/sw test programs at this speed and managed to get correct results on the board.

Performance

To collect statistic of our deep pipelined processor, we add a “statistic module” that collects control signals and other information from the instruction cache, data cache, stalling logic module, and the write back stage register. It counts how many valid instructions are executed (which is valid instruction in the WB stage) and how many cycles are used, expect the cycles for initializing the memory and during the break. These numbers are outputted to a log file. Then, we calculate our CPI as the following:

[pic]

We run different test programs, including the quick_sort.s, extra.s, base.s and our test program “compute 12!” that computes the factorial of 12 without using the “mult instruction” (implement mult by addition in the program). The measurement results are summarized in the following table:

|Test program |CPI |

|Quick sort |2.2 |

|Extra |3.6 |

|Base |2.99 |

|Compute 12! |2.05 |

|average |2.71 |

Notice that the average CPI is about 2 to 3. The CPI is greater than 2 because whenever we have lw-use stall, we need to stall for more than 1 cycle (we need to stall for 4 cycles) due to our 10 stage pipelines (we have extra 1 FO, 2 EXE and 2 MEM stages), compared to the regular 5 stage pipelines. Moreover, since we have 3 IF stages, we need to invalidate the instruction after the branch delay slot if the branch is taken. In general, with more stages, we need to pay more cost of increasing CPI.

Furthermore, wrong branch and jr predictions can also increase our CPI, because we need to invalidate the fetched instructions.

Other stalls due to instruction and data cache miss, and write buffer full during sw also increase our CPI.

The CPI for the test program “Extra” is higher than the others because there are many consecutive sw and lw instructions in the program, which cause many memory stalls and the lw-used stalls. For some general program that is not specific for the purpose of testing the cache, such as quick sort and computing factorial, the CPI is a little bit greater than 2.

How different branch prediction schemes affect CPI?

Since branch prediction scheme can affect our processor’s performance significantly, we compared the CPI when we use different prediction schemes to see the difference.

|CPI with different Branch Predictors |

|Test program |Our original version |Kubi’s version mentioned |Predicting always |Predicting always not |

| | |in lecture |branch |branch |

|Quick sort |2.2 |2.19 |2.32 |2.2 |

|Extra |3.6 |3.60 |3.51 |3.82 |

|Base |2.99 |2.99 |2.87 |3.14 |

|Compute 12! |2.05 |2.05 |2.47 |2.16 |

|average |2.71 |2.71 |2.79 |2.83 |

As we can see from the table above, a bad branch predictor can cause higher CPI. We can see that our dynamic branch predictor also does slightly better than static predictors.

Conclusion/Evaluation

There are still many improvements we could have made to our current pipeline. At normal place and route levels, we are unable to reach 50 MHz. However, at higher place and route effort levels, we can meet the timing constraint of 18 ns for the clock, although we have not been able to get it working on the board (even though the timing analysis tells us so).

There are many places we can continue to work on, given more time. We have since eliminated the critical path through the write buffer by the addition of a random write-buffer (as this gets rid of some carry chains caused by pointer arithmetic for the FIFO write-buffer). But there are other modules in the critical path and we could work on these modules to improve the minimum period:

Pipelining the branch verifier. We can spreading the execution of that module over two stages: EX1 and EX2. This would mean that we would have to invalidate another instruction if we mis-predicted a branch.

Pipelining the stalling logic. We could do half the stalling computation (the comparison of the opcodes and funct fields) on the instruction in the IF2 stage in parallel with the branch predictor logic. Then we can generate the actual stall signals in the ID stage, but this would require a major overhaul of the stalling logic and much testing.

Pipelining the ALU. As this is the next module on the list of critical paths, we can do this and actually make use of the EX2 stage that we thought we would at first need. But this would also require changes to forwarding/or stalling, depending on whether we want to forward 16-bit values between the stages or simply stall two consecutive ALU instructions.

Probably testing all the corner cases was the most difficult. There are very complex interactions between the stalling, forwarding, branch prediction, and branch verification modules. There were always more cases that we came up with during testing, and adding signals to test for this lengthened our critical path in some cases, forcing us to adjust our designs.

Synthesis was also a pain as the 5.1 tools in the 150 lab did not always create bit files that worked. We could only use guide design files in 5.1 because checking this option crashes the 4.1 tools. We still don’t know how to make the best utilization of the tools, as our routing is still not up to the 50-50 (logic delay vs routing delay) rule of thumb that Xilinx tells us. This may however be due to the nature of the processor, with the long feedback loops, it is hard for Xilinx to do a good job with routing.

Appendix I (online notebook)

///**********************************

Leo Ting’s online notebook (58 hours)

%%%%%%%%%%%%%%% Sat May 3 16:18:13 PDT 2003 %%%%%%%%%%%%%%%

Goal: begin work on converting the Lab 6 cache control module to a FSM design

#001: i have decided to convert the entire cache system to use 23-bit word addresses, as this will simplify the schematics

as well as the control logic - this is based on the fact that the SDRAM system uses 23 bits of addressing

#002: the cache datapath will be completely revamped

among other things, the use of an FSM controller will allow a slightly cleaner design

this should also be more advantageous with respect to critical path analysis

#003: the new cache datapath is now mostly complete

the new control module should be the only major thing left to instantiate

some notes:

the cache datapath will now take place across 2 pipeline stages (IF1, IF2 in the case of the instruction cache, and

ME1, ME2 in the case of the data cache - the names in this datapath will follow the data cache)

i have tried to do as much work in the ME1 stage as possible so that the ME2 stage will have less logic to take care of

ostensibly, the cache controller will reside in the ME2 stage, and so there will of course be some delay associated with

the fsm controller

i am aiming for a mealy machine type of FSM controller, as this gives the best operation of the cache given the current

timing paradigm (wrt to the SDRAM controller)

%%%%%%%%%%%%%%% Sun May 4 02:36:25 PDT 2003 %%%%%%%%%%%%%%%

%%%%%%%%%%%%%%% Mon May 5 20:35:41 PDT 2003 %%%%%%%%%%%%%%%

Goal: begin work on the new cache controller module

#004: changed some of the wire names in the cache datapath so that they are more descriptive

#005: hmm, there may be a hazard in which a lw that is in the ME1 stage is made to (due to the way the controller works) re-read

the cache after a write-buffer update to the cache; would this be wrong?

#006: in any case, the cache control module now contains the cache-read and cache-write FSMs, and the next-state logic for both

FSMs have been coded

however, there are still some bugs and incomplete logic that need to be ironed out - i will do that tomorrow

also, i will write black-box tests for the cache module, and i will add a debugging $display component to the cache control

module

%%%%%%%%%%%%%%% Mon May 5 23:58:39 PDT 2003 %%%%%%%%%%%%%%%

%%%%%%%%%%%%%%% Tue May 6 14:39:49 PDT 2003 %%%%%%%%%%%%%%%

Goal: continue work on the new cache controller module

#007: ok, i have determined a solution for the existing loopholes in the control module; i will implement it after the midterm

%%%%%%%%%%%%%%% Tue May 6 15:52:57 PDT 2003 %%%%%%%%%%%%%%%

%%%%%%%%%%%%%%% Wed May 7 22:47:34 PDT 2003 %%%%%%%%%%%%%%%

Goal: continue work on the new cache controller module

#008: hmm, another problem: it seems that if a lw is in the ME2 stage, and a sw is in the ME1 stage, and if the lw is in a

cache miss, then there is a WAR hazard - the sw might cause the write buffer to write to SDRAM before the SDRAM controller

processes the lw's read request, and hence possibly causes an updated (but incorrect) value to be presented to the lw

a solution would be to hijack the write buffer's write request signal to the SDRAM controller, and to prevent the write buffer from writing

#009: ok, expanding on the previous suggestion, i have created a third FSM, one that takes care of write-buffer-SDRAM interaction

i am also including one of the multiplexor selector signals inside the cache-read FSM, because it is easier this way

#010: finally! the cache datapath schematic and control module is finished

i am ready to begin testing the cache, and will be writing testbenches soon

%%%%%%%%%%%%%%% Thu May 8 04:19:17 PDT 2003 %%%%%%%%%%%%%%%

%%%%%%%%%%%%%%% Thu May 8 14:22:56 PDT 2003 %%%%%%%%%%%%%%%

Goal: begin testing the cache module

%%%%%%%%%%%%%%% Thu May 8 17:29:39 PDT 2003 %%%%%%%%%%%%%%%

%%%%%%%%%%%%%%% Fri May 9 20:27:14 PDT 2003 %%%%%%%%%%%%%%%

Goal: continue testing the cache module

#011: ok, after discussing with kh, we have decided to introduce two new states into the cache-read FSM; our goal is to eliminate a

critical path that now starts from the SDRAM module, passes through the SDRAM_cache_wr_wait signal, and then through the cache_addr

multiplexor; the new state puts a register in the middle of this path, as well as reduces the number of entries in the multiplex from

5 to 4

#012: hmm, it seems that i may be able to remove the cache-write FSM; right now, the FSM is redundant because none of the signals in the

cache controller depend on the state of the FSM

#013: it seems that i may be able to get rid of two states in the cache-right FSM as well; i will have to check on this

#014: ok, several test fixtures have been written that provide a black-box behaviour analysis of the cache control module; so far, these have

revealed a few bugs that are in the cache control module

#015: bug: if a lw is in ME2 and a sw is in ME1 and the lw causes a cache miss, there is a possibility that the data written to the

write buffer (by the sw) may be written to SDRAM before the cache-miss is resolved, and therefore, the lw may get data that has been

overwritten by the sw

currently, the cache control does not handle this case

a solution would be to prevent the write buffer from actually signalling a write request to the SDRAM controller in the event of a

cache miss

however, the cache controller must still allow an existing write-buffer-SDRAM write-request (at the time of the cache miss) to continue

one way to augment the cache controller with this functionality is to have a register that holds the previous value of the

cache_SDRAM_wr_req signal; this allows the cache controller to detect when a write-buffer-SDRAM write-request has just begun (in

the current cycle), and hence, allows the cache controller to intercept that request (any later, and it will be too late - the SDRAM

controller will acknowledge the write request and things will go wrong)

%%%%%%%%%%%%%%% Sun May 11 01:57:31 PDT 2003 %%%%%%%%%%%%%%%

%%%%%%%%%%%%%%% Sun May 11 20:55:03 PDT 2003 %%%%%%%%%%%%%%%

Goal: start integrating the entire processor (with respect to z)

#016: heh, bug in the cache datapath - the write-buffer-cache-tag match logic did not incorporate the valid bits that come out of the cache

tag modules

#017: bug! the logic that compares the tag values in the ME1 stage do not have an input from the ME2 cache address stage - this causes cache

misses to possibly halt forever!

%%%%%%%%%%%%%%% Mon May 12 03:21:06 PDT 2003 %%%%%%%%%%%%%%%

%%%%%%%%%%%%%%% Mon May 12 23:17:12 PDT 2003 %%%%%%%%%%%%%%%

Goal: continue writing test cases for the cache module, and then integrate it with the processor

#018: ok, it seems that the cache is finally working

thus, we will now begin full testing of the combined processor

#019: helped soe resolve a problem with the write buffer; the problem involved registers being inadvertently delaying their input sampling

to after the clock edge

#020: ok, wrote a few more test cases for the cache module; so far, no bugs detected

%%%%%%%%%%%%%%% Tue May 13 03:45:31 PDT 2003 %%%%%%%%%%%%%%%

%%%%%%%%%%%%%%% Tue May 13 21:27:20 PDT 2003 %%%%%%%%%%%%%%%

Goal: continue testing the processor

#021: discovered a bug in the cache datapath; the ME1_cache_data_rd_dout mux was using the wrong selector line, thus, the cache was

sometimes outputting the wrong set of the 2-way cache

#022: discovered a bug in the cache control: during pipeline stalls, the address to the cache BlockRAMs must be taken from the ME1 stage,

not the EX2 as was discovered

%%%%%%%%%%%%%%% Wed May 14 02:47:38 PDT 2003 %%%%%%%%%%%%%%%

%%%%%%%%%%%%%%% Wed May 14 20:14:21 PDT 2003 %%%%%%%%%%%%%%%

Goal: prepare tomorrow's slides, run a few more test cases

%%%%%%%%%%%%%%% Thu May 15 00:22:20 PDT 2003 %%%%%%%%%%%%%%%

%%%%%%%%%%%%%%% Thu May 15 22:12:33 PDT 2003 %%%%%%%%%%%%%%%

GOal: write a randomised write buffer

%%%%%%%%%%%%%%% Fri May 16 03:12:02 PDT 2003 %%%%%%%%%%%%%%%

%%%%%%%%%%%%%%% Fri May 16 13:29:25 PDT 2003 %%%%%%%%%%%%%%%

Goal: test the randomised write buffer

***********************************//

///***********************************

Khian Hao Lim’s online notebook (~80)

--------------------------------------------------

4/31/03 8pm

First meeting on final project. Decided to do deep pipelining, branch prediction, jump register prediction.

TODO:

- change datapath/controller with kenneth

- branch predictor and jump predictor

- help leo change mem control

- change cycle counter to not run with break in id stage

- pipeline memory mapped io into 2 stages and forward from middle

- check everyone's param and files for conformance

poss:

- 27mhz clock (fix lab6)

- run with higher clock

--------------------------------------------------

5/1/03 9pm

Started with kenneth on changing datapath

notes:

- 3 layer muxing for npc

todo:

- each reg block

- using valid bit because of branch

- tell leo about need for stalling to control enable of his second register

- considering cutting alu into 2

- document everything clearly

- name all components and groups properly

- debug displays (displaying to console or file whenever table (eg BJPredictor) changed)

- testbenches for BJPredictor and BJVerifier

- assertions on invariants

- disassembly module to print valid bit as well

- BJPredictor, BJVerifier to print to file and later get ready for presentation

get branch miss rate, cost of miss, cpi as a result of stalls and squashing of instr

- memory mapped io needs needs register to act as 2 stages also for forwarding

- file log for every module?

- unified way of doing assert and file io displays (only disassembly module gets console??)

random_notes:

- IF_ID stage has delayed_rst signal.. document that

--------------------------------------------------

5/2/03 10 am

Continue with todo's from yesterday nite

Note: currently branch and jump tables not initialized on rst

test_BJPredictor and test_BJVerifer todo:

- test assertions

- write up documentation

- try modelsim help and environment

- timeformat

file writing notes:

- param

- $Id: khianhao_notebook.txt,v 1.17 2003/05/14 20:31:43 cs152-khianhao Exp $

- displays and file io

- delay

- check 'z or 'x on fallen elses

scripting of tests: (done)

- script to that copies file to c:\temp\quicktest.mem

- change the register file to dump out contents to c:\temp\regfile.log

- tcl script that looks for a list of *.mem.con.head files, calls the rename script on each file, run the simulation, raise a line high, run the simulation for one more cycle, concats the data from c:\temp\regfile.log into c:\temp\regfile.logs

necessities: cycle is 1us, Fake_Toplevel is entry point, directory structure is the same

--------------------------------------------------

5/3/03 9am

todo today:

//- look at the links haywood sent

- document and test my jump, branch predictor

- try coregened mux

- check that lab6 resynthesized same delay, if not check why not

- bufg dll for better clock buffing

- datapath changes and testing

- memory_mapped_io needs reg for dout

- wrapper files and using tools to check correctness of connection

//- BUFGDLL

- status

- find out about timing

- do timing simulation

- name components in datapath so debugging and delays easier to see

- minimizing fanout in synplify

- tristate mux?

- try multiply clk

- have reg in sdram controller

- changing implns to one hot

- improve cycle time in branch

timing:

*- what is IBUFG, BUFG

documenting predictor, verifier

- algorithm

- where they reside

- critical path

- logging and statistics collection

- change of one line

- testing

datapath qns and notes:

- delayed rst needed to prevent first instr from executing twice

- need memory_mapped_io to have register for dout as well

- but don't we need to forward from in the middle of memory_mapped_io and datamem?

- don't use the mem_wr_en and reg_wr_en as validity flags, everything should check valid bit

- clear up PC, PC+8, nPC business

- tell soe to better name the stalling logic and change the datapath as necessary

- careful lw, use stall

- why are we taking forwarding stuff are combinational logic

- testbench needed to ensure lines all going through correctly

final testing include:

- cycle count

- DP'S

- io input

meeting timing:

- use timing constraints to

- what haywood said about not accurate static timing analysis

--------------------------------------------------

5/6/03 8 pm

qns to leo

- what happens for leo when proc stalls

- ask leo whether he register the values on stall

- memory mapped io reg

- no resetting of branch info

- ask leo about first cycle executed twice

--------------------------------------------------

5/7/03 7 pm

todo:

//- ask leo: - what happens for leo when proc stalls (then can remove pc mux)

// - whether he register the values on stall

// - ask leo about first cycle executed twice

//- no resetting of branch info

//- remove buf for synthesis of lab6

//- whole processor setup for synthesis

//- synthesize lab7

//- pc + 8

//- hex_led_manager

//- retry clock problems

//- look at forwarding logic, stalling logic and cache

//- stall after every bne/jump

//- pc mux one less, can put back later

//- ask kenneth to verify datapath correct

//- fix the PC(31:0) to instr mem change

//- remind soe to change the valid bit

//- lookup manual for timing analyzer

//- valid for predictor

to improve timing (in order):

- do map report first

- more synplify flags

- follow period settings, ports/pads as in timing closure and run normal, other modes

- floor planning

possiblities for reducing critical path delay in design:

- memory control register input, output to memory

- memory control register input and output to cache

- branch prediction in 2 stages

- branch prediction after instr cache, therefore one cycle late

- remove that mux in the way at nextpc

- cache in 1 cycle

critical paths in order (5/10 1:12pm) logic_delay

- stalling 8.0

- shifter 6.5

- branch verifier 6.0

- branch predictor (will be trouble once we increase the width)

- alu 8 (but routing is easier)

- forwarding 7.5

to soe:

forwarding:

//- PC+8

//- not regdstdata but memiodout

// forwarding, stalling:

//- valid bit not checked

iterations to reduce path:

- start with datapath, half-tested, compilable

- found nextPC to be a problem, pipeline

- optimize ALU

- forwarding critical path, inserted new FO stage

--------------------------------------------------

5/10 3pm

- start testing lab7

- register RAM_CLK

- write and run test bench for BJPredictor and BJVerifier

//- bjverifier should take in en for its registers

//- test memory mapped io reged

//- forwarding left unchanged but stalling needs to change

// to drop the alu from critical path

//- pc initial not -4

//- look at forwarding

try:

//- FPGA editor

//- floor planner

- remove unnecessary logic!!!

- glob_count needs to change to accomodate faster clock

- think about DCS

- predictor to compile to using larger table

- predictor in 2

- cvs commit/update of mem_ctrl

- debug both write buffers, so can report on performance, one slot allows 18ns

- clk multiplier and test clocks and registered io on board

- forwarding should take in en for its registers

- naming of wires and components

- tell leo about naming cache.sch

things we can do to recover the efficiency:

- alu alu case (easiest) //done

- stalling logic (control doing some preprocessor work (target the IF2/ID stage register))

- write buffer slots

- branch delay slot nop (hardest) //dont try

How to make routing lower:

- show advanced settings

- map properties: route for speed

- place and route: 2000 routing passes, effort 5 for all, extra effort on

- try using design guide

- try using mppr

to test:

- effects of registering io from/to pads

timeline:

7pm

- write buffer not making 20ns

7:11

- initialization code for bjpredictor tables

7:31

- reverted back to using older version of mem_ctrl

9:23

- fixed rst, clk flip

11:00

- some of the forwarding wires were wrong

11:34

- bug with forwarding, mux should be 2 more bigger

11:34

- Kenneth reminded me to remove the valid for ex1 stage

testing need to do:

//- need to initialize those bjpredictor ram

//- check whether forwarding needs the register block

//- check why reg_we == 0 and mem_we == 0

//- lw from 00000? how come didn't go to cache, how come didn't forward

//- improve monitor and other displays

//- synthesize again to check

//- think about nextPC, enabling when instructions are invalid etc

//- for level0boot case, how come branched to wrong place, pc not updated etc

//- do we need the register block's enable's signal to mem_map io and the

// other modules?

//- remove all those todo things in datapath

//- valid bit in stalling (ask haywood)

//- valid bit external in datapath

//- some of the debugging in reg blocks

//- first instr comes out at PC==0 but invalid, y?

//- should check valid or not before bjverifer update bjpredictor

//- mem_map_io needs reg_enable

//- instmem should be getting the same as the pipeline regs

//- datamem should do something like instmem in handling 2 stages

//- understand reg_enables

//- zz in branch target address

//- put back block ram and test without memory

//- those valid assignments

//- check that when invalid, don't commit

//- stats.v is incorrect in its counting

//- how should we enable the block register in instmem (level0boot) and mem_map_io

//- should instmem's reg block use valid or ~proc_cache_stall

//- check that all those that need valid bits check for it

// (mem_map_io, stalling_logic, forwarding_logic, datamem, regfile)

//- debug one at a time, let branch predictor be assign 1

//- simple tests of datapath without memory, test stalling for hazards, forwarding, branch prediction, verification

//- ask leo whether he loop data (eg addr) himself

//- what should be the stall signal (proc_stall, complete stall for lab6)be for instmem and datamen

//- we seem to be stalling for sw

- numDOUT > 1 in write buffer to prevent sw store

- document the semantics of the stalling logic

- check all modules don't have the #delay after always (@posedge clk)

- need to buffer some of those signals ourselves (eg stalling en is delayed because of break_stall's fanout etc)

- rerun all test of instr mixes, make sure controls and general datapath works

- for mem to work, need register to pad to be half clock cycle

- call error when UNKNOWN instr or other circumstances

5/13/03 945pm todo

- stats needs name and correct:

stall percentage for write buffer

- analyze branch statistics

- do presentation

- look for performance of sw/branch to see we don't invalidate extra things, stall extra etc

- display when we stall and the reason (in stalling logic and pipeline monitor)

- verify that first instr flow is correct

- test reset in simulation

- jr, guess jr to 31 (debug reg file) or guess value lwed from stack

individual block that needs display:

- bjpredictor and verifier display predict_change, update etc

notes:

- stalling_logic v 1.14 has short cut

- write_buffer.v 1.4 has short cut

- bjverifier v1.11 has pipelined version

- counter's initial out= 0

- watch out all counters (remember that mem_ctrl needs time to reset)

- first instr comes out at PC==0 but invalid

- changed the sel logic in bjverifer.v

- boardramcreate should be

initial addr_reg PC enable... critical path

4. PC+8 thing

several Critical paths we found:

1. stage reg -> stalling -> PC

2. next PC predictor

3. forwarding logic

SOme ways to reduce high fan-out

- duplicate the signal in the logic

Sat May 10 03:30:51 PDT 2003

- ======================

5

+ ========================

Sun May 11 22:20:14 PDT 2003

goal :debug, try some test cases and check wire connections

bugs found

1. branch verifier doesn't need to invalidate the previous instr if the branch predictor is wrong, since that

instr is the delay slot instruction

2. the stalling logic should check whether the break instruction is valid or not...

above bugs are corrected inside CVS...

Khian just made new changes in the BJVerifier.v

******

right now I am testing the datapath without instruction cache, I use blockram instead.

So i modified the instmen.v :

wire [31:0] addr_reg_if1_if2_dout;

reg32 addr_reg (.CLK(CLK), .RST(RST), .EN(~proc_cache_stall), .DIN(addr), .DOUT(addr_reg_dout));

reg32 addr_reg_if1_if2 (.CLK(CLK), .RST(RST), .EN(~proc_cache_stall), .DIN(addr_reg_dout), .DOUT(addr_reg_if1_if2_dout));

wire [31:0] dout;

wire [31:0] bram_addr;

ramblock2048b bram(bram_addr[12:2], CLK, 32'bx, dout, asynchdout0, 1'b1, 1'b0);

//always @ (posedge CLK) begin

// doutramdelay = doutram;

// end

assign bram_addr = (proc_cache_stall) ? addr_reg_if1_if2_dout : addr_reg_dout;

*****

It seems that for the following codes

addiu $8, $0, 0xf00d # $8 = f00d ,78

lw $1, 0($31) # get the first world ,7c

beq $1, $8, pass1 # first error, 80

addiu $12, $12, 1 # $12 = 1, 84

addiu $13, $13, 1 # ERROR CODE 1, 88

pass1:

...

addiu $13,$13,1 is not executed in fact the beq is wrong...

1 thing need to remember is that...according to our current design

addiu $13,$13,1 is fetched no matter what, and it seems that it must also get invalidated....

It seems to me the branch verify doesn't set the correct right address....

Finally I found the bug that cause the above problem:

the last input B_after_forward mux should be ID_FO_B instead of ID_FO_A !!!!!

after I fixed it..it works now...

okay.. I haven't committed the change and going to leave a note to khian

Mon May 12 05:49:40 PDT 2003

- ===========================

7

+ =========================

Wed May 14 20:12:41 PDT 2003

Goal: debug again

a bug about break is found:

we need a way to distinguish the following two situaions

(1)

beq

break

add

vs

(2)

beq

add

break

when the next PC predictor predicts that the branch should not be taken, and then verifier detects that

it is wrong, we need to invalidate break in case (2). according to our current implmentation, the break

in the case 2 is executed anyway, and waiting for release signal.....

i made a simple test case

break_test.s, it verifies that the break in (2) is executed no matter what....

although it is wrong to execute the break, but the PC is set correctly.....

so right now we have problems that we have extra break when it is not supposed to be...

Wed May 15 01:32:41 PDT 2003

- ===========================

5

+ ==============================

Wed May 15 10:50:41 PDT 2003

goal: collect stats of different test benches

collect stats for extra.s, quick_sort, and base...

but it seems that the count for jr is wrong...since it indicates that it is 100% right for base and extra..

need to take a look at it later..

go to presentation, continue on the stat later

Wed May 15 12:30:41 PDT 2003

- ==========================

2

+ ========================

Fri May 16 15:45:50 PDT 2003

goal: change BJpredictor to Kubi's version to see the accurancy difference...

(100 us for initialize mem, take 107 cycles)

quic_sort: simulation runtime: 2318us + 12290 us = 14608 + 10 us

khian's BJ predictor correct %: 65.97 %

Kubi's BJ predictor correct %

extra: runtime: 1321us + 842 us = 2163 us (up to break)

khian's BJ predictor %: 87.06 %

Kubi's BJ predictor %:

base: runtime: 4249 us + 310 us + 450 us = 5229 us

khian's BJ predictor %: 91.17 %

kubi's BJ predictor %

compute 8! runtime: 600us + 1 us +700 us

I generated some stats files inside

lab7\kenneth\kubi_predictor\

and

lab7\kenneth\khian_predictor\

also i tried pipeline the verifier... but after I made the change... it seems not working properly...

i suspected that we also need to change sth to handle stall and update things correctly since now the predict_wrong signal is like 1 cycle delay...

so i am not going to do it now....

Sat May 17 00:30:50 PDT 2003

- ================

9

+ ===================

Sat May 17 15:48:41 PDT 2003

goal: modify the BJPresdictor to always predict branch and predict not branch

new statisitcs files are created in folders..

lab7\kenneth\always_branch\

lab7\kenneth\always_not_branch\

I also modified the datpath to cellect the verifier predicted PC, and add some new inputs to the stats module

but i didn't commit the files, just copied those files into my folder

lab7\kenneth\modified file for statistic\

datapath.sch

stats.v

stats.sym

Sat May 17 18:00:16 PDT 2003

- ==================

2

***************************************//

///***********************************

Soe Myint’s online notebook (~45 hours)

5/5/2003

7:37pm

start lab7 STALLING LOGIC

MIGHT HAVE TO CHANGE STALLING LOGIC AND FORWARDING LOGIC IF WE CHANGE DATA PATH TO ADD IN IF3 STAGE AND DELETE EX2 STAGE

LW_STALL ID_EX1 with IF2_ID conflict

EX1_EX2 with IF2_ID conflict

EX2_ME1 with IF2_ID conflict

deleted multiply logic from pervious stalling logic.

input WR_BUF_FULL ; // get the signal from write_buffer that buffer is full, need to stall until data has been written

input start; // start the processor, start = 0 mean that DRAM is setting up, need to stall the whole processor

have to make sure that register are real instruction, not NOP instruction from previous

input ID_EX1_reg_en ; // to make sure that reg are real instruction, not NOP insturction generated by previous stalling

input EX1_EX2_reg_en;

input EX2_ME1_reg_en;

deleted from always loop from stalling logic, might need it later.

IF1_IF2_opcode or IF1_IF2_funct or IF1_IF2_rs or IF1_IF2_rt or IF1_IF2_rd or

EX1_EX2_opcode or EX1_EX2_funct or EX1_EX2_rs or EX1_EX2_rt or EX1_EX2_rd or

EX2_ME1_opcode or EX2_ME1_funct or EX2_ME1_rs or EX2_ME1_rt or EX2_ME1_rd or

ME1_ME2_opcode or ME1_ME2_funct or ME1_ME2_rs or ME1_ME2_rt or ME1_ME2_rd or

change from this

or_funct:

begin

if(IF_ID_rs == ID_EX_rt || IF_ID_rt == ID_EX_rt)

lw_stall = 1;

else

lw_stall = 0;

end // case: addu_funct:..

TO this

or_funct:

begin

if(IF2_ID_rs == ID_EX1_rt || IF2_ID_rt == ID_EX1_rt)

lw_stall_ID_EX1 = 1;

else

lw_stall_ID_EX1 = 0;

end // case: addu_funct:...

10:51pm

finished stalling_logic,

committed to CVS

has not tested the stalling_logic yet

start forwarding logic

muxes A and B select the forward singal depanding on output

output [2:0] ALU_A_forward_sel;

output [2:0] ALU_B_forward_sel;

will be forwarding from 5 places

2am:

==========================================================

5/7/2003

10:00pm

continue forwarding logic

multiply deleted.

multu_funct = 6'd25, // ** ** indicates some change made by Ka Hou Chan on 4/9 12 pm

mfhi_funct = 6'd16, // **

mflo_funct = 6'd18, // **

original verisoion of forwarding logic is in U: dirve name as forwarding_logicbackup.v

11:10pm

deleted from original

/*

// parameters that will be presented to the data_mem_write_forward_sel output

// i.e., they will be used as selector values in the forwarding mux

parameter data_mem_forward_from_normal = 1'b0;

parameter data_mem_forward_from_data_mem_read = 1'b1; */

or

IF_ID_inst or ID_EX_inst or IF_ID_inst_reads_rs or IF_ID_inst_reads_rt or ID_EX_reg_wr_en)

2:31 AM

finshed three forwarding place, need to do M1 and M2 forwarding logic.

3:42 am

stalling_logic and forwarding_logic compiled with dummy testbench.

datapath have been updated and all signals from forwarding logic and stalling logic have been added in datapath

datapath back up is in u: soe folder.

6:02 am

go home now

==========================================

5/8/2003

1:00 pm

changes to forwarding logic

change reg_wr_en bit to valid bit

have to add one more stage since forwarding logic is too long

put mux in one stage

put backup in without fo stage folder

always @(IF2_ID_opcode or IF2_ID_funct or IF2_ID_rs or IF2_ID_rt or IF2_ID_rd or

ID_EX1_opcode or ID_EX1_funct or ID_EX1_rs or ID_EX1_rt or ID_EX1_rd or

IF2_ID_inst_reads_rs or IF2_ID_inst_reads_rt or ID_EX1_valid or ID_EX1_inst)

begin

case (ID_EX1_inst)

JAL_inst:

begin

// logic for ALU A source

if (IF2_ID_inst_reads_rs && IF2_ID_rs == reg_ra && ID_EX1_valid == 1)

ID_EX1_reg31_A_forward_sel = AB_forward_from_ID_EX1_reg;

else

ID_EX1_reg31_A_forward_sel = AB_forward_from_normal;

if (IF2_ID_inst_reads_rt && IF2_ID_rt == reg_ra && ID_EX1_valid == 1)

ID_EX1_reg31_B_forward_sel = AB_forward_from_ID_EX1_reg;

else

ID_EX1_reg31_B_forward_sel = AB_forward_from_normal;

end // case: JAL_inst

default:

// all other instruction except jal

begin

ID_EX1_reg31_A_forward_sel = AB_forward_from_normal;

ID_EX1_reg31_B_forward_sel = AB_forward_from_normal;

end // case: default

endcase // case(ID_EX1_inst)

end // always @ (IF2_ID_opcode or IF2_ID_funct or IF2_ID_rs or IF2_ID_rt or IF2_ID_rd or...

4:40pm

stalling_logic have been added one more stage

IF2_ID_valid added to stalling_logic for efficiency

always @ ( IF2_ID_opcode or IF2_ID_funct or IF2_ID_rs or IF2_ID_rt or IF2_ID_rd or

EX2_ME1_opcode or EX2_ME1_funct or EX2_ME1_rs or EX2_ME1_rt or EX2_ME1_rd or

EX2_ME1_valid or IF2_ID_valid )

begin

if(EX2_ME1_opcode == lw_opcode && EX2_ME1_rt != 0 && EX2_ME1_valid == 1 && IF2_ID_valid)

begin

10:22pm

go home

5/12/2003

6:05 pm

found two bugs in datapath

while stalling cache will keep producing instructions, therefore need a register at IF1 and IF2 stage to keep the instruction at that PC

11:04 pm

finish writing stats.v and stats_testbench

commited to cvs

start changing the datapath

add one more output to datamem and instruction mem

module datamem (// inputs

CLK,

RST,

we,

addr,

din,

SDRAM_cache_rd_dout,

SDRAM_cache_rd_wait,

SDRAM_cache_wr_wait,

proc_cache_stall,

EX_opcode,

valid,

// outputs

cache_SDRAM_rd_addr,

cache_SDRAM_rd_req,

cache_SDRAM_wr_addr,

cache_SDRAM_wr_din,

cache_SDRAM_wr_req,

dout,

cache_proc_wait,

cache_SDRAM_wr_buf_full,

proc_cache_rd_req

);

one more output " proc_cache_rd_req" added in for data mem and instruction mem

12:36 pm

datapath is updated with datamem, instruction mem and memio changed.

2:00 pm

go home

--=============

5/13/2003

7:05 pm

start the test cases

4 tests regfile.log.test_ran_test_*.mem.con.head

one test fail

regfile.log.test_ran_test_gen_ran_instr.s failed. register 20 is 0 instead of fffffff

fixed cache to fix the bug

testing forwarding_test_ID_EX_ALI.s

testing forwarding_test_ID_EX_ALR.s

testing forwarding_test_ID_EX_BRANCH3.s

testing forwarding_test_ID_EX_BRANCH4.s

testing forwarding_test_ID_EX_J.s

testing forwarding_test_ID_EX_JAL.s

testing forwarding_test_ID_EX_JR.s

testing forwarding_test_ID_EX_LUI.s

testing forwarding_test_ID_EX_LW.s

testing forwarding_test_ID_EX_ME_ALR.s

5/14/2003

6:00 pm

testing fowarding_test_ID_EX_ME_LUI.s

testing fowarding_test_ID_EX_SHIFT.s

testing fowarding_test_ID_EX_SW.s

testing fowarding_test_ID_ME_ALI.s

testing fowarding_test_ID_ME_ALR.s

testing fowarding_test_ID_ME_BRANCH3.s

testing fowarding_test_ID_ME_BRANCH4.s

testing fowarding_test_ID_ME_J.s

testing fowarding_test_ID_ME_JAL.s

testing fowarding_test_ID_ME_JAL_2.s

testing fowarding_test_ID_ME_JR.s

testing fowarding_test_ID_ME_LUI.s

testing fowarding_test_ID_ME_LW.s

1 am

=========

5/15/2003

***********************************//

///***********************************

Haywood Ho’s online notebook (~77)

4/30 8:00pm

Met together with group. Decided finally on deep pipelining. Timed

ALU, muxes, current datapath, to get an idea of how many stages we would need.

Decided on splitting cache into 2 stages, etc. Was assigned to change

mem_cntrl and help with forwarding and stalling.

5/1 2:00am

5/3 4:00pm

Reading up on timing constraints. Trying to figure out how to use the

timing analyzer. Hard to find good examples on the net.

5/3 7:00pm

5/10 3:00pm Finished modification of sdram controller for 50 Mhz; preliminary

testing shows that it works. Registered the outputs, hopefully this would alleviate

the path from the memory controller to the SDRAM.

5/11 5:00am

5/11 4:00pm

Debugging, writing test cases, reading over the forwarding/stalling

logic to make sure there were no bugs. Found some bugs with not checking the

valid bits in forwarding and stalling logic. Had idea to replace caches with

blockRams and registers, and am working on it. Ran into modelsim scripting

problems.

5/12 2:40am

5/12 4:00pm

Found a clean way to get testing without the rams working, will cvs

commit my changes later. Found a bug in which i was using the block rams to

simulate the caches, as i forgot to connect the stall signals to the registers.

5/13 2:00am

5/13 10:05am

More testing with the block RAMs, found a couple more bugs with

the stalling logic with wierd test vectors. Some wires were reversed on the

schematic.

5/13 1:30pm

5/13 4:00pm

More testing again, found a few more bugs involving the PC when we

stall. Actually, we found out later it was more to do with the way I wrote my block

RAMs, as I did not take into account of the stalling signals which I should have fed

into the block RAMs to register the instruction that had been fetched halfway.

5/14 3:40am

5/14 4:00pm

Ran through more test cases; more bugs with the branching logic and branch prediction.

5/15 4:30am

5/15 4:00pm

Synthesis, but doesn't work on the board now. Trying many different

options. Test cases give wierd results; PC sometimes flies. One particularly nasty bug is:

beq (predict wrong)

break

break

beq (predict wrong)

addu

break

beq (predict wrong)

break

addu

These cases require special logic to fix. The worst case is beq, break, addu, as

when the beq gets to the EX1 stage, the break is in the FO stage, and the stalled

break is in the ID stage. We have to invalidate the next instructions after the

branch, but we must make sure not to invalidate the break in the ID stage. This

adds to our critical path, as our critical path is now predict_wrong from the branch

validator module with some logic and into the enable signal of the ID/FO register.

5/15 11:50pm

5/16 2:30pm

more synthesis, changing options to try to make it work on the board.

currently it works on a board in the 150 lab, using the guide constraint file only. We

have a version at extra high effort that has a critical path of 26ns, but this doesn't

seem to work on the board.

5/16 10:00pm

5/17 4:00pm

Checked the results of synthesis; again doesnt' work with machines

in 150 or 152 lab. Totally lost, don't understand why.

5/17 6:00pm

***********************************//

Appendix II (schematics)

I did not paste any schematics here as I thought it would not really help (too small). Our schematics are located in the /sch directory of in our zip file. The verilog files produced from these schematics can be viewed in the /sch_gen_verilog directory.

Appendix III (Verilog files)

The verilog files are located in the /verilog directory.

Appendix IV (testing files)

Test files are located in the /test directory. .tf are test fixtures for each module. Fake_Toplevel.v was used for simulation testing of the processor. .s files are our directed test cases that we used to test our processor as a whole. test_ran_gen_instr.s was a randomly generated test. The .mem files are the result of mipsasm, the .mem.con files are for input to the processor after adding the additional 2 lines at the top for level0boot code to work.

Sample transcripts are located in the transcripts.zip file. I have not included all transcripts as these are all quite large. Stats output is located in the stats.zip file. Register file dumps are located in the regfile.log.zip file. These were generated from an automated test script (batch_testing.do). You can view this test script in the scripts directory. There are a couple of directions in the file that you have to follow if you wish to use this script to run our test cases. Symbols are in the /sym directories.

-----------------------

IF0

IF1

IF2

ID

FO

EX1

EX2

ME1

ME2

WB

IF2

IF1

IF0

Instruction Cache

Branch,

JumpReg Prediction

PC and nextPC muxes

ALU

Branch,

JumpReg Verification

Forwarding Mux B

Forwarding Mux A

Forwarding Logic

EX2

RegFile

Decode

Ctrl

EX1

FO

ID

WB

ME2

ME1

Data Cache

Monitor

Statistics Collection

Predict taken

Predict taken

Predict not taken

Predict not taken

Not taken

Not taken

Taken

Taken

Not taken

Not taken

Taken

Taken

Individual writing of modules

Individual writing of test benches for modules and testing.

Simple test of memory controller on board

Wire up data cache, instr cache and SDRAM together and run simple test benches to test communication between cache and SDRAM

Using block rams as caches, we ran simple MIPS programs with and without lw/sw instructions (to test branch prediction, branch verification, forwarding and stalling logic). These were all targeted test vectors that tried to test with/without hazards.

Run write buffer test benches to test communication between write buffer and SDRAM

Run randomly generated test files with different instructions (many mispredicted branches, lw stalls, breaks, etc.).

Wire up simulation processor, data cache, instru cache, level0boot and run all the test cases run above again.

Run vector test files with many lw/sw instructions to check communication between cache and processor.

Switched to synthesis models and debug on Xillinx board.

normal

miss

miss return

normal update 1

normal update 2

normal update 3

normal update 4

miss update 1

miss update 2

miss update 3

miss update 4

!miss

miss

miss

SDRAM_read_done

!miss &&

SDRAM_write_done

miss &&

SDRAM_wr_done

miss &&

SDRAM_wr_done

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download