Computer Architecture Project Report



ECE 552 Introduction to Computer Architecture

WISC-99S

Computer Architecture Design Project

Final Report

Team Members

KOI, Chao (koic@cae)

CHAN, Tung-Fai (tchan@cae)

April, 1999

TABLE OF CONTENTS

Introduction 1

Overview 1

Components Descriptions 2

Control Units Descriptions 3

Costs of the Design 6

Discussions 7

Comments 8

Schematic Printouts

VHDL Codes

Test Program Simulations

Commented Simulation Documents

INTRODUCTION

W

ISC-99S is a 16-bit RISC oriented computer with load/store architecture. Design and implementation of this architecture are based on Mentor Graphics tools. The architecture has eight general-purpose registers and six major types of instruction: computation register, computation immediate, load/store, branch, subroutine jump, and reserved-for-future. The details of specifications are provided in the “Project Description” hand out from Professor Saluja.

OVERVIEW

W

e have designed 2 versions of architecture, multi-cycle and pipeline. The reason of building the multi-cycle version is because it provides us a better understanding of the system architecture and a chance to improve the performance without the help of pipeline. After we successfully simulated the multi-cycle version and optimized it fundamentally, we moved on to the pipeline version. The optimizations we have made on both versions will be discussed in the Discussion section.

Our multi-cycle design is essentially based on the state diagram on Pg. 3. State 0 and State 1 are common for all types of instructions and stand for Instruction Fetch and Instruction Decode respectively. State 2 are common for load and store instruction. The result states are corresponding to a specific instruction indicated. This design is able to handle NOP (No Operation) instruction and Reserve-for-future exception, which are stated in the Project Description. The former instruction will simply go back from State 1 to State 0. It indicated that the PC (Program Counter) is incremented and does nothing else. The latter exception is treated as invalid OPCODE (Operation Code) by storing the address of current instruction in memory location FFFF and halt further execution. The Control Unit is implemented in VHDL and then convert into a symbol.

The pipeline is basically divided a typical instruction into five stages, IF (Instruction Fetch), ID (Instruction Decode), EX (Execution), MEM (Memory Access), and WB (Write Back). Each stage will make use of one clock cycle. The control signal of Control Unit is provided on Pg. 4. The Control Unit, also implemented in VHDL, but instead of using finite state machine, is implemented as a single cycle machine because the signals will be propagate to later stage though registers at the end of each stage. The advantage of using pipeline is to let each component, Control Unit, ALU, Memory, etc to be fully used in parallel while still pretends that each instruction has its own datapath. However, this design has higher cost as each component can only be used once for every instruction. For example, there must be at least 2 ALU or Adder (in our design, 3 are used) in order to accomplish the task, one for IF stage for PC increment while the other one in EX stage for typical execution. Furthermore, extra hardware and design have to be employed to tackle problems such as Data Hazard and Branch Hazard. More details regarding optimization and handling hazards are provided in Discussion section.

COMPONENTS DESCRITPIONS

M

ulti-cycle and Pipeline are using the same set of components with minor modification. Of course, Pipeline has a larger amount of hardware. Here are some significant components used in the both versions:

|ALU Control Unit |Control the operations take place in the ALU. It has 2 control inputs sources, from Control Unit and current |

| |instruction. Control Unit can control the ALU operation explicitly, ADD, SUB, and AND. However, it also can depends|

| |on FUNC field in the instruction. |

|1Adder |It is a simplified version of ALU. It does not include any other operation such as AND and XOR. This greatly |

| |reduces the cost and increase efficiency. Also, no control input is required and only ADD can be performed. |

|2ALU |Perform ADD, SUB, AND, XOR, and SHIFT operation according to control inputs. It has 2 16 bits data and a 3-bit |

| |control inputs with 16-bit result output. |

|1Branch Detection Unit |Detect whether the input value is ZERO or NEG and output 2 1-bit value. ie, for a negative number, NEG = 1; for a |

| |Zero number (0x0000), ZERO =1; otherwise 0. |

|3EndianfromMem/ |To interchange the big/little endian format of bit orientation of the value. As in our design, all bus are declared |

|EndiantoMem |as (15:0) while memory uses (0:15) |

|Instrreg |Extract each field, rs, rt, rd, func, immd, opcode, from a 16bit instruction from memory. |

|LoadImmUnit |Performs load immediate operation. Has 1 input from Control Unit to determine what position should be loaded onto |

| |(High/Low 8 bits). |

|MemoryBlock |Contain the main memory of the datapath. It contains the EndianfromMem/EndiantoMem components to invert the bit |

| |orientation format. It is a abstracted form of memory. |

|RegFile |Contain 8 x 16bit registers. It can perform 2 registers read simultaneously while write synchronous. It has enable |

| |signal to control Read/Write of selected register. |

|SignExtendUnit |It performs sign extended operation for immediate value from instruction. It has an input signal from Control Unit |

| |to determine which immediate value (6 or 8 bits) are processed. |

|1ForwardUnit |Used to detect data hazard on the pipeline design and forward appropriate value to current stage. It is implemented |

| |in VHDL. |

|1HazardUnit |Used to detect a data hazard situation that no forwarding can be done and still the pipe by inserting NOP instruction|

| |into current stage (IF). Implemented in VHDL. |

|1NOP Instruction |Detect whether current instruction would have data memory access situation and then insert NOP instruction to stall |

| |the pipe. It is because there is only one memory and it could only either fetch instruction or data access. |

| |Implemented in VHDL. |

|1PCSourceUnit |Used to enable PC and determine its source. It handles the branch instruction and insertion of NOP instruction in |

| |IF stage. Implemented in VHDL. |

Other than these components, there are several other common hardware used. They are 2-to-1, 4-to-1 16bit MUX; different kinds of registers used in pipeline design for save up all values at the end of each stage. We are not trying to explain these common component nor printout is included to save up some pages.

[1]

CONTROL UNITS DESCRIPTIONS

C

ontrol Units are essential built by using VHDL for both Multi-cycle and Pipeline datapath. The Control Unit of Multi-cycle design simulates a finite state machine. The state diagram is provided. By and large, all instructions, except load/store (Memory Access), are completed within 3 states. State 0 and State 1 are corresponding to Instruction Fetch and Instruction Decode respectively and they are common for all instruction. The state diagram has a loop for all result states to go back to State 0 for fetching new instruction. Here is a description for each state:

|State Number |Tasks |

|0 |Instruction is fetched into Instruction Register |

| |Increment PC |

|1 |Calculate the Branch Address by adding PC + Immediate |

|2 |Select which type of immediate is being used (6 or 8 bits) and sign extended |

| |Calculate the Memory Access address |

|3 |Select the memory address input to be the result from ALU instead of from PC |

| |Turn the Memory into Read mode |

| |Select which register should be written to and its data source |

|4 |Select the memory address input to be the result from ALU instead of from PC |

| |Turn the Memory into Write mode |

|5 |Select the ALU source A and B are from RegFile |

| |Set the ALU Control to depend on the FUNC field of that instruction |

| |Select which register should be written to and its data source |

|6 |Select the ALU source A is from RegFile while source B from Immediate |

| |Select which type of immediate is being used and sign extended |

| |Set the ALU Control to perform explicit ADD function |

|7 |Select the ALU source A is from RegFile while source B from Immediate |

| |Select which type of immediate is being used and sign extended |

| |Set the ALU Control to perform explicit AND function |

|8 |Select which part of immediate is being loaded into (upper 8 bits) |

| |Select which register should be written and its data source |

|9 |Select which part of immediate is being loaded into (lower 8 bits) |

| |Select which register should be written and its data source |

|10 |Select ALU source A is from RegFile and source B from 0 (constant) |

| |Select ALU Control to be explicit ADD |

| |Select which signal from ALU is to be used (NEG) |

| |Select which source PC should read from |

|11 |Select ALU source A is from RegFile and source B from 0 (constant) |

| |Select ALU Control to be explicit ADD |

| |Select which signal from ALU is to be used (ZERO) |

| |Select which source PC should read from |

|12 |Select PC to read from RegFile |

| |Enable PC to read |

| |Select ALU source A is from PC and source B from 0 (constant) |

| |Select ALU Control to be explicit ADD |

| |Select which register should be written and its data source |

|13 |Select PC to read from RegFile |

| |Enable PC to read |

As the Register File and all kinds of register such as PC, ALUOUT, are synchronous, all data will be written in during the positive trigger of next clock cycle.

State Diagram for Multi-cycle Control Unit:

State 5 State 6 State 7 State 8

ALUA = 1 ALUA = 1 ALUA = 1

ALUB = 0 ALUB = 10 ALUB = 10 MemtoReg = 10

ALUOp = 11 ALUOp= 00 ALUOp = 10 RegDst = 01

MemtoReg = 0 ImmdSrc = 0 ImmdSrc = 0 RegWrite

RegDst = 0 MemtoReg = 00 MemtoReg = 00 LdImmdPos = 0

RegWrite = 1 RegDst = 01 RegDst = 01

RegWrite = 1 RegWrite = 1

State 0 State 1 State 2 State 3

PCWrite = 1

MemRW = 1

ALUA = 0 ALUA = 1 ALUA = 1 IorD = 0

ALUB = 01 ALUB = 10 ALUB = 10 MemRW = 1

IorD = 00 ALUOp = 00 ALUOp = 00 MemtoReg = 1

PCSource = 00 ImmdSrc = 1 ImmdSrc = 0 RegDst = 10

IRWrite = 1 RegWrite = 1

ALUOp = 00

State 9 State 10 State 11 State 4

LdImmdPos = 1 ALUA = 1 ALUA = 1

MemtoReg = 10 ALUB = 11 ALUB = 11

RegDst = 01 ALUOp = 00 ALUOp = 00 PCSource = 10

RegWrite = 1 PCWriteCond = 1 PCWriteCond = 1 PCWrite = 1

BranchType = 0 BranchType = 1

PCSource = 01 PCSource = 01

Invalid1 Invalid2 State 13 State 12

ALUA = 0 MemData = 1 PCSource = 10 ALUA = 0

ALUB = 01 IorD = 10 PCWrite = 1 ALUB = 11

ALUOp = 01 MemRW = 0 MemtoReg = 0

RegDst = 10

PCSource = 10

PCWrite = 1

RegWrite = 1

End

Here is a description of each Control Signal (both Multi-cycle and Pipeline design)

|Control Signal |Descriptions |

|ALUA |Control the input as ALU Source A |

| |0 – PC; 1 – Register File |

|ALUB |Control the input as ALU Source B |

| |00 – Register File |

| |01 – Constant 0x0001 |

| |10 – Immediate value |

| |11 – Constant 0x0000 |

|ALUOp |Control the operation of ALU |

| |00 - Explicit ADD |

| |01 - Explicit SUB |

| |10 - Explicit AND |

| |11 – according to FUNC in the instruction |

|PCWrite |Enable/Disable the PC |

|MemRW |Read/Write mode of memory |

| |1 – Read |

| |0 – Write |

|MemtoReg |Control the input data to Register File for written |

| |in |

| |00 – value from ALU |

| |01 – value from Memory |

| |02 – value from Load Immediate unit |

|RegDst |Determine which register should be written |

|RegWrite |Enable/Disable write to Register File |

|ImmdSrc |Determine which immediate should be signed extended |

| |0 – 6 bit Immediate |

| |1 – 8 bit Immediate |

|LdImmdPos |Determine which position (high/low) should be low to |

| |the register |

| |0 – upper 8 bit |

| |1 – lower 8 bit |

|IorD |Control the source address to memory |

| |00 – PC |

| |01 – ALU |

| |10 – Constant 0xFFFF |

|PCSource |Control the source of PC |

| |00 – PC + 1 |

| |01 – Branch Addres |

| |10 – Register File |

| |11 – Constant 0x0000 |

|PCWriteCond |Determine whether there is a possibility that the PC |

| |would be written, the result is depends on the |

| |comparison of values. |

|BranchType |Determine which signal from ALU should be read |

| |0 – Neg (for BLT instruction) |

| |1 – Zero (for BEQ instruction) |

|MemData |Determine where input data to memory from |

| |0 – Register Value |

| |1 - ALU |

The Control Unit for Pipeline Datapath, same as previous version, which are implemented in VHDL and are essentially using the same set of Control Signal with the same descriptions. However, the Control Unit is not designed as finite state machine. On the other hand, it is implemented as single clock cycle machine which all signals are generated at the time when OPCODE is received from new instruction. Then, these generated signals would be propogated through the datapath through the control registers, such as ID_EX_Control, EX_MEM_Control, MEM_WB_Control. However, due to limitation of pages, we would exclude the printout of these registers.

There is no state diagram for Control Unit for Pipeline version. The output signals for each OPCODE are simply combination of all signals of all states that OPCODE required (except State 0 and State 1). For example, for Computational Register Type Instruction, the output signals are simply those in State 5. Take another example of load instruction, a combination of signals in State 3 and State 4 are included.

COSTS OF DESIGN

Below is the cost breakdown for Multi-cycle design:

Block |And |Or |Other |Buffer |Register |RAM |MUX |Decoder |Cost | |MemBlock | | | |1 | |1 | | |0.48 | |InstrReg | | | | |16 | | | |96 | |RegFile |8*and2 | | | |128 | |32*mux81 |Dec38 |1557 | |SignExtendUnit | | | | | | |16*mux21 | |16 | |

ALU |39*and2

18*and3

14*and4 |9*or2

5*or3

10*or4 |96*xor2

7*inv

1*nor16 | | | |18*mux21

17*mux41 | |

796 | |LoadImmdUnit | | | | | | |16*mux21 | |32 | |ALUControl |1*and2 | | | | | |3*mux21 | |7 | |Misc. |1*and2 |1*nor2 | | |64 | |49*mux21

83*mux41 | |1148 | |Total | | | | | | | | |3652.48 | |

Below is the cost breakdown for Pipeline design:

Block |And |Or |Other |Buffer |Register |RAM |MUX |Decoder |Cost | |MemBlock | | | |1 | |1 | | |0.48 | |InstrReg | | | | |16 | | | |96 | |RegFile |8*and2 | | | |128 | |32*mux81 |Dec38 |1557 | |SignExtendUnit | | | | | | |16*mux21 | |16 | |

ALU |39*and2

16*and3

14*and4 |8*or2

5*or3

10*or4 |96*xor2

7*inv

| | | |18*mux21

17*mux41 | |

612.28 | |LoadImmdUnit | | | | | | |16*mux21 | |32 | |ALUControl |1*and2 | | | | | |3*mux21 | |7 | |2*Adder |72*and2

30*and3

30*and4 |10*or2

10*or3

10*or4 |160*xor2 | | | | | |

1025.4 | |MEM_WB_Control | | | | |5 | | | |30 | |IF_ID_Reg | | | | |32 | | | |192 | |MEM_WB_Reg | | | | |57 | | | |342 | |ID_EX_Control | | | | |15 | | | |90 | |ID_EX_Reg | | | | |84 | | | |504 | |EX_MEM_Control | | | | |9 | | | |54 | |DX_MEM_Reg | | | | |57 | | | |342 | |Misc. |1*and2 |1*nor2 | | | | |6*mux21

10*mux41 | |92 | |Total | | | | | | | | |4997.16 | |

DISCUSSIONS

O

ptimizations have been made on both versions to improve the performance and reliability. In this section, we would discussion optimizations in both Multi-cycle and Pipeline datapath and hazard handling in latter. Also, we would talk about the pros and cons of each design.

Multi-cycle:

▪ Optimization

▪ In Multi-cycle design, as all instructions have to be executed serially, the best way to improve its performance is to reduce number of clock cycle each instruction takes and reduce the period of each clock cycle. In the design, the clock cycle period is determined to be the longest path of all states. We figure out that the bottleneck is on the ALU execution which we can’t really improve. Therefore, we focus on reducing the number of clock cycles of each instruction type.

▪ As mentioned in the text, the first two states, State 0 and 1, are common for all instructions and it is no way to eliminate them. We target on the states after State 1 and successfully make most instructions to be completed with 3 states (including State 0 and 1). The 2 exceptions are load/store instructions which takes 4 states. The technique of reducing 1 additional for computational register/computational immediate/load instruction is based on the synchronization of register file. We made use of the fact that the data will only be written during the next positive trigger, we supply the value to register file directly from ALU and Memory Block instead of storing in ALUOut and Memory Data Register which cost extra clock cycle.

▪ Though this optimization, we can reduce 1 clock cycle (equal to 60ns for our design) for each computational register/computational immediate instructions, which are the most common. Take the example of the test program, there are altogether 55 instructions (more than 90%) belong to these 2 categories and it saves 55 x 60 = 3300ns in total.

Pipeline:

▪ Optimization and hazard handling:

▪ In fact, using Pipeline design itself is already a great improvement on efficiency relative to Multi-cycle design as most instructions can be executed “almost” in parallel. Yet, we have made several modifications to further improve its reliability and efficiency.

▪ First-of-all, to handle data hazards, as described in the text, we implemented a forwarding unit which forwards data from different stages. However, our design is different from what is described in the text. In the text, the design is to put the MUX in the beginning of EX stage which select appropriate value from different stages registers (EX/MEM, MEM/WB) Our design, instead, put the MUX at the end of ID stage. There are several reasons of doing so:

1) As mentioned, the clock cycle is determined on the worst path of all stage and in our design, EX is the bottleneck. Thus, the clock cycle has to be based on the longest time taken in EX stage. If we put the MUX in EX stage, it will definitely increase the clock cycle period. A 4-to-1 MUX has 4ns delay and if we increase clock cycle by 8ns, each instruction has to take 8x5 = 40ns more to complete. It’s a definitely a huge trade-off.

2) Put it in ID stage can simulate the “Write before Read” of register file. In our design, as register file are synchronous and data won’t be written till the next clock trigger. Thus, data can’t be read till next clock and it can’t accomplish that task of “Write before Read” within the same clock. However, by putting the MUX at the end of ID stage, we can forward value directly from wire carrying the value going to be written in the next clock trigger. In other word, instead of obtaining value after register file has read in the value, we obtain the value at the same time as the register file. If we put the forwarding MUX in EX stage, there is no way to accomplish this task but to put a new MUX in ID stage. This increases the cost and the delay of EX stage as mentioned above.

3) This related to branch hazard. We tried to determine whether the branch will be taken or not in ID stage because in this way, it will only stall 1 clock cycle. However, if we following the text that compare the values from register file directly, data hazard occurs. Therefore, instead of comparing the values from register file, we have to compare forwarded values from MUX. If we put the forwarding MUX in EX stages, that means we can’t determine the branch till EX stage and one more clock cycle has to be stall. In our pipeline version, each clock cycle is 64ns and it’s rather costly to determine in EX stage.

▪ Because of these reasons, putting the MUX in ID stage can both increase efficiency and increase reliability of the datapath.

Comparison:

▪ The advantage of using pipeline is, of course, its performance in long run. The Multi-cycle datapath takes #ns with 60ns for each clock cycle to complete the test program while pipeline only takes #ns to complete with 64ns clock cycle. There is almost 50% improvement.

▪ The disadvantage is costly. Additional hardware is spent on forwarding, hazard detection units, and MUX. From the tables above, Pipeline is over 1000units more expensive than Multi-cycle. Secondly, due to additional hardware, the clock cycle are increased. Also, in pipeline design, each instruction, no matter its type, have to go through all 5 stages so, in short run, say 3 R-type instructions, multi-cycle though runs serially, only takes 3 x 3 x 60 = 540ns to finish while it takes 7 x 64 = 448ns to finish.

▪ Overall, Pipeline design is worth than Multi-cycle according to performance and cost.

Performance of Multi-cycle = 1 / # =

Performance of Pipeline = 1 / # =

Cost of Multi-cycle = 3652.48

Cost of Pipeline = 4997.16

Performance per unit cost for Multi-cycle =

Performance per unit cost for Pipeline =

COMMENTS

T

here are in fact several ways to further improve the design. The first method is using caches. Caches, in fact, are a bunch of registers whose delay is 3ns (from the specifications). This greatly reduces the time taken on memory access. However, we did not implement this feature because it is not practical in our design at all. In our design, memory access only takes 20ns (even faster than ALU!) to complete which is within a single clock cycle and thus bottleneck is not on memory but on ALU instead. Therefore, improving efficiency on memory access will not improve the overall performance. However, in reality, memory access takes munch longer than ALU and bottleneck will be on it and caches will be very helpful to improve the performance.

-----------------------

[1] Only used in Pipeline Design

2 The ALU used in Multi-cycle also performs ZERO/NEG/OVERFLOW test but not in the ALU in Pipeline version as these operations are done by Branch Detection in ID stage. This is used to reduce cost.

3 No printout is included.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download