COMPUTER ARCHITECTURE - Weebly



UNIT - I

COMPUTER ARCHITECTURE

• Organization of Co-ordination of different functional write and their structure

• Encompasses the specification of an instruction net and the hardware unit that implements the instruction.

Encoding : Numbers are represented as Binary Bits.

• ASCII : American Standard Code for Information Interchange each character is

represented by a 7 bit code.

• EBCDIC : Extended Binary coded Decimal Interchange Code each

character is represented by a 8 bit code.

FUNCTIONAL UNITS :

- Input,

- Output,

- ALU,

- Control Unit,

- Memory Unit.

[pic]

Digital computer systems consist of three distinct units. These units are as follows: Input unit, Central Processing unit, Output unit.These units are interconnected by electrical cables to permit communication between them. This allows the computer to function as a system.

INPUT UNIT:

A computer must receive both data and program statements to function properly and be able to solve problems. The method of feeding data and programs to a computer is accomplished by an input device. Computer input devices read data from a source, such as magnetic disks, and translate that data into electronic impulses for transfer into the CPU. Some typical input devices are a keyboard, a mouse, or a scanner.

CENTRAL PROCESSING UNIT

The brain of a computer system is the central processing unit (CPU).  The CPU processes data transferred to it from one of the various input devices. It then transfers either an intermediate or final result of the CPU to one or more output devices. A central control section and work areas are required to perform calculations or manipulate data.  The CPU is the computing center of the system. It consists of a control section, an arithmetic-logic section , and an internal storage section (main memory). Each section within the CPU serves a specific function and has a particular relationship with the other sections within the CPU.

CONTROL UNIT

The control section directs the flow of traffic (operations) and data. It also maintains order within the computer. The flow of control is indicated by dotted arrows in figure. The control section selects one program statement at a time from the program storage area, interprets the statement, and sends the appropriate electronic impulses to the arithmetic-logic and storage sections so they can carry out the instructions. The control section does not perform actual processing operations on the data. The control section instructs the input device on when to start and stop transferring data to the input storage area. It also tells the output device when to start and stop receiving data from the output storage area.

ARITHMETIC LOGIC UNIT

The arithmetic-logic   section   performs   arithmetic operations,   such   as   addition,   subtraction, multiplication, and division.  Through internal logic capability, it tests various conditions encountered during processing and takes action based on the result. As indicated by the solid arrows in figure, data flows between the arithmetic-logic section and the internal storage section during processing. Specifically, data is transferred as needed from the storage section to the arithmetic-logic section, processed, and returned to internal storage. At no time does processing take place in the storage section.  Data maybe transferred back and forth between these two sections several times before processing   is   completed.   The   results   are   then transferred from internal storage to an output device, as indicated by the solid arrow in figure.

MEMORY UNIT

The internal storage section is sometimes called primary storage, main storage, or main memory, because this section functions similar to our own human memory. The storage section serves four purposes;  three relate to retention (holding) of data during processing. First, as indicated by the solid arrow (fig. 3-1), data is transferred  from  an  input  device  to  the  INPUT STORAGE AREA where it remains until the computer is  ready  to  process  it.  Second, a WORKING STORAGE AREA ("scratch pad" memory) within the storage section holds both the data being processed and the   intermediate   results   of   the   arithmetic-logic operations.  Third, the storage section retains the processing results in the OUTPUT STORAGE AREA. From there the processing results can be transferred to an output device.  The fourth storage section, the PROGRAM STORAGE AREA, contains the program statements transferred from an input device to process the data. Please note that the four areas (input, working storage, output, and program storage) are not formed in size or location but are determined by individual program requirements.

OUTPUT UNIT

As program statements and data are received by the CPU from an input device, the results of the processed data are sent from the CPU to an OUTPUT DEVICE. These results are transferred from the output storage area onto an output medium, such as a floppy disk, hard drive, video display, printer, and so on. By now, you should have an idea of the functions performed by a CPU. It is the CPU that executes stored programs   and   does   all   of   the   processing   and manipulating of data. The input and output (I/O) devices simply aid the computer by sending and receiving data and programs.

RAM – Random Access Memory – Memory in which any location can be reached in a short fixed amount of time.

Memory Access Time: Time required to access one word is called memory access time.

BASIC OPERATIONAL CONCEPTS

• Any activity in a computer is governed by instructions. To perform a given task, an appropriate program consisting of a list of instructions is stored in the memory.

• Individual instructions are brought from the memory into the processor, which executes the specified operands.

• The operands (data) are also stored in the memory.

The instruction

Add LOC A, Ro

adds the contents of memory location ‘LOC A’ with the contents of Ro and stores the result in Ro (i.e.) Ro [LOC A] + [Ro]

The sequence of steps needed in order to execute the above instruction are,

1. Instruction is fetched from the memory into the processor.

2. The operand at LOC A is fetched.

3. Add the contents of LOC A with Ro

4. Store the sum is Ro.

Data transfer between memory and processor is facilitated with two registers namely MAR (Memory Address Register) and MDR (Memory Data Register).

- MAR holds the address of the location to be accessed.

- MDR contains the data to be written into read out of the addressed location.

In addition to MAR and MDR, there are also IR, PC and n general-purpose registers

IR - Instruction Register – Holds the instruction that is currently being executed.

PC - Program Counter – Contains the memory address of the next instruction to be fetched and the execution of an instruction, the contents of the PC are automatically incremented.

Operating Steps

[pic]

• Programs are fed through the input unit and stored in the memory.

• PC is set to point to the first instruction of the program.

• Contents of PC is transferred to MAR and a read control signal is sent to memory.

• The addressed word is read out of the memory and loaded into the MDR.

• The contents of MDR are transferred to the IR.

• The operands are fetched by sending their address to MAR and initiating a Read cycle.

• When the operand has been read from the memory into the MDR, it is transferred from MDR to ALU.

• After fetching all the operands, ALU performs the desired operation.

• To write the result into memory, the result is sent to MDR. The address of the location where the result is to be stored in sent to the MAR, and a write cycle is initiated.

• During execution of the current instruction, the contents of PC are automatically incremented.

• As soon as the execution of current instruction, is completed, a new instruction can be fetched.

Handling Interrupts:

- An interrupt is a request from an I/O device for service by the processor.

- The processor provides the requested service by executing an appropriate Interrupt service Routine (ISR).

- Before giving the control to ISR, the processor must save its state (contents of the PC general registers and some control information)

- After completing ISR, state of the processor is restored so that the interrupted program may continue.

Bus : A group of lines that serves as a connecting path for several devices.

Computer : Translates the high level language program into machine language program.

Operating System: Collection of routines that is used to control the sharing of and interaction

among various computer units.

Multiprogramming: Concurrent execution of several application programs by the operating system to make the best possible use of computer resources.

PERFORMANCE

Elapsed time : Total time required to execute the program

- Measure of the performance of the entire computer system

- Affected by the speed of the processor, disk and the printer.

Processor time : The period during which the processor remains active.

Depends o the h/w involved in the execution of individual machine

instructions.

Speed is the important measure of performance. It is affected by,

- Design of hardware

- Machine language instructions

- Computer design

Cache Memory

[pic]

At the start of execution, all program instruction and the required data are stored in the memory As execution proceed, instruction or data are fetched one by one over the bus into the processor and a copy is placed in the cache. When the instruction / data is fetched for the second time, it is taken from the cache. Memory Access time for a data in cache is less than that of fetching a data in main memory. Thus wage of cache memory significantly improves the performance of a system.

Processor Clock

- Processor circuits are controlled by clock signal

- Clock defines regular intervals called clock cycles.

- To execute a machine instruction, it is divided into a sequence of steps such that each step can be completed within a clock cycle.

- The length of one cycle is P.

- R = 1/p cycle / second

- cycles / second is called ‘Hertz’

Today’s processors have clock rates (R) that range from a few hundred million over a billion cycles per second.

BASIC PERFORMANCE EQUATION

The basic performance equation is given by

N x S

T = -----------

R

Where

‘T’ is the processor time required to execute a program written in a high level language

‘N’ is the actual number of instruction executions (not necessarily equal to the number of instructions in the object program)

‘R’ is the clock rate in cycles / second

‘S’ is the average number of basic steps needed to execute one machine instruction, where each basic step is completed in one clock cycle.

To achieve high performance, T value should be reduced N, S and R are not independent parameters: changing one may affect another. There factors are adjusted to achieve higher performance.

Pipelining and Superscalar Operation

Instead of executing the instructions one after the other, they can be overlapped. This can improve the performance of the processors

Pipelining:

The technique of overlapping successive instructions. Pipelining increases the rate of executing instructions significantly and causes the effective value of approach using multiple functional units, multiple instruction pipelines can be created and a higher degree of concurrency can be achieved.

With multiple functional units, it is possible to start the execution of several instructions in every clock cycle. This mode of operation is called ‘Superscalar execution’. Parallel execution of instruction should preserve the logical correctness of a program.

Clock Rate

There are two ways to increasing clock rate

1. “Improving” the IC technology to make the logic circuits faster.

2. Reducing the amount of processing done is one step. Thus ‘P’ decreases and ‘P’ increases.

INSTRUCTION SET: CISC AND RISC

Simple instruction requires a small number of basic steps to execute. A given programming task may require large number of such instructions.

Complex instructions combined with pipelining would achieve better performance.

It is easier to implement efficient pipelining in processors with simple instruction sets.

• The processors with simple instructions are called Reduced Instruction Set Computer (RISC) and

• The processors with complex instructions are called Complex Instruction Set Computers (CISC).

Compiler

To reduce N, we should have a suitable machine instruction set and a compiler that makes good use of it. An optimizing compiler takes advantage of various features of the target processor to reduce the product N x S. A computer can improve the performance by re-arranging the program instructions.

Performance Measurement

The performance of a computer is measured using benchmark programs. The performance measure is the time it takes a computer to execute a given benchmark. A set of real application programs are selected to evaluate performance. There programs range from game playing, compiler, database applications to numerically intensive programs in quantum chemistry.

In each case, the program is compiled for the computer under test and the running time on a real computer is measured. The same program is also compiled and run on one computer selected as reference. The SPEC (System Performance Evaluation Corporation) rating is given by

Running time on the reference computer

SPEC rating = --------------------------------------------------------

Running time on the computer under test

There would be a suite of benchmark programs.

The overall SPEC rating for the computer is given by

SPEC rating = ( ∏^n SPECi) 1/n

i=1

Memory Locations and Addresses

Memory Organization

Memory is organized as million of storage cells. Each cell can store one bit of information. Cells are handled / processed together as bytes / words.

Memory is organized so that a group of n bits can be stored or retrieved in a single, basic operation. Each group of n bits is referred to as ‘word’ of information and n is called the word length.

To access a word / byte, we need to specify the ‘address’ of the location where it is actually stored Address Space: collection of addresses of successive locations in the memory.

- k bit address generate Qk addressable locations

[ o ( 2k-1]

- eg: a 24 bit address generates 224 locations.

Byte Addressable Memory (Fig. 2.5, 2.6 P.No.34)

Allocating address to individual bits is impractical. Hence successive bytes are assignment is called Byte Addressable Memory.

Big Endian and Little Endian

• Big-endian is used when lower byte addresses are used for the more significant bytes (the leftmost bytes) of the word.

• Little-endian is used for the opposite ordering, where the lower byte addresses are used for the less significant bytes (the rightmost bytes) of the word.

Word Alignment

Words are aligned in memory if they begin at a byte address that is a multiple of the number of bytes in a word.

The number of bytes in a word is in powers of 2.

2bytes - 0, 2, 4 (address of work)

4bytes - 0, 4

8bytes - 0,8,16

Unaligned addresses are also possible.

Accessing

Numbers : Are accessed by their word addresses

Characters : Byte address

Strings : Bytes address of first character

Memory Operations

1. Load (Read / Fetch)

2. Store

|Load |Store |

|Transfer a copy from memory to processor |Transfer an item from processor to memory |

|Original data remains unchanged |Original contents are lost |

|Processor sends the address of the desired location to memory. |Processor sends the address of desired location and the data to be|

|Memory reads the data and sends them to the processor. |written into that desired location. |

In a single operation, other a word / byte can be transferred between the processor and the memory. Register are the storage area of processors.

INSTRUCTIONS AND INSTRUCTION SEQUENCING

4 types of operations

• Data transfers between the memory and the processor registers

• Arithmetic and Logic operations

• Program sequencing and control

• I/O transfer

Register Transfer Notation

Let the names for the address of memory be Loc, A, VAR2 location

Register names be Ro, R5

I/o Register names be DATA IN, OUTSTATUS etc.,

The controls of a location are denoted by square brackets.

R1( [Loc] ------- (1)

Ic the contents of the memory location ‘Loc’ are transferred to processor register R1.

R3 ( [R1] + [R2] ------- (2)

The contents of Register R1 is added with the contents of Register R2 the result is placed in the Register R3.

Assembly Language Notation

Move Loc, RL (Similar to (1)

The contents of the memory location Loc are transferred to Register R1.

(2) is represented as Add R1, R2, R3

Basic Instruction Types

(1) Three Address instruction

Syntax :

Operation Source 1, Source 2, Destination

eg:

Add A, B, C

If k bits are needed to specify the memory address of each operand, the above instruction must contain k+ k+ k = 3k bits for addressing purposes.

For a processor with 32 bit address space, a 3 address instruction is too large to fit in one word.

(2) Two Address instructions

Syntax :

Operation Source, Destination

Add A, B

Add A, B, C is equivalent ( Add A, B

Add the contents of A, B and Place the result in B.

(3) One – Address instruction

These instructions specify only one memory. The second operand is always placed in an unique location. This unique location is the processor Register called Accumulator.

eg : Add A

l( Add the contents of the memory location A with the contents of the Accumulator and place the result in the Accumulator.

(4) Zero – address instructions

These are the instruction in which the locations of all operands are defined implicitly.

This is possible in machines that state operands in a pushdown stack.

Example

• Three-Address Instructions

Program to evaluate X = (A + B) * (C + D) :

ADD R1, A, B /* R1 ¬ M[A] + M[B] */

ADD R2, C, D /* R2 ¬ M[C] + M[D]*/

MUL X, R1, R2 /* M[X] ¬ R1 * R2 */

- Results in short programs

- Instruction becomes long (many bits)

• Two-Address Instructions

Program to evaluate X = (A + B) * (C + D) :

MOV R1, A /* R1 ¬ M[A] */

ADD R1, B /* R1 ¬ R1 + M[A] */

MOV R2, C /* R2 ¬ M[C] */

ADD R2, D /* R2 ¬ R2 + M[D] */

MUL R1, R2 /* R1 ¬ R1 * R2 */

MOV X, R1 /* M[X] ¬ R1 */

• One-Address Instructions

Use an implied AC register for all data manipulation

Program to evaluate X = (A + B) * (C + D) :

LOAD A /* AC ¬ M[A] */

ADD B /* AC ¬ AC + M[B] */

STORE T /* M[T] ¬ AC */

LOAD C /* AC ¬ M[C] */

ADD D /* AC ¬ AC + M[D] */

MUL T /* AC ¬ AC * M[T] */

STORE X /* M[X] ¬ AC */

• Zero-Address Instructions

Can be found in a stack-organized computer

Program to evaluate X = (A + B) * (C + D) :

PUSH A /* TOS ¬ A */

PUSH B /* TOS ¬ B */

ADD /* TOS ¬ (A + B)*/

PUSH C /* TOS ¬ C */

PUSH D /* TOS ¬ D */

ADD /* TOS ¬ (C + D)*/

MUL /* TOS ¬ (C + D) * (A + B) */

POP X /* M[X] ¬ TOS */

INSTRUCTION EXECUTION AND STRAIGHT LINE SEQUENCING

Assumptions

- Memory is byte addressable

- Word length is 32 bits

- Only one memory operand is allowed for each instruction and the computer has a number of processor registers.

- Each instruction is 4 bytes long

(i.e.) if the first instruction is at location i, second instruction is at i+4, 3rd at i+8 and soon.

Program Counter : Processor Register that contains the address of the next instruction to be executed.

Straight Line Sequencing : The instruction are fetched and executed one at a time in the order of increasing addresses.

Instruction Execution - 2 phases

• Instruction Fetch

Instruction is fetched from the memory location where address is in the program counter(PC). The instruction is placed in the Instruction Register (IR).

• Instruction Execute

The Instruction is IR is examined to determine which operation is to be performed.

Fetch Cycle Execute cycle

Branching

Consider adding a list of n numbers, the addresses of the n number are, NUM1, NUM2, NUMn

Straight Line Program

[pic]

The steps needed are,

1. Add each number NUM to registered RO,

2. After adding all the numbers, the result is placed in the memory location SUM wing ‘Move’ instruction.

Using a loop to add n numbers

The same program mentioned above can be efficiency done using a loop,

The loop is a straight line sequence of instructions executed as many times as needed.

According to the figure, the loop starts at location loop and ends at Branch (o.

During each pass through the loop.

- The address of the next number is calculated

- The next number is added to Ro.

The loop count (n) is maintained in the register R1.R1 is decremented during every pass through the loop.

After decrementing R1, Branch instruction is executed to see whether the loop count is zero. If zero, the loop is terminated, else the loop is repeated.

Branch Instruction

This instruction loads a new value into the program counter. This new value is called the branch target.

Conditional Branch Instruction: Causes a branch only if a specified condition is specified.

Condition Codes

The results of various operations are tracked by the processor and the required information is recorded in individual bits called the condition code flags.

There flags are grouped together in a special processor register called the condition code register or status register.

Commonly used flags are

• N (Negative) set to 1 if the result is negative.

• Z (zero) set to 1 if the result is zero

• V (Overflow) set to 1 if the result is arithmetic overflow occurs

• C (Carry) set to 1 if a carry out results from the ape ration.

The Stored Program Concept

• Learning how instructions are represented leads to discovering the secret of computing: “the stored-program concept”

• Today’s computers are build on two key principles :Instructions are represented as numbers

• Programs can be stored in memory to be read or written just like numbers

INSTRUCTION SET ARCHITECTURE (ISA)

• Introduces the wide variety of design alternatives available to the instruction set architect

• Present a taxonomy of ISA alternatives and give some qualitative assessment of the advantages and disadvantages of various approaches

• Present and analyze some instruction set measurements that are largely independent of a specific instruction set

• Address the issue of languages and compilers and their bearing on instruction set architecture

• Show some examples of instructions set architecture

To command a computer's hardware, you must speak its language

• The words of a machine's language are called instructions, and its vocabulary is called instruction set

• Once you learn one machine language, it is easy to pick up others:

• There are few fundamental operations that all computers must provide

• All designer have the same goal of finding a language that simplifies building the hardware and the compiler while maximizing performance and minimizing cost

• Learning how instructions are represented leads to discovering the secret of computing: “the stored-program concept”

• The MIPS instruction set is used as a case study

Classifying Instruction Set Architectures

Accumulator Architecture

• Common in early stored-program computers when hardware was so expensive

• Machine has only one register (accumulator) involved in all math. & logical operations

• All operations assume the accumulator as a source operand and a destination for the operation, with the other operand stored in memory

Extended Accumulator Architecture

• Dedicated registers for specific operations, e.g stack and array index registers, added

• The 8085 microprocessor is an example of such special-purpose register arch.

General-Purpose Register Architecture

• MIPS is an example of such arch. where registers are not sticking to play a single role

• This type of instruction set can be further divided into:

• Register-memory: allows for one operand to be in memory

• Register-register (load-store):demands all operands to be in registers

High-Level-Language Architecture

• In the 1960s, systems software was rarely written in high-level languages and virtually every commercial operating system before Unix was written in assembly

• Some people blamed the code density on the instruction set rather than the programming language

• A machine design philosophy was advocated with the goal of making the hardware more like high-level languages

• The effectiveness of high-level languages, memory size limitation and lack of efficient compilers doomed this philosophy to a historical footnote

Reduced Instruction Set Architecture

• With the recent development in compiler technology and expanded memory sizes less programmers are using assembly level coding

• Instruction set architecture became measurable in the way compilers rather programmable use them

• RISC architecture favors simplifying hardware design over enriching the offered set of instructions, relying on compilers to effectively use them to perform complex operations

• Virtually all new architecture since 1982 follows the RISC philosophy of fixed instruction lengths, load-store operations, and limited addressing mode

[pic]

ADDRESSING MODES

Addressing modes: The different ways in which the location of an ape rand is specified in an instruction are referred to as Addressing modes.

Variable Representation in Assembly Language

A variable can be repressed in 2 ways

1. By allocating a register

2. By allocating a memory location

1. Register mode

The operand is the contents of a processor register. The name of the register is given in the instruction.

2. Absolute mode

The operand is in a memory location. The address of this location is given explicitly in the instruction. This mode is also called direct mode

3. Immediate mode

The operand is given explicitly in the instruction. This mode is only used to specify the value of a source operand.

eg; move 200 Immediate ,Ro

move #200 , Ro

4. Indirect mode

Effective address: The actual memory address of the operand

Indirect mode: The effetely address of the contents of a register /memory location whose address appears in the instruction.

Example :

Fig 2.12, Fig 2.11

Add (R2), RO

The Register R2 contains the address of the operand.

Consider an operand stored at location 1000. This value 1000 is stored in the register R2.

Pointer: The register / memory location that contains the address of an operand.

5. Indexed Mode

The Effective Address of an operand is generated by adding a constant value to the contents of a register

Index Register: The register that is used to hold the constant value is indexed mode of addressing is called Index Mode is symbolically represented as X (Ri)

Constant name of the register

value

(offset)

EA = X + [Ri]

Offset value is also called displacement. [pic]

Eg. If R1 contains the value of 1000,

Add 20(R1), R2 ( Add the operand at location (20+1000 = 1020) with the operand contained in the Register R2.

The Offset (X) may be a constant value or it may be contained in a Register. In that case, the indexed mode is represented as

(Ri, Rj)

second register is called base register. The effective address in this case is given by

EA = [Ri] + [Rj]

Index Mode is extensively used to represent arrays. Two dimensional arrays are effectively represented by indexed mode.

eg: Student’s marklist

[pic]

[pic]

Indexed addressing used in accessing test scores in the list above figure

6. Relative Addressing

Special case of Indexed Mode where the index register is replaced with program counter.

Relative mode: The effective address is determined by the index mode using the program counter is place of the general purpose Regular Ri.

This mode is commonly used to specify target address in the branch instructions.

Additional Modes

- Autoincrement mode

- Autodecrement mode.

Auto Increment Mode

The effective address of the operand is the contents of an register specified in the instruction. After accessing the operand the contents of this register are automatically incremented to point to the next item in a list.

Representation : (Ri) +

The increment is 1 for byte sized operands, 2 for 16 bit operands and 4 for 32 bit operands.

(Similar to post increment operation)

In a 32 bit word length processor,

1. EA = Ri

2. EA = Ri + 4

3. EA = Ri + 8

Ri + 8

Auto Decrement mode:

The contents of a register specified in the instruction are first automatically decremented and are then used as the effective address of the operand.

Representation - (Ri)

(Similar to predecrement operation)

for a 32 bit wordlength

1. EA = Ri - 4

2. EA = Ri – 8

3. EA = Ri + 12

The autoincrement and autodecrement modes are together used to implement a stack.

[pic]

COMPLEX INSTRUCTION SET COMPUTER

• These computers with many instructions and addressing modes came to be known as Complex Instruction Set Computers (CISC)

• One goal for CISC machines was to have a machine language instruction to match each high-level language statement type

• The large number of instructions and addressing modes led CISC machines to have variable length instruction formats

• The large number of instructions means a greater number of bits to specify them

• In order to manage this large number of opcodes efficiently, they were encoded with different lengths:

– More frequently used instructions were encoded using short opcodes.

– Less frequently used ones were assigned longer opcodes.

• Also, multiple operand instructions could specify different addressing modes for each operand

– For example,

• Operand 1 could be a directly addressed register,

• Operand 2 could be an indirectly addressed memory location,

• Operand 3 (the destination) could be an indirectly addressed register.

• All of this led to the need to have different length instructions in different situations, depending on the opcode and operands used

• For example, an instruction that only specifies register operands may only be two bytes in length

– One byte to specify the instruction and addressing mode

– One byte to specify the source and destination registers.

– An instruction that specifies memory addresses for operands may need five bytes

– One byte to specify the instruction and addressing mode

– Two bytes to specify each memory address

• Maybe more if there’s a large amount of memory.

• Variable length instructions greatly complicate the fetch and decode problem for a processor

• The circuitry to recognize the various instructions and to properly fetch the required number of bytes for operands is very complex

• Another characteristic of CISC computers is that they have instructions that act directly on memory addresses

– For example,

ADD L1, L2, L3

that takes the contents of M[L1] adds it to the contents of M[L2] and stores the result in location M[L3]

• An instruction like this takes three memory access cycles to execute

• That makes for a potentially very long instruction execution cycle

• The problems with CISC computers are

– The complexity of the design may slow down the processor,

– The complexity of the design may result in costly errors in the processor design and implementation,

– Many of the instructions and addressing modes are used rarely, if ever

REDUCED INSTRUCTION SET COMPUTERS

• In the late ‘70s and early ‘80s there was a reaction to the shortcomings of the CISC style of processors

• Reduced Instruction Set Computers (RISC) were proposed as an alternative

• The underlying idea behind RISC processors is to simplify the instruction set and reduce instruction execution time

• RISC processors often feature:

– Few instructions

– Few addressing modes

– Only load and store instructions access memory

– All other operations are done using on-processor registers

– Fixed length instructions

– Single cycle execution of instructions

– The control unit is hardwired, not microprogrammed

• Since all but the load and store instructions use only registers for operands, only a few addressing modes are needed

• By having all instructions the same length, reading them in is easy and fast

• The fetch and decode stages are simple, looking much more like Mano’s Basic Computer than a CISC machine

• The instruction and address formats are designed to be easy to decode

• Unlike the variable length CISC instructions, the opcode and register fields of RISC instructions can be decoded simultaneously

• The control logic of a RISC processor is designed to be simple and fast

• The control logic is simple because of the small number of instructions and the simple addressing modes

• The control logic is hardwired, rather than microprogrammed, because hardwired control is faster

RISC Characteristics

• Relatively few instructions

• Relatively few addressing modes

• Memory access limited to load and store instructions

• All operations done within the registers of the CPU

• Fixed-length, easily decoded instruction format

• Single-cycle instruction format

• Hardwired rather than microprogrammed control

Advantages of RISC

- VLSI Realization

- Computing Speed

- Design Costs and Reliability

- High Level Language Support

Assembly Language: A complete set of symbolic names and rules for their use constitute assembly language

Syntax: The set of rules for using the mnemonics in the specification of complete instructions and programs.

Assembler

- Translates Assembly language program into machine instructions.

- Part of the system software

- It is stored as a sequence of machine instructions in the memory of the computer.

Source Program : User program is alphanumeric text format

Object Program : Assembled machine language program

Assembler Directive

The non-executable statements that provide information to the assembler during the translation of source program into object program are called assembler directives.

eg : EQU, ORIGIN, DATAWORD etc.

EQU - Equates a variable to a constant

ORIGIN - Tells the assembler where in the memory to place the data block that

follows

DATAWORD - Inform the assembler of the data to be loaded into the memory

RESERVE - Declares the number of memory blocks to be reserved for data

Syntax of an ALP statement

Label operation operand(s) comment

opcode contains addressing information for accessing operands

Assembly and Execution of Program

Source Object

Program Program

Symbol Table: Table of all variable and their corresponding value maintained by the Assembler

Two Pass Assembler

I Pass - Creation of Symbol table. At the end of I pass, all names will have been

assigned numerical value.

II Pass - Assembler goes through the source program and substitutes values for all names from the symbol table.

Loader - An utility program

Performs a sequence of input operation needed to transfer the machine language program from the disk into a specified place in the memory.

Debugger: Enables assembles to defect and report syntax errors.

Enables the user to stop execution of the object program and examine the contents of various processor register and memory locations.

ALU DESIGN

N Bit Ripple Carry Adder:

Si =Xi exor Yi exor Ci

Ci+1 =Yi Ci +Xi Ci + Xi Yi

The sum is obtained by Xoring Xi Yi and carry is obtained with

Ci+1 =Yi Ci +Xi Ci + Xi Yi + Xi Yi Ci

The fourth term is omitted because it will affect the result only when it is ‘1’.

i.e. Xi = Yi =1

N Bit Ripple Carry:

N full adders can be arranged to form a n bit ripple carry. In this circuit, the inputs are fed into FAS at each stage. The carry out from one FA becomes the carry in to the successive FA. Carry propagates from the first FA to the last FA.

K n bit FAs can be arranged to form a cascade of n but adders.

Adder/ Subtractor:

A n bit adder can also be used for subtraction. It is performed in the following steps:

Eg: X-Y

1. Take the 2’s complement of the negative number.

2. Add it with X.

3. If the frist bit of the result is 1,the result is negative and take the reverse complement of the result.

4. If the first bit of the result is 0,the result is positive and leave it as such.

5.While generating the result, ignore the carries if any.

Circuit Design:

Aim:

To perform X+Y and X-Y, Xi is fed into the circuit without any change and Yi is fed into the circuit through XOR gates.

The second input to the XOR gate is ADD/SUB control line.

To Do Addition:

Y value should be unchanged. Hence ADD/SUB is set to 0.

|ADD/SUB |Y |XOR |

|0 |1 |1 |

|0 |0 |0 |

To Do Subtraction:

ADD/SUB is set to 1.

|ADD/SUB |Y |XOR |

|1 |0 |1 |

|1 |1 |0 |

But we need 2’s complement to do subtraction (i.e.) (1’s complement +1). This is automatically done. Because here the ADD/SUB line is fed as, carry in. Hence this will be added with 1’s complement to yield 2’s complement.

Delay:

Delay through any combinational logic network is determined by adding up the number of logic gate delays along the longest signal propagation path through the network.

Sum gate requires 1 gate delay.

Carry requires 2 gate delays.(delay along the longest path is 2).

To Generate C4, S4:

C4 =8 delays.(2 delays for all the four gates)

S4 = 8+1=9 delays.

Approaches to reduce delay:

• Fastest possible electronic technology is implementing the ripple carry logic design.

• Augmented logic gate network structure that is larger.

Design of Fast Adder:

• Uses Carry Look Ahead Logic.

In the previous case,(n bit ripple carry), the FAs cannot operate simultaneously. Because the carry i/p to FA depend on the carry o/p of the previous FA.

Carry Look Ahead Logic itself generates carries and give them to the FAs and operate simultaneously thus reducing the delay significantly.

CARRY LOOK AHEAD LOGIC (CAL):

Uses 2 functions namely Generate and Propagate.

Ci+1 = Xi Yi +Xi Ci + Yi Ci

= Xi Yi +( Xi + Ci )

= Gi + PiCi

where,

Gi = Xi Yi

Pi = Xi + Ci

Take i=3,

C4 =G3 + P3 C3

C3 =G2 + P2C2

C2 = G1 + P1C1

C1 = G0 + P0C0

Thus 1 becomes,

C4 = G3 + P3 (G2 + P2 C 2) (Put 2)

= G3 +P3 G 2+ P3 P2 (G1 + P1 C1) (Put 3)

= G 3+ P3 G2 + P3 P2 G1 (G0 +P0 C0) (Put 4)

C4 = G3 +P3 G2 +P3 P2 G1 + P1 P2 P3 G0 +P0 P1 P2 P3 C0

From the above equation it is clear that to generate C4 , we require only C0.

But in the case of ripple carry, C4 can be generated only when C1, C2 and C3 are available.

Thus a carry look ahead logic can generate all the carries simultaneously, when all the inputs and C0 are fed into the logic.

A bit storage cell is used to generate All the bit storage cells operate simultaneously and propagate functions simultaneously.

Thus irrespective of the size of the adder,

1 gate delay is required to generate Gi , Pi.

1 gate delay is required to generate Pi Ci (AND)

1 gate delay is required to generate Gi + Pi Ci .

Totally 3 gate delays are required to generate and 1 gate delay for

Thus in a 4 bit adder,

C4 = 3 delay

S4 =3+1 = 4 delay.(FOR ALL CASES)

Circuit Design - BIT STORAGE CELL:

Gi = Xi Yi

This is implemented with a AND gate.

Si= Xi exor Yi exor Zi

This is implemented with two exor gates

Pi = Xi + Yi

This is implemented with same XOR gate which was used to generate the sum.

XOR’s truth table is same as that of OR’s except when the input is 1.

|X |Y |XOR |OR |

|0 |0 |0 |0 |

|0 |1 |1 |1 |

|1 |0 |1 |1 |

|1 |1 |0 |1 |

When X=Y=1, X exor =0

Whereas X+Y=1

This difference can be ignored because when X=Y=1, Pi=0 (instead of 1)

Higher Level Generate and Propagate Functions:

The carry ahead logic(CAL) produces upto 4 carries. When the circuit is to be extended to some more bits, the same circuit design cannot be used.

This is because, to generate C4,CAL needs 5 i/ps. To generate C8 it will require more i/ps. CAL is a logic gate circuit, hence there is a limitation on the number of i/ps to the circuit. This problem is called ‘fanin’ problem. Hence the same circuit cannot be extended.

Instead two levels can be used.(16 bit adder)

Are given as i/ps. Thus the carries can be generated in parallel. The same technique can be employed for Gi also.

MULTIPLICATION:

Multiplication of Positive Numbers:

- Normal Multiplication Technique

Simplest way of Multiplication:

Use the adder circuitry in the ALU for a number of sequential steps.

Delays 6(n-1)-1 gate delays.

Circuit Design:

1. A register Q to store the multiplier. The individual bits are represented as

2. A register M to store the multiplicand. The individual bits are represented as

3. A register A which is used temporarily hold the intermediate results. Bits are

4. A flipflop C to store the carry .

5. A control sequencer to determine whether to perform ‘ADD/NO ADD’.

6. An adder circuit to perform the addition.

7. A multiplexer circuit that selects one of the i/ops to the adder circuit.

For ADD,MUX selects Multiplicand for addition.

For NO ADD MUX selects ‘0’ as an i/p to the n-bit adder.

The second i/p to the n-bit adder is always the partial product at a particular instance. This is being held in register A.

Multiplication Rule:

Eg:1000(8) X 0011(3)

Initially C -> 0

A -> 0

M -> Multiplicand (1000)

Q -> Multiplier (0011)

Scan the multiplier from the right(i.e from q0)

If q0 =1

1. Add A+M.

2. Right shift A and Q

else if q0 =0

1.No ADD.

2. Right shift A AND Q.

These steps constitute a cycle.

For 4 bit multiplication, 4 cycles are to be performed.

N bit multiplication ,n cycles are to be performed.

Depending upon the result of addition at each cycle, C flip-flop (carry) is set to ‘0’ or ‘1’ correspondingly.

Signed Operand Multiplication:

To multiply numbers of two different signs(+ and -)

1. Take the 2’s complement of the negative number.

2. Perform normal multiplication( positive number X 2’s complement of negative no).

3.While generating partial products , nXn -> 2n bits should be filled.

When multiplier is 1, the remaining bits are filled with 1.

When multiplier is 0, the remaining bits are filled with 0.

4.The reverse complement of the result is taken with a negative sign in front to indicate the negative number.

[ Not to be confused with signed number addition

While adding + and -, the result can be either +ve or –ve depending upon which number is greater.[Eg: 3=6=-3; 6-3=+3]

But in the case of multiplication, the result is always –ve irrespective of the magnitude of the two numbers.]

Fast Multipliers:

Two Methods:

1. Bit Pair Recoding.

2. CSA – Carry Save Adder.

Bit Pair Recoding:

Truth table:

|Multiplier bit pair |Multiplicand selected at position i|

|I+1 |I |i-1 | |

| | | | |

| | | |0xM |

| | | |+1xM |

| | | |+1xM |

| | | |+2xM |

| | | |-2xM |

| | | |-1xM |

| | | |-1xM |

| | | |0xM |

|0 |0 |0 | |

|0 |0 |1 | |

|0 |1 |0 | |

|0 |1 |1 | |

|1 |0 |0 | |

|1 |0 |1 | |

|1 |1 |0 | |

|1 |1 |1 | |

Example:

13 * -6

13=> 01101 x

-6=> 11010

Recoding of -6:

Booth=> 0 -1 +1 -1 0

To bit pair recoding, Refer truth table,

0 -1 +1 -1 0

| |_______| |______| [ Booth Recoded form]

| | |

| | |

| | |

0 -1 -2

0 -1 +1 -1 0 ( Booth form

0 -1 -2 ( Bit pair recoded form

Now 13 x -6 is,

0 1 1 0 1

0 -1 -2

________________________________________________

1 1 1 1 1 0 0 1 1 0

1 1 1 1 0 0 1 1

0 0 0 0 0 0

________________________________________________

1] 1 1 1 0 1 1 0 0 1 0

________________________________________________

0 0 0 1 0 0 1 1 0 1 [1s Complement]

1 +

________________________________________________

1 0 0 1 1 1 0

_____________________________________

Note:

While generating partial products, care should be that the partial product corresponds to the position of the Multipliers. This is shown as (1, 3, 5) in example. Bit pair recoding requires n/2 summands for n x n bit Multiplication.

Carry Save Addition: ( CSA)

Logic:

Instead of letting the carries ripple along the rows , they can be “ rared ” and introduced into the next row, at the correct weighted positions.

Example:

1 0 1 1 0 1 (45) x

1 1 1 1 1 1 (63)

____________________________________

1 0 1 1 0 1 -------A

1 0 1 1 0 1 -------B

1 0 1 1 0 1 -------C

1 0 1 1 0 1 -------D

1 0 1 1 0 1 -------E

1 0 1 1 0 1 -------F

1 0 1 1 0 0 0 1 0 0 1 1 (2,835)

___________________________________________________________________

Procedure:

1. In CSA, each FA can handle 3 inputs. Hence the partial products are divided into groups consisting of three.

2. The results of three input addition (Sum and Carry) are added with the remaining results.

3. The whole process is repeated until there is no result (or input) left out.

Considering the above example, the logic is,

A+B+C D+E+F

|_____| |____|

| |

S1, C1 S2, C2

|__|___________________________| |

| |

S3, C3 |

|___|_________________|

|

S4, C4

|

Product

4. While generating the results, the positions of the partial products should be taken

care of very important.

Procedure:

1 0 1 1 0 1 (M)

1 1 1 1 1 1 (Q)

_______________________________________

| 1 | 0 1 1 0 1 -------A

1 | 0 | 1 1 0 1 -------B

1 0 | 1 | 1 0 1 -------C

1 0 1 | 1 | 0 1 -------D

1 0 1 1 | 0 | 1 -------E

1 0 1 1 0 | 1 | -------F

_________________________________________________________

A+B+C ( | 1 | 0 1 1 0 1 -------A

1 | 0 | 1 1 0 1 -------B

1 0 | 1 | 1 0 1 -------C

____________________________________________

1 1 0 0 0 0 1 1 -------S1

0 0 1 1 1 1 0 0 -------C1

Delay:

32 x 32 bit Multiplication

Array multiplier - 185 gate delays

CSA - 29 gate delays

To perform carry save Addition of k Summands, Approximately.

1.7log2k – 1.7 levels of CSA are required.

Integer Division:

[pic]

Circuit Design:

1. Dividend is held in the register Q.

2. Dividend is held in the register M.

3. Register A is used for temporarily storing intermediate values.

4. n+1 bit adder is used for subtraction.

5. Control sequencer is used to determine whether to perform addition/subtraction in subsequent cycle.

Divide Rule:

Two types of division is possible,

1. Restoring Division

2. Non Restoring Division

Restoring Division Procedure:

1. Initially A is cleared to 0, Q is loaded with the dividend and M is loaded with divisor.

2. For dividing a no of bit length n, n cycles are needed.

3. Repeat the following for n times.

a. Shift (left) both Q and A.

b. subtract divisor from A [By 2’s C addition].

i. if first bit of result is 1, set q0 of dividend to 0.then add the divisor A (Restore). Ignore cases if any.

ii. Else if 1 bit of the result is 0, set q0 of dividend to 1 and go to the next cycle.

4. After n cycles register A has Remainder and Q has Quotient.

Non Restoring Division Procedure:

1. Initially A is cleared to 0,Q is loaded with the dividend and M is loaded with divisor.

2. For dividing a no of bit length n, n cycles are needed.

3. repeat the following for n times.

a. Shift (left) both Q and A.

b. Subtract divisor from A [By 2’s C addition].

i. If first bit of result is 1, set q0 of dividend to 0. In the subsequent cycle,

addition is to be performed instead of subtraction.

ii. Else if 1 bit of the result is 0, set q0 of dividend to 1 and go to the next

cycle.

4. After n cycles, if the first bit of the final result is 1, Add the divisor with a. Else leave it

as such. register A has Remainder and Q has quotient.

Floating Point Numbers of Operation:

A computer must be able to represent numbers and operate on them in such a way that the position of the binary point is variable and is automatically adjusted as computation proceeds.

In such case representation is

6.0247 x 1023

|_______|

( Mantissa) string of significant digits.

Floating Point Condition:

When the decimal point is placed to the right of the first significant digit, the no. is said to be normalized.

IEEE Standard for floating point numbers:

Declare both the representation and the way in which the four arithmetic operation are to be performed.

(--------------32Bits----------------------------------(

|S |E’ |M |

|_____| |___________________________|

| |

{ 8 bit signed exponent {23 Bit}

in excess -127 representation}

value= + 1.Mx 2E’-127 .

E’=E+127. E’ is the ratio 0MAT

RAM - Any word access time is same.

Reducing MAT->use a cache memory

Cache Memory: Small fast memory that is inserted between the larger slower main memory& the processor. It holds the currently active segments of a program and their data

Virtual Memory: Data may be stored in physical memory locations that have addresses different from those specified by the program.

• Memory control circuitry translates the address specified by the program into an address that can be used to access the physical memory.

• The address generated by the processor is virtual\logical address.

• Virtual address is mapped to physical address -> done by Memory Management Unit (MMU).

Mapping function can be changed during program execution according to system requirement.

Virtual memory -> increase the apparent size of physical memory. Only active portion of the space is mapped onto location in the physical memory.

Remaining virtual address are mapped onto the bulk storage device (eg.. magnetic disk)

As the active portion changes during program execution, MMU changes the mapping function also.

During every cycle, the address processing mechanism determines whether the address information is in physical memory, if it is, proper word is accessed. Else, a flag of word containing the word is transferred from the disk to the memory.

->replace currently inactive page.

Time required to move pages causes low speed.

Semiconductor RAM Memories:

Cycle speed ranges from 100 ns to cache memory

Dynamic memory -> computer main memory

MEMORY CONTROLLER:

[pic]

To reduce the number of pins, memory chips use multiplexed address inputs. Multiplexing is done by the memory controller circuit, which is placed between the processor and the dynamic memory. The processor issue s all bits of an address at the same time to a memory controller. The controller accepts a complete address and the R/ W (bar) signal from the processor under the control of a request signal, which indicates that a memory access operation is needed, The controller then forwards the row and column portions of the address to the memory and generates the RAS (bar) and CAS (bar) Signals. When used with DRAM chips, the memory controller has to provide all the information needed to control the refreshing process.

Refreshing Overhead

EXAMPLE

- A SDRAM chips has cells arranged in 8k rows. It takes 4 clock cycles to access each row.

- The clock rate is 133MHz. What is the refresh overhead? (The refresh time is 64 ms in typical 8 DRAM)

- Total number of rows= 8k = 8192

- Time taken to refresh a single row = 4 clock cycles.

- Time taken to refresh 8192 rows = 8192 x 4

- In terms of cycles - 32,768 cycles.

Clock rate =133 MHz =133 x 106 Hz.

Time taken to refresh all rows in terms of seconds= (32,768) / 133 x 106.

= 246 x 10-6 Seconds.

= 0.246 milliseconds.

Therefore, Refresh Overhead= (0.246 x 10-3) / 64 x10 -3

Refresh Overhead= 0.0038

RAMBUS MEMORY

• The rambus technology increases the Speed of the memory by using a faster narrow bus.

• The key feature is to use a Fast Signaling method to transfer information between chips.

• Instead of using signals that have voltage levels either 0 or V supply to represent the logic values, the signal consist of much smaller voltage swings around a reference voltage vref.

• The reference voltage is about 2V, and the two logic values are represented by 0.3V swings above and below Vref.

• The result in shorter transmission time, which allows for a high speed of transmission. This type of signaling is known as Differential Signaling.

• Differential Signaling requires special circuit Interface.

Rambus Channel: A complete specification for the design of communication links for the differential signaling.

Rambus DRAM: Chips that have the circuitry needed to interface the Rambus channel.

Direct RDRAM: - 2 channel Rambus

- It has 18 data lines intended to transfer two types of data at a time

- There are no separate address lines.

RDRAM chips can be assembled into larger modules.

COMMUNICATION TECHNOLOGY

3 types of packets are usually transmitted between the master and the slave. They are

❖ Request – issued by the master that indicates the type of operation that is to be performed.

❖ Acknowledge – the addressed slave responds by giving either a positive or a negative acknowledgement.

❖ Data – the data is sent/received.

READ ONLY MEMORY

Since RAM is volatile, a Boot program cannot be stored in RAM. We need a non-volatile memory to store a Boot Program. (i.e) ROM.

ROM

[pic]

A ROM cell has a transistor. A logic value 0 is Stored in the cell if the transmitter is connected to ground at point P; otherwise 1 is stored. To read the state of the cell, the word line is activated. A sense circuit is at the end of bit line generates the proper output values .Data are written into a ROM when it is manufactured.

1. PROM - Programmable ROM

PROM can be programmed by the user. This is achieved by inserting a fuse at point P. Before it is programmed, the memory contains all OS. The user can insert 1s at the required location by burning out the fuses at these locations using high current pulses. However, these data cannot be erased.

PROMs provide a faster and less expensive approach.

2. EPROM - Erasable PROM

In EPROM, the stored data can be erased and new data can be stored. EPROM is capable of retaining stored information for a long time. Erasure requires dissipating the charges trapped in the transistor of memory cells. This is done by exposing the chips to ultra violet light. Hence, EPROM chips are mounted in packages that have transparent windows.

3. EEPROM - Electrically Erasable PROM

The drawback of EPROM is that a chip must be physically removed from the circuit for reprogramming and that its entire contents are erased by the ultra- violet light.

EEPROM are electrically erasable and do not have to be removed for erasure. It is possible to erase the cell contents selectively. The drawback of EEPROM is that different voltages are needed for erasing, writing and reading the stored data.

4. Flash Memory

A flash memory is similar to EEPROM .A flash cell is based on a single transistor controlled by trapped charge.

In EEPROM it is possible to read and write the contents of a single cell. In a Flash Device, it is possible to read the contents of a single cell, but it is only possible to write an entire block of cell. Before writing, the contents are to be erased.

Flash devices have higher capacity and consumes less power. They are used in hand-held computers, cell phones, digital camera and MP3 music players.

Single flash chips do not provide sufficient Storage capacity. Hence, larger modules consisting of a number of chips are needed. Two choices of implementation are,

- Flash Cards

- Flash Drives

Flash Cards:

Many flash chips are mounted on a small card to form a flash card. These flash cards have a standard interface that makes them usable in a variety of products. They come in a variety of size such as 8, 32, 64 Mbytes.

Flash Drives:

Flash Drives are solid-state electronic devices that have no movable parts. they have shorter seek and access times. They have lower power consumption and have used for battery device applications. They are also insensitive to vibrations. The drawbacks are, they have a smaller capacity and higher cost per bit. The flash memory will deteriorate after it has been written number of times.

SPEED SIZE COST

Memory system requirements

- Speed - ↑

- Size - ↑

- Cost - ↓

Hence, to achieve these requirements, different types of memory units are employed to form a hierarchy.

[pic]

- The fastest access is to data held in processor registers.

- At the next level, is a relatively small amount of memory that can be implemented directly on the processor cache. [Primary cache]. This is also referred to as level1 (L1) cache.

- A large, secondary cache is placed between the primary cache and the rest of the memory. This is referred as level 2 (L2) cache. It is usually implemented with SRAM chips.

- The next level is the main memory. This is implemented using dynamic memory components (SIMMs, DIMMs and RIMMs). The main memory is larger and slower the cache memory.

- Next level is the disk devices that provide a huge amount of inexpensive storage. They are very slow when compared to the main memory.

CACHE MEMORY

The speed of main memory is very low is comparison with the speed of modern processors. Hence, a cache memory is used that makes the main memory appears to the processor to be faster than it really is.

The effective of the cache memory is based on locality of Reference

i) Temporal

ii) Spatial

[pic]

Temporal Locality:

Whereas an instruction / data is first needed, this item should be brought into the cache where it will hopefully remain until it is needed again.

Spatial Locality:

Instead of fetching just one item from the main memory to cache, it is useful to fetch several items that reside at adjacent addresses.

Block:

Set of contiguous address locations of some size.

Cache Line:

Another term for cache block.

Mapping Function:

Defines the correspondence between the main memory blocks and those in the cache. When the cache is full and a memory word that is not in the cache is referenced, the cache control hardware must decide which block should be removed to create space for the new block that contains the referenced word.

Replacement Algorithm:

The collection of rules for making this decision.

Cache Control Circuitry:

Determines whether the requested word currently exists in cache or not.

Read/Write Hit:

The requested word exists in the cache.

Read/Write Miss:

The requested word does not exist in the cache.

Write Opetation:

i. Write through protocol

ii. Write back/ copy back protocol

Write Through Protocol:

Cache location and main memory location are updated simultaneously.

Write Back Protocol:

Update only the cache location and mark it updated with an associated flag bit called ‘dirty bit or modified bit ‘.

The main memory location of the word is updated later, when the block containing this marked word is to be removed from the cache during a replacement.

|WRITE THROUGH PROTOCOL |WRITE BACK PROTOCOL |

|(DEMERITS) |(DEMERITS) |

|Results is unnecessary write operations in the main memory when the |Results in unnecessary write operations because when a cache block is |

|word in cache is updated several times |written back to the memory, all words of the block are written back |

| |even if only a single word has been changed while the block was in |

| |cache. |

Read Miss can be handled in two ways:

• The block of words that contains the requested word is copied from the main memory into the cache. After the entire block is loaded into the cache, the requested word is forwarded to the processor.

• Load through/ Early Restart – The word may be sent to the processor as soon as it is read from the main memory.

Write Miss can be handled in two ways:

• The information is written directly into the main memory [write through protocol]

• The block containing the addressed word is first brought into cache, and then the desired word in the cache is overwritten with the new information [write back protocol].

MAPPING FUNCTIONS:

These are three types of mapping namely,

• Direct mapping

• Associative mapping

• Set associative mapping

EXAMPLE A:

A cache consisting of 128 blocks, each block has 16 words each and totally there are 2048 words. The main memory has 64K words i.e 4k blocks of 16 words each. Assume a consecutive address refers to consecutive words.

DIRECT MAPPING:

[pic]

• Here block j of the main memory maps onto block j module 128 of the cache.

• Ex: 128th block of main memory maps to

• 128 mod 128 = 0th block of cache

• 129th block of main memory maps to

• 129 mod 128 = 1st block of cache

The drawback here is that, more than one memory block is mapped onto a given cache block position. This rises to contention.

Ex: 128, 256, 512,129

o Among these 4 blocks 128, 256, 512 map to block 0 and 129 maps to block 1.

o In this case, block 0 is locked by block 128 when other blocks of cache are free. Hence, 256, 512 has to wait until block 128 releases block 0 of cache.

o Contention is resolved by allowing the new block to overwrite the currently resident block.

o Placement of a block in cache is determined by the memory address. The memory address is divided into three fields namely

• Tag (5 bits)

• Block (7 bits)

• Word (4 bits)

According to the given Example A,

• Each block of cache can be mapped to any one of 32 blocks of main memory [4096/128=32]

• i.e blocks 128,256,……3967 of main memory can be mapped to block 0 of cache memory. The tag field (5 bits) of main memory address can be used to identify which of the 32 blocks is currently being mapped to the cache (25=32).

• Totally, there are 128 blocks in the cache. To access a single block we need 7 bits (27= 128). Hence, these 7 bits are specified in these ‘block’ field of main memory address.

• Each block consists of 16 words. To access one particular word, we need 4 bits (24= 16). These 4 bits are specified by the ‘word’ field of the main memory address.

• The direct mapping technique is easy to implement but not very flexible.

ASSOCIATIVE MAPPING:

[pic]

According to this technique, a main memory block can be place into any cache block position.

Hence, any block of main memory can be mapped to one of the 128 blocks of the cache. Thus, 12 bits are required (212= 4096) to identify which of the 4096 blocks of main memory is mapped into the cache. The individual word can be accessed with 4 bits as in the previous case.

Hence the tag field of the main memory address is 12 bits and the word field is 4 bits.

Thus the space in the cache can be used more efficiently. A new block that has to be brought into the cache has to replace an existing block only if the cache is full.

The cost of an associative cache is higher than the cost of a direct mapped cache.

SET – ASSOCIATIVE MAPPING

[pic]

This technique is a combination of the direct and associative mapping technique. Blocks of the cache are grouped into sets, and the mapping allows a block of the main memory to reside in any block of a specific set. A cache that has k blocks per set is referred to as k way set associative cache.

Thus, the contention problem of the direct method is reduced by having a few choices for block placement.

The hardware cost is reduced by decreasing the size of associative search.

Example:

Here every set of cache has 2 blocks in it and totally there are 64 sets. In this case, memory blocks 0, 64,128 maps to cache set 0 [direct mapping]. In set 0, they can occupy either block 0 or block 1 [associative mapping]

The main memory address has three fields namely tag, set and word.

Totally, there are 64 sets in cache and 6 bits are required to identify one set. (26=64). Hence, the set field has 6 bits

4096 / 64= 64. (i.e) each cache set can be mapped to 64 blocks of the main memory. Hence, to identify which of the 64 blocks of main memory is currently mapped to a cache set requires 6 bits (26=64). Here the ‘tag’ field constitutes 6 bits.

The ‘word’ field has 4 bits as in the previous case.

Valid Bit:

- A control bit that is associated with each block

- Indicates whether the block contains valid data or not.

The valid bit of a particular cache block is set to 1 the first time the block is loaded from the main memory. Whenever a main memory block is updated by a source that bypasses the cache a check is made to determine whether the block being loaded is currently in the cache.

If it is, its valid bit is cleared to zero. This ensures that stable data will not exist in the cache.

Cache – Coherance Problem:

Two different entities using the same copy of data (ex: processor and DMA ) may have different values assigned to it.

Replacement Algorithm:

LRU - Least Recently Used

When a block is to be overwritten, overwrite the one that has gone the longest time without being referenced.

The algorithm is implemented with a cache controller A 2 bit counter can be used for each block.

When a hit occurs, the counter of the block that is referenced is set to 0. Counters with values originally lower than the referenced one are incremented by one and the others remain unchanged.

When a miss occurs there are 2 cases

Case 1: If the set is not full, the counter associated with the new block loaded from the main memory is set to 0, and the values of all other counters are incremented by 1

Case 2 : When a min occurs and the set is full, the block with the counter value 3 is removed, the new block is put in its place and its counter is set to 0. The other three block counters are incremented by 1.

LRU – WORST CASE:

When accesses are made to sequential elements of an array that is slightly too large to fit in the cache.

EXAMPLE OF MAPPING FUNCTIONS (VERY IMPORTANT )

EXAMPLE: A cache has only eight blocks of data.

Each block consists of only one word of data. Memory is word addressable with 16 bit address.

LRU replacement algorithm is used for block replacement.

PROBLEM:

A 2 dimensional array (4x10) of numbers, each occupying one word is stored in the main memory locations 7A00 through 7A27. The elements of the are stored in the column order

[pic]

The following code is operated on these data

Sum=0

for j=0 to 9 do

sum=sum + A(0, j) ______________LOOP(1)

end

AVE: = sum/10

for i=9 down to 0 do _________________LOOP(2)

A(0,i):=A(0,i) / AVE;

end

Direct Mapping

[pic]

Block j of Main Memory is mapped to j mod 8 of cache Memory.

Thus, Bloch 0 (A(0,0)) of Main Memory is mapped to 0 mod 8=0 of cache memory. Hence, during the 1st pass of the 1st Loop.

i.e j =1,Sum = Sum + A(0,1)

A (0,1)(This is stored at the location 7A04 (4th block of Main Memory)

This is mapped to 4 mod 8 = 4th block of cache memory.

The Same procedure is repeated for loop 2.

Assosiative Mapped Cache

[pic]

During the first 8 passes through the first loop, the elements are brought into consecutive loops,

[i.e from j=0 to 7]

During the ninth pass [i.e] j =8

The LRU algorithm replaces the least recently used block [i.e A(0,0)] and replaces it with a(0,0)

Now considering Loop 2,

When i=1 [i.e during the 9th pass ,(after completing i=9,8,7,6,5,4,3,2)],the least recently used block is A(0,9),This is Replaced by A(0,1).

The Same procedure is repeated for i=0.

Set –Associative Mapped Cache

[pic]

Here the data cache is divided into 2 sets, each capable of holding 4 blocks .The blocks having even addresses are mapped into set 0 of cache. In set 0,they can occupy any one of the four blocks.

Hence in general, associative mapping Performs best, set-associative is next best and direct mapping is the worst.

However, associative mapping is expensive to implement, so set associative mapping is performed.

Unified Cache:

The same cache memory is user for storing data and instructions.

Split Cache:

Separate data cache and instruction cache are used.

PERFORMANCE CONSIDERATIONS:

Performance depends on how fast machine instructions can be brought into the processor for execution and how fast they can be executed.

Interleaving:

The memory is structured as a collection of physically separate modules. Each modules has its own Address Buffer Register (ABR) and Data Buffer Register (DBR). Memory access operations may proceed in more than one module at the same time.

There are 2 cases in structuring these modules;

Case 1:

[pic]

The consecutive words are placed in a single module. Hence, to access them, only one module is to be accessed. All the other modules remain free.

The higher order bits of the main memory address refer to the module number and the lower order bits refer to the word address in the module.

Case 2:

[pic]

Consecutive words are placed in the consecutive modules. To access these words, access is made to several modules at the same time. This leads to faster access. This technique is called Memory Interleaving.

Here the higher order bits refer to the word address in the module and the lower order bits refer to module number.

Effect of Interleaving - Example

A cache with 8 blocks is used. During cache miss, it takes one clock cycle to send an address to the main memory. The first word from the memory is accessed in 8 cycles and subsequent words are accessed in 4 cycles. And one clock cycle is needed to send one word to the cache. When a single memory module is used(Case 1), the time needed to load the desired block into the

1 + 8 + (7*4) + 1= 38 cycles

↓ ↓ ↓ ↓

sending I word remaining sending word

address 7 blocks to cache

Now, using interleaved memory with 4 modules, the time taken to load the desired block into the cache is

1+8+4+4=17



Sending address to the main memory.

During the first 8 cycles, the 4 modules are accessed and 4 words are fetched simultaneously.

Then, 4 cycles are needed to access one words from each module (totally 4 words) simultaneously. While fetching these words, the previously fetched 4 words are sent to cache.

4 more cycles are needed to send the remaining 4 words to the cache.

[In the previous case, when a word is being fetched, the previously fetched word is sent to cache. Thus, it requires only one extra cycle to send the last fetched word to the cache.]

Hit Rate And Miss Penalty:

Hit : Successful access to the data in a cache.

Hit Rate : Number of hits stated as a fraction of all attempted accesses.

Miss Rate : Number of misses stated as a fraction of attempted accesses.

Miss Penalty : The extra time needed to bring the desired information into

the cache.

Example:

Let h be the hit rate

M be the Penalty

C be the time to access the information in the cache.

The average access time is given by,

t ave = hC + (1- h) M.

Cache on Processor Chips:

In high performance processors, 2 levels are normally used. L1 cache is fabricated on the processor chip and L2 cache is fabricated externally. L1 fast and small. L2 is slow and big .

The average access time with two levels of cache is

t ave = h1C1+ (1- h1) h2C2 + (1- h1)(1- h2)M

h1 -> hit rate of L1 cache

h2 -> hit rate of L2 cache

C1 -> time to access information in the L1 cache

C2 -> time to access information in the L2 cache

M -> time to access information in the main memory.

Number of misses in L2 = (1-h1) (1-h2).

Other Enhancements

Write Buffer:

When a write through protocol is used, each operation results in writing a new value into the main memory. If the processor must wait for the memory function to be completed, the processor is slowed down.

To improve performance, a write buffer can be included for temporary storage of write requests.

The processor places each write request into the buffer and continues execution of next instruction when a write back protocol is used, the existing block that has dirty is to be replaced in the case of a read miss. If the required write back is performed first, the processor would be slowed down. This can be avoided by provided a fast write buffer for temporary storage of the dirty block that is ejected from the cache while the new block is being read.

Prefetching:

To avoid stalling the processor, we can prefetch the data into the cache before they are needed. A special prefetch instruction may be provided in the instruction set of the processor. Prefetch instructions can be inserted into a program either by the programmer or by the compiler.

Prefetching can also be done through hardware.

Lockup Free Cache:

It is desirable to design a cache in such a way that it can support more than one outstanding miss. A cache that can support multiple outstanding misses is called lockup-free. Since it can service only one miss at a time, it must include circuitry that keeps track of all outstanding misses.

VIRTUAL MEMORIES:

In modern computers, the physical main memory is not as large as the address space issued by the processor.

For eg, a processor that issues 32 bit address has an addressable space of 4G. The size of the main memory ranges from 100 MB to 1G bytes.

When a program does not completely fit into the main memory, the parts of it not currently being executed are stored on secondary storage device.

For executing a program, it is to be brought into the main memory. When a new segment of a program is to be moved into a full memory, it must replace another segment already in the memory. All these things are taken care by the operating system.

VIRTUAL MEMORY TECHNIQUE:

Technique that automatically move program and data blocks into the physical main memory when they are required for execution.

Virtual / Logical Address:

Binary addresses that the processor issues for either instructions/data. These addresses are translated into physical addresses by a combination of hardware and software components

. [pic]

Memory Management Unit (MMU):

A special hardware unit that translates virtual addresses into physical address.

Address Translation:

It is assumed that all programs and data are composed of fixed length units called pages, each of which consists of a block of words that occupy contiguous locations in the main memory.

Pages commonly ranges from 2K t016K bytes in length.

Pages should not be too small because the access time of a magnetic disk is longer than the access time of the main memory. If pages are too large, a substantial portion of a page may not be used and occupies valuable space in the main memory.

[pic]

A virtual address translation method is based on the concept of fixed-length pages.

Each virtual address generated by the processor is interpreted as a virtual page number followed by an offset that specifies the location of a particular byte within a page.

Information about the main memory location of each page is kept in a page table.

An area in the main memory that can hold one page is called a page frame. The starting address of the page table is kept in a page table base register.

By adding the virtual page number to the contents of this register, the address of the corresponding entry in the page table to obtain. The contents of this location give the starting address of the page if that page currently resides in the main memory. Each entry in the page table includes some control bits that describes the status of the page.

Translation Look Aside Buffer (TLB):

The page table information is used by the MMU for every read and write access, so the page table should be situated within the MMU. The page table may be large, it is impossible to place the entire table in the MMU.

Hence, the page table is kept in the main memory and a copy of small portion of the page table can be accommodated within the MMU. (Similar to cache).

This portion consists of the page table entries that correspond to the most recently accessed pages.

A small cache, called TLB is corporate into MMU for this purpose.

The operation of TLB is same as the operation of cache. The contents of the TLB should be coherent with the contents of the page tables in the memory.

Given a virtual address, MMU looks is the TLB for the reference pages. If the page table entry for this page is found in the TLB, the physical address is obtained immediately. If there is obtained from the page table in the main memory and the TLB is updated.

Page Fault: When a program generates an access request to a page that is not in the main memory, page fault occurs.

If a new page is brought from the disk when the main memory is full, it must replace one of the resident pages. Any replacement policy (eg: LRU) can be applied for this.

MEMORY MANAGEMENT REQUIREMENTS:

• The operating system routines are arranged into a virtual address space called System Space.

• User Space, user application programs reside.

A processor has 2 states namely,

➢ Supervisor state

➢ User state

Supervisor State:

State of the processor when the OS routines are being executed.

User State:

Processor state to execute user programs.

The instructions that could not be executed in the user state are called Privileged instructions.

[Secondary storage devices -> Learn from any general book].

UNIT – V

INPUT- OUTPUT ORGANIZATION

• Memory mapped I/O

• Programmed I/O

Memory Mapped I/O :

I/O devices and memory share the same address space (i.e.) instructions like ‘MOVE’ treat I/O buffers (DATAIN & DATAOUT) as memory locations

Here to transfer data from DATAIN to Register R0, the following instruction can be used

• Move DATAIN,R0

Programmed I/O:

I/O devices and memory have different address spaces. Hence to transfer from DATAIN or to DATAOUT, special instructions such as ‘In’ and ‘Out’ are used.

Here the processor repeatedly checks a status flag to achieve synchronization

between the processor and an I/O device. This process of checking is called “Polling”.

Interrupts:

An I/O device can communicate with a processor by raising interrupts. These Interrupts alert the processor that the I/O device is ready. The I/O device can interrupt by sending a hardware signal called ‘Interrupt’. At least one of the bus control lines called ‘Interrupt- Request’ line is dedicated for this purpose.

Processor executes a routine (function) in response to an interrupt. This routine is called Interrupt Service Routine (ISR).

As a part of handling interrupts, the processor must inform the device that its request has been recognized so that it may remove its interrupt-request signal. This is done by a special signal called interrupt – acknowledge signal.

The treatment of a ISR is very similar to that of a subroutine. Thus, before starting execution of ISR, any information that may be alerted during the execution of the routine must be saved. The information must be restored before execution of the interrupted program is resumed. Saving these information increases the delay between the time an interrupt request is received and the start of execution of ISR. This delay is called interrupt latency. Typically, the processor saves only the contents of the program counter and the processor status register.

Interrupt Hardware:

[pic]

A single interrupt request line may be used to serve n devices. All devices are connected to the line via switches to ground. If all the interrupt-request signals INTR1 to INTRn are inactive (i.e. if all switches are open), the voltage on the interrupt – request line is Vdd. This refers to the inactive state of the line. When a device requests an interrupt by closing its switch, the voltage on the line drops to 0, causing the INTR received by the processor to go to 1. Since the closing of one or more will cause the line voltage to drop to 0, the value of INTR is the logical OR of the requests from individual devices

INTR=INTR1+…. +INTRn

Resistor ‘R’ is called a pullup register because it pulls the line voltage upto the high voltage state when the switches are open.

Enabling & Disabling Interrupts:

The arrival of an Interrupt request from an external device causes the processor to suspend the execution of one program and start the other. Because interrupts can arrive at any time, they may alter the sequence of events. Hence the processor should be provided with the facility to enable and disable interrupts. There are many possibilities

• To have the processor hardware ignore the interrupt request line until the execution of first instruction of ISR has been completed. The interrupt enable instruction will be the last instruction in the ISR before the return from the Interrupt execution.

• To have the processor automatically disable interrupts before starting the execution ISR. There is a bit in the processor status register (PS) called Interrupt enable indicates whether the interrupts are enabled.

• The processor has a special interrupt request line for which the interrupt handling circuit responds only to the leading edge of the signal. Such a line is called edge triggered.

The sequence of events involved in handling an interrupt request from a single device is,

o The device raises an interrupt request.

o The processor interrupts the program currently being executed.

o Interrupts are disabled by changing the control bits in the PS.

o The device is informed that its request has been recognized and in response it deactivates the interrupt request signal.

o The action requested by the interrupt is informed by the ISR.

o Interrupts are enabled and execution of the interrupted program is resumed.

Handling Multiple Devices :

When a request over the common interrupt-request line, additional information is needed to identify a particular device that activated the line. If 2 devices have activated the line at the same time, it must be possible to break the tie and select one of the 2 requests for service. When the ISR for the selected device has been completed, the second request can be serviced. The information needed to determine whether a device is requesting an interrupt is available in its status register, called IRQ bit.

Polling:

The simplest way to identify the interrupting device is to have the ISR poll the entire I/O device connected to the bus. This is easy. But more time is spent in interrogating the IRQ bits of all the devices that may not be requesting any service.

Vectored Interrupts:

Here, the device requesting an interrupt may identify itself directly to the processor. Then the processor can immediately start executing the corresponding ISR. All the interrupt handling schemes based on this approach are called Vectored Interrupts.

A device requesting and interrupt can identify itself by sending a special code to the processor. The code supplied by the device may represent the starting address of the ISR for that device. This address is called Interrupt vector.

Interrupt Nesting:

[pic]

I/O devices should be organized in a priority structure. An interrupt request from a high priority device should be accepted while the processor is servicing another request from a lower priority device.

A multiple level priority organization means that during execution of an ISR interrupt requests will be accepted from some devices but not from others, depending upon device’s priority. The processor accepts interrupts only from devices that have priorities higher than its own.

The processor’s priority is encoded in a few bits of the processor status word. A multiple priority scheme can be implemented by using separate interrupt request and interrupt acknowledge lines for each device. Interrupt requests received over these lines are sent to a priority arbitration circuit in the processor. A request is accepted only if it has a higher priority level than that currently assigned to processor.

Simultaneous Requests:

When there are simultaneous interrupt requests from two or more devices, the processor should decide which request to service first. A widely used scheme is to connect the devices to form a dairy chain. The interrupt request line INTR is common to all devices. The interrupt acknowledge line INTA is connected in a daisy chain fashion, such that the INTA propagates serially through the devices. When several devices raise an interrupt request and the INTA line is activated the processor responds by setting the INTA line to 1. Device 1 passes the signal on to device 2 only if it does not require any service. If device1 has a pending request for interrupt, it blocks the INTA signal and proceeds to put its identifying code on the data lines. Thus, the device that is electrically closest to the processor has the highest priority.

[pic]

The daisy chain arrangement and the priority scheme can be combined to produce a more general structure(fig 4.8(b)). Devices are organized into groups and each group is connected at a different priority level. Within a group, devices are connected in a daisy chain. [pic]

Controlling Device Requests:

It is important to ensure that interrupt requests are generated only by those I/O devices that are being used by a given program. Idle devices must not be allowed to generate interrupt requests, even though they may be ready to participate in I/O transfer operations. Hence a mechanism is needed to control whether a device is allowed to generate an interrupt request. This is usually provided in the form of an interrupt enable bit in the device’s interface circuit.

Eg: KEN - Keyboard Interrupt Enable

DEN - Display Interrupt Enable

Thus there are 2 independent mechanisms for controlling interrupt requests. At the device end, an interrupt – enable bit in a control register determines whether the device is allowed to generate an interrupt request. At the processor end, either an interrupt enable bit in the PS register or a priority structure determines whether a given interrupt request will be accepted.

Exceptions:

Exception refers to any event that causes an interruption. Thus I/O interrupts are example of an exception. Other kinds of Exceptions are,

• Recovery from Errors

• Debugging

• Privilege Exception

Recovery From Errors:

The processor interrupts a program if it detects an error/unusual condition while executing the instructions of a program. It suspends the program being executed and starts an exception service routine.

Debugging:

• Trace

• Breakpoints

Trace:

In trace mode, an exception occurs after execution of every instruction using debugging program as the exception service routine.

Breakpoints:

Breakpoints provide a similar facility except that the program being debugged is interrupted only at specific points selected by the user.

An instruction called Trap or software interrupt is provided for this purpose.

Privilege Exception:

Certain instructions can be executed only while the processor is in supervisor mode. These are called privileged instructions.

An attempt to execute such an instruction in user mode will produce a privilege exception.

DIRECT MEMORY ACCESS:

To transfer large blocks of data at high speed, Direct Memory Access is used. A special control unit may be provided to allow transfer of a block of data directly between an external device and the main memory.

DMA transfers are performed by a control circuit that is part of the I/O device interface. This circuit is called DMA controller. The DMA controller performs the functions that would normally be carried out by the processor when accessing the memory. For each word transferred, it provides the memory address and all of the bus signals that control data transfer.

To initiate the transfer of a block of words, the processor sends the starting address, the number of words in the block and direction of the transfer. On receiving this information, the DMA controller proceeds to perform the requested operation. When the entire block has been transferred, the controller informs the processor by raising an interrupt signal.

While a DMA transfer is taking place, the program that requested the transfer cannot continue, and the processor can be used to execute another program. After the DMA transfer is completed, the processor can return to the program that requested the transfer.

Thus for an I/O operation involving DMA, the OS puts the program that requested the transfer in the Blocked state, initiates the DMA operation and starts the execution of another program when the transfer is completed, the DMA controller informs the processor by sending an interrupt request. In response, the OS puts the suspended program in the runnable state so that it can be selected by the scheduler to continue execution.

[pic]

Two registers are used for storing the starting address and the word count. The third register contains the status and control flags. The R/W bit determines the direction of the transfer. When the controller has completed transferring a block of data and is ready to receive another command, it sets the Done flag to 1. Bit 30 is the Interrupt Enable flag (IE).

To start a DMA transfer of a block of data from the main memory to one of the disks, a program writes the address and word count information into the registers of the corresponding channel of the disk controller. When the DMA transfer is completed, this fact is recorded in the status and control channel by setting the Done bit. At the same time, IE bit is set, the controller sends an interrupt request to the processor and sets the IRQ bit.

Since the processor originates most memory access cycles, the DMA Controller can be said to steal memory cycles from processor. This is called Cycle stealing.

Alternatively, the DMA controller may be given exclusive access to the main memory to transfer a block of data without interruption. This is known as block/burst mode.

BUS ARBITRATION:

A conflict may arise if both the processor and a DMA controller or 2 DMA controller try to use the bus at the same time to access main memory. To resolve these conflicts, an arbitration procedure is implemented on the bus to coordinate the activities of all devices requesting memory transfers.

BUS MASTER:

The device that is allowed to initiate data transfers on the bus at any given time.

There are 2 approaches

• Centralized Arbitration

• Distributed Arbitration

CENTRALISED ARBITRATION:

Here a single bus arbiter may be the processor or a separate unit connected to the bus.

[pic]

The processor is normally the bus master unless it grants bus mastership to one of the DMA controllers. A DMA controller indicates that it needs to become the bus master by activating the bus request line(BR). Now

The processor activates the Bus Grant signal BG1 indicating to the DMA controllers that they may use the bus when it becomes free. This signal is connected to all DMA controllers using a daisy-chain arrangement. Hence after receiving the Bus-Grant signal, a DMA controller waits for Bus-Busy to become inactive, and then assumes the mastership of the bus. At this time, it activates the Bus-Busy to prevent other devices from using the bus at the same time.

[pic]

Timing Diagram:

DMA controller 2 requests and acquires mastership and later release the bus. During its tenure as bus master, it may perform one/more data transfer operations, depending on whether it is operating in the cycle stealing/block mode. After it releases the bus, the processor resumes bus mastership.

DISTRIBUTED ARBITRATION:

Here, all devices participate in the selection of the next bus master.

[pic]

Each device on the bus is assigned a 4 bit identification number. When one or more devices request the bus, they assert the Start-Arbitration signal and place their 4 bit ID numbers on the lines ARB0 through ARB3.

A winner is selected as a result of the interaction among the signals transmitted over these lines.

For eg, there are 2 devices, A and B having ID numbers 5 and 6 are requesting the use of the bus. Device A transmits the pattern 0101 and device B 0110.

| |A |B |OR |

|ARB3 |0 |0 |0 |

|ARB2 |1 |1 |1 |

|ARB1 |0 |1 |1 |

|ARB0 |1 |0 |1 |

The OR is 0111. This is compared with device A’s code from the last bit.

A OR

0 0

1 1

0 1 -not same

1 1 -same

If not same, the bit at that position and all the bits following that position are changed into zero.

Thus,

A becomes A

0 0

1 1

0 0 changed to

1 0 zero

This new A value is again placed on the line.

A B OR

0 0 0

1 1 1 B’s CODE

0 1 1

0 0 0

Thus B is the winner and it is given the ownership of the bus.

BUSES:

The processor, main memory and I/O devices are interconnected by means of a bus whose primary function is to provide a communication path for the transfer of data. The bus includes the lines needed to support interrupts and arbitration.

A bus protocol is a set of rules that govern the behavior of various devices connected to the bus,

The bus lines are of three types

• Data

• Address

• Control

The bus control signals also carry timing information. A variety of schemes have been devised for the timing of the data transfers over a bus.

These are classified as

• Synchronous.

• Asynchronous.

Master:

Device that initiates data transfer by issuing read /write commands on the bus. It is also called as initiator. Usually the processor is the master.

Slave/Target:

The device addressed by the master.

Synchronous bus:

In synchronous bus, all devices derive timing information from a common clock line. Equally spaced pulses on this line define equal time intervals. Each of these intervals constitutes a bus cycle, during which one data transfer can take place.

Timing diagram:

Input operation: (fig 4.23)

[pic]

1. During to, the master places the device address on the address lines and sends an appropriate command on the control lines. The clock pulse width t1-t0 must be longer than the delay between 2 devices connected to the bus.

2. The slave responds at time t1.It is important that slaves take no action or place any data on the bus before t1.Thus the information on the bus is unreliable during the period t0 to t1 because signals are changing state.

3. At the end of t2, master captures the data on the data lines into its input buffer.

Output operation: (fig 4.24) (same as input operation)

The master places the output data on the data lines when it transmits the address and command information. At time t2, the addressed device captures the data lines and loads the data into its data buffer.

Multiple Cycle Transfers:

These are some limitations in the above design.

1. The clock period must be chosen to accommodate the longest delay on the bus and the slowest device interface. This forces all devices to operate at the speed of the slowest device.

2. The processor has no way of determining whether the addressed device has actually responded. To overcome this limitation, most buses incorporate signals that represent a response from the device. Also a high frequency clock signal is used such that a complete data transfer cycle would span several clock cycles.

Timing diagram: (fig 4.25)

[pic]

1. During cycle1, the master sends address and command information on the bus, requesting a read operation. The slave receives this information and decodes it.

2. During cycle2, the slave makes a decision to respond and begins to access the requested data. It is assumed that some delay is involved in getting the data, and hence the slave cannot respond immediately.

3. The data become ready and are placed on the bus in clock cycle3. At the same time, the slave asserts a control a signal called slave-ready.

The master, which has been waiting for this signal, strobes (captures) the data into its input buffer at the end of clock cycle3.

Thus, the slave-ready signal is an acknowledgement from the slave to the master, confirming that valid data have been sent.

If the addressed device does not respond at all, the master wait for some predefined maximum number of clock cycles, then aborts the operation.

Asynchronous Bus:

Under this scheme, handshaking is used. The common clock is replaced by two timing control lines, Master-ready and Slave-ready.

Procedure

The master places the address and command information on the bus. Then it indicates to all devices that it has done so by activating the Master- Ready line. This causes all devices on the bus to decode the address. The selected slave performs the required operation and informs the processor by activating the slave ready line. The master waits for Slave-Ready to become asserted before it removes its signals from the bus. In the case of a read operation it also strobes the data into its input buffer.

Timing Diagram:

[pic]

1. To - The master places the address and command information on the bus, and all devices on the bus begin to decode this information.

2. T1 - The master sets the Master ready line to 1 to inform the I/O devices that the address and command information is ready. A skew (delay) may occur on the bus. Skew occurs when two signals simultaneously transmitted from one source arrive at the destination at different times.

3. T2 - The selected slave, having decoded the address and command information performs the required input operation by placing the data from its data register on the data lines. At the same time, it sets the slave-ready signal to 1.

4. T3 - The slave ready signal arrives at the master indicating that the input data are available on the bus.

5. T4 -The master removes the address and command information from the bus.

6. T5 - When the device i/f receives the 1 to 0 transition of the Master ready signal, it removes the data and the slave ready signal from the bus.

Timing diagram – I/O operation (fig. 4.27)

[pic]

Full Handshake: The two handshaking Signals are fully interlocked. A change of state in one signal is followed by a change in the other signal.

|Synchronous Bus |Asynchronous Bus |

| |Handshaking process eliminates the need for |

|Faster data transfer rates can be achieved |Synchronization of the sender and receiver clocks. Thus, timing design is |

| |simplified. |

TRANSFER CIRCUITS

An I/O interface consists of the circuitry required to connect an I/O device to a computer bus. On one side of the interface, we have the bus signals for address, data and control. The other side is called port, which has a data path with its associated controls to transfer data between the interface and the I/O device. There are 2 types of ports

• Serial Port

• Parallel port

Serial port: Transmits data one bit at a time.

Parallel port: Transfers data in the form of number of bits (for 16) simultaneously to or from

the device.

In the case of a parallel port, the connection between the device and the computer uses a multiple pin connector and a cable with as many wires. This is suitable for devices that are physically close to the computer.

The serial format is more convenient and cost-effective where longer cables are needed.

• An I/O interface does the following:

• Provides a storage buffer.

• Contains status flags that can be accessed by the processor.

• Contains address decoding circuitry.

• Generates appropriate timing signals.

• Performs any format conversion.

Parallel port (fig. 4.08)

Connecting a Keyboard to a Processor:

[pic]

A keyboard consists of mechanical switches those are normally open. When a key is pressed, its switch closes and establishes a path for an electrical signal. This signal is detected by an encoder circuit that generates the ASCII code for the corresponding character.

To avoid key bouncing, denouncing circuits are included as a part of the Encoder block. This information is sent to the interface circuit, which contains the data register DATAIN and a status flag, SIN. The interface which transfers are controlled, using the handshake signal Master-ready and Slave-ready. The control line R/W distinguishes read and writes transfers.

Circuit design (fig. 4.29)

[pic]

The output lines of the DATAIN register are Connected to the datelines of the bus by means of three state drivers, which are turned on when the processor issues a read instruction with the address that select this register.

The SIN signal is generated by a status flag circuit. This signal is also sent to the bus through a three-state driver. It is connected to bit DO, that it will appear as bit O of the status register. An address decoder is used to select the input interface when the high order 31 bits of an address correspond to any of the addresses assigned to this interface.

Address bit AO determines whether the status or the data register is to be read when the Master-ready signal is active. The control handshake is accomplished by activating the slave-date is equal to 1.

Output interface (fig. 4.31)

[pic]

The printer operates under control of the handshake signals valid and Idle is a manner similar to the handshake used on the bus with the Master-ready and slave-ready signals. When it is ready to accept a character, the printer asserts its idle signal. The interface circuit can then place a new character on the data lines and activate the valid signal. In response the printer starts printing the new character and negates the idle signal, which causes the interface to deactivate the valid signal.

Fig. 4.32 ---> same as i/p interface circuit.

[pic]

DATAIN is replaced by DATAOUT.

SIN is replaced by SOUT.

[pic]

The input and output interfaces can be combined into a single circuit. This is shown in fig. 4.33. Eight lines are used for input and eight lines are used for output.

A general 8 bit parallel interface (fig. 4.34)

[pic]

A more flexible parallel port is created if the data lines to I/O devices are bidirectional. The direction is given in the Data Direction Register (DDR).

Fig. 4.35 -A Timing logic block is introduced to generate the Load-data and Read status signals. Initially the machine is in the idle state. When the output of the address decoder, My-address, shows that this interface is being addressed, and the machine changes state to respond. As a result, it asserts Go, which in turn asserts either Load-data or Read status depending on address bit AO and the state of the R/W line.

Serial Port (fig.4.37)

[pic]

The key feature of the interface circuit for a serial port is that it is a capable of communicating in a bit-serial fashion on the device side and in a bit-parallel fashion on the bus side. The transformation between the Parallel and serial formats is achieved with shift

register that have parallel access capability.

Standard I/O interfaces:

Bridge: Circuit that interconnects two buses and translates the signals and protocols of one bus into those of the other.

PCI BUS – PERIPHERAL COMPONENT INTERCONNECT:

Devices connected to the PCI bus appear to the processor as if they were connected directly to the Processor bus. They are assigned addresses in the memory address space of the processor.

The PCI is processor independent. The important features of PCI bus is plug and play capability.

Data transfer:

The bus supports three independent address spaces

• Memory.

• I/O.

• Configuration.

A 4 bit command that accompanies the address identifies which of the three spaces is being used in a given data transfer operation.

The PCI bus provides a separate physical connection for the main memory.

A master is called initiator in PCI. This is either a processor or a DMA controller. The addressed device that responds to read and write commands is called a target.

(Table 4.3) Data Transfer Signals

Consider a bus transaction in which the processor reads 432 bit words from the memory. A complete transfer operation on the bus, winding a address and a burst of data is called transaction. Individual word transfers within a transaction are called phases. A clock signal provides the timing reference used to coordinate different phases of a transaction. All signal transaction are triggered by the rising edge of the clock.

Timing Diagram: (fig 4.40)

1. The processor asserts FRAME# to indicate the beginning of a transaction. It sends the address on the AD lines and a command on the c/BE# lines.

2. The processor removes the address and disconnects its drivers from the AD lines.

3. The initiator asserts the initiator ready signal, IRDY# to indicate that it is ready to receive data. If the target has data ready to send at this time, it asserts target ready, TRDY# and sends a word of date.

4. The initiator uses the FRAME# signal to indicate the duration of the burst. It negates this signal during the second last word of the transfer.

Fig 4.41 ->more general input transaction

It shows how IRDY# and TRDY# signals can be used by the indicator and target to indicate a pause in the middle of a transaction.

Device Configuration:

PCI simplifies device configuration by incorporating in each I/O devices interface a small configuration ROM memory that stores information about the device. The configuration ROM is accessible in the configuration address space. Each device has an input signal called initialization device select, IDSEL#.During configuration operation, this signal is applied to the AD inputs of the device, that causes the device to be selected the configuration software scans all locations in the configuration address spaces to identify which devices are present. Each device is then assigned an address by writing that address into the appropriate device register. The configuration software sets parameter as the device interrupt priority.

This process relieves the user from having to be involved in the configuration process. The user simply plugs in the interface board and turns on the power. The device is ready to use.

Electrical Characteristics

The PCI bus has been defined for operation with either a 5v or 3.3v power supply. The mother board may be designed to operate with either signaling system.

SCSI BUS- SMALL COMPUTER SYSTEM INTERFACE

▪ A SCSI bus may have 8 data lines-narrow bus

▪ A SCSI bus may have 16 data lines-wide bus

SCSI bus may use single ended transmission or differential signaling SCSI connector may have 50, 68 or 80 pins.

Devices connected to the SCSI bus are not part of the address space of the processor. The SCSI bus is connected to the processor bus through a SCSI controller.

A controller connected to a SCSI bus can be either an initiator or a target. The initiator establishes a logical connection with the target data transfer on the SCSI bus are always controlled by the target controller.

Disk Read Operation

1. The SCSI controller, acting as an initiator extends for the control of the bus.

2. When the initiator wins the arbitration process, it selects thru target controller and hands over control of the bus to it.

3. The target starts an output operation; in response, the initiator sends a command specifying the required operation.

4.In order to do a disk seek operation, the target sends a message to the initiator indicating that it will temporarily suspend the connection between them .this releases the bus.

5. The target controller sends a command to the disk drive to move the read head to the first sector involved in the requested read operation. Target transfers data to initiator.

6. The target controller sends a command to the disk drive to perform another seek operation. At the end of the transfer, the logical connection between the two controllers is terminated.

7. Initiator stores the data in the main memory.

8. SCSI controller sends an interrupt to the processor to inform it that the requested operation has been completed .The SCSI bus standards defines a wide range of messages to handle different types of I/O devices.

Bus Signals (Table 4.4)

The active low signals are proceeded by a ‘-‘ sign .The bus has no address lines .Instead the data lines are used to identify the bus controllers involved during the selection or reselection process and during bus arbitration. The main phase in the bus operation are

▪ Arbitration

▪ Selection

▪ Information Transfer

▪ Reselection

Arbitration

The bus is free when the –BSY signal is in the inactive state .Any controller can request the use of the bus while it is in this state’s controller requesting a bus has an associated data line to identify itself. The SCSI bus uses a simple distributed arbitration scheme. Each controller is assigned a fixed priority .The controller using the highest numbered line wins the arbitration process.

Selection (Fig 4.42)

The winner can select another device and hand over the bus to that device. This is done by asserting the SEL signal.

Information Transfer

Handshake signaling is used to control information transfer .The REG and ACK signals replace the Master ready and slave ready signals.

The high speed versions of the SCSI bus use double edge clocking or double translations (i.e) data are transferred on both the rising and falling edges of these signals, thus doubling the transfer rate.

Reselection

When the logical connection is suspended, the target must gain the control of the bus. Before data must gain the control of the bus. Before data transfer can begin, the initiator must hand control over to the target. The target between the two controllers has been reestablished, with the target in control of the bus as required for data transfer to proceed.

USB- (Important)[pic]

-----------------------

Halt

Start

Execute Instruction

Fetch Next Instruction

M/c language

Assembler

ALP

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download