Introduction - University of Washington



The Memory Hierarchy

Introduction

Terminology

Access time

Time to access a word in memory

Specifies the read or write time

Note these may be different

The memory may be organized as

Bits, bytes, or words

Cycle time

Time from the start of one read until the next

Block size

Number of words in a block

Note this is a logical description

Bandwidth

Word transmission rate

Latency

Time to access the first of a sequence of words

Block access time

Time to access an entire block from the start of a read

DRAM

Dynamic random access memory

SRAM

Static random access memory

Semi-static RAM

The periphery is clock activated (dynamic) - thus is inactive until clocked.

Only one memory cycle is permitted per clock

Periphery circuitry must be allowed to reset after each active memory cycle for min precharge time.

No refresh is required.

SDRAM

Synchronous DRAM

SDRAM synchronizes all addresses, data, and control signals to the system clock. Allows much higher data transfer rates that asynchronous transfers

ROM

Read only memory

PROM

Programmable read only memory

EPROM

Erasable programmable read only memory

EEPROM

Electrically erasable programmable read only memory

CAS

Column address strobe

Clock in used in dynamic memories to control the input of column addresses

RAS

Row address strobe

Clock in used in dynamic memories to control the input of row addresses

Refresh

Technique used in dram or SDRAM through which data is retained in memory

Refresh time interval

Time between two refresh operations - determine by system in which memory is operating

Memory block on Von Neumann machines

Actually comprised of number of memory components

Arranged

In hierarchical manner

To co-operate with each other

Hierarchical metrics

Speed

Storage capacity

At top

Slowest largest memories

Also known as secondary memory

Also tend to be least expensive

Size

On order of 10’s to 100’s of gigabits

Latency

On the order of 10’s of ms

Bandwidth

1 MB per sec

Cost

$0.02 per MB

Devices

Tape for archival storage

High density disk drives

Bottom

Smallest fastest memories

Call this cache memory

Also tend to be most expensive

Size

On order of 100’s to 1000’s bits

Up to several M in some machines today

Latency

10-20 ns

Bandwidth

8-10 MB per sec

Cost per MB

$500.00

Devices

Registers

High speed cache

Middle

Often called primary memory

Size

On order of 100’s of M bits to 1’s of G bits

Latency

50 ns

Bandwidth

1 MB per sec

Cost

$30.00 per MB

Devices

RAM ROM

Some hard drives

Lower speed cache

Motivation

We would prefer program to execute as quickly as possible

As we’ve seen

Accessing memory takes time

Each access contributes to time required to execute instruction

Static Ram Design

A typical SRAM cell appears as follows

Observe that we have 6 transistors per cell

Two access transistors enable the cell for

Read and write

Write operation

Value written into cell by

• Applying value to bi and !bI through write / sense amplifiers

• Assert word line

• Causes new value to be written into latch

Read operation

Value read from cell by

• Precharge bi and !bi to voltage halfway between 0 and 1

• Assert the word line

• Drives bi and !bi to high and low or low and high

• Values are sensed and amplified by write / sense amplifier

Typical timing is given as

DRAM Design

A typical DRAM cell appears as follows

Observe that we have only one transistor per cell

Read and write operations use single bit line

Write operation

Value written into cell by

• Applying 0 or 1 to bi through write / sense amplifiers

• Assert word line

• Charges cap if 1 stored discharges if 0 stored

Read operation

Value read from cell by

• Precharge bi to voltage halfway between 0 and 1

• Assert the word line

• Gates signal onto bi

• Values are sensed and amplified by write / sense amplifier

• Read operation causes cap to discharge

• Sensed and amplified value placed back on bit line

Called a refresh operation

Typical timing is given as shown

Chip Organization

Independent of type of internal storage

Typical RAM chip configured as shown in following drawing

Making Things Work

Locality of Reference

Goal

Reduce number of accesses

Make each access short as possible

Utilized to much greater extent in today’s memories

Ideally would like to make all memory as fast as technology allows

Such action has associated cost

Memories near bottom are expensive

Support circuitry for such memories also expensive

Additional circuitry required

Power supplies to support

Almost all programs executed today written using procedural paradigm

If we analyze how such programs

Designed

Execute

Discover interesting phenomenon

Execution generally occurs

Sequentially

Small loops

Number of instructions

Means overall progress forward through program

Executing at much lower rate than

[pic]

Access times of fastest memory

Put another way

With respect to entire program

We are executing within a small window

That moves forward through program

This is shown as the following

Formally such phenomenon called

Locality of reference

We recognize program executing only few instructions

Within small window

Benefits

If we can keep those few instructions

In fast memory

Program will appear to be executing out of fast memory

Gain benefits of such speed

Reduced cost

Important point

Approach works provided

Area within which we are executing is in window

Method can easily be defeated with

Large loops

Branches outside window

Architecture

Let’s now look at portion of memory hierarchy

We’ll not consider

Archival storage

[pic]

ROM or CDROM

Registers

We will focus on

Hard Drive

RAM

Cache

Secondary memory

218 pages

Primary memory

210 pages

Page = 4 blocks

Block = 1 K words

Word = 4 bytes

[pic]

Caching and Virtual Memory

Can represent these as

Goal

Operate out of cache memory

When need instruction or data not in cache

Bring in from RAM

Refer to this as caching

When need instruction or data not in RAM

Bring in from hard drive

Use mechanism similar to caching

Call it virtual memory

Performance

How well we do this

Establishes effectiveness of our memory management scheme

Caching

Let’s examine cache memory and caching techniques first

Once again idea

In one sense

Take advantage of locality of reference

Instructions

Data

To minimize access time

Larger sense

Can use caching techniques

Many places to optimize performance

Internet

Bit images cached locally to improve display speed

Network file systems

Temporarily maintain local copy

To avoid having to retransfer

Based upon assumption will be using again in near future

Implementation

Caching requires certain amount of local memory

Size determines how much information can be stored locally

High Level Description

Program begins executing

Encounters needed data or instruction

Check cache

If in cache

Have a cache hit

Use

Else

Have cache miss

Must go get from somewhere else

Bring in new block of data or instructions

If room left in cache

Store block

Else

Must make room

Remove existing block

If block has been modified

Save changes

Else

Discard

Write new block in its place

Important issues

How do we know something is not in cache

Where do we go to find something if not in cache

What if not there

How do we know if room left in cache

How do we know if information in cache modified

How do we select block to replace

Detailed Implementation

Will address each question as we build a cache

Implementation scheme called direct mapped cache

First step

Design cache

Hardware

Collection of memory devices

[pic]

Memory address register

Memory data register

Words will be 32 bits

Will have 256 k word cache

Architecture

We will logically divide cache into 256 blocks

Each block will be 1 k words long

Note this is a logical division

Further note

Address increments are rounded to make simpler

Provides reasonable size piece of memory to work with

Cache will now logically appear as

28 = 256 blocks 2 = 1000 words

[pic]

A 1 k block requires 10 bits address bits

To uniquely identify each location

Recall our word is 4 bytes long

Bits A0 - A1

Identify the byte

Bits A2 - A11

Identify a word in cache

Because our cache is logically divided into 256 blocks

Need 8 bits to identify each block

We can use the actual physical address to do this

Thus will use address bits

A12 - A19

These 8 bits will give the required 256 combinations

We’ll call these the index

We do this as follows

Any block of addresses with

A12 - A19 - 0000 0000

Store in Block 0

A12 - A19 - 0000 0001

Store in Block 1

A12 - A19 - 0000 0010

Store in Block 2

etc.

A20 - A31

Not directly used to address cache

Used to distinguish

[pic]

Blocks within cache

Called a tag

Stored in tag table

Tag Table

Tag table provides last bit of information

Contains

One entry for each block in cache

Ours will contain 256 entries

One for each block

Entry contains

Bit to indicate if word within block modified

Called dirty bit

Address bits A20 - A31 of corresponding block

Bit to indicate block in cache

Summarizing

A0 - A1 Identify Byte within a Word

A2 - A11 Identify Word within a block

A12 - A19 Identify Block within cache

A20 - A31 Identify Addresses within a Block

Stored in tag table

Finding a word

To find word in cache execute simple process

Check tag table for bits A20 - A31

If present

Use bits A12 - A19 to index into cache

Use bits A2 - A11 to index into block

Use bits A0 - A1 for byte access

if WRITE operation

Set dirty bit in tag table

Modify word

else return word

else

Get block from primary memory

if block occupied

Check dirty bit

if set

write block to primary memory

write block to cache

set occupied bit

else

write block to cache

set occupied bit

Data or instruction

Most contemporary computers

Use two caches

Data

Instruction

Same principles work for both

Only extra work

Deciding which cache to use

Performance

Factors to consider in each case

With and without cache

With cache

With and without miss

Optimizing size

Affect of look ahead

Associative Cache

Alternate approach

Let block be placed anywhere in cache

Use associative search to locate

Organization now appears as

Let’s specify the following

Main memory

8 K with 8 byte blocks

Cache

2 K with 256 8 byte blocks

Tag table

256 entries

Main memory block goes anywhere in cache

Entry in tag table is main memory block number

Linear search of tag table not feasible

Let a main memory address be of the form

We find a word as follows

Problem with full associative cache

Long search time

Complexity of underlying logic

Let’s now look at scheme that combines features of

Direct mapping

Associative mapping

Called block set associative

Block Set Associative

Approach combining direct and associative mapping

Main memory organized as collection of groups

Each group comprises number of blocks

Cache memory organized as collection of sets

Containing specified number of blocks

Set number corresponds to main memory group number

Any block from group j can be placed into set j

Set is now searched associatively

Far less complex search

Dealing with smaller search space

Organization now appears as

Let’s specify the following

Main memory

8 K with 8 byte blocks

Cache

2 K with 256 8 byte blocks

Tag table

256 entries

Our addresses have the following association

We can now see how a main memory address I

Mapped to a cache address

Computation of address follows in same manner as

Direct and associative mappings

Intel Pentium

Implements separate

Data and instruction caches

Each uses 2 way block set associative scheme

Virtual Memory

Virtual memory is a scheme

Very much like caching

[pic]

Difference

Caching between primary memory and CPU

Virtual memory works between secondary and primary memories

Translates

From logical address - program

To physical address = primary memory

When information not found in cache

Primary memory checked

When information not found in primary memory

Secondary memory checked

In essence primary memory acts as cache for secondary memory

Purpose two fold

Take advantage of speed of primary memory

Create the appearance of unlimited primary memory

High Level Analysis

As we saw with cache scheme

Size of primary memory significantly smaller than secondary

Rather than blocks

Primary memory divided into pages

Would like each program to have memory space allocated when loaded

Will assume memory space is contiguous

Would like to be to place pages anywhere in primary memory

Makes addressing only slightly more complicated

Will store location of program memory in page table

Similar to tag table in cache scheme

General retrieval algorithm similar to what we’ve seen

Program begins executing

Encounters needed data or instruction

Check cache

If in cache

Have a cache hit

Use

Else

Have cache miss

Check primary memory

If in primary

Bring associated block into cache

Else

Have a page fault

Get from secondary memory

Must make room

Remove existing page

If page has been modified

Save changes

Else

Discard

Write new page in its place

Bring associated block into cache

Implementation

Design primary memory

Hardware

Collection of memory devices

Memory address register

Memory data register

Words will be 32 bits

Will have 16 M byte primary memory

Architecture

We will logically divide primary memory into 210 4k pages

Each page will be 4k words long

Each will hold 4 blocks

Each block will hold 1 K words

Note this is a logical division

Further note

Again address increments are rounded to make simpler

Provides reasonable size piece of memory to work with

[pic]

Primary memory will now logically appear as

Note j not necessarily = i + 1

We consider a

Virtual address

Primary memory address

Secondary memory address

[pic]

Assume Virtual Memory has

212 pages

Assume Secondary Memory has

218 pages ( 1 G words or 4 G bytes

Similar to cache

A2 - A11

Identifies word in page

Assume Main Memory has

210 pages – 1000 pages

Identified by bits A23 - A14

Now

A13 - A12

Identify block in page

A11 - A2

Identify word in block

A1-A0

Byte within a word

Page table

Contains one entry for each of your possible page in secondary memory

Our design will have 218 entries

Your pages can be anywhere in secondary memory

To your program they appear at

0, 1, 2…m-1

Could also have

Page tables i, j, and k etc.

Alternately use

A31 - A24 to identify one of 256 page tables

Potentially allows for upto 256 jobs in memory

When job enters system

Page table included

May have only subset of your pages in memory at any one time

A23 - A14 to identify a page within a page table

From point of view of virtual memory address

[pic]

Page number represents an offset into the page table

Valid bit to indicate if page in primary memory

Entry contains pointer to location in main memory

If not in main memory

Point to location in secondary memory

Dirty bit to indicate modified data

|VM Page Number |Page Address |Status |Location in MM or SM |

|0 |0-4095 | |0 |

|1 |4096-8191 | |2 |

|2 | | |1 |

| | | |13 |

|127 | | |123 |

| | | |7 |

|212 |…. | |55 |

Address calculation proceeds as with cache

Once we find the pages in primary memory

When program loaded

Primary memory space allocated

Amount depends upon program

Address of allocated space

Stored in page table register

Gives starting address of allocated space

To find pages

Go to page table

Add

Contents of page table register to A31 - A14 of virtual address

Gives index into page table

Address

Physical memory if there

Pointer to secondary memory otherwise

Use A13 - A12 to identify block

Page Replacement

Clearly primary memory of limited size

16 M bytes

We have the ability to address

4 G words

Although 16 M seems like lot

Do not want to restrict program to that size

To satisfy requirements will need to be able to

Load additional pages into memory

As long as space left no problem

If no space

Must remove something

Several schemes available each has advantages and disadvantages

All require checking dirty bit prior to removal

If set

Write operation necessary

Otherwise

Overwrite page

Two most common

Require time stamp on each page

LRU

Remove least recently used page

Also called FIFO

Assumes oldest page least likely to be used in future

MRU

Remove most recently used page

Also called LIFO

Assumes newest page least likely to be used in future

Random

Select and remove page at random

Easy to implement

Performance

With and without VM

With VM

With and without page fault

Optimizing size

Speeding things up - TLB

Our virtual memory page table

Contains information for 218 pages

Search process and address calculation

Can be very time consuming

Would like to improve response time

Easiest way

Use caching techniques learned earlier

Keep in memory a cache containing

Most recently used page addresses

Called Translation Look Aside Buffer - TLB

Search for a page begins in the TLB

If not found then check page table as before

Architecture

Choose 256 entries

256 most recently calculated addresses stored

Implement associative search

Keyed on virtual memory page number

Each entry must contain

Valid Bit

Indicates if entry is valid

Dirty Bit

Indicates if entry has been changed

Tag

A31 - A14 of virtual address

Physical Page

Computed address of page in primary memory

-----------------------

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download