## Memory Hierarchy and Cache Ch 4-5

Memory Hierarchy
Main Memory
Cache
Implementation

12.9.2002 Copyright Teemu Kerola 2002



#### Goal (4)

- I want my memory lightning fast
- I want my memory to be gigantic in size
- Register access viewpoint:



HW solution

HW help for

SW solution

virtual

memory

- data access as fast as HW register
- data size as large as memory
- Memory access viewpoint
  - data access as fast as memory
  - d-4- -:-- -- 1---- -- di-1-
  - data size as large as disk

12.9.2002 Copyright Teemu Kerola 2002

#### Memory Hierarchy (5)



- · Most often needed data is kept close
- Access to small data sets can be made fast
   simpler circuits
- · Faster is more expensive
- Large can be bigger and cheaper (per byte)

Memory Hierarchy

up: smaller, faster, more expensive,

more frequent access

down: bigger, slower, less expensive,

less frequent access

12.9.2002 Copyright Teemu Kerola 2002

#### Principle of locality (7)



- In any given time period, memory references occur only to a <u>small subset</u> of the whole address space
- The reason why memory hierarchies work



- Average cost is close to the cost of small data set
- · How to determine that small data set?
- How to keep track of it?

12.9.2002 Copyright Teemu Kerola 200

#### Principle of locality (5)

- In any given time period, memory references occur only to a <u>small subset</u> of the whole address space (paikallisuus)
- Temporal locality: it is likely that a data item referenced a short time ago will be referenced (again soon (ajallinen paikallisuus)
- Spatial locality: it is likely that a data items
  close to the one referenced a short time ago will be
  referenced soon (alueellinen paikallisuus)



Chapter 4-5, Memory, Cache

#### Memory

- · Random access semiconductor memory
  - give address & control, read/write data
- ROM, PROMS

Table 5.1 (Table 4.2 [Stall99])

- system startup memory, BIOS (Basic Input/Output System)

- · load and execute OS at boot
- also random access
- RAM
  - "normal" memory accessible by CPU

Copyright Teemu Kerola 2002

**RAM** 

E.g., \$0.12 / MB (year 2001)?

- Dynamic RAM, DRAM
  - simpler, slower, denser, bigger (bytes per chip)
  - main memory?

E.g., 60 ns access

- periodic refreshing required
- refresh required after read
- Static RAM, SRAM E.g., \$0.50 / MB (year 2001)?
  - more complex (more chip area/byte), faster, smaller (bytes per chip) E.g., 5 ns access?

  - no periodic refreshing needed
  - data remains until power is lost

12.9.2002

Copyright Teemu Kerola 2002

#### **DRAM Access**

- 16 Mb DRAM
  - 4 bit data items

Fig. 5.3 (Fig. 4.4 [Stal99])

- 4M data elements, 2K \* 2K square
- Address 22 bits

Fig. 5.4 (b) (Fig. 4.5 (b) [Stal99])

- row access select (RAS)
- · column access select (CAS)
- · interleaved on 11 address pins
- Simultaneous access to many 16Mb memory chips to access larger data items
  - Access 8 bit words in parallel? Need 8 chips.

Fig. 5.5 (Fig. 4.6 [Stal99])

12.9.2002

Copyright Teemu Kerola 2002

### SDRAM (Synchronous DRAM)

- 16 bits in parallel
  - access 4 DRAMs (4 bits each) in parallel
- CPU clock synchronizes also the bus
  - not by separate clock for the bus
  - CPU knows how longs it takes make a reference - it can do other work while waiting
- Faster than plain DRAM
- Current main memory technology (year 2001)

E.g., \$0.11 / MB (year 2001)

12.9.2002

Copyright Teemu Kerola 2002

#### RDRAM (RambusDRAM)

- New technology, works with fast memory bus
  - expensive

E.g., \$0.40 / MB (year 2001)?

• Faster transfer rate than with SDRAM

E.g., 1.6 GB/sec vs. 200 MB/sec (?) E.g., 38 ns vs. 44 ns

- Faster access than SDRAM • Fast internal Rambus channel (800 MHz)
- · Rambus memory controller connects to bus
- Speed slows down with many memory modules
  - serially connected on Rambus channel
  - not good for servers with 1 GB memory (for now!)
- 5% of memory chips (year 2000), 12% (2005)?

#### Flash memory

- Original invention
- Fujio Masuoka, Toshiba Corp., 1984
- non-volatile, data remains with power off
- slow to write ("program")
- Nand-Flash, 1987
  - Fujio Masuoka
  - lowers the wiring per bit to one-eighth that of the Flash Memory's

#### Intel ETOX Flash

- Intel, 1997
- A single transistor with the addition of an electrically isolated polysilicon floating gate capable of storing charge (electrons)
- Negatively charged electrons act as a barrier between the control gate and the floating gate.
- · Depending on the flow through the floating gate (more or less than 50%) it has value 1 or 0.
- Read/Write data in small blocks

Copyright Teemu Kerola 2002

use high voltage to write, and "Fowler-Nordheim Tunneling" to clear http://developer.intel.com/technology/ itj/q41997/articles/art\_1.htm



## Flash **Implementations**

- BIOS (PC's, phones, other hand-held devices....)
- Toshiba SmartMedia, 2-256 MB ....
- Sony Memory Stick, 2-256 MB
- CompactFlash, 8-512 MB .....
- PlayStation II Memory Card, 8 MB
- MMC MultiMedia Card, 32-128 MB
- IBM MicroDrive (hard disk!) compatible memory card
- · Hand-held phone memories

Copyright Teemu Kerola 2002



## Cache Memory

(välimuisti)

15

- Problem: how can I make my (main) memory as fast as my registers?
- Answer: (processor) cache
  - keep most probably referenced data in fast cache close to processor, and rest of it in memory
    - · much smaller than main memory
    - (much) more expensive (per byte) than memory
    - · most of data accesses to cache



Fig. 4.3 & 4.6 (Fig. 4.13 & 4.16 [Stal99])

Copyright Teemu Kerola 2002

## Memory references with cache (5)

• Data is in cache?

Data is only in memory? Read it to cache CPU waits until data available



Many blocks (cache lines) help for temporal locality many different data items in cache

Large blocks help for spatial locality lots of "nearby" data available

Fig. 4.4 (Fig. 4.14 [Stall99])

Fixed cache size?

Select "many" or "large"?

Copyright Teemu Kerola 2002

























# Two definitions for "Set" in "Set Associative Mapping"

- Term "set" is the set of all possible locations where referenced memory block can be
  - Field "set" of memory address determines this set
  - [Stal03], [Stal99]
- Cache memory is split into multiple "sets", and the referenced memory block can be in only one location in each "set"
  - Field "index" of memory address determines possible location of referenced block in each "set"
  - [HePa96], [PaHe98]

12.9.2002

Copyright Teemu Kerola 2002







## Replacement Algorithm

- Which cache block (line) to remove to make room for new block from memory?
- Direct mapping case trivial
- First-In-First-Out (FIFO)
- Least-Frequently-Used (LFU)
- Random
- Which one is best?
  - Chip area?
  - Fast? Easy to implement?

12.9.2002

Copyright Teemu Kerola 2002

#### Write Policy

- How to handle writes to memory?
- Write through

(läpikirjoittava)

- each write goes always to memory
- each write is a cache miss!
  - iss!
- Write back

(lopuksi kirjoittava takaisin kirjoittava?)

- write cache block to memory only when it is replaced in cache
- memory may have stale (old) data
- cache coherence problem (välimuistin

(välimuistin yhteneväisyysongelma)

12.9.2002

Copyright Teemu Kerola 2002

#### Line size

- How big cache line?
- Optimise for temporal or spatial locality?
   bigger is better for spatial locality
- <u>Data</u> references and <u>code</u> references behave in a different way
- Best size varies with <u>program</u> or <u>program phase</u>
- 2-8 words?
  - word = 1 float??

12.9.2002

Copyright Teemu Kerola 2002

#### Number of Caches (3)

- One cache too large for best results
- Unified vs. split cache

(yhdistetty, erilliset)

- same cache for data and code, or not?
- split cache: can optimise structure separately for data and code
- Multiple levels of caches
  - L1 same chip as CPU
  - $-\,L2$  same package or chip as CPU
    - older systems: same board
  - L3 same board as CPU

Fig. 4.13 (Fig. 4.23 [Stal99])

12.9.2002

Copyright Teemu Kerola 2002

-- End of Ch. 4-5: Cache Memory -
http://www.intel.com/procs/servers/feature/cache/unique.htm

"The Pentium® Pro processor's unique multi-cavity chip package brings L2 cache memory closer to the CPU, delivering higher performance for business-critical computing needs."

12.9.2002 Copyright Teemu Kerola 2002 39