# Memory Hierarchy and Cache Ch 4-5

Memory Hierarchy Main Memory Cache Implementation

24 9 2003

Copyright Teemu Kerola 2003

# Teemu's Cheesecake

Register, on-chip cache, memory, disk, and tape speeds relative to times locating cheese for the cheese cake you Europa

are baking..





(Jupiter) (tape)

24 9 2003

Copyright Teemu Kerola 2003

#### Goal (4)

- I want my memory lightning fast
- I want my memory to be gigantic in size
- Register access viewpoint:
  - data access as fast as HW register
  - data size as large as memory
- Memory access viewpoint
  - data access as fast as memory
  - data size as large as disk

HW solution

cache



HW help for SW solution

24 9 2003

Copyright Teemu Kerola 2003

# Memory Hierarchy (5)

Fig. 4.1

2

- · Most often needed data is kept close
- · Access to small data sets can be made fast - simpler circuits
- Faster is more expensive
- Large can be bigger and cheaper (per byte)

Memory Hierarchy

smaller, faster, more expensive, up:

more frequent access

down: bigger, slower, less expensive,

less frequent access

24 9 2003 Copyright Teemu Kerola 2003

# Principle of locality (8)

(paikallisuus)

- · In any given time period, memory references occur only to a small subset of the whole address
- · The reason why memory hierarchies work

Prob (small data set) = 99% Cost (small data set) =  $2 \mu s$ Prob (the rest) = 1%Cost (the rest) =  $20 \mu s$ Aver cost 99% \* 2  $\mu$ s + 1% \* 20  $\mu$ s = 2.2  $\mu$ s | Fig. 4.2

- · Average cost is close to the cost of small data set
- How to determine data for that small set?
- How to keep track of it?

24.9.2003 Copyright Teemu Kerola 2003

# Principle of locality (5)

- In any given time period, memory references occur only to a small subset of the whole address space (paikallisuus)
- Temporal locality: it is likely that a data item referenced a short time ago will be referenced (again) soon (ajallinen paikallisuus)
- Spatial locality: it is likely that a data items (close) to the one referenced a short time ago will be referenced soon (alueellinen paikallisuus)



24 9 2003 Copyright Teemu Kerola 2003

# Memory

- Random access semiconductor memory
  - give address & control, read/write data
- ROM, PROMS, FLASH

Table 5.1 (Table 4.2 [Stall99])

- system startup memory,

BIOS (Basic Input/Output System)

- · load and execute OS at boot
- also random access
- RAM
  - "normal" memory accessible by CPU

24 9 2003

Copyright Teemu Kerola 2003

#### **RAM**

E.g., \$0.12 / MB (year 2001)

- Dynamic RAM, DRAM
  - simpler, slower, denser, bigger (bytes per chip)
  - main memory?

E.g., 60 ns access

- periodic refreshing required
- refresh required after read
- Static RAM, SRAM E.g., \$0.70 / MB (year 2001)
  - more complex (more chip area/byte), faster, smaller (bytes per chip)
  - cache?

E.g., 5 ns access?

- no periodic refreshing needed
- data remains until power is lost

24 9 2003

Copyright Teemu Kerola 2003

#### DRAM Access

- 16 Mb DRAM
  - 4 bit data items

Fig. 5.3 (Fig. 4.4 [Stal99])

- 4M data elements, 2K \* 2K square
- Address 22 bits

Fig. 5.4 (b) (Fig. 4.5 (b) [Stal99])

- row access select (RAS)
- column access select (CAS)
- · interleaved on 11 address pins
- Simultaneous access to many 16Mb memory chips to access larger data items
  - Access 8 bit words in parallel? Need 8 chips.

Fig. 5.5 (Fig. 4.6 [Stal99])

24 9 2003

Copyright Teemu Kerola 2003

# SDRAM (Synchronous DRAM)

- 16 bits in parallel
  - access 4 DRAMs (4 bits each) in parallel
- CPU clock synchronizes also the bus
  - not by separate clock for the bus
  - CPU knows how longs it takes make a reference - it can do other work while waiting
- Faster than plain DRAM
- Current main memory technology (year 2001)

E.g., \$0.11 / MB (year 2001)

24 9 2003

Copyright Teemu Kerola 2003

# RDRAM (RambusDRAM)

- · New technology, works with fast memory bus
  - expensive

E.g., \$0.40 / MB (year 2001)?

E.g., 38 ns vs. 44 ns

· Faster transfer rate than with SDRAM

· Faster access than SDRAM

E.g., 1.6 GB/sec vs. 200 MB/sec (?)

- Fast internal Rambus channel (800 MHz)
- · Rambus memory controller connects to bus
- Speed slows down with many memory modules
  - serially connected on Rambus channel
  - not good for servers with 1 GB memory (for now!)
- 5% of memory chips (year 2000), 12% (2005)?

24 9 2003 Copyright Teemu Kerola 2003

# Flash memory

- · Original invention
  - Fujio Masuoka, Toshiba Corp., 1984
  - non-volatile, data remains with power off
  - slow to write ("program")
- · Nand-Flash, 1987
  - Fujio Masuoka
  - lowers the wiring per bit to one-eighth that of the Flash Memory's

24 9 2003

Copyright Teemu Kerola 2003



12

#### Intel ETOX Flash

- · Intel, 1997
- A single transistor with the addition of an electrically isolated polysilicon floating gate capable of storing charge (electrons)
- · Negatively charged electrons act as a barrier between the control gate and the floating gate.
- Depending on the flow through the floating gate (more or less than 50%) it has value 1 or 0.
- Read/Write data in small blocks



http://developer.intel.com/technology/ itj/q41997/articles/art\_1.htm

24.9.2003 Copyright Teemu Kerola 2003

#### Intel StrataFlash

- Flash cell is analog, not digital storage
- · Use different charge levels to store 2 bits (or more!) of data in each flash cell



14

24 9 2003

Copyright Teemu Kerola 2003

# Flash **Implementations**

- BIOS (PC's, phones, other hand-held devices....)
- Toshiba SmartMedia, 2-256 MB ...
- · Sony Memory Stick, 2-1024 MB
- CompactFlash, 8-512 MB .....
- · PlayStation II Memory Card, 8 MB
- MMC MultiMedia Card, 32-128 MB
- Fuji XD Picture Card 32-256 MB
- · Hand-held phone memories

24 9 2003 Copyright Teemu Kerola 2003 24 9 2003 Copyright Teemu Kerola 2003 16

# Cache Memory

(välimuisti)

15

- Problem: how can I make my (main) memory as fast as my registers?
- Answer: (processor) cache
  - keep most probably referenced data in fast cache close to processor, and rest of it in memory
    - · much smaller than main memory
    - (much) more expensive (per byte) than memory
    - most of data accesses only to cache 90% 99%?

17

Fig. 4.3 & 4.6 (Fig. 4.13 & 4.16 [Stal99])

24.9.2003 Copyright Teemu Kerola 2003

# Memory references with cache (5)

• Data is in cache?

Data is only in memory?

Read it to cache CPU waits until data available



Many blocks (cache lines) help for temporal locality many different data items in cache

Large blocks help for spatial locality lots of "nearby" data available

Fig. 4.4 (Fig. 4.14 [Stall99])

Fixed cache size?

Select "many" or "large"? (can not have both!)

24 9 2003 Copyright Teemu Kerola 2003 18



















# Fully Associative Mapping Lots of circuits tag fields are long - wasted space! each cache line tag must be compared simultaneously with the memory address tag lots of wires

- lots of comparison circuits area on chip
   Final comparison "or" has large gate delay
  - did any of these 64 comparisons match?

 $^2 \log(64) = 8$  levels of binary or-gates

Large surface

- how about 262144 comparisons?

• ⇒ Can use it only for small caches

24.9.2003 Copyright Teemu Kerola 2003 28

#### Set Associative Mapping (6) (joukkoassosiatiivinen kuvaus) • With set size k=2, every block has 2 possible locations in cache Possible location of block is Cache line size determined by set (index) field bits block = block size 34 bit address $2^5 = 32$ bytes tag set (index) offset (byte address) → 22 Unique bits that are Nr of sets $v=2^7=128$ blocks = 4 KB different for each block, Total cache size vk=2\*4 KB= 8 KB stored in each cache line Fig. 4.11 (confusing, complex?) Fig. 5.8 [HePa96] (E.g., k=2, s=29, d=7, w=5)Fig. 7.19 [PaHe98] (Fig. 4.21 [Stall99]) 24 9 2003 Copyright Teemu Kerola 2003

# Two definitions for "Set" in "Set Associative Mapping"

- Term "set" is the set of all possible locations where referenced memory block can be
  - Field "set" of memory address determines this set
  - [Stal03], [Stal99]
- Cache memory is split into multiple "sets", and the referenced memory block can be in only one location in each "set"
  - Field "index" of memory address determines possible location of referenced block in each "set"
  - [HePa96], [PaHe98]

24.9.2003 Copyright Teemu Kerola 2003 30







# Set Associative Mapping

- Set associative cache with set size 2 = 2-way cache
- Degree of associativity v? Usually 2
  - v large?
- Fig. 7.16 [PaHe98]
- v large.
  - More data items (v) in one set
  - less "collisions" within set
  - final comparison (matching tags?) gate delay?
- v maximum (nr of cache lines)
  - ⇒ fully associative mapping
- $-v \min (1) \Rightarrow \text{direct mapping}$

24.9.2003 Copyright Teemu Kerola 2003 34

# Replacement Algorithm

- Which cache block (line) to remove to make room for new block from memory?
- Direct mapping case trivial
- First-In-First-Out (FIFO)
- Least-Frequently-Used (LFU)
- Random
- Which one is best?
  - Chip area?
  - Fast? Easy to implement?

24.9.2003 Copyright Teemu Kerola 2003

35

# Write Policy

- How to handle writes to memory?
- Write through

(läpikirjoittava)

- each write goes always to memory
- each write is a cache miss!
- Write back

(lopuksi kirjoittava takaisin kirjoittava?)

- write cache block to memory only when it is replaced in cache
- memory may have stale (old) data
- cache coherence problem (välimuistin

(välimuistin yhteneväisyysongelma)

24.9.2003 Copyright Teemu Kerola 2003

30

#### Line size

- · How big cache line?
- Optimise for temporal or spatial locality?
  - bigger cache line is better for spatial locality
  - More cache lines is better for temporal locality
- <u>Data</u> references and <u>code</u> references behave in a different way

37

- Best size varies with <u>program</u> or <u>program phase</u>
- 2-8 words?
  - word = 1 float??

24.9.2003 Copyright Teemu Kerola 2003

# Number/types of Caches (3)

- One cache too large for best results
- Unified vs. split cache

(yhdistetty, erilliset)

- same cache for data and code, or not?
- split cache: can optimise structure separately for data and code
- Multiple levels of caches
  - L1 same chip as CPU

- L3 - same board as CPU

- L2 same package or chip as CPU
  - older systems: same board

Fig. 4.13 (Fig. 4.23 [Stal99])

24.9.2003 Copyright Teemu Kerola 2003

u Kerola 2003 38

-- End of Ch. 4-5: Cache Memory --



http://www.intel.com/procs/servers/feature/cache/unique.htm

"The Pentium® Pro processor's unique multi-cavity chip package brings L2 cache memory closer to the CPU, delivering higher performance for business-critical computing needs."

24.9.2003 Copyright Teemu Kerola 2003 39