## IA-64 and Crusoe Architectures Ch 15

IA-64 General Organization Predication, Speculation Software Pipelining Example: Itanium

Crusoe General Architecture Emulated Precise Exceptions

11.10.2002 Copyright Teemu Kerola 2002

### General Organization

- EPIC Explicit Parallel Instruction Computing
  - parallelism visible at instruction level, not "secrectly" implement in processor
    - · new instruction stream semantics
  - compiler prevents many "hazards" (dependency problems), hardware can depend on it
- VLIW (Very Long Instruction Word)
- Branch predication many speculative execution tracks
- · Speculate on memory data loads

11.10.2002 Copyright Teemu Kerola 2002 2

### IA-64 General Organization

- 128 64-bit (+ Not a Thing bit) registers
  - integer, logical, general purpose
- 128 82-bit registers
- f0: 0.0 f1: 1.0
- floating point (IEEE double extended)
- graphics
- 64 1-bit predicate registers
- 8 64-bit branch registers

Fig 15.1

Slide 9 [Lamb00]

11.10.2002 Copyright Teemu Kerola 2002

### **Instruction Format**

- Instruction (41 bits)
  - operation & predicates
  - up to 6 instruction executions in parallel
- Fig 15.2

Tbl 15.3

Slide 8 [Lamb00]

- Instruction bundle (128 bits)
  - three instructions & template
  - smallest unit to fetch instructions from memory
- Instruction group
  - machine instructions that could be issued in parallel
  - end of group marked with ";;" in symbolic assembly language code

11.10.2002 C

Copyright Teemu Kerola 2002

### **Predicated Execution**

- · Execute each branch
- Fig 15.3 (a)
- if-then-else gives two predicates, and each path will advance with its own predicate
- Predicate values known only after branch instruction completes
- Discard "wrong" path, commit "right" path
  - known always before commit time?



11.10.2002

Copyright Teemu Kerola 2002

### Speculative Loading, I.e., Control Speculation

- Start loading from memory in advance so that data is available earlier
  - load instruction "hoisted" earlier in code, before some <u>branch instruction</u>
  - interrupts are delayed (via NaT bit in register), and handled only at the time when they would have been handled normally



Fig 15.3 (b)

Slides 27, 28 [Lamb00]

11.10.2002 Copyright Teemu Kerola 2002



Copyright Teemu Kerola 2002







# Itanium 1st implementation of IA-64 architecture "Simpler" than conventional superscalar no reservation stations, reorder buffers no large renamed register set for architecture registers no dependency issue logic dependencies solved by compiler, and explicitly solved in code Very large memory address space explicit control over memory hierarchies explicit memory op fences Slides 10-12 [Lamb00]



### Itanium 2

- · Upgraded cache hierarchy
  - split L1: 16KB + 16KB, 4-way set assoc, 64B lines
  - unified L2: 256KB, 8-way set assoc, 128B lines
  - on-chip unified L3: 3MB, 12-way set assoc
- · TLB hierarchy
  - instruction **L1 TLB**: 32 entry full assoc
  - instruction <u>L2 TLB</u>: 128 entry full assoc
  - data L1 TLB: 32 entry full assoc
  - data L2 TLB: <u>128</u> entry full assoc

11.10.2002

Copyright Teemu Kerola 2002

t Teemu Kerola 2002

### Itanium 2

- Max 6 issues per cycle
  - 11 issue ports
- · Many functional units, all fully pipelined
  - 6 general purpose ALU's
  - 4 data cache memory ports
  - 6 multimedia FU's
  - 4 FPU's
  - 3 branch units
- · Perfect loop prediction
- Lots of branch prediction hints in code

11.10.2002

Copyright Teemu Kerola 2002

### IA-64 Summary

- Parallel semantics for ISA (Instr Set Arch)
- Lots of explicit ILP (Instr Level Parallelism)
- Memory hierarchy (cache) controls in ISA
- Memory synchronization primitives in ISA
  - normal access temporal locality hint (E.g., ifetch.t1) suggests to keep data in L1D, L2, and L3
  - less important hint (E.g., Fpload.nt1) suggests to keep data only in L2 and L3.

11.10.2002

Copyright Teemu Kerola 2002

15

### IA-64 Summary (contd)

- Lots of speculative work, that may be wasted
  - predicated execution
  - miss-prediction costs mostly avoided
  - branch prediction hints in ISA
  - load speculation: "hoist" loads above branch or store
- Large visible register set no hidden rename regs
  - automatic stack frame save/restore
- · HW-controlled software pipelining

11.10.2002

Copyright Teemu Kerola 2002

### Crusoe Architecture

Major Ideas
General Architecture
Emulated Precise Exceptions
What to do with It

11.10.200

opyright Teemu Kerola 2002

.10.2002 Copyright Teemu Kerola 2002

### Background

- Transmeta Corporation
  - Paul Allen (Microsoft), George Soros (Soros Funds)
  - David R. Ditzel (Sun)

Orig. CEO, now CTO

- Edmund J. Kelly, Malcolm John Wing, Robert F. Cmelik
- Linus B. Torvalds, February 1997  $\rightarrow \dots$
- Patent 5832205
  - applied August 20, 1996
  - granted November 3, 1998
  - many (a few) other patents ...
- Crusoe processor
  - published January 19, 2000

11.10.2002

Copyright Teemu Kerola 2002

nu Kerola 2002

### Basic Idea(s) (5)



Create a new processor which, when coupled with "morph host" emulator, can run Intel/Windows code faster than state-of-the-art Intel processor, *or* with same speed but with less electric power



- New processor can be implemented with significantly fewer gates than competitive processors
- · Compete with Intel, friendly with Microsoft
  - sell chip with emulator code to system manufacturers (Dell, IBM, Sun, etc etc)
- X86 (IA-32) binary is new binary standard
- Native OS not so important
  - services from target OS: E.g., Windows or Linux

11.10.2002

Copyright Teemu Kerola 2002

20

22

### Major General Ideas

- Emulation can be faster than direct execution
- TLB used to solve new problems
  - track memory accesses for memory mapped I/O
  - track memory accesses for self-modifying code
- Most of executed code generated "on-the fly"
  - not compiled before execution begins
  - extremely optimized dynamic code generation
- Optimized code allows for simpler machine
  - smaller, faster, uses less power?

11.10.2002

Copyright Teemu Kerola 2002

### Major General Ideas (contd)

- Self-modified code (dynamically created code) can be generated so that it is extremely optimized for execution
  - issue dependencies, reorder, reschedule problems solved at code generation (<u>not</u> in HW)
  - processor HW does not need to solve these
- · Optimize for speed, but only when needed
  - do <u>not optimize</u> for speed when exact state change is required (<u>this is the tricky part!</u>)
- Alias detection to assist keeping globals is registers

11.10.2002

21

Copyright Teemu Kerola 2002

Major General Ideas (contd)

• NOT: faster and with less power

Class action suit (5.7.2001) ... stating that ... a revolutionary process that delivered longer battery life in Mobile Internet Computers while delivering high performance ....

http://www.theregister.co.uk/content/3/20058.html

11.10.2002

Copyright Teemu Kerola 2002

### Major Emulation Ideas

- Target processor (I.e., Intel processor) state kept in dedicated HW registers
  - working state ("speculated" state?), committed state
- Memory store buffer keeps uncommitted ("speculated") emulated memory state
- · Specific instructions support emulation
  - commit, rollback (exact exceptions)
  - prot (aliases)
- TLB (and VM) designed to support emulation
  - A/N-bit (mem-mapped I/O), T-bit (self-mod. code)

11.10.2002

Copyright Teemu Kerola 2002

### General Architecture

- · VLIW implementation
  - VLIW = Very Long Instruction Word
  - 4 simultaneous RISC instructions in "molecule"
    - · one each of float, int, load/store, branch
  - large L3 Translation Cache for VLIW "molecules"
    - 8-16 MB
    - similar to Pentium 4 Trace Cache?
  - no circuitry for issue dependencies, reorder, optimize, reschedule
    - · compiler takes care of these
    - data & structural dependencies under compiler control?

11.10.2002

Copyright Teemu Kerola 2002

25

### General Architecture (contd)

- · Large register set
  - native regs: 64 INT, 32 FP
    - · extra regs for renaming
  - target architecture regs: complete CPU state
    - INT, FP, control

Reax, Recx, Rseq, Reip

- working regs for normal emulation
- · committed regs for saving emulated processor state

11.10.2002

Copyright Teemu Kerola 2002

### General Architecture (contd)

- TLB
  - new features to solve new problems
    - before used to solve also memory protection problems in addition to plain VM address mapping
  - A/N-bit for memory-mapped I/O detection
    - trap to emulator, which creates precise code
    - memory-mapped I/O requires precise emulated processor state changes
  - T-bit for self-modifying code detection
    - trap to emulator, which recreates emulating code in instruction cache ("translation buffer")

11.10.2002

Copyright Teemu Kerola 2002

27

### General Architecture (contd)

- Target memory store buffer
  - implemented with special register(s) to support emulation
  - keep track on which target processor memory stores are committed and which are not
  - uncommitted memory stores can be discarded at rollback
    - modify HW registers implementing it
    - commit & rollback controlled from <u>outside</u> of the processor, not internally as is usual with speculative instructions

11.10.200

Copyright Teemu Kerola 2002

General Architecture (contd)

- · RISC instruction set
  - explicitly parallel code (VLIW)
  - $-\ commit\ instruction\ supports\ emulation$ 
    - commits emulated processor and memory state
    - use when coherent <u>target processor</u> (Intel) state!
  - rollback instruction (?) supports emulation
    - some or all of it can be in emulator code
       recover latest committed emulated target regions.
    - recover latest committed emulated target register state
    - · delete uncommitted writes from store buffer
    - retranslate emulation code for precise state changes *commit* now after every emulated instruction?
  - prot instruction for alias detection

11.10.2002

Copyright Teemu Kerola 2002

29

## Ordinary Program Execution memory LDA R1, =543 ADD R2, R4, R5 cache instruction exec. circuits 11.10.2002 Copyright Teemu Kerola 2002 30













### **Crusoe Summary**

- Emulation can be done faster or with less energy than the "real thing"
- VLIW (EPIC?) core architecture
- · Special HW to speed up emulation
  - x86 regs
  - memory-mapped I/O detection
  - alias and self-modifying code detection
- Special HW for precise interrupts
  - 2nd set of x86 regs
  - target memory store buffer
  - commit and rollback instruction in ISA

11.10.2002

Copyright Teemu Kerola 2002

## Crusoe Summary (contd)

- Complex overall structure
- "Code Morphing Software"
  - JIT optimized code generation
  - compiler and interpreter resident in memory
  - fast but imprecise, or slow and precise emulation
- Optimize for speed or size (power, electricity)?
  - Small size ⇒ cheaper, less power

TM3200, TM5400, ..., TM5600 low power TM5800 high speed

11.10.2002

Copyright Teemu Kerola 2002

### -- IA-64 and Crusoe End --



"Aqua 3400 Portable Wireless Internet Access Device, Transmeta 400MHz, 8.4" TFT touch-screen"



"NEC Versa DayLite combines the power-saving 600 Mhz Crusoe TM5600 processor with dual battery systems that NEC claims will extend battery life to up to 7.5 hours on a single charge"

11.10.2002

Copyright Teemu Kerola 2002

Ch 15, IA-64 Architecture