## CPU Structure and Function Ch 11

General Organisation Registers Instruction Cycle Pipelining Branch Prediction Interrupts

## General CPU Organization (4)

- ALU
  - does all <u>real</u> work
- Registers
  - data stored here
- Internal CPU Bus
- Control



Fig. 11.1

Fig. 11.2

More in Chapters 14-15

- determines who does what when
- driven by clock
- uses control signals (wires) to control what every circuit is doing at any given clock cycle

Copyright Teemu Kerola 2001

### Register Organisation (4)

- Registers make up CPU work space
- User visible registers ADD R1,R2,R3
   accessible directly via instructions
- Control and status registers BNeq Loop
  - may be accessible indirectly via instructions
  - may be accessible only internally HW exception
- Internal latches for temporary storage during instruction execution
  - E.g., ALU operand either from constant in instruction or from machine register

## User Visible Registers

- Varies from one architecture to another
- General purpose register (GPR)
  - Data, address, index, PC, condition, ....
- Data register
  - Int, FP, Double, Index
- Address register
- Segment and stack pointers
  - only privileged instruction can write?
- Condition codes
  - result of some previous ALU operation

## Control and Status Registers (5)

• PC

- next instruction (not current!)
- part of process state
- IR, Instruction (Decoding) Register
  - current instruction
- MAR, Memory Address Register
  - current memory address
- MBR, Memory Buffer Register
  - current data to/from memory
- PSW, Program Status Word
  - what is allowed? What is going on?
  - part of process state



## PSW - Program Status Word (6)

- State info from latest ALU-op
  - Sign, zero?
  - Carry (for multiword ALU ops)?
  - Overflow?
- Interrupts that are enabled/disabled?
- Pending interrupts?
- CPU execution mode (supervisor, user)?
- Stack pointer, page table pointer?
- I/O registers?

26/09/2001

### Instruction Cycle (4)

- Basic cycle with interrupt handling Fig. 11.4
- Indirect cycle
- Data Flow
  - CPU, Bus, Memory
- Data Path



Figs 11.5-6

### Fig 14.5

- CPU's "internal data bus" or "data mesh"
- All computation is data transformations occurring on the data path
- Control signals determine data flow & action for each clock cycle

# Pipeline Example

#### (liukuhihna)

- Laundry Example (David A. Patterson)
- Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold
- Washer takes 30 minutes
- Dryer takes 40 minutes
- "Folder" takes 20 minutes





**ABCD** 





# Pipelining Lessons (4)

- Pipelining doesn't help <u>latency</u> of single task, but it helps throughput of the entire workload
- Pipeline rate limited by slowest pipeline stage
- Multiple tasks operating simultaneously
- Potential speedup = maximum possible speedup = Number pipe stages





# Pipelining Lessons (3)

- <u>Unbalanced lengths</u> of pipe stages reduces speedup
- May need more resources
  - Enough electrical current to run both washer and dryer simultaneously?
  - Need to have at least
    2 people present all
    the time?
- Time to "fill" pipeline and time to "drain" it reduces speedup



# 2-stage Instruction Execution Pipeline (4) Fig. 11.10

- Good: instruction pre-fetch at the same time as execution of previous instruction
- Bad: execution phase is longer, I.e., fetch stage is sometimes idle
- Bad: Sometimes (jump, branch) wrong instruction is fetched
  - every  $6^{th}$  instruction?
- Not enough parallelism  $\Rightarrow$  more stages?

# Another Possible Instruction Execution Pipeline

- FE <u>Fe</u>tch instruction
- DI <u>D</u>ecode <u>instruction</u>
- CO <u>Calculate operand effective addresses</u>
- FO <u>Fetch operands from memory</u>
- EI <u>Execute</u> Instruction
- WO <u>Write operand</u> (result) to memory





### Pipeline Execution Time (3)

- <u>Time</u> to execute <u>one instruction</u> (latency, seconds) may be <u>longer</u> than for non-pipelined machine
  - extra latches to store intermediate results
- <u>Time</u> to execute 1000 instructions (seconds) is <u>shorter</u> (better) than that for non-pipelined machine, I.e.,

<u>Throughput</u> (instructions per second) for pipelined machine is <u>better</u> (bigger) than that for non-pipelined machine

• Is this good or bad? Why?

## Pipeline Speedup Problems

- Some stages are shorter than the others
- Dependencies between instructions
  - control dependency
    - E.g., conditional branch decision know only after EI stage

### Pipeline Speedup Problems

- Dependencies between instructions
  - data dependency
    - One instruction depends on data produced by <u>some earlier</u> instruction
  - <u>structural dependency</u>
    - Many instructions need the <u>same resource</u> <u>at the same time</u>
    - memory bus, ALU, ...

Known Fig. 11.12 after EI stage MUL **R1**,R2,R3 LOAD R6, $ArrB(\mathbf{R1})$ Needed in CO stage STORE R1, VarX R2,R3,VarY ADD R3,R4,R5 MUL FO FI

Cycle Time overhead?  $\tau = \max[\tau_i] + d = \tau_m + d >> d$ (min) cycle time

max gate delay in stage

delay in latches between stages (= clock pulse, or clock cycle time) gate delay in stage i

- Cycle time is the same for all stages - time (in clock pulses) to execute the cycle
- Each stage executed in one cycle time •
- Longest stage determines min cycle time – max MHz rate for system clock

Pipeline Speedupn instructions, k stagesn instructions, k stagesTime  
not pipelined:
$$T_1 = nk\tau$$
Time  
pipelined: $T_k = [k + (n-1)]\tau$ k cycles until  
l st instruction  
completest cycle for  
each of the rest  
(n-1) instructions

(n-1) instructions

Copyright Teemu Kerola 2001



### Branch Problem Solutions (5)

### • Delayed Branch

- compiler places some useful instructions
   (1 or more!) after branch (or jump) instructions
- these instructions are almost completely executed when branch decision is known
- less actual work lost

Fig. 12.7

– can be difficult to do

### Branch Probl. Solutions (contd) (6)

- Multiple instruction streams
  - execute speculatively in both directions
    - Problem: we do not know the branch target address early!
  - if one direction splits, continue each way again
  - lots of hardware
    - speculative results (registers!), control
  - speculative instructions may delay real work
    - bus & register contention?
  - need to be able to <u>cancel</u> not-taken instruction streams in pipeline

### Branch Probl. Solutions (contd) (2)

• Prefetch Branch Target

IBM 360/91 (1967)

- prefetch just branch target instruction
- do not execute it, I.e., do only FI stage
- if branch take, no need to wait for memory
- Loop Buffer
  - keep n most recently fetched instructions in high speed buffer inside CPU
  - works for small loops (at most *n* instructions)

# Branch Probl. Solutions (contd) (5)

- Branch Prediction
  - guess (intelligently) which way branch will go
  - static prediction: all *taken* or all *not taken*
  - static prediction based on opcode
    - E.g., because BLE instruction is *usually* at the end of loop, guess "taken"
  - dynamic prediction taken/not taken
    - based on previous time this instruction was executed
    - need space (1 bit) in CPU for each (?) branch
    - end of loop always wrong twice!
    - extension based on two previous time execution
      - need more space (2 bits)



### Branch Address Prediction (3)

- It is not enough to know whether <u>branch</u> is <u>taken or not</u>
- Must know also <u>branch address</u> to fetch target instruction
- Branch History Table
  - state information to guess whether branch will be taken or not
  - previous branch target address
  - stored in CPU for each (?) branch

## Branch History Table

• Cached

#### PowerPC 620

- entries only for most recent branches
  - Branch instruction address, or tag bits for it
  - Branch taken prediction bits (2?)
  - Target address (from previous time) or complete target instruction?
- Why cached
  - expensive hardware, not enough space for all possible branches
  - at lookup time check first whether entry for correct branch instruction

### CPU Example: PowerPC

- User Visible Registers
  - 32 general purpose regs, each 64 bits

Fig. 11.22

Fig. 11.23a

Table 11.3

Fig. 11.23b

Table 11.4

- Exception reg (XER), 32 bits
- 32 FP regs, each 64 bits
  - FP status & control (FPSCR), 32 bits
- branch processing unit registers
  - Condition, 32 bits
    - -8 fields, each 4 bits
    - identity given in instructions
  - Link reg, 64 bits
    - E.g., return address
  - Count regs, 64 bits
    - E.g., loop counter

### CPU Example: PowerPC

### • Interrupts

- cause

- system condition or event
- instruction



### CPU Example: PowerPC

- Machine State Register, 64 bits Table 11.6
  - bit 48: external (I/O) interrupts enabled?
  - bit 49: privileged state or not
  - bits 52&55: which FP interrupts enabled?
  - bit 59: data address translation on/off
  - bit 63: big/little endian mode
- Save/Restore Regs SRR0 and SRR1
  - temporary data needed for interrupt handling

## Power PC Interrupt Invocation

• Save return PC to SRR0

### Table 11.6

- current or next instruction at the time of interrupt
- Copy relevant areas of MSR to SRR1
- Copy additional interrupt info to SRR1
- Copy fixed new value into MSR
  - different for each interrupt
  - address translation off, disable interrupts
- Copy interrupt handler entry point to PC
  - two possible handlers, selection based on bit 57 of original MSR

### Power PC Interrupt Return

### Table 11.6

- Return From Interrupt (rfi) instruction
   privileged
- Rebuild original MSR from SRR1
- Copy return address from SRR0 to PC

