Processor Architecture
CPU Organisation & Control
- Processor continuously runs fetch-decode-execute cycle
- Each instruction cycle take several CPU clock cycles
- Requires interaction of lots of CPU components
- ALU, CU, PC, IR, MAR, MDR
- Machine instructions may specify
- Op code
- Source operand reference
- Result operand reference
- Next instruction reference
- Some CPU registers are user-visible, such as data and address registers
- Control and status registers are used by CU and privileged OS processes only
- Executing an instruction may involve one or more operands, each requiring to be fetched
- Can account for this in instruction cycle model known as indirect cycle
- Instruction pipelining allows to use wasted time, as new inputs can be accepted before previously accepted instructions and been output
- Control unit is responsbible for generating control signals to drive cycle
- Observe opcode input and choose right control signal - decode
- Assert control signals - execute
- Two approaches to CU design:
- Hardwired
- Uses a sequencer and a digital logic circuit that produces outputs
- Fast but limited by complexity and inflexibility
- Microprogrammed
- Uses a microprogram memory
- Has it's own fetch-execute cycle - mini computer in the CPU
- Microaddress, MicroPC, MicroIR, microinstructions
- Easy to design, implement, flexible, can be reprogrammed
- Slower than hardwired
- Hardwired
- Instruction sequencing is important to be designed to utilise as many memory cycles as possible, possibly by overlapping fetches
- Proper sequence must be followed in sequencing control signals, to avoid conflicts
- MAR <- PC must precede MBR <- Memory
- Proper sequence must be followed in sequencing control signals, to avoid conflicts
- Micro-ops are enabled by control signals to transfer data between registers/busses and perform arithmetic or logical operations
- Each step in the operation of a larger machine instruction is encoded into a micro-instruction
- Micro-instructions make up the micro-program
- Micro-program word length is based on 3 factors:
- The max number of simultaneous micro-ops supported
- How control info is represented/encoded
- How the next micro-instruction address is specified
- Horizontal/direct control has very wide word length with few micro-instructions per machine instruction
- Outputs buffered/gated with timing signals
- Fewer instructions == faster
- Vertical control uses narrower instructions with control signals encoded into bits
- Limited ability to express parallelism
- Requires external decoder to identify what control lines are being asserted
Performance
- M J Flynn in 1966 defined a simple means of classifying machines, SISD is one such classification
- Uses fetch-decode-execute
- Fetch sub-cycle is fairly constant-ish speed
- Execute sub-cycle may vary in speed greatly
- A simple measure of performance is MIPS, millions of instructions per second
- Not actually that useful as it measures how fast a processor can do nothing
- Parallel performance is very difficult to measure due to system architecture and degree of parallelism varying
- Instruction bandwith measures the instruction execution rate, similar to MIPS
- Data bandwith measured in FLOPS measures the throughput
- It is nigh-on impossible to get full theoretical throughput in any system, especially parallel
- Speedup is a useful measure that factors in the degree of parallelism
- (Execution time on sequential machine, ) / (Execution time on parallel machine, )
- A closeley related measure is efficiency,
- Both measures depend on parallelism of algorithm
- An algorithm may be characterised by it's degree of parallelism , which is the degree of parallelism that exists at time
- Assume all computations are of two types, vector operations of length and scalar operations where
- is the total proportion of scalar ops, so is the measure of parallelism in the program
- is the throughput of vector ops in MFLOPS and is the scalar throughput
- Average throughput
Pipelining
- The problem with an instruction/execute pipeline is contention over memory access
- Overcome with interleaved memory
- Two possible methods of controlling the transfer of information between pipeline stages
- Asynchronously using handshake signals
- Most flexible, max speed determined by slowest stage
- Synchronously, where there are latches between each stage all synced to a clock
- Asynchronously using handshake signals
- Example 5-stage I/E pipeline: fetch instruction, decode instruction, fetch operands, execute instruction, store results
- Pipelining assumes the only interaction between stages is the passage of information, but there are 3 major things that can cause hazards and stall the pipeline
- Structural hazards, resource conflicts where two stages wish to use the same resource, ie a memory port
- Interleave memory or prefetch data into cache
- Control hazards occur when there is a change in order of execution of instructions, eg when there is a branch or jump
- Cause the pipeline to stall and have to refill it
- Strategies exist to reduce pipeline failures due to conditional branches
- Instruction pre-fetch buffers, which fetches both branches
- Complex and rarely used
- Pipeline freeze strategy, which freezes the pipeline when it receives a branch instruction
- Simple, but poor performance
- Static prediction leverages known facts about branches to guess which one is taken
- 60% of all branches are taken, so may be better to predict this
- However to not take wastest less pipeline cycles so average performance may be better
- Dynamic prediction predicts on the fly for each instruction
- Based on branch instruction characteristics, target address characteristics, and branch history
- Instruction pre-fetch buffers, which fetches both branches
- Data hazards, where an instruction depends on the result of a previous instruction that has not yet completed
- Structural hazards, resource conflicts where two stages wish to use the same resource, ie a memory port
- Pipeline clock period is determined by the slowest stage, usually execution
- Pipeline execution unit separately or have multiple execution units
- Sometimes useful to add feedback between stages (recursion), where the output of one stage becomes the input to a previous one
- Used in accumulation
- Alternative designs are always possible, which come with their own performance tradeoffs
- Space-time diagrams show pipeline usage
- Efficiency = (busy area)/(total area)
- Speedup
- More generally,
- is number of stages, is instructions executed
- As , S(n) \to \n
- Efficiency = (busy area)/(total area)
- Complex pipelines with feedback and differently clocked stages can be difficult to design and optimise
- Reservation tables are space-time diagrams that show where data can be admitted to the pipeline
X
s in adjacent columns of the same row show that stages operate for more than one clock period- More than one
X
s in a row not next to each other show feedback - Pipelines may not accept initiations at the start of every clock period, or collisions may occur
- Potential collisions shown by the distance in time slots between
X
s in each row
- Potential collisions shown by the distance in time slots between
- Collision vector is derived from the distance between
X
s-
- , always
- if a collision would occur with an initiation cycles after a previous initiation
- The initial collision vector is the state of the pipeline after the first initiation
- Distances between all pairs of
X
s in each row, if distance is then set bit
- Distances between all pairs of
-
- Need a control mechanism to determine if new initiations can happen without a collision occurring
- Latency is the number of clock periods between initiations
- Average latency is the number of clock periods between initiations over some repeating cycle
- Minimum average latency is the smallest possible considering all possible sequences of initiations
- The goal for optimum design
- A pipeline changes state as a result of initiations, so represent activity as a state diagram
- A diagram of all pipeline states and changes starting with the initial collision vector
- Shifting the collision vector to the right gives the next state
- If shifted vector has , cannot initiate
- If , then can do new initiation, new vector is bitwise OR of shifted vector and initial vector
- State diagram can be reduced to show only changes where initiations are taken
- Numbers on edges indicate number of clock periods to reach the next tate shown
- Can identify cycles in graph
- Always taking initiations when , to give minimum latency is the greedy strategy
- Will not always give minimum average latency but is close
- Often more than one greedy cycle
- Average latency for a greedy cycle is less than or equal to the number of 1s in the initial collision vector
- Gives an upper bound on latency
- Minimum average latency is greater than or equal to the max number of
X
s in any reservation table row- Gives a lower bound on latency
- Max
X
s in row min avg latency greedy cycles avg latency number of 1s in the initial collision vector
- A given pipeline may not give the required latency, so insert delays into the pipeline to expand the number of time slots and reduce collisions
- Can identify where to place delays to give a latency of cycles: -
- Start with the first
X
, enter anX
in a revised table and mark as forbidden every cycles, to indicate the positions are reserved for initiations - Repeat for all
X
s untilX
falls on a forbidden mark, then delay theX
by one or more - Mark all delayed positions and delay all subsequent
X
s by the same amount
- Start with the first
- Delays can be added using a latch to delay by a cycle
- Reservation tables are space-time diagrams that show where data can be admitted to the pipeline
Honestly just check the slides and examples for this one it makes zero sense lol
Superscalar Processors
- A single, linear instruction pipeline provides at very best a steady-state Clocks per Instruction (CPI) of 1
- Fetching/decoding more than one instruction per clock cycle can reduce the CPI below 1
- An easy way to do this is to add duplicate the pipeline
- For example:
- Two fetch/decode stages
- Execution staging window register
- Multiple execution pipelines for different instructions
- Non-uniform superscalar has pipeline is not duplicated
- Number of replications before window is the degree of the superscalar processor
- Some pipeline stages need less than half a clock cycle, so double internal clock speed can get two tasks done per half a clock cycle
- Known as superpipelining
- A pipeline takes clock cycles to execute instructions
- A superscalar pipeline takes to do the same
- An example pipeline has 4 stages, fetch, decode, execute, write-back
- Each stage is duplicated
- , the number of replications
- , the number of stages
- If instructions are aligned, the number of clocks required if
- If instructions are unaligned, then
- Each stage is duplicated
- The CPI of a superscalar processor is
- For large values of , the speedup is limited by delays set by and pipeline length
- As increases, speedup increases linearly too until the point where instruction level parallelism limits further increases
- For many problems, ILP gives parallelism in the range 2-4x
- No reason to have a huge number of duplicated pipelines, as most programs have a limited degree of inherent parallelism
- Can be maximised by compiler and hardware techniques
- Limited by dependencies
- The program to be executed is a linear stream of instructions
- Instruction fetch stage includes branch prediction to form a dynamic stream which may include dependencies
- Processor dispatches instructions to be executed according to their dependencies
- Instructions are conceptually put back into sequential order and results recorded - known as committing or retiring the instruction
- Needed as instructions are executed out of order
- Instruction may also be executed speculatively and not need to be retired
Instruction Level Parallelism
- Common instructions can be initiated simultaneously and executed independently
- Superscalar processors rely on this ability to execute instructions in separate pipelines, possibly out-of-order
- Multiple functional units for multiple tasks
- ILP refers to the degree to which instructions can be executed in parallel
- Common techniques to exploit it include instruction pipelining and superscalar execution, but also:
- Out-of-order execution
- Register renaming
- Values conflict for use of the registers, processor has to stall to resolve conflicts
- Can treat the problem as a resource conflict, and dynamically rename registers in hardware to reduce dependencies
- Use different registers to the ones that the instructions say
- Branch prediction
- Prefetch both sides of the branch, reduces delay
- Can be static or dynamic
- Speculative execution aims to do the work before it is known if results will be needed
- Relies on resource abundance to provide performance improvements
- Fiver factors fundamentally constrain ILP:
- True data dependency
- An instruction cannot execute because it requires data that will be produced by a preceding instruction
- Usually causes pipeline delays
- Procedural dependency
- Inherent to the sequential nature of execution
- Instructions following a branch have a dependency on the result of the branch
- Variable length instructions can prevent simultaneous fetching
- Resource conflicts
- Two or more instructions require a system resource at the same time
- Memories, caches, functional units, etc
- True data dependency
- A program may not always have enough inherent ILP to take advantage of the machine parallelism
- Limited machine parallelism will always inhibit performance
- Processor must be able to identify ILP
- Instruction issue refers to the process of initiating execution in the processors functional units
- Instruction has been issued once it finishes decoding and hits first execute stage
- The instruction issue policy can have a large performance impact
- Three types of instruction order are significant:
- Fetch order
- Execute order
- Order in which instructions update the contents of memory
- Issue policy can fuck with these orders to whatever extent it pleases provided the results are correct
- Three general categories for instruction issue policies:
- In-order issue with in-order completion
- Do the same as what would be done by a sequential processor
- Issuing stalls when there is a conflict on a functional unit or takes more than one cycle
- Do the same as what would be done by a sequential processor
- In-order issue with out-of-order completion
- A number of instructions may be being executed at any time
- Limited by machine parallelism in functional unites
- Still stalled by resource conflicts and dependencies
- Introduces output dependencies
- Out-of-order issue with out-of-order completion
- In-order issue will only decode up to a dependency or conflict
- Further decouple decode and execute stages
- A buffer - the instruction window - holds instructions after decode
- Processor can continually fetch/decode as long as window not full and execution is separate
- Increases instructions that are available to execution unit
- In-order issue with in-order completion