Processor Architecture

CPU Organisation & Control

  • Processor continuously runs fetch-decode-execute cycle
    • Each instruction cycle take several CPU clock cycles
    • Requires interaction of lots of CPU components
      • ALU, CU, PC, IR, MAR, MDR
    • Machine instructions may specify
      • Op code
      • Source operand reference
      • Result operand reference
      • Next instruction reference
    • Some CPU registers are user-visible, such as data and address registers
    • Control and status registers are used by CU and privileged OS processes only
    • Executing an instruction may involve one or more operands, each requiring to be fetched
      • Can account for this in instruction cycle model known as indirect cycle
  • Instruction pipelining allows to use wasted time, as new inputs can be accepted before previously accepted instructions and been output
  • Control unit is responsbible for generating control signals to drive cycle
    • Observe opcode input and choose right control signal - decode
    • Assert control signals - execute
    • Two approaches to CU design:
      • Hardwired
        • Uses a sequencer and a digital logic circuit that produces outputs
        • Fast but limited by complexity and inflexibility
      • Microprogrammed
        • Uses a microprogram memory
        • Has it's own fetch-execute cycle - mini computer in the CPU
          • Microaddress, MicroPC, MicroIR, microinstructions
        • Easy to design, implement, flexible, can be reprogrammed
        • Slower than hardwired
  • Instruction sequencing is important to be designed to utilise as many memory cycles as possible, possibly by overlapping fetches
    • Proper sequence must be followed in sequencing control signals, to avoid conflicts
      • MAR <- PC must precede MBR <- Memory
  • Micro-ops are enabled by control signals to transfer data between registers/busses and perform arithmetic or logical operations
    • Each step in the operation of a larger machine instruction is encoded into a micro-instruction
    • Micro-instructions make up the micro-program
    • Micro-program word length is based on 3 factors:
      • The max number of simultaneous micro-ops supported
      • How control info is represented/encoded
      • How the next micro-instruction address is specified
    • Horizontal/direct control has very wide word length with few micro-instructions per machine instruction
      • Outputs buffered/gated with timing signals
      • Fewer instructions == faster
    • Vertical control uses narrower instructions with control signals encoded into bits
      • Limited ability to express parallelism
      • Requires external decoder to identify what control lines are being asserted

Performance

  • M J Flynn in 1966 defined a simple means of classifying machines, SISD is one such classification
    • Uses fetch-decode-execute
    • Fetch sub-cycle is fairly constant-ish speed
    • Execute sub-cycle may vary in speed greatly
  • A simple measure of performance is MIPS, millions of instructions per second
    • Not actually that useful as it measures how fast a processor can do nothing
  • Parallel performance is very difficult to measure due to system architecture and degree of parallelism varying
  • Instruction bandwith measures the instruction execution rate, similar to MIPS
  • Data bandwith measured in FLOPS measures the throughput
  • It is nigh-on impossible to get full theoretical throughput in any system, especially parallel
  • Speedup is a useful measure that factors in the degree of parallelism
    • (Execution time on sequential machine, ) / (Execution time on parallel machine, )
    • A closeley related measure is efficiency,
    • Both measures depend on parallelism of algorithm
  • An algorithm may be characterised by it's degree of parallelism , which is the degree of parallelism that exists at time
  • Assume all computations are of two types, vector operations of length and scalar operations where
    • is the total proportion of scalar ops, so is the measure of parallelism in the program
    • is the throughput of vector ops in MFLOPS and is the scalar throughput
      • Average throughput

Pipelining

  • The problem with an instruction/execute pipeline is contention over memory access
    • Overcome with interleaved memory
  • Two possible methods of controlling the transfer of information between pipeline stages
    • Asynchronously using handshake signals
      • Most flexible, max speed determined by slowest stage
    • Synchronously, where there are latches between each stage all synced to a clock
  • Example 5-stage I/E pipeline: fetch instruction, decode instruction, fetch operands, execute instruction, store results
  • Pipelining assumes the only interaction between stages is the passage of information, but there are 3 major things that can cause hazards and stall the pipeline
    • Structural hazards, resource conflicts where two stages wish to use the same resource, ie a memory port
      • Interleave memory or prefetch data into cache
    • Control hazards occur when there is a change in order of execution of instructions, eg when there is a branch or jump
      • Cause the pipeline to stall and have to refill it
      • Strategies exist to reduce pipeline failures due to conditional branches
        • Instruction pre-fetch buffers, which fetches both branches
          • Complex and rarely used
        • Pipeline freeze strategy, which freezes the pipeline when it receives a branch instruction
          • Simple, but poor performance
        • Static prediction leverages known facts about branches to guess which one is taken
          • 60% of all branches are taken, so may be better to predict this
          • However to not take wastest less pipeline cycles so average performance may be better
        • Dynamic prediction predicts on the fly for each instruction
          • Based on branch instruction characteristics, target address characteristics, and branch history
    • Data hazards, where an instruction depends on the result of a previous instruction that has not yet completed
  • Pipeline clock period is determined by the slowest stage, usually execution
    • Pipeline execution unit separately or have multiple execution units
  • Sometimes useful to add feedback between stages (recursion), where the output of one stage becomes the input to a previous one
    • Used in accumulation
  • Alternative designs are always possible, which come with their own performance tradeoffs
  • Space-time diagrams show pipeline usage
    • Efficiency = (busy area)/(total area)
      • Speedup
    • More generally,
      • is number of stages, is instructions executed
      • As , S(n) \to \n
  • Complex pipelines with feedback and differently clocked stages can be difficult to design and optimise
    • Reservation tables are space-time diagrams that show where data can be admitted to the pipeline
      • Xs in adjacent columns of the same row show that stages operate for more than one clock period
      • More than one Xs in a row not next to each other show feedback
      • Pipelines may not accept initiations at the start of every clock period, or collisions may occur
        • Potential collisions shown by the distance in time slots between Xs in each row
    • Collision vector is derived from the distance between Xs
        • , always
      • if a collision would occur with an initiation cycles after a previous initiation
      • The initial collision vector is the state of the pipeline after the first initiation
        • Distances between all pairs of Xs in each row, if distance is then set bit
    • Need a control mechanism to determine if new initiations can happen without a collision occurring
      • Latency is the number of clock periods between initiations
      • Average latency is the number of clock periods between initiations over some repeating cycle
      • Minimum average latency is the smallest possible considering all possible sequences of initiations
        • The goal for optimum design
      • A pipeline changes state as a result of initiations, so represent activity as a state diagram
        • A diagram of all pipeline states and changes starting with the initial collision vector
        • Shifting the collision vector to the right gives the next state
          • If shifted vector has , cannot initiate
          • If , then can do new initiation, new vector is bitwise OR of shifted vector and initial vector
        • State diagram can be reduced to show only changes where initiations are taken
          • Numbers on edges indicate number of clock periods to reach the next tate shown
          • Can identify cycles in graph
    • Always taking initiations when , to give minimum latency is the greedy strategy
      • Will not always give minimum average latency but is close
      • Often more than one greedy cycle
      • Average latency for a greedy cycle is less than or equal to the number of 1s in the initial collision vector
        • Gives an upper bound on latency
      • Minimum average latency is greater than or equal to the max number of Xs in any reservation table row
        • Gives a lower bound on latency
      • Max Xs in row min avg latency greedy cycles avg latency number of 1s in the initial collision vector
    • A given pipeline may not give the required latency, so insert delays into the pipeline to expand the number of time slots and reduce collisions
    • Can identify where to place delays to give a latency of cycles: -
      • Start with the first X, enter an X in a revised table and mark as forbidden every cycles, to indicate the positions are reserved for initiations
      • Repeat for all Xs until X falls on a forbidden mark, then delay the X by one or more
      • Mark all delayed positions and delay all subsequent Xs by the same amount
    • Delays can be added using a latch to delay by a cycle

Honestly just check the slides and examples for this one it makes zero sense lol

Superscalar Processors

  • A single, linear instruction pipeline provides at very best a steady-state Clocks per Instruction (CPI) of 1
  • Fetching/decoding more than one instruction per clock cycle can reduce the CPI below 1
  • An easy way to do this is to add duplicate the pipeline
  • For example:
    • Two fetch/decode stages
    • Execution staging window register
    • Multiple execution pipelines for different instructions
    • Non-uniform superscalar has pipeline is not duplicated
  • Number of replications before window is the degree of the superscalar processor
  • Some pipeline stages need less than half a clock cycle, so double internal clock speed can get two tasks done per half a clock cycle
    • Known as superpipelining
  • A pipeline takes clock cycles to execute instructions
    • A superscalar pipeline takes to do the same
  • An example pipeline has 4 stages, fetch, decode, execute, write-back
    • Each stage is duplicated
      • , the number of replications
      • , the number of stages
    • If instructions are aligned, the number of clocks required if
    • If instructions are unaligned, then
  • The CPI of a superscalar processor is
  • For large values of , the speedup is limited by delays set by and pipeline length
  • As increases, speedup increases linearly too until the point where instruction level parallelism limits further increases
    • For many problems, ILP gives parallelism in the range 2-4x
  • No reason to have a huge number of duplicated pipelines, as most programs have a limited degree of inherent parallelism
    • Can be maximised by compiler and hardware techniques
    • Limited by dependencies
  • The program to be executed is a linear stream of instructions
    • Instruction fetch stage includes branch prediction to form a dynamic stream which may include dependencies
    • Processor dispatches instructions to be executed according to their dependencies
    • Instructions are conceptually put back into sequential order and results recorded - known as committing or retiring the instruction
      • Needed as instructions are executed out of order
      • Instruction may also be executed speculatively and not need to be retired

Instruction Level Parallelism

  • Common instructions can be initiated simultaneously and executed independently
  • Superscalar processors rely on this ability to execute instructions in separate pipelines, possibly out-of-order
    • Multiple functional units for multiple tasks
  • ILP refers to the degree to which instructions can be executed in parallel
  • Common techniques to exploit it include instruction pipelining and superscalar execution, but also:
    • Out-of-order execution
    • Register renaming
      • Values conflict for use of the registers, processor has to stall to resolve conflicts
      • Can treat the problem as a resource conflict, and dynamically rename registers in hardware to reduce dependencies
      • Use different registers to the ones that the instructions say
    • Branch prediction
      • Prefetch both sides of the branch, reduces delay
      • Can be static or dynamic
      • Speculative execution aims to do the work before it is known if results will be needed
        • Relies on resource abundance to provide performance improvements
  • Fiver factors fundamentally constrain ILP:
    • True data dependency
      • An instruction cannot execute because it requires data that will be produced by a preceding instruction
      • Usually causes pipeline delays
    • Procedural dependency
      • Inherent to the sequential nature of execution
      • Instructions following a branch have a dependency on the result of the branch
      • Variable length instructions can prevent simultaneous fetching
    • Resource conflicts
      • Two or more instructions require a system resource at the same time
      • Memories, caches, functional units, etc
  • A program may not always have enough inherent ILP to take advantage of the machine parallelism
    • Limited machine parallelism will always inhibit performance
    • Processor must be able to identify ILP
  • Instruction issue refers to the process of initiating execution in the processors functional units
    • Instruction has been issued once it finishes decoding and hits first execute stage
    • The instruction issue policy can have a large performance impact
    • Three types of instruction order are significant:
      • Fetch order
      • Execute order
      • Order in which instructions update the contents of memory
    • Issue policy can fuck with these orders to whatever extent it pleases provided the results are correct
  • Three general categories for instruction issue policies:
    • In-order issue with in-order completion
      • Do the same as what would be done by a sequential processor
        • Issuing stalls when there is a conflict on a functional unit or takes more than one cycle
    • In-order issue with out-of-order completion
      • A number of instructions may be being executed at any time
      • Limited by machine parallelism in functional unites
      • Still stalled by resource conflicts and dependencies
      • Introduces output dependencies
    • Out-of-order issue with out-of-order completion
      • In-order issue will only decode up to a dependency or conflict
      • Further decouple decode and execute stages
      • A buffer - the instruction window - holds instructions after decode
      • Processor can continually fetch/decode as long as window not full and execution is separate
      • Increases instructions that are available to execution unit