Memory Systems

Main Memory

  • We have a memory hierarchy to balance the tradeoff between cost and speed
  • Want to exploit temporal and spatial locality
  • Moore's law is long dead and never really applied to memory
  • The basic element of main memory is a memory cell capable of being written or read to
    • Need to indicate read/write, data input, and also an enable line
  • When organising memory cells into a larger chip, it is important to maintain a structure approach and keep the circuit as compact as possible
    • For example, a 16 word x 8 bit memory chip requires 128 cells and 4-bit addresses
    • A 1024 bit device as a 128x8 array requires 7 address pins and 8 data pins
      • Alternatively, it is possible to organise it as a 1024x1 array, which would be really dumb as it would result in a massive decoder and inefficient space usage
    • Dividing the address inputs into 2 parts, column and row address, minimise the decoder space and allows more space for memory
  • Can use the same principle to build smaller ICs into larger ICs, using decoders/multiplexers to split address spaces
  • Semiconductor memory is generally whats used for main store, Random Access Memory
  • Two main technologies:
    • Static RAM (SRAM) uses a flip-flop as a storage element for each bit
    • Dynamic RAM (DRAM) uses the presence or lack of charge in a capacitor for each bit
      • Charge leaks away over time so needs refreshing, but DRAM is generally cheaper if the overhead of the refresh circuitry is sufficiently amortised
    • SRAM typically faster so is used for cache
    • DRAM used for main memory
  • The interface to main memory is always a bottleneck so we can do some fancy DRAM organisations stuff
    • Synchronous DRAM exchanges data with the processor according to an external clock memory
      • Clock runs at the speed of the bus to avoid waiting on memory
      • Processor can perform other tasks while waiting because clock period and wait times are known
    • Rambus DRAM was used by Intel for Pentium and Itanium
      • Exchanges data over a 28-wire bus no more than 12cm long
      • Provides address and control information
      • Asynchronous and block-oriented
      • Fast because requests are issued by the processor over the RDRAM bus instead of using explicit R/W and enable signals
      • Bus propertties such as impedances must be known to processor
    • DDR SDRAM extends SDRAM by sending data to the processor on both rising and falling edge
      • Actually used
    • Cache DRAM (CDRAM) combines DRAM with a small SRAM cache
      • Performance very dependant upon domain and load
  • ROM typically used in microprogramming or systems stuff
    • ROM is mask-written read only memory
    • PROM is same as above, but electrically written
    • EPROM is same as above, but is erasable via UV light at the chip level
    • EEPROM is erasable electrically at the byte-level
  • Flash memory is a high speed semiconductor memory
    • Used for persistent storage
    • Limited to block-level erasure
    • Uses typically 1 transistor per bit

Interleaved Memory

  • A collection of multiple DRAM chips grouped to form a memory bank
  • banks can service requests simultaneously, increading memory read/write rates by a factor of
  • If consecutive words of memory are stored in different banks, the transfer of a block of memory is sped up
  • Distributing addresses among memory units/banks is called interleaving
    • Interleaving addresses among memory units is known as -way interleaving
  • Most effective when the number of memory banks is equal to number of words in a cache line

Virtual Memory

  • Virtual memory is a hierarchical system accross caches, main memory and swap that is managed by the OS
  • Locality of reference principle: addresses generated by the CPU should be in the first level of memory as often as possible
    • Use temporal, spatial, sequential locality to predict
    • The working set of memory addresses usually changes slowly so should maintain it closest to CPU
  • Performance measured has hit ratio (assuming a two-level memory hierarchy with data in and )
  • The average access time
    • When there is a miss, the block is swapped in from to then accessed
    • is the time to transfer a block, so
    • , the access time ratio of the two levels
    • , the factor by which average access time differs from minimum, access efficiency
  • Memory capacity is limited by cost considerations, so wastins space is bad
    • The efficiency which space is being used can be defined as the ratio of useful stuff in memory over total memory,
    • Wasted space can be empty due to fragmentation, or inactive data that is never used
    • System also takes up some memory space
  • Virtual memory space is usually much greater than physical
    • If a memory address is referenced that is not in main memory, then there is a page fault and the OS fetches the data
    • When virtual address space is much greater than physical, most page table entries are empty
      • Fixed by inverted hashed page tables, where page numbers are hashed to smaller values that index a page table where each entry corresponds to physical frames
      • Hash collisions handled by extra chain field in the page table which indicates where colliding entry lives
      • Lookup process is:
        • Hash page number
        • Index the page table using hash. If the tag matches then page found
        • If not then check chain field and go to that index
          • If chain field is null then page fault
      • Average number of probes for an inverted page table with good hashing algorithm is 1.5
        • Practical to have a page frame table with twice the number of entries than frames of memory
  • Segmentation allows programmer to view memory as multiple address spaces - segments
    • Each segment has its own access and usage rights
    • Provides a number of advantages:
      • Simplifies dynamic data structures, as segments can grow/shrink
      • Programs can be altered and recompiled independently without relinking and reloading
      • Can be shared among processes
      • Access privileges give protection
    • Programs divided into segments which are logical parts of variable length
    • Segments make up pages, so segment table used to get offset of address within page table
      • Two levels of lookup tables, address split into 3
  • Translation Lookaside Buffer (TLB) holds most recently reference table entries as a cache
    • When TLB misses, there is a significant overhead in searching main memory page tables
    • Average address translation time
    • TLB miss ratio usually low, less than 0.01
  • Page size has an impact on memory space utilisation factor
    • Too large, then excessive internal fragmentation
    • Too small, then page tables become large and reduces space utilisation
    • is the segment size in words, so when , the last page assigned to a segment will contain on average words
    • Size of the page table associated with each segment is approx words, assuming each table entry is 1 word
    • Memory overhead for each segment is
    • Space utilisation is therefore
    • Optimum page size =
    • Optimum utilisation =
    • Hit ratio increases with page size up to a maximum, then begins to decrease again
      • Value of yielding max hit ratios can be greater than the optimum page size for utilisation
  • When a page fault occurs, the memory management software is called to swap in a page from secondary storage
    • If memory is full, it is necessary to swap out a page
    • Efficient page replacement algorithm required
      • Doing it randomly would be fucking stupid, might evict something being used
      • FIFO is simple and removes oldest page, but still might evict something being used
      • Clock replacement algorithm modifies fifo, which keeps track of unused pages through a use bit
        • Use bit is set if page hasn't been used since last page fault
      • LRU algorithm works well but complex to implement, requires an age counter per entry
        • Usually approximated through use bits set at intervals
      • Working set replacement algorithm keeps track of the set of pages referenced during a time interval
        • Replaces the page which has not been referenced during the preceding time interval
        • As time passes, a moving window captures a working set of pages
        • Implementation is complex
  • Thrashing occurs when there is too many processes in too little memory and OS

Cache

  • Cache contains copies of sections of main memory and relies of locality of reference
  • Objective of cache is to have as high a hit ratio as possible
  • Three techniques used for cache mapping
    • Direct, maps each block of memory to only one possible cache line
    • Associative, permits each main memory block to be loaded into any line of cache
      • Cache control logic must examine each cache line for a match
    • Set associative, each cache line can be in one of a set of cache lines
  • In direct mapping, address is divided into three fields: tag, line and word
    • Cache is accessed with the same line and word as main memory
    • Tag is stored with data in the cache
      • If tag matches that of the address, then that's a cache hit
      • If a miss occurs, the new data and tag is fetched to cache
    • Simple and inexpensive
    • Fixed cache location for each block means that if two needed blocks map to the same line than cache will thrash
    • Victim cache was originally proposed as a solution
      • A fully associative cache of 4-16 lines sat between L1 and L2
  • Fully associative cache scheme divide the CPU address into tag and word
    • Cache accessed by same word
    • Tag stored with data, have to examine every tag to determine if theres a cache miss
      • Complex because of this
  • Set associative combines the two, where a given block maps to any line in a given set
    • eg, a 4-way cache has 4 lines per set and a block can map to any one of these 4
    • Performance increases diminish as set size increases
  • Performance can be improved with separate instruction and data caches, L1 usually split
  • Principle of inclusion states that L1 should always be subset of L2, L2 subset of L3, etc
    • When L3 is fetched to, data is written to L2 and L1 also
  • Writing to cache can result in cache and main memory having inconsistent data
    • It is necessary to be coherent if
      • I/O operates on main memory
      • Processors share main memory
    • There are two common methods for maintaining consistency
      • With write through, every write operation to cache is repeated to main memory in parallel
        • Adds overhead to write to memory, but usually there are several reads between each write
        • Average access time
          • Assumes is time to transfer block to cache, and is the fraction of references that are writes
        • Main memory write operation must complete before any further cache operations
          • If size of block matches datapath width, then whole block can be transferred in one operation,
            • If not, then transfers are required and
        • Write through often enhanced by buffers for writes to main memory, freeing cache for subsequent accesses
        • In some systems, cache is not fetched to when a miss occurs on a write operation, meaning data is written to main memory but not cache
          • Reduces average access time as read misses incur less overhead
      • With write back, a write operation to main memory is performed only at block replacement time
        • Increases efficiency if variables are changed a number of times
        • Simple write back refers to always writing back a block when a swap is required, even if data is unaltered
        • Average access time becomes
          • x2 because you write the block back then fetch a new one
        • Tagged write back only writes back a block if the contents have altered
          • 1-bit tag stored with each block, and is set when block altered
          • Tags examined at replacement time
          • Access time
            • is the probability a block has been altered
        • Write buffers can also be implemented
  • Most modern processors have at least two cache levels
    • Normal memory hierarchy principles apply, though on an L2 miss data is written to L1 and L2
    • With two levels, average access time becomes
  • A replacement policy is required for evicting cache lines in associative and set-associative mappings
    • Most effective policy is LRU, implemented totally in hardware
    • Two possible implementations, counter and reference matrix
      • A counter associated with each line is incremented at regular intervals and reset when the line is referenced
        • Reset every time line is accessed
        • On a miss when the cache is full, the line with a counter set at the maximum value is replaced and counter reset, all other counters set to 0
    • Reference matrix is based on a matrix of status bits
      • If lines to consider, then the upper triangular matrix of a matrix is formed without the diagonal, with
      • When the th line is referenced, all bits in the th row are set to one and th column is zeroed
      • The least recently used one is one that has all 0s in its row and all 1s in its column
  • There are three types of cache miss:
    • Compulsory, where an access will always miss because it is the first access to the block
    • Capacity, where a miss occurs because a cache is not large enough to contain all the blocks needed
    • Conflict, misses occurring as a result of blocks not being fully associative
    • Sometimes a fourth category, coherency, is used to describe misses occurring due to cache flushes in multiprocessor systems
  • Performance measures based solely on hit rate don't factor in the actual cost of a cache miss, which is the real performance issue
    • Average memory access time = hit time + (miss rate x miss penalty)
    • Measuring access time can be a more indicative measure
  • There are a number of measures that can be taken to optimise cache performance
    • Have larger block sizes to exploit spatial locality
      • Likely to reduce number of compulsory misses
      • Will increase cache miss penalty
    • Have a larger cache
      • Longer hit times and increased power consumption and more expensive
    • Higher levels of associativity
      • Reduces number of conflict misses
      • Can cause longer hit times and increased power consumption
    • Multilevel Caches
      • Idea is to reduce miss penalty
      • L1 cache keeps pace with CPU clock, further caches serve to reduce the number of main memory accesses
      • Can redefine average accesstime for multilevel caches: L1 hit time + (L1 miss rate x (L2 hit time + (L2 miss rate x L2 miss penalty)))
    • Prioritising read misses over writes
      • Write buffers can hold updated value for a location needed on a read miss
      • If no conflicts, then sending the read before the write will reduce the miss penalty
      • Optimisation easily implemented in write buffer
      • Most modern processor do this as cost is low
    • Avoid address translation during cache indexing
      • Caches must cope with the translation of virtual addresses to physical
      • Using the page offset to index cache means the TLB can be omitted
        • Imposes restrictions in structure and size of cache
    • Controlling L1 cache size and complexity
      • Fast clock cycles encourage small and simple L1 caches
      • Lower levels of associativity can reduce hit times as they are less complex
    • Way prediction
      • Reduce conflict misses
      • Keep extra bits in cache to preduct the block within the next set of the next cache access
      • Requires block predictor bits in each block
        • Determine which block to try on the next cache access
        • If prediction correct then latency is equal to direct mapped, otherwise at least an extra clock cycle required
        • Prediction accuracy commonly 90%+ for 2-way cache
    • Pipelined access
      • Effective latency of an L1 cache hit can be multiple cycles
      • Pipelining allows to increase clock speeds and bandwith
      • Can incur slower hit times
    • Non-blocking cache
      • Processors in many systems do not need to stall on a data cache miss
        • Instruction fetch could be performed while data fetched from main memory following a miss
      • Allows to issue more than one cache request at at time
        • Cache can continue to supply hits immediately following a miss
      • Performance hard to measure and model
        • Out-of-order processors can hide impact of L1 misses that hit L2
    • Multi-bank caches
      • Increase cache bandwith by having multiple banks that support simultaneous access
      • Ideal if cache accesses spread themselves accross banks
        • Sequential interleaving spreads block addresses sequentially accross banks
    • Critical word first
      • A processor often only needs one word of a block at a time
      • Request the missing word first and send it to the processor, then fill the remainder of the block
      • Most beneficial for large caches with large blocks
    • Merging write buffer
      • Write buffers are used by write-through and write-back caches
      • If write buffer is empty then data and full address are written to buffer
      • If write buffer contains other modified blocks then address can be checked to see if new data and buff entry match, and the data is combined with the buffer entry
        • Known as write merging
      • Reduces miss penalty
    • Hardware prefetching
      • Put the data in cache before it's requested
      • Instruction prefetches usually done in hardware
      • Processor fetches two blocks on a miss, the missed block and then prefetches the next one
      • Prefetched block put in instruction stream buffer
    • Compiler driven prefetching
      • Reduces miss rate and penalty
      • Compiler inserts prefetching instructions based on what it can deduce about a program
    • Compiler can make other optimisations such as loop interchange and blocking