Memory Systems

Main Memory

We have a memory hierarchy to balance the tradeoff between cost and speed
Want to exploit temporal and spatial locality
Moore's law is long dead and never really applied to memory
The basic element of main memory is a memory cell capable of being written or read to
- Need to indicate read/write, data input, and also an enable line
When organising memory cells into a larger chip, it is important to maintain a structure approach and keep the circuit as compact as possible
- For example, a 16 word x 8 bit memory chip requires 128 cells and 4-bit addresses
- A 1024 bit device as a 128x8 array requires 7 address pins and 8 data pins
  - Alternatively, it is possible to organise it as a 1024x1 array, which would be really dumb as it would result in a massive decoder and inefficient space usage
- Dividing the address inputs into 2 parts, column and row address, minimise the decoder space and allows more space for memory
Can use the same principle to build smaller ICs into larger ICs, using decoders/multiplexers to split address spaces
Semiconductor memory is generally whats used for main store, Random Access Memory
Two main technologies:
- Static RAM (SRAM) uses a flip-flop as a storage element for each bit
- Dynamic RAM (DRAM) uses the presence or lack of charge in a capacitor for each bit
  - Charge leaks away over time so needs refreshing, but DRAM is generally cheaper if the overhead of the refresh circuitry is sufficiently amortised
- SRAM typically faster so is used for cache
- DRAM used for main memory
The interface to main memory is always a bottleneck so we can do some fancy DRAM organisations stuff
- Synchronous DRAM exchanges data with the processor according to an external clock memory
  - Clock runs at the speed of the bus to avoid waiting on memory
  - Processor can perform other tasks while waiting because clock period and wait times are known
- Rambus DRAM was used by Intel for Pentium and Itanium
  - Exchanges data over a 28-wire bus no more than 12cm long
  - Provides address and control information
  - Asynchronous and block-oriented
  - Fast because requests are issued by the processor over the RDRAM bus instead of using explicit R/W and enable signals
  - Bus propertties such as impedances must be known to processor
- DDR SDRAM extends SDRAM by sending data to the processor on both rising and falling edge
  - Actually used
- Cache DRAM (CDRAM) combines DRAM with a small SRAM cache
  - Performance very dependant upon domain and load
ROM typically used in microprogramming or systems stuff
- ROM is mask-written read only memory
- PROM is same as above, but electrically written
- EPROM is same as above, but is erasable via UV light at the chip level
- EEPROM is erasable electrically at the byte-level
Flash memory is a high speed semiconductor memory
- Used for persistent storage
- Limited to block-level erasure
- Uses typically 1 transistor per bit

Interleaved Memory

A collection of multiple DRAM chips grouped to form a memory bank
$n$ banks can service $n$ requests simultaneously, increading memory read/write rates by a factor of $n$
If consecutive words of memory are stored in different banks, the transfer of a block of memory is sped up
Distributing addresses among memory units/banks is called interleaving
- Interleaving addresses among $n$ memory units is known as $n$ -way interleaving
Most effective when the number of memory banks is equal to number of words in a cache line

Virtual Memory

Virtual memory is a hierarchical system accross caches, main memory and swap that is managed by the OS
Locality of reference principle: addresses generated by the CPU should be in the first level of memory as often as possible
- Use temporal, spatial, sequential locality to predict
- The working set of memory addresses usually changes slowly so should maintain it closest to CPU
Performance measured has hit ratio $H = \frac{N _{1}}{N _{1} + N _{2}}$ (assuming a two-level memory hierarchy with data in $M_{1}$ and $M_{2}$ )
The average access time $t_{A} = H t_{A 1} + (1 - H) t_{A 2}$
- When there is a miss, the block is swapped in from $M_{2}$ to $M_{1}$ then accessed
- $t_{B}$ is the time to transfer a block, so $t_{A 2} ≊ t_{B}$
- $r = t_{A 2} / t_{A 1}$ , the access time ratio of the two levels
- $e = t_{A 1} / t_{A}$ , the factor by which average access time differs from minimum, access efficiency
Memory capacity is limited by cost considerations, so wastins space is bad
- The efficiency which space is being used can be defined as the ratio of useful stuff in memory over total memory, $u = S_{u} / S$
- Wasted space can be empty due to fragmentation, or inactive data that is never used
- System also takes up some memory space
Virtual memory space is usually much greater than physical
- If a memory address is referenced that is not in main memory, then there is a page fault and the OS fetches the data
- When virtual address space is much greater than physical, most page table entries are empty
  - Fixed by inverted hashed page tables, where page numbers are hashed to smaller values that index a page table where each entry corresponds to physical frames
  - Hash collisions handled by extra chain field in the page table which indicates where colliding entry lives
  - Lookup process is:
    - Hash page number
    - Index the page table using hash. If the tag matches then page found
    - If not then check chain field and go to that index
      - If chain field is null then page fault
  - Average number of probes for an inverted page table with good hashing algorithm is 1.5
    - Practical to have a page frame table with twice the number of entries than frames of memory
Segmentation allows programmer to view memory as multiple address spaces - segments
- Each segment has its own access and usage rights
- Provides a number of advantages:
  - Simplifies dynamic data structures, as segments can grow/shrink
  - Programs can be altered and recompiled independently without relinking and reloading
  - Can be shared among processes
  - Access privileges give protection
- Programs divided into segments which are logical parts of variable length
- Segments make up pages, so segment table used to get offset of address within page table
  - Two levels of lookup tables, address split into 3
Translation Lookaside Buffer (TLB) holds most recently reference table entries as a cache
- When TLB misses, there is a significant overhead in searching main memory page tables
- Average address translation time $t_{t} = t_{tl b} + (1 - H_{tl b}) t_{tl b}$
- TLB miss ratio usually low, less than 0.01
Page size $S_{p}$ has an impact on memory space utilisation factor
- Too large, then excessive internal fragmentation
- Too small, then page tables become large and reduces space utilisation
- $S_{s}$ is the segment size in words, so when $S_{s} > S_{p}$ , the last page assigned to a segment will contain on average $S_{p} /2$ words
- Size of the page table associated with each segment is approx $S_{s} / S_{p}$ words, assuming each table entry is 1 word
- Memory overhead for each segment is $S = \frac{S _{p}}{2} + \frac{S _{s}}{S _{p}}$
- Space utilisation is therefore $u = \frac{S _{s}}{S _{s} + S} = \frac{2 S _{s} S _{p}}{S _{p}^{2} + 2 S _{s} ( 1 + S _{p} )}$
- Optimum page size = $2 S_{s}$
- Optimum utilisation = $\frac{1}{1 + 2/ S _{s}}$
- Hit ratio increases with page size up to a maximum, then begins to decrease again
  - Value of $S_{p}$ yielding max hit ratios can be greater than the optimum page size for utilisation
When a page fault occurs, the memory management software is called to swap in a page from secondary storage
- If memory is full, it is necessary to swap out a page
- Efficient page replacement algorithm required
  - Doing it randomly would be fucking stupid, might evict something being used
  - FIFO is simple and removes oldest page, but still might evict something being used
  - Clock replacement algorithm modifies fifo, which keeps track of unused pages through a use bit
    - Use bit is set if page hasn't been used since last page fault
  - LRU algorithm works well but complex to implement, requires an age counter per entry
    - Usually approximated through use bits set at intervals
  - Working set replacement algorithm keeps track of the set of pages referenced during a time interval
    - Replaces the page which has not been referenced during the preceding time interval
    - As time passes, a moving window captures a working set of pages
    - Implementation is complex
Thrashing occurs when there is too many processes in too little memory and OS
- Get a better page replacement algorithm
- Close some chrome tabs
- Download more RAM

Cache

Cache contains copies of sections of main memory and relies of locality of reference
Objective of cache is to have as high a hit ratio as possible
Three techniques used for cache mapping
- Direct, maps each block of memory to only one possible cache line
- Associative, permits each main memory block to be loaded into any line of cache
  - Cache control logic must examine each cache line for a match
- Set associative, each cache line can be in one of a set of cache lines
In direct mapping, address is divided into three fields: tag, line and word
- Cache is accessed with the same line and word as main memory
- Tag is stored with data in the cache
  - If tag matches that of the address, then that's a cache hit
  - If a miss occurs, the new data and tag is fetched to cache
- Simple and inexpensive
- Fixed cache location for each block means that if two needed blocks map to the same line than cache will thrash
- Victim cache was originally proposed as a solution
  - A fully associative cache of 4-16 lines sat between L1 and L2
Fully associative cache scheme divide the CPU address into tag and word
- Cache accessed by same word
- Tag stored with data, have to examine every tag to determine if theres a cache miss
  - Complex because of this
Set associative combines the two, where a given block maps to any line in a given set
- eg, a 4-way cache has 4 lines per set and a block can map to any one of these 4
- Performance increases diminish as set size increases
Performance can be improved with separate instruction and data caches, L1 usually split
Principle of inclusion states that L1 should always be subset of L2, L2 subset of L3, etc
- When L3 is fetched to, data is written to L2 and L1 also
Writing to cache can result in cache and main memory having inconsistent data
- It is necessary to be coherent if
  - I/O operates on main memory
  - Processors share main memory
- There are two common methods for maintaining consistency
  - With write through, every write operation to cache is repeated to main memory in parallel
    - Adds overhead to write to memory, but usually there are several reads between each write
    - Average access time $t_{a} = t_{c} + (1 - h) t_{b} + w (t_{m} - t_{c}) = (1 - w) t_{c} + (1 - h + w) t_{m}$
      - Assumes $t_{b} = t_{m}$ is time to transfer block to cache, and $w$ is the fraction of references that are writes
    - Main memory write operation must complete before any further cache operations
      - If size of block matches datapath width, then whole block can be transferred in one operation, $t_{b} = t_{m}$
        
        If not, then $b$ transfers are required and $t_{b} = b t_{m}$
    - Write through often enhanced by buffers for writes to main memory, freeing cache for subsequent accesses
    - In some systems, cache is not fetched to when a miss occurs on a write operation, meaning data is written to main memory but not cache
      - Reduces average access time as read misses incur less overhead
  - With write back, a write operation to main memory is performed only at block replacement time
    - Increases efficiency if variables are changed a number of times
    - Simple write back refers to always writing back a block when a swap is required, even if data is unaltered
    - Average access time becomes $t_{a} = t_{c} + 2 (1 - h) t_{b}$
      - x2 because you write the block back then fetch a new one
    - Tagged write back only writes back a block if the contents have altered
      - 1-bit tag stored with each block, and is set when block altered
      - Tags examined at replacement time
      - Access time $t_{1} = t_{c} + (1 - h) t_{b} + w_{b} (1 - h) t_{b}$
        
        $w_{b}$ is the probability a block has been altered
    - Write buffers can also be implemented
Most modern processors have at least two cache levels
- Normal memory hierarchy principles apply, though on an L2 miss data is written to L1 and L2
- With two levels, average access time becomes $t_{a} + t_{c 1} + (1 - h_{1}) t_{c 2} + (1 - h_{2}) t_{m}$
A replacement policy is required for evicting cache lines in associative and set-associative mappings
- Most effective policy is LRU, implemented totally in hardware
- Two possible implementations, counter and reference matrix
  - A counter associated with each line is incremented at regular intervals and reset when the line is referenced
    - Reset every time line is accessed
    - On a miss when the cache is full, the line with a counter set at the maximum value is replaced and counter reset, all other counters set to 0
- Reference matrix is based on a matrix of status bits
  - If $B$ lines to consider, then the upper triangular matrix of a $B \times B$ matrix is formed without the diagonal, with $(B \times (B - 1)) /2$
  - When the $i$ th line is referenced, all bits in the $i$ th row are set to one and $i$ th column is zeroed
  - The least recently used one is one that has all 0s in its row and all 1s in its column
There are three types of cache miss:
- Compulsory, where an access will always miss because it is the first access to the block
- Capacity, where a miss occurs because a cache is not large enough to contain all the blocks needed
- Conflict, misses occurring as a result of blocks not being fully associative
- Sometimes a fourth category, coherency, is used to describe misses occurring due to cache flushes in multiprocessor systems
Performance measures based solely on hit rate don't factor in the actual cost of a cache miss, which is the real performance issue
- Average memory access time = hit time + (miss rate x miss penalty)
- Measuring access time can be a more indicative measure
There are a number of measures that can be taken to optimise cache performance
- Have larger block sizes to exploit spatial locality
  - Likely to reduce number of compulsory misses
  - Will increase cache miss penalty
- Have a larger cache
  - Longer hit times and increased power consumption and more expensive
- Higher levels of associativity
  - Reduces number of conflict misses
  - Can cause longer hit times and increased power consumption
- Multilevel Caches
  - Idea is to reduce miss penalty
  - L1 cache keeps pace with CPU clock, further caches serve to reduce the number of main memory accesses
  - Can redefine average accesstime for multilevel caches: L1 hit time + (L1 miss rate x (L2 hit time + (L2 miss rate x L2 miss penalty)))
- Prioritising read misses over writes
  - Write buffers can hold updated value for a location needed on a read miss
  - If no conflicts, then sending the read before the write will reduce the miss penalty
  - Optimisation easily implemented in write buffer
  - Most modern processor do this as cost is low
- Avoid address translation during cache indexing
  - Caches must cope with the translation of virtual addresses to physical
  - Using the page offset to index cache means the TLB can be omitted
    - Imposes restrictions in structure and size of cache
- Controlling L1 cache size and complexity
  - Fast clock cycles encourage small and simple L1 caches
  - Lower levels of associativity can reduce hit times as they are less complex
- Way prediction
  - Reduce conflict misses
  - Keep extra bits in cache to preduct the block within the next set of the next cache access
  - Requires block predictor bits in each block
    - Determine which block to try on the next cache access
    - If prediction correct then latency is equal to direct mapped, otherwise at least an extra clock cycle required
    - Prediction accuracy commonly 90%+ for 2-way cache
- Pipelined access
  - Effective latency of an L1 cache hit can be multiple cycles
  - Pipelining allows to increase clock speeds and bandwith
  - Can incur slower hit times
- Non-blocking cache
  - Processors in many systems do not need to stall on a data cache miss
    - Instruction fetch could be performed while data fetched from main memory following a miss
  - Allows to issue more than one cache request at at time
    - Cache can continue to supply hits immediately following a miss
  - Performance hard to measure and model
    - Out-of-order processors can hide impact of L1 misses that hit L2
- Multi-bank caches
  - Increase cache bandwith by having multiple banks that support simultaneous access
  - Ideal if cache accesses spread themselves accross banks
    - Sequential interleaving spreads block addresses sequentially accross banks
- Critical word first
  - A processor often only needs one word of a block at a time
  - Request the missing word first and send it to the processor, then fill the remainder of the block
  - Most beneficial for large caches with large blocks
- Merging write buffer
  - Write buffers are used by write-through and write-back caches
  - If write buffer is empty then data and full address are written to buffer
  - If write buffer contains other modified blocks then address can be checked to see if new data and buff entry match, and the data is combined with the buffer entry
    - Known as write merging
  - Reduces miss penalty
- Hardware prefetching
  - Put the data in cache before it's requested
  - Instruction prefetches usually done in hardware
  - Processor fetches two blocks on a miss, the missed block and then prefetches the next one
  - Prefetched block put in instruction stream buffer
- Compiler driven prefetching
  - Reduces miss rate and penalty
  - Compiler inserts prefetching instructions based on what it can deduce about a program
- Compiler can make other optimisations such as loop interchange and blocking

Computer Systems Engineering Notes

Memory Systems

Main Memory

Interleaved Memory

Virtual Memory

Cache