We have a memory hierarchy to balance the tradeoff between cost and speed
Want to exploit temporal and spatial locality
Moore's law is long dead and never really applied to memory
The basic element of main memory is a memory cell capable of being written or read to
Need to indicate read/write, data input, and also an enable line
When organising memory cells into a larger chip, it is important to maintain a structure approach and keep the circuit as compact as possible
For example, a 16 word x 8 bit memory chip requires 128 cells and 4-bit addresses
A 1024 bit device as a 128x8 array requires 7 address pins and 8 data pins
Alternatively, it is possible to organise it as a 1024x1 array, which would be really dumb as it would result in a massive decoder and inefficient space usage
Dividing the address inputs into 2 parts, column and row address, minimise the decoder space and allows more space for memory
Can use the same principle to build smaller ICs into larger ICs, using decoders/multiplexers to split address spaces
Semiconductor memory is generally whats used for main store, Random Access Memory
Two main technologies:
Static RAM (SRAM) uses a flip-flop as a storage element for each bit
Dynamic RAM (DRAM) uses the presence or lack of charge in a capacitor for each bit
Charge leaks away over time so needs refreshing, but DRAM is generally cheaper if the overhead of the refresh circuitry is sufficiently amortised
SRAM typically faster so is used for cache
DRAM used for main memory
The interface to main memory is always a bottleneck so we can do some fancy DRAM organisations stuff
Synchronous DRAM exchanges data with the processor according to an external clock memory
Clock runs at the speed of the bus to avoid waiting on memory
Processor can perform other tasks while waiting because clock period and wait times are known
Rambus DRAM was used by Intel for Pentium and Itanium
Exchanges data over a 28-wire bus no more than 12cm long
Provides address and control information
Asynchronous and block-oriented
Fast because requests are issued by the processor over the RDRAM bus instead of using explicit R/W and enable signals
Bus propertties such as impedances must be known to processor
DDR SDRAM extends SDRAM by sending data to the processor on both rising and falling edge
Actually used
Cache DRAM (CDRAM) combines DRAM with a small SRAM cache
Performance very dependant upon domain and load
ROM typically used in microprogramming or systems stuff
ROM is mask-written read only memory
PROM is same as above, but electrically written
EPROM is same as above, but is erasable via UV light at the chip level
Virtual memory is a hierarchical system accross caches, main memory and swap that is managed by the OS
Locality of reference principle: addresses generated by the CPU should be in the first level of memory as often as possible
Use temporal, spatial, sequential locality to predict
The working set of memory addresses usually changes slowly so should maintain it closest to CPU
Performance measured has hit ratio H=N1+N2N1 (assuming a two-level memory hierarchy with data in M1 and M2)
The average access time tA=HtA1+(1−H)tA2
When there is a miss, the block is swapped in from M2 to M1 then accessed
tB is the time to transfer a block, so tA2≊tB
r=tA2/tA1, the access time ratio of the two levels
e=tA1/tA, the factor by which average access time differs from minimum, access efficiency
Memory capacity is limited by cost considerations, so wastins space is bad
The efficiency which space is being used can be defined as the ratio of useful stuff in memory over total memory, u=Su/S
Wasted space can be empty due to fragmentation, or inactive data that is never used
System also takes up some memory space
Virtual memory space is usually much greater than physical
If a memory address is referenced that is not in main memory, then there is a page fault and the OS fetches the data
When virtual address space is much greater than physical, most page table entries are empty
Fixed by inverted hashed page tables, where page numbers are hashed to smaller values that index a page table where each entry corresponds to physical frames
Hash collisions handled by extra chain field in the page table which indicates where colliding entry lives
Lookup process is:
Hash page number
Index the page table using hash. If the tag matches then page found
If not then check chain field and go to that index
If chain field is null then page fault
Average number of probes for an inverted page table with good hashing algorithm is 1.5
Practical to have a page frame table with twice the number of entries than frames of memory
Segmentation allows programmer to view memory as multiple address spaces - segments
Each segment has its own access and usage rights
Provides a number of advantages:
Simplifies dynamic data structures, as segments can grow/shrink
Programs can be altered and recompiled independently without relinking and reloading
Can be shared among processes
Access privileges give protection
Programs divided into segments which are logical parts of variable length
Segments make up pages, so segment table used to get offset of address within page table
Two levels of lookup tables, address split into 3
Translation Lookaside Buffer (TLB) holds most recently reference table entries as a cache
When TLB misses, there is a significant overhead in searching main memory page tables
Average address translation time tt=ttlb+(1−Htlb)ttlb
TLB miss ratio usually low, less than 0.01
Page size Sp has an impact on memory space utilisation factor
Too large, then excessive internal fragmentation
Too small, then page tables become large and reduces space utilisation
Ss is the segment size in words, so when Ss>Sp, the last page assigned to a segment will contain on average Sp/2 words
Size of the page table associated with each segment is approx Ss/Sp words, assuming each table entry is 1 word
Memory overhead for each segment is S=2Sp+SpSs
Space utilisation is therefore u=Ss+SSs=Sp2+2Ss(1+Sp)2SsSp
Optimum page size = 2Ss
Optimum utilisation = 1+2/Ss1
Hit ratio increases with page size up to a maximum, then begins to decrease again
Value of Sp yielding max hit ratios can be greater than the optimum page size for utilisation
When a page fault occurs, the memory management software is called to swap in a page from secondary storage
If memory is full, it is necessary to swap out a page
Efficient page replacement algorithm required
Doing it randomly would be fucking stupid, might evict something being used
FIFO is simple and removes oldest page, but still might evict something being used
Clock replacement algorithm modifies fifo, which keeps track of unused pages through a use bit
Use bit is set if page hasn't been used since last page fault
LRU algorithm works well but complex to implement, requires an age counter per entry
Usually approximated through use bits set at intervals
Working set replacement algorithm keeps track of the set of pages referenced during a time interval
Replaces the page which has not been referenced during the preceding time interval
As time passes, a moving window captures a working set of pages
Implementation is complex
Thrashing occurs when there is too many processes in too little memory and OS
A processor often only needs one word of a block at a time
Request the missing word first and send it to the processor, then fill the remainder of the block
Most beneficial for large caches with large blocks
Merging write buffer
Write buffers are used by write-through and write-back caches
If write buffer is empty then data and full address are written to buffer
If write buffer contains other modified blocks then address can be checked to see if new data and buff entry match, and the data is combined with the buffer entry
Known as write merging
Reduces miss penalty
Hardware prefetching
Put the data in cache before it's requested
Instruction prefetches usually done in hardware
Processor fetches two blocks on a miss, the missed block and then prefetches the next one
Prefetched block put in instruction stream buffer
Compiler driven prefetching
Reduces miss rate and penalty
Compiler inserts prefetching instructions based on what it can deduce about a program
Compiler can make other optimisations such as loop interchange and blocking