Memory Management
Paging, segmentation, virtual memory, page replacement.
Paging
Page table, TLB, fragmentation.
Instruction pipelining — overlap execution of multiple instructions to improve throughput.
Classic 5-stage pipeline:
- IF (Instruction Fetch)
- ID (Instruction Decode) + register read
- EX (Execute) in ALU
- MEM (Memory access)
- WB (Write Back)
Without pipelining: 5 cycles per instruction → 1/5 instruction per cycle.
With pipelining (steady state): 1 instruction per cycle. 5× speedup in ideal case.
Pipeline hazards:
1. Structural hazard: two stages competing for the same resource (e.g., memory unit serving both IF and MEM). Fix: duplicate (separate I-cache and D-cache).
2. Data hazard: instruction depends on result of preceding one not yet computed.
- RAW (Read After Write): most common. ADD r1, r2, r3 followed by SUB r4, r1, r5 — SUB reads r1 before ADD wrote it.
- Mitigations: forwarding (bypass result from EX/MEM to next ALU input), stalling (insert bubbles), compiler reordering.
3. Control hazard: caused by branches.
- Wait until branch is resolved (stall ~3 cycles) — bad.
- Branch prediction: static (always taken / always not) or dynamic (history-based, e.g., 2-bit saturating counter, branch target buffer). Modern CPUs achieve >95% accuracy.
Speedup formula: S = (Time without pipelining) / (Time with pipelining + stalls).
MEMORY HIERARCHY
| Level | Size | Speed | Cost/byte |
|---|---|---|---|
| Registers (CPU) | ~kBs | < 1 ns | Highest |
| L1 cache | tens of KB | 1-2 ns | High |
| L2 cache | hundreds of KB | 3-10 ns | |
| L3 cache | MBs | 10-30 ns | |
| Main memory (DRAM) | GBs | 50-100 ns | |
| SSD / Disk | TBs | 0.1-10 ms | Low |
Principle of locality justifies caches:
- Temporal locality: recently-used items used again soon.
- Spatial locality: items near recently-used items likely to be used.
CACHE BASICS
When CPU requests data, check cache first.
- Hit: data found in cache. Fast.
- Miss: must fetch from lower level. Slow.
Average memory access time (AMAT):
AMAT = Hit time + Miss rate × Miss penalty.
Example: Hit time = 1 ns, hit rate = 0.95, miss penalty = 100 ns.
AMAT = 1 + 0.05 × 100 = 6 ns.
Even a 5% miss rate doubles average access time. Hence the obsession with high hit rates.
CACHE MAPPING POLICIES — where does a memory block go in cache?
1. Direct mapped: block goes to one specific cache line. Simple but conflict misses when multiple blocks map to same line.
2. Fully associative: block can go anywhere. No conflicts but slow lookup (must compare all entries).
3. Set associative: k-way; block goes to one of k lines in a set. Compromise between simplicity and flexibility. Common: 4-way, 8-way.
Address breakdown:
- Tag (uniquely identifies block).
- Index (which set).
- Offset (byte within block).
REPLACEMENT POLICIES (when cache is full):
- LRU (Least Recently Used): evict least recently accessed. Good but expensive to track.
- FIFO: simple but suffers Belady's anomaly.
- Random: surprisingly competitive; simple.
- MRU, NRU, Pseudo-LRU for approximations.
WRITE POLICIES:
1. Write-through: write to cache AND main memory simultaneously. Simple, consistent. Slow (bottlenecked by memory).
2. Write-back: write only to cache; mark "dirty"; flush to memory when evicted. Fast but consistency overhead.
Combined with:
- Write-allocate: on write miss, load block into cache first.
- Write-no-allocate: on write miss, write directly to memory.
Typical pairings: write-back + write-allocate, OR write-through + no-allocate.
Worked example. A direct-mapped cache has 64 lines, each 16 bytes. Memory address is 32 bits. Find tag, index, offset fields.
- 16 bytes → 4 bits offset.
- 64 lines → 6 bits index.
- Tag = 32 − 6 − 4 = 22 bits.
For a 32-bit address ABCDEF12 (hex):
- Offset = last 4 bits = 0010 = 2.
- Index = next 6 bits.
- Tag = upper 22 bits.
Page replacement
FIFO, LRU, optimal, second-chance.
When physical memory is full and a new page is referenced, the OS must evict an existing page. Page replacement algorithms decide which one.
Three classics:
1. FIFO (First-In, First-Out). Evict the page that has been in memory longest.
- Simple to implement (queue).
- Suffers from Belady's anomaly: more frames can paradoxically cause MORE faults.
2. LRU (Least Recently Used). Evict the page that hasn't been used for the longest time.
- Closer to optimal in practice.
- Implementation: stack/counter (expensive) or approximation via "use bits".
- Does NOT suffer Belady's anomaly (it's a "stack algorithm").
3. Optimal (OPT, Belady's). Evict the page that won't be needed for the longest time in the future.
- Provably optimal — fewest faults.
- Requires future knowledge → unimplementable in real OS, but used as a benchmark.
Worked example. Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5. With 3 frames:
FIFO: 1*, 2*, 3*, 4* (replaces 1), 1* (replaces 2), 2* (replaces 3), 5* (replaces 4), 1, 2, 3* (replaces 5), 4* (replaces 1), 5* (replaces 2). 9 faults.
LRU: 1*, 2*, 3*, 4* (LRU=1), 1* (LRU=2), 2* (LRU=3), 5* (LRU=4), 1, 2, 3* (LRU=5), 4* (LRU=1), 5* (LRU=2). 10 faults.
OPT: 1*, 2*, 3*, 4* (replaces 3 — not used soon), 1, 2, 5* (replaces 4), 1, 2, 3* (replaces 5), 4* (replaces 1), 5* (replaces 2). 7 faults.
Belady's anomaly demo. With FIFO and reference: 1,2,3,4,1,2,5,1,2,3,4,5:
- 3 frames → 9 page faults
- 4 frames → 10 page faults (paradox!)
Approximations used in real OSes:
- Second-chance / Clock: FIFO with a use-bit. If the candidate's use-bit is 1, give it a second chance and clear the bit.
- NRU: classify pages by (referenced, modified) bits.
- WSClock / aging: decaying counter approximating LRU.
Linux uses an approximation called "two-handed clock" with separate active and inactive lists.