Virtual memory and paging

Every process sees a private 48-bit address space; the MMU maps it to physical frames page-by-page.

The answer

Virtual memory gives every process the illusion of a private, contiguous address space, usually 48 bits (256TB) on x86_64. The MMU translates virtual addresses to physical addresses page-by-page through a 4-level (or 5-level) page table walk, with the TLB caching recent translations. A page is 4KB by default, 2MB or 1GB with huge pages.

Pages are lazily allocated on first write (demand paging). Pages can be swapped out to disk under memory pressure. Pages can be shared between processes (libc, COW after fork). The cost of a TLB miss is one page walk, around 100 cycles. The cost of a page fault that triggers disk I/O is millions of cycles.

The translation

virtual address (48 bits)
  -> PML4 index (9 bits) -> page directory pointer table
  -> PDPT index  (9 bits) -> page directory
  -> PD index    (9 bits) -> page table
  -> PT index    (9 bits) -> physical frame
  -> page offset (12 bits)
= physical address (48 bits typical, can be more)

Four levels means four memory accesses per translation if the TLB misses. Hardware page walkers in the MMU do this in parallel with other work, but it is still expensive.

What lives where

A typical Linux process layout:

Stack at the top, grows down. Usually 8MB cap.
Memory-mapped region: shared libs, mmap'd files, anonymous mmaps for big allocations. Grows down.
Heap: malloc territory. Grows up via brk or via mmap for large allocs.
BSS: zero-initialized globals.
Data: initialized globals.
Text: code, read-only.

Process address space layout, top to bottom.

RSS vs VSZ vs PSS

VSZ (virtual size): every page the process has mapped, including unallocated ones. Mostly meaningless.
RSS (resident set size): physical memory currently backing this process's pages. Counts shared pages once per process, so summing RSS over-counts.
PSS (proportional set size): RSS but shared pages are divided by the number of sharers. Sum of PSS across the system equals actual memory used.

Look at PSS via smem or /proc/PID/smaps_rollup. RSS is fine for one process, lies when comparing across processes that share libraries.

Swap and OOM

Under memory pressure, the kernel reclaims pages. Clean file-backed pages get dropped (the file is the backup). Dirty pages get written back. Anonymous pages get swapped to disk if swap is configured. If swap is off or full, the OOM killer picks a process and kills it.

On servers, swap is controversial. Disabling swap makes OOM behavior more predictable (you crash earlier, but you crash visibly). Keeping swap helps tolerate brief spikes but can cause death-by-swap if the working set exceeds RAM.

The interview answer

"Every process gets a private 48-bit address space. The MMU translates page-by-page through a 4-level walk, with the TLB caching results. Pages are 4KB, allocated lazily on first write. The big real-world traps are confusing RSS with actual usage (shared libs inflate it), and swap making latency variable. For high-perf work I use huge pages (2MB) to reduce TLB pressure."

Learn more

Docs
OSTEP: PagingOSTEP
Paper
What every programmer should know about memory - Ulrich DrepperUlrich Drepper