Virtual memory and paging
Page tables, TLB, huge pages, COW, demand paging, NUMA, and the per-process /proc files that tell you the truth.
Why virtual memory exists
Three reasons. First, isolation: process A cannot read process B's memory because their address spaces are different. Second, abstraction: each program sees a clean, contiguous space starting at 0, not the fragmented physical reality. Third, overcommit: you can allocate more virtual memory than physical RAM, and physical pages only get attached on first use.
Without virtual memory you would either statically partition RAM (terrible utilization) or have processes share one address space (memory safety nightmare, see classic Mac OS 9 and Windows 3.1).
Page tables on x86_64
x86_64 uses 4-level page tables by default. A 48-bit virtual address breaks into:
[ sign extend ][ PML4 9 ][ PDPT 9 ][ PD 9 ][ PT 9 ][ offset 12 ]
CR3 holds the physical address of the PML4. To translate, the MMU walks: PML4 entry points to a page directory pointer table, that entry points to a page directory, that to a page table, that to the actual physical frame. The 12 low bits of the virtual address are the offset within the 4KB frame.
Each table is one page (4KB) holding 512 entries of 8 bytes each. So a complete page table for a 256TB address space would be huge, but page table entries are themselves lazily allocated. An empty process has only the top-level PML4 populated.
5-level paging (Ice Lake server and later) adds a PML5 layer, extending VA to 57 bits (128PB). Userspace must opt in via mmap hints because some software assumes pointers fit in 47 bits.
The TLB
A page walk is 4 memory accesses. At 4ns per L1-cached access that is 16ns minimum per translation, and most translations would also miss in L1 and go to L3 or memory, costing 100ns or more.
To avoid this, the CPU has a Translation Lookaside Buffer (TLB) that caches recent VA-to-PA translations. Typical sizes:
- L1 dTLB: 64-128 entries, 1 cycle hit
- L1 iTLB: 32-64 entries
- L2 TLB (shared): 1024-2048 entries, 7-10 cycle hit
- TLB miss: full page walk, 100+ cycles
A TLB entry covers one page (4KB), so 1024 entries covers 4MB of working set. If your hot data is 100MB, you will TLB-thrash. The fix is huge pages.
Huge pages
A 2MB huge page is one TLB entry instead of 512. A 1GB huge page is one entry instead of 524288. Linux supports both via MAP_HUGETLB on mmap or transparent huge pages (THP).
- Explicit huge pages (hugetlbfs): you reserve N huge pages at boot via
/proc/sys/vm/nr_hugepages. Postgres, MySQL, JVMs, DPDK all support this. Predictable. No fragmentation. Requires planning. - Transparent huge pages (THP): kernel automatically promotes 2MB-aligned anonymous regions to huge pages. Easy. Has caused latency spikes (THP defrag can stall) and is disabled by many databases.
Rule: for known hot regions (database buffer pool, KVS), use explicit huge pages. For general code, either set THP to madvise mode or off.
Demand paging and COW
When a process calls malloc(1GB), libc usually calls mmap with PROT_READ|PROT_WRITE and MAP_ANONYMOUS. The kernel records the mapping but does NOT allocate any physical pages. When the process first writes to a page, a page fault traps into the kernel, which allocates a physical frame (zero-filled for MAP_ANONYMOUS) and returns.
This is why top may show VSZ of 4GB but RSS of 50MB. Most of the allocation is mapped but unbacked.
Copy-on-write (COW) after fork: both parent and child have the same page tables, marked read-only. When either writes, a page fault triggers, and the kernel allocates a fresh frame for the writer and clears the read-only flag.
COW is also how /lib/x86_64-linux-gnu/libc.so.6 is shared across every process on the system. The libc text section is mapped read-only into every process, backed by the same physical frames.
RSS vs VSZ vs PSS, the truth
VSZ: every page mapped, including unbacked. Includes mmap holes. Mostly garbage.
RSS: physical pages currently in this process's address space. Shared pages counted in FULL for each sharer.
PSS: like RSS but shared pages divided by number of sharers. Sum of PSS = real system memory used.
USS: unique set size. Pages mapped only by this process. What you would free by killing it.
To see USS and PSS:
cat /proc/PID/smaps_rollup
smem -P process-nameFor server capacity planning, PSS is the honest number. For "what will I get back if I kill this," USS.
The page cache
When you read a file, the kernel caches the pages in the page cache. Subsequent reads of the same file hit cache and never touch disk. Writes go to the page cache and are flushed to disk asynchronously (writeback).
The page cache uses any RAM not otherwise allocated. free -m shows it under "buff/cache." A box with 64GB RAM and 8GB used by processes will typically show 50GB of page cache, all of it instantly reclaimable.
Postgres and most databases rely on the kernel page cache as an L2 buffer (their internal buffer pool is L1). Databases that bypass the page cache (MySQL with O_DIRECT, for example) take full responsibility for caching themselves.
Swap and the OOM killer
When free physical memory gets low, the kernel reclaims pages. The reclaim algorithm (LRU-based) considers:
- Clean file-backed pages: drop them, they can be reread from disk.
- Dirty file-backed pages: write back, then drop.
- Anonymous pages (heap, stack): no backing file, so swap them to swap space.
If swap is off or full, the kernel cannot reclaim anonymous pages. When it runs out of options, the OOM killer fires. It scores each process by RSS, runtime, oom_score_adj, and other factors, then SIGKILLs the winner.
/proc/PID/oom_score_adj is a -1000 to 1000 knob. -1000 means "never kill me," 1000 means "kill me first." Set critical processes to -500. Set easily-restartable workers higher.
vm.swappiness (0 to 100) controls the tradeoff between swapping anonymous pages and reclaiming page cache. Default 60. Servers often set it to 10 (prefer keeping anon pages in RAM) or 1 (almost never swap).
NUMA
On multi-socket servers, each CPU socket has its own memory controller and "local" memory. Accessing memory on another socket is slower (200ns vs 80ns, roughly). This is non-uniform memory access (NUMA).
Linux tries to keep tasks and their memory on the same NUMA node. numactl --hardware shows topology. numastat -p PID shows per-process node distribution. For latency-critical work, pin tasks to a node and allocate memory there: numactl --membind=0 --cpunodebind=0 ./prog.
What /proc/meminfo really tells you
MemTotal: physical RAM
MemFree: truly free, not in page cache
MemAvailable: what an app could allocate without swapping (includes reclaimable cache)
Buffers: block device cache (small)
Cached: page cache
SwapCached: pages in swap that are also in RAM
Active/Inactive: LRU lists
AnonPages: anonymous (heap/stack) pages in RAM
Mapped: file-backed pages mapped into a process
Shmem: shared memory (tmpfs, shm_open)
KReclaimable: kernel-side reclaimable (dentry cache, inode cache, slabs)
Slab: kernel data structures
The number you usually want is MemAvailable. It is what the kernel estimates is available without causing swap.
Common interview pitfalls
Mental model
Virtual memory is a bookcase metaphor. Every process gets its own library catalog (page tables). The catalog maps "row 47, shelf 3" (virtual address) to a real warehouse location (physical frame). The catalog can list books that don't exist yet (unallocated pages). The catalog can list books that are currently in another library's warehouse (swapped or in another process). The MMU is the librarian; the TLB is the librarian's memory of where they just walked.
Learn more
- Docs
- PaperWhat every programmer should know about memoryUlrich Drepper
- DocsLinux memory management documentationkernel.org
- DocsBrendan Gregg: Linux MemoryBrendan Gregg