Filesystems and inodes
VFS, page cache, dentry cache, journals, COW filesystems, fsync semantics, and why naive disk benchmarks lie.
The VFS layer
Linux has many filesystems but one common API. The Virtual File System (VFS) defines abstract operations (read, write, open, mmap, stat) and each filesystem implements them. Userspace calls go through VFS, which dispatches to the actual fs.
VFS objects:
- superblock: describes a mounted filesystem (block size, free space, root inode).
- inode: in-memory representation of a file's metadata.
- dentry: directory entry, cached name-to-inode mapping.
- file: open file with offset, mode, flags.
When you call open("/etc/passwd"), VFS walks /, etc, passwd, consulting the dentry cache for cached lookups and falling through to the filesystem driver on miss. The result is a file descriptor pointing to an in-memory file object.
The dentry cache and inode cache
Both are critical for performance. Without them, every path lookup would hit the disk.
- Dentry cache: maps "in directory inode X, name Y -> inode Z." Walking a path is N cache lookups.
- Inode cache: keeps inode metadata in RAM. stat() is essentially free if hot.
Both are reclaimable under memory pressure. The kernel logs them as Slab / SReclaimable in /proc/meminfo. A "find /" on a fresh boot is slow; the second time it's fast because everything is cached.
The page cache, again
For file data (not metadata), the page cache holds 4KB pages of file contents indexed by (inode, offset). When you read, the kernel checks the page cache; on miss, it reads from disk into the cache, then copies to your buffer. On write (without O_DIRECT), data lands in the page cache and is flushed asynchronously by pdflush / bdi_writeback kernel threads.
free -h shows page cache under "buff/cache." On a healthy server most "used" RAM is page cache. Servers can confuse operators who panic at "98% memory used" when really 70% of that is reclaimable cache.
echo 1 > /proc/sys/vm/drop_caches (debug only, don't do this in prod) clears the page cache.
fsync, fdatasync, and the durability story
write() returns when data is in the page cache. The kernel will flush eventually, typically within 30 seconds (vm.dirty_expire_centisecs). Power loss before flush = data loss.
To force durability:
fsync(fd): flush data AND metadata. Blocks until disk acknowledges.fdatasync(fd): flush data and only the metadata required to read the data back (mtime not synced, size if it changed yes).sync_file_range: more granular control, doesn't sync metadata.sync(): flush everything. Don't use in apps; use fsync per file.
fsync on modern NVMe: ~50us. On a slow SSD: 1-10ms. On a write-cached HDD: highly variable. On a network filesystem: depends on the server.
Postgres calls fsync on every WAL commit by default. MySQL similar. SQLite has multiple modes (FULL, NORMAL, OFF) that trade durability for speed.
Journaling and crash consistency
Naive filesystems risk corruption on crash: imagine the inode was updated to claim 10 new blocks but the block bitmap wasn't yet updated. After reboot, you have either dangling blocks or aliased blocks (two files thinking they own the same data).
Journaled filesystems (ext3, ext4, xfs, ntfs) write changes to a journal first, then to their final location. After crash, replay the journal. ext4 has three modes:
- journal: data and metadata both journaled. Safest, slowest.
- ordered (default): metadata journaled, data written before metadata commit. Crash leaves either old or new data, not garbage.
- writeback: metadata journaled, data flushed independently. Fastest, crash can leave stale data in newly extended files.
COW filesystems (btrfs, zfs, apfs) take a different approach: every write goes to a new location, then the tree is atomically updated to point at the new version. Crash means you have either the old or the new tree, never a half-update.
ext4 internals, quick
ext4 divides the disk into block groups (~128MB each), each with its own inode table, block bitmap, and inode bitmap. Locality: a file's data lives near its inode, child directories near parent.
Block addressing uses extents (since ext4): instead of pointing to individual blocks, an extent says "starting at logical block X, N contiguous physical blocks." Much more compact for large files than the old direct/indirect pointer scheme of ext3.
An ext4 inode is 256 bytes. A file under ~60KB needs only one extent. The default block size is 4KB, matching the page size.
XFS, btrfs, ZFS quick contrasts
XFS: very mature, parallel allocation (independent allocation groups), delayed allocation, no inode resize after format. Default on RHEL 7+. Excellent for large files and parallel workloads.
btrfs: COW, snapshots, online resize, RAID 0/1/10, checksums, compression. Less mature than ext4/xfs; RAID 5/6 has had data loss bugs. Used by Synology, Facebook (in some places), openSUSE default.
ZFS: COW, snapshots, end-to-end checksums, native RAID-Z, deduplication, compression, encryption. Not in mainline kernel due to CDDL/GPL license conflict; OpenZFS module is the standard install. Heavy memory user (rule of thumb: 1GB RAM per TB storage for dedup).
The "open but deleted" case
When you unlink("foo"), the directory entry is removed. If link count drops to 0 but some process has fd open, the inode is marked for deletion but the data is kept. Only when the last fd closes does the inode actually get freed.
This is why rm on a live log file doesn't free disk:
> /var/log/big.log # truncate via shell redirect, frees space NOWor send SIGHUP to the holding daemon to reopen its log file (most daemons handle this).
lsof | grep deleted is the diagnostic.
inotify and fs events
inotify (Linux) lets you watch directories and files for changes. Each watch creates a fd you can poll. Common uses: file watchers, build tools, IDE indexing.
Per-user limits matter: /proc/sys/fs/inotify/max_user_watches. Default 8192 on many distros, way too low for VS Code or webpack on a big repo. Raise to 524288.
On macOS, kqueue's EVFILT_VNODE plus FSEvents. On Windows, ReadDirectoryChangesW.
Mount options worth knowing
mount -o noatime,nodiratime,...
noatime: don't update access time on read. Faster, breaksmuttand a few other tools.relatime(default on modern Linux): only update atime if it's older than mtime/ctime. Good compromise.nodiratime: same for directories.discard: TRIM on SSD on every delete (slower); usually better to runfstrimweekly via timer.data=writebackfor ext4: faster, less crash-safe.barrier=0: disable write barriers. Faster, dangerous; battery-backed cache or you'll regret it.
What benchmarks lie about
Naive dd if=/dev/zero of=test bs=1M count=1000 numbers are wrong because:
- /dev/zero gives you zeros, which compress and dedup trivially.
- The data lives in page cache; you measured RAM bandwidth, not disk.
- No fsync means writes haven't actually committed.
Use fio for real benchmarks: specify random vs sequential, sync vs async, queue depth, working set size, fsync frequency. SSD numbers vary 100x between "single thread sequential, no fsync" and "16 thread random write with fsync every op."
tmpfs and /dev/shm
tmpfs is a filesystem that lives in RAM (with swap backing). /tmp on many distros, and /dev/shm always, are tmpfs. Reads and writes are memory operations, no disk involved.
Useful for: build caches, scratch space, shared memory between processes (mmap a file in /dev/shm), test data.
Watch for: tmpfs counts against memory limits. A cgroup with a 1GB limit that fills /dev/shm with 800MB will OOM-kill processes in that cgroup.
Pitfalls
Mental model
A filesystem is a phone book + a warehouse. Directories are pages of the phone book, mapping names to inode numbers (warehouse slot IDs). The inode table is the warehouse register, listing what's in each slot, who owns it, and when it was last touched. The data blocks are the actual stuff in the slots. Deleting a name removes a phone book entry; the inventory stays until the last reference is gone. fsync is "go to the warehouse and make sure my changes are actually in storage, not just on the manager's clipboard."
Learn more
- Docs
- DocsLinux VFS documentationkernel.org
- Docsext4 howtokernel.org
- DocsBrendan Gregg: Filesystem performance toolsBrendan Gregg