Processes vs threads
Address spaces, clone flags, the GIL, fork-vs-exec, and why goroutines beat both for I/O.
What a process actually is
A process on Linux is a task_struct in the kernel. It owns a virtual address space (a set of page tables mapping virtual addresses to physical frames), a file descriptor table, a signal disposition table, credentials (uid, gid), namespaces, cgroup memberships, and a PID. When you fork(), the kernel creates a new task_struct, copies the page tables (with copy-on-write semantics so physical memory is shared until written), dups the fd table, and returns twice (once in parent, once in child).
The cost of fork is dominated by copying page tables, not the actual memory. A 4GB process has roughly 1 million pages, which means 1 million page table entries to walk and copy. That is where the 1ms-ish cost comes from. After fork, both processes write-protect the shared pages and trap into the kernel on first write to allocate a private copy. This is called copy-on-write or COW.
What a thread actually is on Linux
A thread is a task_struct that shares fields with its siblings. There is no pthread_t kernel structure. When pthread_create runs, glibc calls clone() with the flags CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM | CLONE_SETTLS | CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID. Each flag controls one piece of sharing: VM shares the address space, FILES shares the fd table, SIGHAND shares signal handlers, THREAD says "you're a thread of the same thread group, share my TGID."
The TGID is what getpid() returns, the actual per-thread PID is what gettid() returns. Most monitoring tools collapse threads under their TGID, which is why top shows one row per process by default and top -H shows threads.
The GIL problem and why CPython forks
CPython has a Global Interpreter Lock. Only one thread can execute Python bytecode at a time. This means multi-threading in Python helps with I/O (the thread releases the GIL during a blocking syscall) but does nothing for CPU-bound work. To get parallelism on multiple cores in Python you must use multiprocessing or a pre-fork server like Gunicorn with multiple worker processes.
Ruby's MRI has the GVL with the same characteristic. Node.js is single-threaded by design and uses the cluster module or PM2 to fork worker processes. Java, Go, Rust, C++, and .NET all have true threading and do not need this dance.
The cost: each forked Python worker has its own copy of the interpreter and your imported modules. Postgres-style pre-fork helps because COW means most of that memory stays shared until written. uWSGI and Gunicorn both fork after import so the bulk of memory is shared.
fork after threads is a footgun
If you create threads, then call fork from one of them, the child process gets exactly one thread (the one that called fork) but all the mutexes that were held by the now-dead sibling threads are still locked. The most famous example is glibc's malloc, which has internal locks. If a sibling thread was in the middle of malloc when fork happened, the child's heap is now locked forever, and the next malloc deadlocks.
The official rule from POSIX is that after fork in a multi-threaded program, you may only call async-signal-safe functions before exec. In practice, this means "fork and immediately exec, do not do anything else." This is one reason container runtimes use posix_spawn instead of fork+exec for child processes.
Stacks, TLS, and what is private
Each thread gets its own stack (default 8MB on Linux, you can tune with pthread_attr_setstacksize). That 8MB is virtual address space reservation, not physical memory. Physical pages are allocated lazily as the stack grows.
Each thread has thread-local storage (TLS). In C/C++ this is __thread or thread_local. errno is implemented as a TLS variable, which is why errno is thread-safe despite looking like a global. The Linux kernel sets up TLS via the set_thread_area or arch_prctl syscall and uses the FS or GS segment register on x86_64 to point to the TLS block.
Signals are weird in threaded programs. Each thread has its own signal mask, but signal handlers are process-wide. If a signal arrives at a process, the kernel picks any thread that does not have it masked. Best practice: mask all signals in worker threads, dedicate one thread to handle signals via sigwait.
Cost numbers that matter
| Operation | Approximate cost |
|---|---|
| fork (small process) | 200us |
| fork (4GB process) | 5ms |
| clone (thread) | 10us |
| posix_spawn (fork+exec) | 1ms |
| context switch (same core) | 1us |
| context switch (cross core) | 5us |
| goroutine spawn | 1us |
| async/await task | 100ns |
These numbers are from x86_64 Linux on modern hardware, mid-2020s. Numbers move a bit with kernel version but the orders of magnitude hold.
Goroutines and the M:N model
Go's runtime multiplexes goroutines (G) onto OS threads (M) using logical processors (P). Default GOMAXPROCS equals the number of cores. A goroutine starts with a 2KB stack that grows as needed (up to a configurable maximum). When a goroutine blocks on a syscall, the runtime parks the M and grabs a fresh M to keep running other goroutines on that P.
This is the M:N scheduler. Erlang does the same with processes. Java added virtual threads in JDK 21 with the same idea. The win is that you get the programming model of one-thread-per-request without the OS cost, because the runtime is doing the scheduling in userspace.
Async/await in JavaScript, Python, and Rust is a different shape: cooperative scheduling with explicit yield points. The functions are state machines, suspended on await, resumed when the future is ready. Cost per task is even lower than goroutines because there is no stack at all, just a heap-allocated state machine.
When to use which, the actual decision
- Crash isolation needed (browser tabs, payment workers, postgres backends): processes.
- CPU parallelism in a language with no GIL (Java, Go, Rust, C++): threads.
- CPU parallelism in a language with a GIL (Python, Ruby): processes.
- I/O concurrency at high scale (10k+ concurrent connections): async or goroutines, not threads.
- Sharing large in-memory caches between workers: processes with shared memory (mmap) or threads.
- You're not sure: threads if the language is safe, processes if it isn't.
Mental model for interviews
Think of a process as an apartment: walls, locked door, own kitchen. Threads are roommates in the same apartment: they share the kitchen (heap), each has their own bedroom (stack), and they can stab each other (data races) if they are not careful. Async tasks are calendar events on one person's day: only one runs at a time, but switching between them is just turning the page.
Learn more
- Docs
- Article
- DocsBrendan Gregg: Linux PerformanceBrendan Gregg
- DocsThe Go Memory Modelgo.dev