Deep dive15 min read← Back to crisp

Python GIL

Why CPython has a GIL, how it interacts with reference counting and C extensions, how it switches threads, and what PEP 703's free-threaded build changes.

Why CPython has a GIL

The GIL is a single mutex that must be held by any thread executing Python bytecode. It exists for one reason: CPython uses reference counting for memory management, and reference counts must be updated atomically. Without a GIL, every Py_INCREF and Py_DECREF would need to be an atomic operation (or protected by a lock), and reference counting happens on basically every operation.

Guido van Rossum has said for decades: "I'd be happy to remove the GIL if someone shows me a patch that doesn't slow down single-threaded code." For 30 years, every attempt slowed down single-threaded code by 30%+. The CPython developers consistently chose single-thread speed over multi-core scaling, because the vast majority of Python code is run in single processes or on multi-process systems.

The GIL is a tradeoff. It buys:

Fast reference counting (no atomic instructions in the hot path).
Simple C extension API (extensions assume single-threaded execution by default).
Simple internal data structures (no lock granularity decisions inside dict, list, etc.).

It costs:

No multi-core scaling for CPU-bound Python code.
Thread switches are coarser than they could be.

How the GIL actually works

A thread that wants to execute Python bytecode must hold the GIL. When the interpreter calls PyEval_AcquireThread, it blocks until the GIL is available, then takes it. When the thread is going to do something that doesn't need Python state (a blocking syscall, a CPU-heavy NumPy operation), it calls PyEval_SaveThread to release the GIL, then PyEval_RestoreThread when it wants to re-enter Python.

Inside the interpreter loop, the GIL is also released periodically to allow other threads a chance. Before Python 3.2, this happened every 100 bytecode instructions. After 3.2, it's time-based: every 5ms (the "switch interval"), the running thread sets a flag that causes the eval loop to release the GIL at the next safe point. The flag-and-release mechanism reduces overhead compared to checking the counter on every instruction.

import sys
sys.getswitchinterval()   # 0.005 (5ms)
sys.setswitchinterval(0.1)  # 100ms

There's a fairness mechanism. The thread that just released the GIL doesn't immediately try to grab it again - it waits a moment to let a different thread acquire. This was added in 3.2 because before then, on multi-core systems, the same thread would often re-acquire the GIL before any other thread woke up, causing thread starvation. David Beazley's famous 2010 talk demonstrated this with benchmarks: 2 threads on 2 cores were slower than 1 thread on 1 core because of GIL fight overhead.

GIL handoff with the post-3.2 fairness mechanism

I/O-bound vs CPU-bound: a tale of two workloads

The GIL is released around every blocking syscall in CPython. open, read, write, socket.recv, time.sleep, requests.get, all release the GIL while waiting. Other Python threads run during the wait.

This is why threading is great for I/O-bound work. A web scraper with 10 threads hitting 10 URLs in parallel actually does the network calls in parallel. The GIL is held by each thread only while parsing the response, not while waiting for the network. Real-world speedup: nearly 10x for pure I/O latency.

CPU-bound work is different. A pure Python loop computing primes never releases the GIL voluntarily (except at switch interval). Two threads doing the same CPU work share one core's worth of bytecode execution and pay for switching. Speed often goes down with more threads.

import threading
import time
 
def cpu_work():
    s = 0
    for i in range(10_000_000):
        s += i
    return s
 
# Single thread
start = time.perf_counter()
cpu_work()
print(time.perf_counter() - start)  # e.g. 0.5s
 
# Two threads
start = time.perf_counter()
t1 = threading.Thread(target=cpu_work)
t2 = threading.Thread(target=cpu_work)
t1.start(); t2.start()
t1.join(); t2.join()
print(time.perf_counter() - start)  # e.g. 1.1s (slower, not 0.5s)

C extensions and GIL release

The standard library and major third-party C extensions release the GIL during expensive operations. NumPy releases it during matrix operations. Pillow releases it during image processing. The compression and hashing modules release it during their work.

This is why NumPy-heavy code can use threads effectively for "embarrassingly parallel" array operations. The threads do almost all their work in C with the GIL released, only re-acquiring briefly to update Python objects.

You can release the GIL from your own C extension with the macros:

Py_BEGIN_ALLOW_THREADS
// expensive C work, no Python API calls
Py_END_ALLOW_THREADS

Inside the block, you cannot touch any Python object or call any Python API. You also have to handle thread safety yourself. The contract is: you promised not to touch Python state, and the interpreter is free to let other threads execute.

Multiprocessing: the workaround

If you have CPU-bound Python work that you want to run in parallel, the standard answer is multiprocessing. Each process is a separate Python interpreter with its own GIL. The Process class spawns OS processes; Pool and concurrent.futures.ProcessPoolExecutor manage pools.

The cost: process startup is slow (fork is fast on Linux, but spawn on macOS/Windows is hundreds of ms). Data passed between processes must be pickled. Shared memory is awkward (multiprocessing.Value, multiprocessing.Array, shared_memory in Python 3.8+).

The rule of thumb: if the work is fine-grained (microseconds per task), multiprocessing overhead kills you. If it's coarse-grained (seconds per task), multiprocessing scales linearly with cores.

PEP 703: free-threaded Python

PEP 703 was accepted in 2023 and the first experimental free-threaded build shipped with Python 3.13 in late 2024. The plan:

Phase 1 (3.13): experimental, opt-in build (python3.13t). Default Python still has GIL.
Phase 2 (3.14-3.16): supported but opt-in. Encourage ecosystem migration.
Phase 3 (3.17+): becomes default. Original GIL build deprecated.

What changed under the hood:

Reference counts are now atomic operations. There's also a "biased reference counting" optimization where the thread that created the object can use cheaper non-atomic ops, and other threads pay the atomic cost.
Container types (dict, list, set) got per-object locks for thread-safe mutation. Dict's lock is highly optimized to avoid contention on read-heavy workloads.
The cycle GC runs in a stop-the-world phase but more efficiently. Concurrent GC for free-threaded mode is in design.
The _Py_IsImmortal mechanism marks small ints, None, True, False, and some constants as immortal so their refcounts don't need to be updated. This avoids cache-line bouncing on heavily shared objects.

Performance cost:

Single-threaded benchmarks ~10% slower than GIL build (atomic refcount overhead).
Multi-threaded CPU-bound: nearly linear scaling with cores (the whole point).
I/O-bound: same as before, no real change.

The transition is the hardest part. C extensions written assuming the GIL need updates. NumPy, PyTorch, and other major projects have ongoing work to support free-threading. The next 2-3 years are the painful migration.

When to use which model

I/O-bound, latency-sensitive, many concurrent tasks: asyncio. Single thread, no GIL contention, cheap context switches.

I/O-bound, simpler code: threading with ThreadPoolExecutor. Same throughput as asyncio for typical workloads.

CPU-bound, parallel, coarse-grained: multiprocessing or ProcessPoolExecutor. Works today, no GIL issues.

CPU-bound, vector math: NumPy, SciPy, PyTorch. They drop the GIL internally; you get parallelism for free.

CPU-bound, want threads, willing to use experimental Python: python3.13t plus threading.

CPU-bound with C extensions: write the hot loop in C/Cython/Rust and release the GIL around it.

Common misconceptions

"Python is slow because of the GIL." False. Python's single-threaded performance is set by the interpreter overhead, not the GIL. The GIL only affects multi-core scaling. A pure single-threaded Python program runs at the same speed in python3.13 and python3.13t.

"asyncio bypasses the GIL." Technically true but misleading. asyncio runs everything on one thread, so there's no GIL contention - but there's also no parallelism. You get concurrency (overlapping I/O) but not multi-core CPU use.

"Threads in Python are useless." False. They are excellent for I/O-bound work. The misconception comes from people benchmarking CPU-bound work and seeing no speedup, then generalizing wrongly.

"The GIL is just a bad design." Lazy take. The GIL bought 30 years of fast single-threaded execution and a thriving C extension ecosystem. The tradeoff has shifted now that many-core CPUs are universal, and that's why PEP 703 happened. But it wasn't a mistake when it was made.

Mental model

The GIL is a token. One token. To execute Python bytecode, you need to hold it. Every thread takes turns. When a thread is doing something that doesn't need the token (waiting on a socket, running C code that promised not to touch Python state), it puts the token down and someone else picks it up.

The implication is: if all your work is bytecode-executing Python, only one thread runs at a time. If your work mostly waits for external things, threads scale well. If your work is in C extensions that release the GIL, threads scale.

Free-threaded Python removes the token. Every thread runs in parallel. The price is per-object locking and atomic refcounts, paid even in the single-threaded case.

Learn more

Docs
PEP 703: Making the Global Interpreter Lock Optionalpython.org
Talk
David Beazley: Understanding the Python GILPyCon
Talk
Larry Hastings: Python's Infamous GILLarry Hastings
Docs
Python docs: threading modulepython.org