CPython bytecode compilation
From source to running code in CPython: PEG parser, AST, bytecode compiler, code objects, the ceval.c interpreter loop, specializing adaptive bytecode, and the new 3.13 JIT.
Why bytecode
CPython could have been a tree-walking interpreter (run the AST directly). Tree walkers are simple but slow: every operation involves following pointers through tree nodes. Bytecode interpreters compile the AST to a flat sequence of fixed-size instructions that an interpreter loop can dispatch quickly.
CPython chose bytecode in the early days (1990s) for the standard reason: it's a sweet spot between simplicity and performance. The compiler is straightforward (tens of thousands of lines, not millions). The interpreter loop is tight enough that modern CPUs predict branches well. You get reasonable performance without the complexity of a JIT.
The downside is no compile-time optimization beyond simple peephole passes. The compiler doesn't inline functions, doesn't do escape analysis, doesn't unroll loops. Everything dynamic happens at runtime. This is why Python is much slower than C and why making it faster has been a multi-year project.
The pipeline in detail
1. Tokenizer
The tokenizer (Parser/tokenizer.c) reads source character by character and produces a stream of tokens: NAME, NUMBER, STRING, OP, NEWLINE, INDENT, DEDENT, etc. Python's indentation-as-syntax is handled here: the tokenizer tracks indentation levels and emits INDENT/DEDENT tokens for block boundaries.
The tokenizer is the same as in older Python versions. Its job is mechanical and well-understood.
2. Parser
Until Python 3.8, CPython used an LL(1) parser generated from a grammar in Grammar/Grammar. LL(1) means it could look ahead one token. This limited the grammar - some patterns required ugly workarounds.
Python 3.9 replaced it with a PEG (Parsing Expression Grammar) parser via PEP 617. PEG allows unlimited lookahead with memoization. The grammar is more expressive and the error messages improved significantly. The same grammar file generates both the parser and the AST node definitions.
The parser produces an AST: a tree of node types like Module, FunctionDef, If, Call, Name. You can dump it:
import ast
tree = ast.parse("x = 1 + 2")
print(ast.dump(tree, indent=2))3. Compiler
The compiler (Python/compile.c) walks the AST and emits bytecode. Major steps:
a. Symbol table. A pre-pass over the AST collects all variable references and assignments, determining scopes (local, enclosing, global, builtin). This drives the choice of LOAD_FAST vs LOAD_GLOBAL etc.
b. Bytecode emission. Walks the AST emitting instructions. Function bodies become separate code objects. Constants and names go into pools.
c. Peephole optimization. A small pass eliminates obvious inefficiencies: constant folding (2 + 3 becomes 5), dead code elimination after returns, jump-to-jump simplification.
d. Code object assembly. Wraps the bytecode, constants, names, line number table, and exception table into a PyCodeObject.
The compiler is straightforward by design. Most of CPython's perf work focuses on the interpreter, not the compiler.
4. Interpreter
Python/ceval.c contains the interpreter loop. It's one of the most-optimized pieces of C code in the project. Pseudo-code:
for (;;) {
opcode = NEXTOP();
switch (opcode) {
case LOAD_FAST: ... break;
case BINARY_ADD: ... break;
case CALL_FUNCTION: ... break;
...
}
}Modern CPython uses "computed gotos" (a GCC extension) instead of switch for faster dispatch on supported compilers. Each opcode handler jumps directly to the next opcode's handler via a label table. This avoids the switch's bounds check and helps branch prediction.
The code object
A PyCodeObject is what compile() returns. It contains:
co_code: the bytecode bytes.co_consts: tuple of constant values (literals, nested code objects for functions).co_names: tuple of names referenced (for LOAD_NAME, LOAD_ATTR).co_varnames: local variable names.co_freevars,co_cellvars: closure-related names.co_argcount,co_kwonlyargcount, etc.: signature info.co_lnotab/co_linetable: line number mapping for tracebacks.co_exceptiontable(3.11+): exception handling info.
You can introspect any function's code object: func.__code__. You can disassemble: dis.dis(func).
The stack machine
CPython's VM is stack-based. Every operation pushes and pops from a per-frame value stack. There are no registers (unlike Lua or Java's bytecode formats which use register VMs).
Stack machines are simple to compile to (just translate each AST node to push/pop sequences) and simple to interpret (one virtual machine pointer, one stack pointer). The cost is more memory traffic - each operation reads from and writes to the stack.
Pre-3.11 bytecode was one byte per opcode plus one byte of operand. 3.11+ uses two bytes per instruction (one opcode, one operand) with EXTENDED_ARG for larger operands. This change supported the inline cache slots needed for specialization.
def add(a, b):
return a + bThe bytecode (simplified, exact varies by version):
LOAD_FAST a
LOAD_FAST b
BINARY_ADD
RETURN_VALUE
Four instructions. Tight, predictable.
.pyc caching
On import, CPython compiles module.py and writes __pycache__/module.cpython-3X.pyc. The file format:
- 4-byte magic number (changes with each Python version).
- 4 bytes of bit field (timestamp invalidation vs hash invalidation).
- 4 bytes of timestamp (or hash) of source.
- 4 bytes of source file size.
- Marshalled code object.
On next import, CPython checks the magic and timestamp/hash. If they match the source, it loads the marshalled code object directly. If they don't, it recompiles.
This is why imports are slow the first time and fast after. The cache makes Python's startup tolerable in normal use.
PEP 659: Specializing adaptive interpreter
Python 3.11 brought a major architectural change: bytecode instructions specialize themselves at runtime based on observed types.
The idea: most call sites in Python see only one or two object types. x + y is almost always integer-plus-integer, or float-plus-float, or string-concat. The generic BINARY_ADD instruction has to check types, dispatch to the right __add__ or __radd__, handle subclasses, etc. The specialized form just does the integer add.
The mechanism:
- Initially, the instruction is generic (
BINARY_OP). - After a few executions, the interpreter observes the types and "specializes" the instruction in-place to a specific form (
BINARY_OP_ADD_INT). - The specialized form has a guard at the top: check types. If they match, do the fast path. If they don't, count failures.
- After several guard failures, de-specialize back to the generic form.
This is V8's inline cache pattern adapted to a bytecode interpreter. The bytecode is mutable at runtime, and inline cache slots are baked into the instruction stream.
The result was a 10-60% speedup in 3.11 over 3.10, depending on workload. The "Faster CPython" team led by Mark Shannon has continued the work in 3.12 and 3.13 with more specialized forms and better feedback collection.
The 3.13 JIT
Python 3.13 (October 2024) shipped an experimental JIT, off by default. It uses the "copy-and-patch" technique from a 2021 paper by Haoran Xu and Fredrik Kjolstad.
The idea: pre-compile templates of machine code for each specialized bytecode instruction (during CPython build time). At runtime, when a function is hot, concatenate the templates for its instructions and patch in the constants. The result is straight-line machine code instead of an interpreter loop.
It's a baseline JIT. There's no IR, no SSA optimization, no inlining. The speedup is modest (5-15% on benchmarks). The value is the infrastructure: future versions can add optimization passes on top.
The JIT is enabled with PYTHON_JIT=1 or --enable-experimental-jit at build time. As of 3.13 it's still labeled experimental. Production use should wait a few more releases.
Performance patterns
What the bytecode tells you:
Attribute access is slow. obj.attr is LOAD_ATTR, which does a __getattribute__ lookup. Caching to a local is faster:
# Slow
for x in xs: obj.method(x)
# Faster
m = obj.method
for x in xs: m(x)Global lookups are slow. LOAD_GLOBAL searches the module's globals dict then the builtins dict. Locals are LOAD_FAST which is array indexing. Make hot loops self-contained.
Comprehensions are fast. [x*x for x in xs] compiles to a tight loop with LIST_APPEND. Faster than for x in xs: result.append(x*x).
Calls have overhead. Each function call sets up a new frame: allocate, link to caller, copy arguments. Pre-3.11 this was expensive; 3.11 made frames cheaper with PEP 657's "zero-cost exception" handling and inline frame allocation.
Tools
dis.dis(obj): disassemble any function or code object.dis.show_code(func): print code object metadata.compile(src, filename, mode): compile a string to a code object.ast.parse(src): parse to AST without compiling.sys.settrace(tracer): install a tracer that gets called on every line.import sys; sys.getsizeof(func.__code__): how big is the bytecode.
For deeper analysis, the Faster CPython project on GitHub has tools and benchmarks. The official benchmark suite is pyperformance.
Why the design has staying power
CPython's bytecode-interpreter design has been around for 30+ years. Several alternatives (PyPy with a tracing JIT, Jython on the JVM, IronPython on .NET) exist but haven't displaced CPython. The reasons:
- Simplicity. CPython's compiler and interpreter are written in C in a way that one person can understand. New contributors can navigate it.
- C extension compatibility. The C API is built around the bytecode interpreter's model. JITs that change the model (PyPy) struggle with extensions.
- Portability. A bytecode interpreter runs anywhere C runs. JITs need machine-code generators per architecture.
- Good enough. For most Python workloads, the bottleneck is in C extensions (NumPy, libxml, etc.) or I/O. The interpreter speed matters less than people think.
The Faster CPython project is trying to close the gap with PyPy by adding specialization and JIT while preserving compatibility. 3.11-3.13 have made big progress. The next few versions should make more.
Mental model
Source goes through a deterministic pipeline (tokenize, parse, compile) to a code object: a flat sequence of bytecodes plus a constant pool plus name tables. The interpreter loop dispatches bytecodes one at a time on a stack VM.
Specialization mutates the bytecode at runtime: hot instructions get replaced with type-specific versions that skip checks. The 3.13 JIT goes further by replacing the interpreter loop entirely for hot functions with copied-and-patched native code.
You don't usually need to think about any of this when writing Python. But knowing it explains why some patterns are faster, why .pyc files exist, why imports are slow the first time, why local variables beat globals, and why the language keeps getting faster between releases.
Learn more
- DocsPython docs: dis modulepython.org
- DocsPEP 617: New PEG parser for CPythonpython.org
- Docs
- Repo
- ArticleEli Bendersky: CPython internalsEli Bendersky