Compilers
A compiler turns source text in one language into source text
(or machine code) in another. The classical pipeline —
lex, parse,
codegen — is a small idea that scales: the same
three stages run inside gcc, rustc, and
every interpreter you've ever used. The way to learn it is to
write one.
This section builds the same compiler — a growing subset of C, targeting x86-64 assembly — repeatedly, in different implementation languages. Same input, same output, different host. The host language changes what feels easy and what feels hard, and the contrast is most of what's worth learning after you've seen the algorithms once.
Implementation tracks
- C compiler in Python — The opening track. Python lets the algorithms breathe — string slicing, dictionaries, exceptions, no manual memory. Read this first; the structure stays the same in every other track.
- C compiler in C — The self-hosting move. No regex library, no built-in dict, no garbage collection. Forces you to confront the data structures you got for free in Python (symbol tables, AST allocation, error messages with source locations). When this version compiles itself, you have a self-hosting compiler.
- C compiler in Lisp — The AST-as-data move. Lisp's homoiconicity makes the parser output literally a Lisp list, and codegen becomes a tree-walk with pattern matching. The whole compiler reads as one long transformation pipeline.
- C compiler in OCaml — The type-driven move. Algebraic data types make AST nodes precise and exhaustive pattern matching catches missing cases at compile time. Many compilers in industry are written in OCaml or its descendants for this reason.
Only the Python track is written so far. The other tracks reuse the same chapter list, so once you've worked through Python you can pick any other track and read it as a "how do I express the same algorithm here?" exercise.
What we compile
A growing subset of C, chapter by chapter. The first chapter
handles only int main(void) { return INTEGER; };
the last handles structs, pointers, and a multi-function program
with a real calling convention. The C subset is the same across
all tracks — only the implementation changes.
The target is x86-64 assembly in AT&T syntax — the dialect
GNU as assembles by default. We hand the assembly
to gcc for assembling and linking against the C
runtime, so our main function becomes a real Unix
program with a real exit code. Linux and macOS both work; the
chapters call out the small Mach-O / ELF differences when they
matter.