The smallest C program
Compilers
Let's start with a simple C program. It's even less interesting that "Hello World".
int main(void) { return 42; } This program just returns the integer 42. The object of a compiler is to turn this langauge into one that means something to your CPU. Which basically means translate from C to assembly.
Compilation classically has three stages: lex (turn characters into tokens), parse (turn tokens into a tree), and codegen (turn the tree into target code). For chapter 1 the tree degenerates to a single integer, but we'll write the three stages as separate functions anyway. Future chapters will grow them, not invent new architecture.
The lexer
Before writing any code, decide what you want it to produce. The lexer's job is to turn the source string into a flat list of tokens, each tagged with a kind. Here is exactly what we want out, on the right, given the input on the left:
int main(void) { return 42; }
↓
[ ('KW', 'int'),
('IDENT', 'main'),
('PUNCT', '('),
('KW', 'void'),
('PUNCT', ')'),
('PUNCT', '{'),
('KW', 'return'),
('INT', 42),
('PUNCT', ';'),
('PUNCT', '}') ]
Four token kinds: KW for the three keywords
int, void, return;
IDENT for identifiers (here, only main);
INT for integer literals; PUNCT for the
punctuation ( ) { } ;. Whitespace and
line comments produce no token — they're consumed silently.
The lexer's algorithm is a single loop: look at the next character, decide which of the five things it starts (whitespace, comment, punctuation, integer, or identifier), consume the run of characters that belongs to that thing, emit a token if appropriate, and continue. That's it. Skeleton first:
KEYWORDS = {'int', 'void', 'return'}
PUNCTUATION = set('(){};')
def lex(src):
"""Walk the source one character at a time and emit tokens."""
tokens = []
i = 0
while i < len(src):
c = src[i]
# ... classify c, append a token, advance i ...
return tokens Now fill in the five branches one at a time. Whitespace and comments emit nothing; we just advance the cursor:
# Whitespace: skip and continue.
if c.isspace():
i += 1
continue
# Line comment: skip to end of line.
if c == '/' and i + 1 < len(src) and src[i + 1] == '/':
while i < len(src) and src[i] != '\n':
i += 1
continue
The comment branch peeks at the next character to make sure
it's a second / before consuming. We don't have
division yet, but in a later chapter when / can
also be the divide operator, this peek is exactly what
distinguishes a/b from a // b.
Punctuation is the easy case: a single character is a single token. Look it up in our small set, emit, advance:
# Punctuation: each character is its own one-character token.
if c in PUNCTUATION:
tokens.append(('PUNCT', c))
i += 1
continue
Integer literals span multiple characters, so we walk a second
cursor j forward past the digits, then slice
src[i:j] to recover the literal. Notice we
convert it to a Python int immediately — the
parser shouldn't care that the source said the digits "4" and
"2" specifically:
# Integer literal: consume a run of digits.
if c.isdigit():
j = i
while j < len(src) and src[j].isdigit():
j += 1
tokens.append(('INT', int(src[i:j])))
i = j
continue
Identifiers and keywords share the same scanning rule: start
with a letter or underscore, then more letters or digits or
underscores. The only difference between an identifier
and a keyword is whether the resulting word is in our
KEYWORDS set. If it is, tag it KW;
otherwise IDENT. After this branch, anything we
haven't recognized is a syntax error:
# Identifier or keyword: starts with a letter or underscore,
# then more letters / digits / underscores. We classify it as
# a keyword only if the resulting word is in KEYWORDS.
if c.isalpha() or c == '_':
j = i
while j < len(src) and (src[j].isalnum() or src[j] == '_'):
j += 1
word = src[i:j]
kind = 'KW' if word in KEYWORDS else 'IDENT'
tokens.append((kind, word))
i = j
continue
raise SyntaxError(f"Unexpected character at {i}: {c!r}") Five branches, each one obvious. No regex, no escape sequences, no capture groups — just "what does the next character begin?" asked five times. Real-world lexers (including C's) use the same pattern; production lexers usually swap regex or a generated DFA in for speed, but the structure is identical.
The parser
A real parser builds an AST. For chapter 1 there's exactly one
valid program shape, so the parser is a flat sequence of
expect(...) calls — a hand-written recursive
descent parser with no recursion yet, because there's no
nesting yet. The only thing we extract is the integer literal
after return.
def parse(tokens):
"""Parse: int main(void) { return INT; } → integer value."""
pos = [0]
def peek():
return tokens[pos[0]] if pos[0] < len(tokens) else None
def expect(kind, val=None):
t = peek()
if t is None or t[0] != kind or (val is not None and t[1] != val):
raise SyntaxError(
f"Expected {kind}"
+ (f' \"{val}\"' if val else '')
+ f', got {t!r}'
)
pos[0] += 1
return t
expect('KW', 'int')
expect('IDENT', 'main')
expect('PUNCT', '(')
expect('KW', 'void')
expect('PUNCT', ')')
expect('PUNCT', '{')
expect('KW', 'return')
value = expect('INT')[1]
expect('PUNCT', ';')
expect('PUNCT', '}')
return value expect is the workhorse: it asks for a specific
token kind (and optionally a specific value), advances the
cursor on success, throws on mismatch. The body of
parse is then a transcription of the grammar rule:
function : "int" IDENT "(" "void" ")" "{" "return" INT ";" "}"
All ten symbols, in order, with no choices to make. We just
keep expect-ing until the program is consumed.
The code generator
The target is x86-64 assembly in AT&T syntax (the dialect
GNU as consumes by default). System V AMD64 ABI:
integer return values live in %eax. So our
codegen is two real instructions plus the symbol declaration:
def codegen(value):
"""Emit System V AMD64 assembly that returns `value` from main."""
return (
' .globl main\n'
'main:\n'
f' movl ${value}, %eax\n'
' ret\n'
) .globl main exports the symbol so the linker can
find it. main: is the label the C runtime jumps to
after its startup code. movl $N, %eax loads the
return value into the register the calling convention expects.
ret pops the return address off the stack and
jumps to it. Three instructions, one program.
Wire it together
The full file. Reads C source from stdin (or a filename
argument), prints assembly to stdout. Save as
tinyc.py:
"""tinyc.py — a C compiler, chapter 1.
Compiles the smallest valid C program:
int main(void) { return INTEGER; }
into x86-64 assembly suitable for `gcc tinyc.s -o tinyc`.
"""
import sys
KEYWORDS = {'int', 'void', 'return'}
PUNCTUATION = set('(){};')
def lex(src):
tokens = []
i = 0
while i < len(src):
c = src[i]
if c.isspace():
i += 1
continue
if c == '/' and i + 1 < len(src) and src[i + 1] == '/':
while i < len(src) and src[i] != '\n':
i += 1
continue
if c in PUNCTUATION:
tokens.append(('PUNCT', c))
i += 1
continue
if c.isdigit():
j = i
while j < len(src) and src[j].isdigit():
j += 1
tokens.append(('INT', int(src[i:j])))
i = j
continue
if c.isalpha() or c == '_':
j = i
while j < len(src) and (src[j].isalnum() or src[j] == '_'):
j += 1
word = src[i:j]
kind = 'KW' if word in KEYWORDS else 'IDENT'
tokens.append((kind, word))
i = j
continue
raise SyntaxError(f"Unexpected character at {i}: {c!r}")
return tokens
def parse(tokens):
pos = [0]
def peek():
return tokens[pos[0]] if pos[0] < len(tokens) else None
def expect(kind, val=None):
t = peek()
if t is None or t[0] != kind or (val is not None and t[1] != val):
raise SyntaxError(f"Expected {kind} {val!r}, got {t!r}")
pos[0] += 1
return t
expect('KW', 'int')
expect('IDENT', 'main')
expect('PUNCT', '(')
expect('KW', 'void')
expect('PUNCT', ')')
expect('PUNCT', '{')
expect('KW', 'return')
value = expect('INT')[1]
expect('PUNCT', ';')
expect('PUNCT', '}')
return value
def codegen(value):
return (
' .globl main\n'
'main:\n'
f' movl ${value}, %eax\n'
' ret\n'
)
def compile_c(src):
return codegen(parse(lex(src)))
if __name__ == '__main__':
src = sys.stdin.read() if len(sys.argv) < 2 else open(sys.argv[1]).read()
sys.stdout.write(compile_c(src)) Run it
Pipe a C program in, redirect the assembly to a file,
assemble-and-link with gcc, run, check the exit
code:
$ echo 'int main(void) { return 42; }' | python tinyc.py
.globl main
main:
movl $42, %eax
ret
$ echo 'int main(void) { return 42; }' | python tinyc.py > tiny.s
$ gcc tiny.s -o tiny
$ ./tiny
$ echo $?
42
The shell variable $? holds the exit status of
the last command. ./tiny exited with 42 because
main returned 42 because the codegen put 42 in
%eax because the parser pulled 42 out of the
integer literal because the lexer recognized "42" as an
INT token. The whole pipeline is now
end-to-end, in well under 100 lines of Python.
On macOS, two adjustments. First, change
.globl main / main: to
.globl _main / _main: — Mach-O
prefixes user symbols with an underscore. Second, on Apple
Silicon the host CPU is arm64, so cross-compile to x86-64 and
let Rosetta run the binary:
gcc -arch x86_64 tiny.s -o tiny. (On Intel Macs
and on Linux, the plain gcc tiny.s -o tiny just
works.) The chapters target x86-64 throughout because its
instruction set has the longest pedagogical track record;
porting the codegen to arm64 is a fine exercise once you've
finished the series.
Exercises
Try these before clicking the reveal buttons. The point is to hold the chapter-1 compiler in your head and answer from there, not to look up answers and confirm.
Why have a separate lexer and parser instead of parsing characters directly?
Modularity and human comfort, not correctness. The lexer
handles character-level concerns — whitespace, comments,
multi-character operators (==, <=,
&& in later chapters) — and converts a
character stream into a token stream. The parser then handles
structural concerns — matching parentheses, applying
operator precedence, building the AST.
Combining them is possible. The combined grammar would mix "is this character a digit?" with "is this expression a return statement?" in the same productions. It works, but every rule reads worse, and a bug in one area is harder to isolate from a bug in the other. The split is for the humans, not the machine.
What do `int` and `void` actually do for the chapter-1 compiler?
Nothing semantic. The parser requires both keywords —
int before main, void
inside the parens — but consumes them and throws them away.
The compiler doesn't have a type system yet; it doesn't
check that the return value is actually an int,
doesn't reject void main(void) if you later wired
it up that way, doesn't know what void means
beyond "the literal token must be there." The keywords are
just punctuation, used to lock onto the function-definition
shape.
A real C compiler does check types. We'll bolt that on around the chapter on functions, when type information starts to actually constrain what code can be generated. Until then the keywords are present-but-ignored.
Why emit assembly text instead of machine code directly?
Assembly is text and the GNU assembler (as, which
gcc drives for us) handles every layer between
that text and an executable: instruction encoding, the ELF or
Mach-O object format, relocations, linking against the C
runtime so main becomes the real program entry
point. Emitting machine code directly would mean handling all
of that by hand — opcode bytes, ModR/M bytes, displacement
and immediate fields, ELF sections, relocation tables — for
no pedagogical benefit.
Production compilers do emit machine code (or, more
commonly, lower to LLVM IR and let LLVM do it). At our level
the assembler is a free abstraction; we only pay for it in
"you need gcc on your machine."
Problem 1. Run the chapter-1 lexer in your head on this source. List every token in order with its kind and value:
int main(void) { return 100; } Ten tokens, with whitespace silently consumed:
KWintIDENTmainPUNCT(KWvoidPUNCT)PUNCT{KWreturnINT100PUNCT;PUNCT}
Note that main classifies as IDENT,
not KW — it isn't in our KEYWORDS
set even though every C program in this chapter has
main as the function name. Real C compilers
would call this a "library-defined name," but for our lexer
it's just an arbitrary identifier that happens to be the one
the parser later looks for.
Problem 2. Compile this program through your
chapter-1 compiler, run it, then check echo $?:
int main(void) { return 257; }
What does $? print? Is your compiler wrong?
$? prints 1, not 257.
Your compiler is doing the right thing — the truncation is
happening downstream.
The Unix exit-status convention only preserves the low 8
bits of the value main returns. 257
in binary is 1 0000 0001; the low 8 bits are
0000 0001 = 1. The kernel stores the truncated
byte; the shell reads it back through
WEXITSTATUS() on wait()'s
return value; $? shows the result.
Your codegen put 257 into %eax correctly. The C
runtime's _start code took whatever was in
%eax and passed it to the exit()
syscall, which on Linux takes a 32-bit argument but the
kernel only stores 8 bits of exit status per process. This
is documented behavior, but it surprises everyone the first
time they hit it.
Try return values 0, 1, 255, 256, 511. $? will
show 0, 1, 255, 0, 255 respectively — the value modulo 256.
Where this is going
Chapter 1 is the smallest possible non-trivial C compiler.
The next chapter adds unary operators
(-x, !x, ~x), which
forces us to introduce a real AST instead of collapsing the
program to a single integer, and forces the codegen to emit
more than three instructions. Every chapter from there expands
one piece of the language; the architecture established here
(lex / parse / codegen, three Python functions, a single
compile_c entry point) does not change.