The smallest C program

Compilers

Let's start with a simple C program. It's even less interesting that "Hello World".

int main(void) { return 42; }

This program just returns the integer 42. The object of a compiler is to turn this langauge into one that means something to your CPU. Which basically means translate from C to assembly.

Compilation classically has three stages: lex (turn characters into tokens), parse (turn tokens into a tree), and codegen (turn the tree into target code). For chapter 1 the tree degenerates to a single integer, but we'll write the three stages as separate functions anyway. Future chapters will grow them, not invent new architecture.


The lexer

Before writing any code, decide what you want it to produce. The lexer's job is to turn the source string into a flat list of tokens, each tagged with a kind. Here is exactly what we want out, on the right, given the input on the left:

int main(void) { return 42; }

[ ('KW',    'int'),
  ('IDENT', 'main'),
  ('PUNCT', '('),
  ('KW',    'void'),
  ('PUNCT', ')'),
  ('PUNCT', '{'),
  ('KW',    'return'),
  ('INT',   42),
  ('PUNCT', ';'),
  ('PUNCT', '}') ]

Four token kinds: KW for the three keywords int, void, return; IDENT for identifiers (here, only main); INT for integer literals; PUNCT for the punctuation ( ) { } ;. Whitespace and line comments produce no token — they're consumed silently.

The lexer's algorithm is a single loop: look at the next character, decide which of the five things it starts (whitespace, comment, punctuation, integer, or identifier), consume the run of characters that belongs to that thing, emit a token if appropriate, and continue. That's it. Skeleton first:

KEYWORDS = {'int', 'void', 'return'}
PUNCTUATION = set('(){};')


def lex(src):
    """Walk the source one character at a time and emit tokens."""
    tokens = []
    i = 0
    while i < len(src):
        c = src[i]
        # ... classify c, append a token, advance i ...
    return tokens

Now fill in the five branches one at a time. Whitespace and comments emit nothing; we just advance the cursor:

        # Whitespace: skip and continue.
        if c.isspace():
            i += 1
            continue

        # Line comment: skip to end of line.
        if c == '/' and i + 1 < len(src) and src[i + 1] == '/':
            while i < len(src) and src[i] != '\n':
                i += 1
            continue

The comment branch peeks at the next character to make sure it's a second / before consuming. We don't have division yet, but in a later chapter when / can also be the divide operator, this peek is exactly what distinguishes a/b from a // b.

Punctuation is the easy case: a single character is a single token. Look it up in our small set, emit, advance:

        # Punctuation: each character is its own one-character token.
        if c in PUNCTUATION:
            tokens.append(('PUNCT', c))
            i += 1
            continue

Integer literals span multiple characters, so we walk a second cursor j forward past the digits, then slice src[i:j] to recover the literal. Notice we convert it to a Python int immediately — the parser shouldn't care that the source said the digits "4" and "2" specifically:

        # Integer literal: consume a run of digits.
        if c.isdigit():
            j = i
            while j < len(src) and src[j].isdigit():
                j += 1
            tokens.append(('INT', int(src[i:j])))
            i = j
            continue

Identifiers and keywords share the same scanning rule: start with a letter or underscore, then more letters or digits or underscores. The only difference between an identifier and a keyword is whether the resulting word is in our KEYWORDS set. If it is, tag it KW; otherwise IDENT. After this branch, anything we haven't recognized is a syntax error:

        # Identifier or keyword: starts with a letter or underscore,
        # then more letters / digits / underscores. We classify it as
        # a keyword only if the resulting word is in KEYWORDS.
        if c.isalpha() or c == '_':
            j = i
            while j < len(src) and (src[j].isalnum() or src[j] == '_'):
                j += 1
            word = src[i:j]
            kind = 'KW' if word in KEYWORDS else 'IDENT'
            tokens.append((kind, word))
            i = j
            continue

        raise SyntaxError(f"Unexpected character at {i}: {c!r}")

Five branches, each one obvious. No regex, no escape sequences, no capture groups — just "what does the next character begin?" asked five times. Real-world lexers (including C's) use the same pattern; production lexers usually swap regex or a generated DFA in for speed, but the structure is identical.


The parser

A real parser builds an AST. For chapter 1 there's exactly one valid program shape, so the parser is a flat sequence of expect(...) calls — a hand-written recursive descent parser with no recursion yet, because there's no nesting yet. The only thing we extract is the integer literal after return.

def parse(tokens):
    """Parse: int main(void) { return INT; }  →  integer value."""
    pos = [0]

    def peek():
        return tokens[pos[0]] if pos[0] < len(tokens) else None

    def expect(kind, val=None):
        t = peek()
        if t is None or t[0] != kind or (val is not None and t[1] != val):
            raise SyntaxError(
                f"Expected {kind}"
                + (f' \"{val}\"' if val else '')
                + f', got {t!r}'
            )
        pos[0] += 1
        return t

    expect('KW', 'int')
    expect('IDENT', 'main')
    expect('PUNCT', '(')
    expect('KW', 'void')
    expect('PUNCT', ')')
    expect('PUNCT', '{')
    expect('KW', 'return')
    value = expect('INT')[1]
    expect('PUNCT', ';')
    expect('PUNCT', '}')
    return value

expect is the workhorse: it asks for a specific token kind (and optionally a specific value), advances the cursor on success, throws on mismatch. The body of parse is then a transcription of the grammar rule:

function : "int" IDENT "(" "void" ")" "{" "return" INT ";" "}"

All ten symbols, in order, with no choices to make. We just keep expect-ing until the program is consumed.


The code generator

The target is x86-64 assembly in AT&T syntax (the dialect GNU as consumes by default). System V AMD64 ABI: integer return values live in %eax. So our codegen is two real instructions plus the symbol declaration:

def codegen(value):
    """Emit System V AMD64 assembly that returns `value` from main."""
    return (
        '    .globl main\n'
        'main:\n'
        f'    movl ${value}, %eax\n'
        '    ret\n'
    )

.globl main exports the symbol so the linker can find it. main: is the label the C runtime jumps to after its startup code. movl $N, %eax loads the return value into the register the calling convention expects. ret pops the return address off the stack and jumps to it. Three instructions, one program.


Wire it together

The full file. Reads C source from stdin (or a filename argument), prints assembly to stdout. Save as tinyc.py:

"""tinyc.py — a C compiler, chapter 1.

Compiles the smallest valid C program:

    int main(void) { return INTEGER; }

into x86-64 assembly suitable for `gcc tinyc.s -o tinyc`.
"""

import sys


KEYWORDS = {'int', 'void', 'return'}
PUNCTUATION = set('(){};')


def lex(src):
    tokens = []
    i = 0
    while i < len(src):
        c = src[i]

        if c.isspace():
            i += 1
            continue

        if c == '/' and i + 1 < len(src) and src[i + 1] == '/':
            while i < len(src) and src[i] != '\n':
                i += 1
            continue

        if c in PUNCTUATION:
            tokens.append(('PUNCT', c))
            i += 1
            continue

        if c.isdigit():
            j = i
            while j < len(src) and src[j].isdigit():
                j += 1
            tokens.append(('INT', int(src[i:j])))
            i = j
            continue

        if c.isalpha() or c == '_':
            j = i
            while j < len(src) and (src[j].isalnum() or src[j] == '_'):
                j += 1
            word = src[i:j]
            kind = 'KW' if word in KEYWORDS else 'IDENT'
            tokens.append((kind, word))
            i = j
            continue

        raise SyntaxError(f"Unexpected character at {i}: {c!r}")

    return tokens


def parse(tokens):
    pos = [0]

    def peek():
        return tokens[pos[0]] if pos[0] < len(tokens) else None

    def expect(kind, val=None):
        t = peek()
        if t is None or t[0] != kind or (val is not None and t[1] != val):
            raise SyntaxError(f"Expected {kind} {val!r}, got {t!r}")
        pos[0] += 1
        return t

    expect('KW', 'int')
    expect('IDENT', 'main')
    expect('PUNCT', '(')
    expect('KW', 'void')
    expect('PUNCT', ')')
    expect('PUNCT', '{')
    expect('KW', 'return')
    value = expect('INT')[1]
    expect('PUNCT', ';')
    expect('PUNCT', '}')
    return value


def codegen(value):
    return (
        '    .globl main\n'
        'main:\n'
        f'    movl ${value}, %eax\n'
        '    ret\n'
    )


def compile_c(src):
    return codegen(parse(lex(src)))


if __name__ == '__main__':
    src = sys.stdin.read() if len(sys.argv) < 2 else open(sys.argv[1]).read()
    sys.stdout.write(compile_c(src))

Run it

Pipe a C program in, redirect the assembly to a file, assemble-and-link with gcc, run, check the exit code:

$ echo 'int main(void) { return 42; }' | python tinyc.py
    .globl main
main:
    movl $42, %eax
    ret

$ echo 'int main(void) { return 42; }' | python tinyc.py > tiny.s
$ gcc tiny.s -o tiny
$ ./tiny
$ echo $?
42

The shell variable $? holds the exit status of the last command. ./tiny exited with 42 because main returned 42 because the codegen put 42 in %eax because the parser pulled 42 out of the integer literal because the lexer recognized "42" as an INT token. The whole pipeline is now end-to-end, in well under 100 lines of Python.

On macOS, two adjustments. First, change .globl main / main: to .globl _main / _main: — Mach-O prefixes user symbols with an underscore. Second, on Apple Silicon the host CPU is arm64, so cross-compile to x86-64 and let Rosetta run the binary: gcc -arch x86_64 tiny.s -o tiny. (On Intel Macs and on Linux, the plain gcc tiny.s -o tiny just works.) The chapters target x86-64 throughout because its instruction set has the longest pedagogical track record; porting the codegen to arm64 is a fine exercise once you've finished the series.


Exercises

Try these before clicking the reveal buttons. The point is to hold the chapter-1 compiler in your head and answer from there, not to look up answers and confirm.

Concept

Why have a separate lexer and parser instead of parsing characters directly?

Concept

What do `int` and `void` actually do for the chapter-1 compiler?

Concept

Why emit assembly text instead of machine code directly?

Problem 1. Run the chapter-1 lexer in your head on this source. List every token in order with its kind and value:

int main(void) { return 100; }
ProblemLex a small program by hand
Reveal the solution once you've written out your token list.

Problem 2. Compile this program through your chapter-1 compiler, run it, then check echo $?:

int main(void) { return 257; }

What does $? print? Is your compiler wrong?

ProblemThe exit-code surprise
Predict before revealing — and try a few other return values too (0, 255, 256, 511).

Where this is going

Chapter 1 is the smallest possible non-trivial C compiler. The next chapter adds unary operators (-x, !x, ~x), which forces us to introduce a real AST instead of collapsing the program to a single integer, and forces the codegen to emit more than three instructions. Every chapter from there expands one piece of the language; the architecture established here (lex / parse / codegen, three Python functions, a single compile_c entry point) does not change.