Unary operators
Compilers
Chapter 1's compiler accepted exactly one expression: an integer literal. Chapter 2 grows that to any unary expression — negation, bitwise NOT, logical NOT — applied to an integer or to another unary expression. Four small example programs:
int main(void) { return -42; }
int main(void) { return ~0; }
int main(void) { return !0; }
int main(void) { return -~!42; }
The grammar still has only one statement (return EXPR;),
but EXPR is no longer a single token. It can nest
arbitrarily deep — -~!42, ~~~~0,
!!!!!1. Two things we did not need in chapter 1
suddenly become necessary: a real AST to
represent the tree of operations, and recursion
in both the parser and the codegen to handle the nesting.
The AST
An abstract syntax tree is the parser's output, a structured representation of the program. For chapter 2 we need four kinds of node — one for integer literals and one for each of the three unary operators. The shape is naturally a tree, because operators nest:
Source: -~!42
AST: Neg
└─ BitNot
└─ Not
└─ IntLit 42 Python has many ways to represent tree nodes. We'll use plain tuples with a string tag in position 0, because tuples print legibly and don't require any class declarations. The first element is the node kind; the rest are its children:
# An integer literal:
('IntLit', 42)
# Unary operators wrap a child expression:
('Neg', ('IntLit', 42)) # -42
('BitNot', ('IntLit', 0)) # ~0
('Not', ('IntLit', 0)) # !0
# Operators nest, so the AST nests:
('Neg', ('BitNot', ('Not', ('IntLit', 42)))) # -~!42 Later chapters will outgrow tuples (when nodes have many fields, or when we want to attach types and source positions). At that point we'll switch to small Python classes. For now, tuples are the smallest possible thing that works.
The lexer
The lexer barely changes. The three new characters
(-, !, ~) are
single-character tokens with no prefix or suffix to worry about,
so the only edit is to add them to the punctuation set:
# Chapter 1:
PUNCTUATION = set('(){};')
# Chapter 2:
PUNCTUATION = set('(){};-!~')
The lexer doesn't care that these are operators rather than
"real" punctuation — they're all just one-character tokens at
this stage. The C standard actually calls all of these
punctuators; we'll keep the PUNCT tag.
When we add multi-character operators in later chapters
(==, &&, <=),
the lexer will need a new branch that peeks ahead — but the
one-character ones stay in this set.
The parser
Chapter 1's parser expected a single INT token
after return. Chapter 2 expects a whole expression,
which means a recursive helper. The grammar rule for an
expression is:
expr : INT
| "-" expr
| "!" expr
| "~" expr A unary operator's right-hand side is itself an expression, which is exactly why this rule is recursive. The parser function mirrors the rule one-to-one — peek at the next token, pick the matching branch, recurse if needed:
def parse_expr(peek, advance):
"""Parse one unary expression. Returns an AST node."""
t = peek()
if t == ('PUNCT', '-'):
advance()
return ('Neg', parse_expr(peek, advance))
if t == ('PUNCT', '!'):
advance()
return ('Not', parse_expr(peek, advance))
if t == ('PUNCT', '~'):
advance()
return ('BitNot', parse_expr(peek, advance))
if t and t[0] == 'INT':
advance()
return ('IntLit', t[1])
raise SyntaxError(f"Expected expression, got {t!r}")
Each branch consumes its operator token and recurses to parse
the operand. The integer branch is the base case — it
consumes the INT token and returns a leaf
IntLit node. Any other token is a parse error.
The function-level parser is mostly the same as chapter 1; it
just calls parse_expr() instead of expecting a
bare INT:
def parse(tokens):
"""Parse: int main(void) { return EXPR; } → AST node."""
pos = [0]
def peek():
return tokens[pos[0]] if pos[0] < len(tokens) else None
def advance():
pos[0] += 1
def expect(kind, val=None):
t = peek()
if t is None or t[0] != kind or (val is not None and t[1] != val):
raise SyntaxError(f"Expected {kind} {val!r}, got {t!r}")
advance()
return t
expect('KW', 'int')
expect('IDENT', 'main')
expect('PUNCT', '(')
expect('KW', 'void')
expect('PUNCT', ')')
expect('PUNCT', '{')
expect('KW', 'return')
expr = parse_expr(peek, advance)
expect('PUNCT', ';')
expect('PUNCT', '}')
return expr
Note that parse_expr here is defined inside
parse (in the full file at the bottom of the page)
so it can close over peek and advance.
Splitting it out, as shown above, makes the recursion easier
to read in isolation.
The code generator
Codegen is now recursive too — but the recursion follows a discipline that will hold for the rest of the series. We establish an invariant:
Every call to emit_expr(node) emits
instructions that leave the value of node in
%eax.
Given that invariant, each unary operator's codegen is trivial:
emit the child first (its value lands in %eax),
then emit one or three instructions that modify %eax
in place. The parent operator inherits the child's value
automatically — no explicit data flow needed:
def emit_expr(node):
"""Emit instructions that leave the value of `node` in %eax."""
kind = node[0]
if kind == 'IntLit':
return f' movl ${node[1]}, %eax\n'
if kind == 'Neg':
return emit_expr(node[1]) + ' negl %eax\n'
if kind == 'BitNot':
return emit_expr(node[1]) + ' notl %eax\n'
if kind == 'Not':
return emit_expr(node[1]) + (
' cmpl $0, %eax\n'
' sete %al\n'
' movzbl %al, %eax\n'
)
raise ValueError(f"Unknown node: {kind}")
def codegen(expr):
return (
' .globl main\n'
'main:\n'
+ emit_expr(expr)
+ ' ret\n'
)
The instruction choices: negl %eax is two's
complement negation. notl %eax is bitwise NOT.
Logical NOT (!) is the only one that needs more
than one instruction, because x86-64 has no "set to 1 if zero,
else 0" in a single op:
-
cmpl $0, %eax— compare%eaxwith 0; sets the zero flag if equal. -
sete %al— set the low byte of%eaxto 1 if the zero flag is set, else 0. -
movzbl %al, %eax— zero-extend that byte to fill all 32 bits of%eax.
The third instruction matters because sete only
writes one byte. Without the zero-extend, the upper 24 bits of
%eax would still hold whatever the comparison left
behind, and the parent operator would compute on garbage.
End-to-end walkthrough
Putting all three stages together for a non-trivial input:
Source: return -~!42;
AST: ('Neg', ('BitNot', ('Not', ('IntLit', 42))))
Codegen: emit_expr walks the tree post-order. Each call leaves
its result in %eax, and the parent operates on %eax.
.globl main
main:
movl $42, %eax # IntLit 42 → %eax = 42
cmpl $0, %eax # Not begins
sete %al # %al = (42 == 0) = 0
movzbl %al, %eax # %eax = 0
notl %eax # BitNot → %eax = 0xFFFFFFFF
negl %eax # Neg → %eax = 1
ret # main returns 1
Read the assembly bottom-up to see why postorder works:
negl %eax only makes sense if %eax
already holds the value of ~!42, which only makes
sense if the previous instructions correctly computed
!42 = 0 and then flipped its bits to get
0xFFFFFFFF. Each operator finds its operand
already in %eax because its child's codegen ran
first.
The full file
The whole compiler, with chapter 1's program still working:
"""tinyc.py — a C compiler, chapter 2.
Adds unary operators (-, !, ~). Compiles, e.g.:
int main(void) { return -~!42; }
into x86-64 assembly suitable for `gcc tinyc.s -o tinyc`.
"""
import sys
KEYWORDS = {'int', 'void', 'return'}
PUNCTUATION = set('(){};-!~')
# ---- lexer ---------------------------------------------------------
def lex(src):
tokens = []
i = 0
while i < len(src):
c = src[i]
if c.isspace():
i += 1
continue
if c == '/' and i + 1 < len(src) and src[i + 1] == '/':
while i < len(src) and src[i] != '\n':
i += 1
continue
if c in PUNCTUATION:
tokens.append(('PUNCT', c))
i += 1
continue
if c.isdigit():
j = i
while j < len(src) and src[j].isdigit():
j += 1
tokens.append(('INT', int(src[i:j])))
i = j
continue
if c.isalpha() or c == '_':
j = i
while j < len(src) and (src[j].isalnum() or src[j] == '_'):
j += 1
word = src[i:j]
kind = 'KW' if word in KEYWORDS else 'IDENT'
tokens.append((kind, word))
i = j
continue
raise SyntaxError(f"Unexpected character at {i}: {c!r}")
return tokens
# ---- parser --------------------------------------------------------
def parse(tokens):
pos = [0]
def peek():
return tokens[pos[0]] if pos[0] < len(tokens) else None
def advance():
pos[0] += 1
def expect(kind, val=None):
t = peek()
if t is None or t[0] != kind or (val is not None and t[1] != val):
raise SyntaxError(f"Expected {kind} {val!r}, got {t!r}")
advance()
return t
def parse_expr():
t = peek()
if t == ('PUNCT', '-'):
advance(); return ('Neg', parse_expr())
if t == ('PUNCT', '!'):
advance(); return ('Not', parse_expr())
if t == ('PUNCT', '~'):
advance(); return ('BitNot', parse_expr())
if t and t[0] == 'INT':
advance(); return ('IntLit', t[1])
raise SyntaxError(f"Expected expression, got {t!r}")
expect('KW', 'int')
expect('IDENT', 'main')
expect('PUNCT', '(')
expect('KW', 'void')
expect('PUNCT', ')')
expect('PUNCT', '{')
expect('KW', 'return')
expr = parse_expr()
expect('PUNCT', ';')
expect('PUNCT', '}')
return expr
# ---- code generator ------------------------------------------------
def emit_expr(node):
kind = node[0]
if kind == 'IntLit':
return f' movl ${node[1]}, %eax\n'
if kind == 'Neg':
return emit_expr(node[1]) + ' negl %eax\n'
if kind == 'BitNot':
return emit_expr(node[1]) + ' notl %eax\n'
if kind == 'Not':
return emit_expr(node[1]) + (
' cmpl $0, %eax\n'
' sete %al\n'
' movzbl %al, %eax\n'
)
raise ValueError(f"Unknown node: {kind}")
def codegen(expr):
return (
' .globl main\n'
'main:\n'
+ emit_expr(expr)
+ ' ret\n'
)
def compile_c(src):
return codegen(parse(lex(src)))
if __name__ == '__main__':
src = sys.stdin.read() if len(sys.argv) < 2 else open(sys.argv[1]).read()
sys.stdout.write(compile_c(src)) Run it
The pipe-and-assemble idiom from chapter 1 still works. Here we
skip writing the .s file by piping the assembly
straight into gcc -xassembler -, which tells gcc
"treat stdin as assembly source." (On Apple Silicon, prepend
the -arch x86_64 flag and remember the symbol
underscore — see chapter 1's platform note.)
$ echo 'int main(void) { return -42; }' | python tinyc.py | gcc -xassembler - -o t && ./t; echo $?
214
$ echo 'int main(void) { return ~0; }' | python tinyc.py | gcc -xassembler - -o t && ./t; echo $?
255
$ echo 'int main(void) { return !0; }' | python tinyc.py | gcc -xassembler - -o t && ./t; echo $?
1
$ echo 'int main(void) { return -~!42; }' | python tinyc.py | gcc -xassembler - -o t && ./t; echo $?
1
$ echo 'int main(void) { return 42; }' | python tinyc.py | gcc -xassembler - -o t && ./t; echo $?
42
A few things to notice: -42 shows up as
214 in $? because the shell's
8-bit exit code is unsigned (256 - 42 = 214). ~0
is 0xFFFFFFFF, which truncates to 0xFF =
255. !0 is 1, as expected. -~!42 is 1
(we walked through why above). And return 42; from
chapter 1 still works — the new code is strictly additive.
Exercises
Chapter 2 is the chapter where the architecture becomes load
bearing. The exercises lean on understanding the AST and the
"value lives in %eax" invariant, because that's
what every later chapter will reuse.
The codegen invariant says every emit_expr call leaves the value in %eax. What does that buy us when we add binary operators next chapter?
It gives us a place to stash partial results without having
to reason about what registers are free. For
a + b, the codegen will emit a
into %eax, push %eax onto the
stack, emit b into %eax, then pop
the saved a into a second register and add. The
generated code is mechanical: each operator reads its inputs
from a known location (%eax for the most-recent
result, the stack for previously-saved ones) and writes its
output to %eax.
This is the standard "stack machine on top of registers" model. It's not the fastest possible code — a real compiler uses register allocation to avoid spilling to the stack — but it's correct, simple, and supports arbitrarily nested expressions. We'll keep this discipline through every chapter that adds new operators.
Why does logical NOT (`!`) need three instructions when `-` and `~` need only one?
Because - and ~ are bit-level
operations the hardware natively supports —
negl computes the two's complement,
notl flips every bit, both produce a 32-bit
result that is the new %eax. Logical NOT is
different: it asks "is this value zero?" and produces 0 or 1.
That's a comparison plus a conditional materialisation, not a
bit operation.
x86-64 splits comparisons across instructions: one to set
flags (cmpl), one to read a flag into a byte
(sete), one to widen the byte to a full register
(movzbl). The split is a hardware design choice;
ARM's equivalents pack more into one instruction. Either
way, "produce a boolean from a comparison" is fundamentally
multi-step work and our codegen has to reflect that.
Why use tuples for AST nodes instead of Python classes?
Brevity and printability, while the AST is small. A tuple
like ('Neg', ('IntLit', 42)) is one line, prints
legibly when you debug-print it, and requires no class
definitions. For four node kinds with one or zero children
each, a class hierarchy would be more code than the data it
describes.
Tuples will get awkward around chapter 4 or 5, when nodes start carrying multiple fields (left and right children for binary ops, name and arguments for function calls, type and initializer for variable declarations). At that point we'll switch to small dataclasses. The point of starting with tuples is to keep the AST chapter from being mostly about Python's class system instead of about parsing.
Problem 1. What integer does this program return after going through your chapter-2 compiler?
int main(void) { return ~~~~5; } Predict from the source first, then check by tracing the AST and the generated assembly.
It returns 5. ~ is its own
inverse — flipping every bit twice gets you back to the
original value, and four ~'s is two pairs.
The AST is ('BitNot', ('BitNot', ('BitNot', ('BitNot',
('IntLit', 5))))). The codegen emits
movl $5, %eax followed by four
notl %eax instructions. 5 in 32-bit
two's complement is 0x00000005; one
notl produces 0xFFFFFFFA; another
produces 0x00000005 again. After four flips:
back to 5.
$? shows 5, since 5 fits in the low byte.
Problem 2. ~~x always equals
x. Does !!x always equal
x? Compile both for several different inputs and
explain the pattern.
int main(void) { return !!7; }
int main(void) { return !!1; }
int main(void) { return !!0; } !!7 = 1, !!1 = 1, !!0 = 0.
!!x equals x only when x
is already 0 or 1; for any other value it collapses to 1.
The reason: ! is lossy. It throws away every bit
except "was this value zero?". The first !
collapses any nonzero x to 0, then the second
! turns 0 back into 1 — but the original value
is gone. !! is the canonical "convert anything
to a 0-or-1 boolean" idiom in C precisely because of this
collapsing.
~, by contrast, is bijective: every bit is
independent and reversible. ~~x always recovers
the original value because no information is lost. The
difference between an information-preserving operator and an
information-destroying one shows up clearly when you compose
them with their own "inverse."
Where this is going
Chapter 3 adds binary operators —
+ - * / — which forces three new things: operator
precedence (so 1 + 2 * 3 parses as
1 + (2 * 3)), the stack-spill technique for
holding partial results, and the idiv instruction
with its weird sign-extension setup. The recursion-and-invariant
architecture from this chapter carries over unchanged; only the
grammar and the per-operator codegen grow.