Unary operators

Compilers

Chapter 1's compiler accepted exactly one expression: an integer literal. Chapter 2 grows that to any unary expression — negation, bitwise NOT, logical NOT — applied to an integer or to another unary expression. Four small example programs:

int main(void) { return -42; }
int main(void) { return ~0;  }
int main(void) { return !0;  }
int main(void) { return -~!42; }

The grammar still has only one statement (return EXPR;), but EXPR is no longer a single token. It can nest arbitrarily deep — -~!42, ~~~~0, !!!!!1. Two things we did not need in chapter 1 suddenly become necessary: a real AST to represent the tree of operations, and recursion in both the parser and the codegen to handle the nesting.


The AST

An abstract syntax tree is the parser's output, a structured representation of the program. For chapter 2 we need four kinds of node — one for integer literals and one for each of the three unary operators. The shape is naturally a tree, because operators nest:

Source:    -~!42

AST:       Neg
            └─ BitNot
                └─ Not
                    └─ IntLit 42

Python has many ways to represent tree nodes. We'll use plain tuples with a string tag in position 0, because tuples print legibly and don't require any class declarations. The first element is the node kind; the rest are its children:

# An integer literal:
('IntLit', 42)

# Unary operators wrap a child expression:
('Neg',    ('IntLit', 42))                     # -42
('BitNot', ('IntLit', 0))                      # ~0
('Not',    ('IntLit', 0))                      # !0

# Operators nest, so the AST nests:
('Neg', ('BitNot', ('Not', ('IntLit', 42))))   # -~!42

Later chapters will outgrow tuples (when nodes have many fields, or when we want to attach types and source positions). At that point we'll switch to small Python classes. For now, tuples are the smallest possible thing that works.


The lexer

The lexer barely changes. The three new characters (-, !, ~) are single-character tokens with no prefix or suffix to worry about, so the only edit is to add them to the punctuation set:

# Chapter 1:
PUNCTUATION = set('(){};')

# Chapter 2:
PUNCTUATION = set('(){};-!~')

The lexer doesn't care that these are operators rather than "real" punctuation — they're all just one-character tokens at this stage. The C standard actually calls all of these punctuators; we'll keep the PUNCT tag. When we add multi-character operators in later chapters (==, &&, <=), the lexer will need a new branch that peeks ahead — but the one-character ones stay in this set.


The parser

Chapter 1's parser expected a single INT token after return. Chapter 2 expects a whole expression, which means a recursive helper. The grammar rule for an expression is:

expr : INT
     | "-" expr
     | "!" expr
     | "~" expr

A unary operator's right-hand side is itself an expression, which is exactly why this rule is recursive. The parser function mirrors the rule one-to-one — peek at the next token, pick the matching branch, recurse if needed:

def parse_expr(peek, advance):
    """Parse one unary expression. Returns an AST node."""
    t = peek()
    if t == ('PUNCT', '-'):
        advance()
        return ('Neg', parse_expr(peek, advance))
    if t == ('PUNCT', '!'):
        advance()
        return ('Not', parse_expr(peek, advance))
    if t == ('PUNCT', '~'):
        advance()
        return ('BitNot', parse_expr(peek, advance))
    if t and t[0] == 'INT':
        advance()
        return ('IntLit', t[1])
    raise SyntaxError(f"Expected expression, got {t!r}")

Each branch consumes its operator token and recurses to parse the operand. The integer branch is the base case — it consumes the INT token and returns a leaf IntLit node. Any other token is a parse error.

The function-level parser is mostly the same as chapter 1; it just calls parse_expr() instead of expecting a bare INT:

def parse(tokens):
    """Parse: int main(void) { return EXPR; }  →  AST node."""
    pos = [0]
    def peek():
        return tokens[pos[0]] if pos[0] < len(tokens) else None
    def advance():
        pos[0] += 1
    def expect(kind, val=None):
        t = peek()
        if t is None or t[0] != kind or (val is not None and t[1] != val):
            raise SyntaxError(f"Expected {kind} {val!r}, got {t!r}")
        advance()
        return t

    expect('KW', 'int')
    expect('IDENT', 'main')
    expect('PUNCT', '(')
    expect('KW', 'void')
    expect('PUNCT', ')')
    expect('PUNCT', '{')
    expect('KW', 'return')
    expr = parse_expr(peek, advance)
    expect('PUNCT', ';')
    expect('PUNCT', '}')
    return expr

Note that parse_expr here is defined inside parse (in the full file at the bottom of the page) so it can close over peek and advance. Splitting it out, as shown above, makes the recursion easier to read in isolation.


The code generator

Codegen is now recursive too — but the recursion follows a discipline that will hold for the rest of the series. We establish an invariant:

Every call to emit_expr(node) emits instructions that leave the value of node in %eax.

Given that invariant, each unary operator's codegen is trivial: emit the child first (its value lands in %eax), then emit one or three instructions that modify %eax in place. The parent operator inherits the child's value automatically — no explicit data flow needed:

def emit_expr(node):
    """Emit instructions that leave the value of `node` in %eax."""
    kind = node[0]
    if kind == 'IntLit':
        return f'    movl ${node[1]}, %eax\n'
    if kind == 'Neg':
        return emit_expr(node[1]) + '    negl %eax\n'
    if kind == 'BitNot':
        return emit_expr(node[1]) + '    notl %eax\n'
    if kind == 'Not':
        return emit_expr(node[1]) + (
            '    cmpl $0, %eax\n'
            '    sete %al\n'
            '    movzbl %al, %eax\n'
        )
    raise ValueError(f"Unknown node: {kind}")


def codegen(expr):
    return (
        '    .globl main\n'
        'main:\n'
        + emit_expr(expr)
        + '    ret\n'
    )

The instruction choices: negl %eax is two's complement negation. notl %eax is bitwise NOT. Logical NOT (!) is the only one that needs more than one instruction, because x86-64 has no "set to 1 if zero, else 0" in a single op:

The third instruction matters because sete only writes one byte. Without the zero-extend, the upper 24 bits of %eax would still hold whatever the comparison left behind, and the parent operator would compute on garbage.


End-to-end walkthrough

Putting all three stages together for a non-trivial input:

Source:    return -~!42;

AST:       ('Neg', ('BitNot', ('Not', ('IntLit', 42))))

Codegen:   emit_expr walks the tree post-order. Each call leaves
           its result in %eax, and the parent operates on %eax.

    .globl main
main:
    movl $42, %eax        # IntLit 42      → %eax = 42
    cmpl $0, %eax         # Not begins
    sete %al              # %al = (42 == 0) = 0
    movzbl %al, %eax      # %eax = 0
    notl %eax             # BitNot         → %eax = 0xFFFFFFFF
    negl %eax             # Neg            → %eax = 1
    ret                   # main returns 1

Read the assembly bottom-up to see why postorder works: negl %eax only makes sense if %eax already holds the value of ~!42, which only makes sense if the previous instructions correctly computed !42 = 0 and then flipped its bits to get 0xFFFFFFFF. Each operator finds its operand already in %eax because its child's codegen ran first.


The full file

The whole compiler, with chapter 1's program still working:

"""tinyc.py — a C compiler, chapter 2.

Adds unary operators (-, !, ~). Compiles, e.g.:

    int main(void) { return -~!42; }

into x86-64 assembly suitable for `gcc tinyc.s -o tinyc`.
"""

import sys


KEYWORDS = {'int', 'void', 'return'}
PUNCTUATION = set('(){};-!~')


# ---- lexer ---------------------------------------------------------

def lex(src):
    tokens = []
    i = 0
    while i < len(src):
        c = src[i]

        if c.isspace():
            i += 1
            continue

        if c == '/' and i + 1 < len(src) and src[i + 1] == '/':
            while i < len(src) and src[i] != '\n':
                i += 1
            continue

        if c in PUNCTUATION:
            tokens.append(('PUNCT', c))
            i += 1
            continue

        if c.isdigit():
            j = i
            while j < len(src) and src[j].isdigit():
                j += 1
            tokens.append(('INT', int(src[i:j])))
            i = j
            continue

        if c.isalpha() or c == '_':
            j = i
            while j < len(src) and (src[j].isalnum() or src[j] == '_'):
                j += 1
            word = src[i:j]
            kind = 'KW' if word in KEYWORDS else 'IDENT'
            tokens.append((kind, word))
            i = j
            continue

        raise SyntaxError(f"Unexpected character at {i}: {c!r}")

    return tokens


# ---- parser --------------------------------------------------------

def parse(tokens):
    pos = [0]
    def peek():
        return tokens[pos[0]] if pos[0] < len(tokens) else None
    def advance():
        pos[0] += 1
    def expect(kind, val=None):
        t = peek()
        if t is None or t[0] != kind or (val is not None and t[1] != val):
            raise SyntaxError(f"Expected {kind} {val!r}, got {t!r}")
        advance()
        return t

    def parse_expr():
        t = peek()
        if t == ('PUNCT', '-'):
            advance(); return ('Neg', parse_expr())
        if t == ('PUNCT', '!'):
            advance(); return ('Not', parse_expr())
        if t == ('PUNCT', '~'):
            advance(); return ('BitNot', parse_expr())
        if t and t[0] == 'INT':
            advance(); return ('IntLit', t[1])
        raise SyntaxError(f"Expected expression, got {t!r}")

    expect('KW', 'int')
    expect('IDENT', 'main')
    expect('PUNCT', '(')
    expect('KW', 'void')
    expect('PUNCT', ')')
    expect('PUNCT', '{')
    expect('KW', 'return')
    expr = parse_expr()
    expect('PUNCT', ';')
    expect('PUNCT', '}')
    return expr


# ---- code generator ------------------------------------------------

def emit_expr(node):
    kind = node[0]
    if kind == 'IntLit':
        return f'    movl ${node[1]}, %eax\n'
    if kind == 'Neg':
        return emit_expr(node[1]) + '    negl %eax\n'
    if kind == 'BitNot':
        return emit_expr(node[1]) + '    notl %eax\n'
    if kind == 'Not':
        return emit_expr(node[1]) + (
            '    cmpl $0, %eax\n'
            '    sete %al\n'
            '    movzbl %al, %eax\n'
        )
    raise ValueError(f"Unknown node: {kind}")


def codegen(expr):
    return (
        '    .globl main\n'
        'main:\n'
        + emit_expr(expr)
        + '    ret\n'
    )


def compile_c(src):
    return codegen(parse(lex(src)))


if __name__ == '__main__':
    src = sys.stdin.read() if len(sys.argv) < 2 else open(sys.argv[1]).read()
    sys.stdout.write(compile_c(src))

Run it

The pipe-and-assemble idiom from chapter 1 still works. Here we skip writing the .s file by piping the assembly straight into gcc -xassembler -, which tells gcc "treat stdin as assembly source." (On Apple Silicon, prepend the -arch x86_64 flag and remember the symbol underscore — see chapter 1's platform note.)

$ echo 'int main(void) { return -42; }'   | python tinyc.py | gcc -xassembler - -o t && ./t; echo $?
214
$ echo 'int main(void) { return ~0; }'     | python tinyc.py | gcc -xassembler - -o t && ./t; echo $?
255
$ echo 'int main(void) { return !0; }'     | python tinyc.py | gcc -xassembler - -o t && ./t; echo $?
1
$ echo 'int main(void) { return -~!42; }'  | python tinyc.py | gcc -xassembler - -o t && ./t; echo $?
1
$ echo 'int main(void) { return 42; }'     | python tinyc.py | gcc -xassembler - -o t && ./t; echo $?
42

A few things to notice: -42 shows up as 214 in $? because the shell's 8-bit exit code is unsigned (256 - 42 = 214). ~0 is 0xFFFFFFFF, which truncates to 0xFF = 255. !0 is 1, as expected. -~!42 is 1 (we walked through why above). And return 42; from chapter 1 still works — the new code is strictly additive.


Exercises

Chapter 2 is the chapter where the architecture becomes load bearing. The exercises lean on understanding the AST and the "value lives in %eax" invariant, because that's what every later chapter will reuse.

Concept

The codegen invariant says every emit_expr call leaves the value in %eax. What does that buy us when we add binary operators next chapter?

Concept

Why does logical NOT (`!`) need three instructions when `-` and `~` need only one?

Concept

Why use tuples for AST nodes instead of Python classes?

Problem 1. What integer does this program return after going through your chapter-2 compiler?

int main(void) { return ~~~~5; }

Predict from the source first, then check by tracing the AST and the generated assembly.

ProblemCancellation under bitwise NOT
Reveal once you've worked it out by hand.

Problem 2. ~~x always equals x. Does !!x always equal x? Compile both for several different inputs and explain the pattern.

int main(void) { return !!7; }
int main(void) { return !!1; }
int main(void) { return !!0; }
Problem!! is not the identity
Predict each $? before running, then explain why !! and ~~ behave differently.

Where this is going

Chapter 3 adds binary operators — + - * / — which forces three new things: operator precedence (so 1 + 2 * 3 parses as 1 + (2 * 3)), the stack-spill technique for holding partial results, and the idiv instruction with its weird sign-extension setup. The recursion-and-invariant architecture from this chapter carries over unchanged; only the grammar and the per-operator codegen grow.