Arithmetic Units

Gate-Level CPU Simulator

The CPU has to do arithmetic — addition, subtraction, the bitwise operations — and we've committed to building all of it from boolean gates. No + shortcuts. So we need to figure out how to make a wire-level circuit that, given two binary numbers, produces their sum. Once we can add, subtraction comes free (two's complement). Once we have add, subtract, AND, OR, XOR, we have the ALU.

This series builds it bottom-up: half-adder → full-adder → ripple-carry adder → subtraction → ALU. Each step is a small composition of the previous one. Five short pages instead of one long one — read in order or jump to whichever piece you need.

The five pieces

Half-adder — XOR for the sum, AND for the carry. The atomic unit. Adds two single bits, produces a 2-bit result.
Full-adder — two half-adders plus an OR. Accepts an incoming carry, so it can handle any column of multi-bit addition.
Ripple-carry adder — chain $n$ full-adders end-to-end. The carry walks through the chain. $O (n)$ gate-delay critical path.
Subtraction — invert $B$ , set cin to 1, reuse the adder. Two's complement gives us subtraction without a new circuit.
ALU — every operation runs in parallel; a multiplexer picks one. No branching at the gate level, just gating.

The necessarily-true facts

The half-adder is exactly one XOR and one AND. Any other 2-input adder built from gates with the same truth table is logically equivalent (different gate counts, same outputs).
The full-adder needs three input bits and produces two output bits. A sum involving an incoming carry cannot be computed from fewer than three inputs without losing information.
An $n$ -bit ripple-carry adder uses $5 n$ gates and has a critical path of $O (n)$ gate delays. The gate count is linear, but so is the delay — that's the tradeoff name "ripple" refers to.
Subtraction reuses the adder. XOR each bit of $B$ with the sub control and feed sub in as the bit-0 carry. No second adder, no new arithmetic logic — just $n$ XOR gates on the B-side.
The ALU evaluates every operation every cycle. Hardware doesn't branch; it computes all results in parallel and multiplexes one out. The cost of an unused operation is the gates it occupies, not the time it takes.
Overflow is the carry-out of bit $n - 1$ for unsigned arithmetic; for signed, it's the XOR of the bit- $n - 1$ carry-in and carry-out. The same hardware detects both — what differs is which wire the control unit reads.

Insights

Patterns that recur past arithmetic — observations about what generalizes whenever a system hits the same constraint. Distinct from the facts above: those are mechanically true; these are why the architecture has the shape it does.

I. When you can't compress time, you can only widen work.

The single-cycle clock period bottoms out at gate-delay × longest-path; below that wall, no instruction goes faster. Past it, the only available knob is what can run independently of what. The ALU is the answer at the smallest scale: every operation runs every cycle, a multiplexer gates one through, and the unused operations are the price of parallelism — paid in silicon rather than time.

II. The pattern recurs at every scale where a system hits a latency wall.

ILP issues independent operations in the same cycle. SIMD applies one operation to many elements at once. GPUs scale this to thousands of cores. Distributed systems scale it across machines. Biology arrived at the same answer long before any of these — neurons fire in parallel because there is no central clock.

III. The cost of computation never disappears, it only moves.

Single-cycle pays for parallelism in silicon: every gate exists every cycle whether it's used or not. Pipelined designs trade gate area for control complexity — hazard logic, forwarding paths, stall machinery. Architects don't choose whether to pay; they choose where.

Common questions

Why build a full-adder out of two half-adders instead of from a single 8-row truth table directly?

You can do it the truth-table way (a 3-input gate with an 8-bit truth table for sum, another for cout). Same function, same outputs. But the half-adder construction is modular: half-adders are useful elsewhere, and composing them this way means you only have to verify the half-adder once and trust the rest by composition. It also makes the gate count obvious (5 gates per full-adder) without needing a Karnaugh map.

Why is ripple-carry an $O (n)$ critical path and not $O (1)$ ?

Because the carry-out of bit $i$ can't be computed until bit $i$ 's full-adder has actually seen its carry-in, which only arrives after bit $i - 1$ finishes, which only arrives after bit $i - 2$ finishes, and so on. Each bit's carry depends on every lower bit's carry. Carry-lookahead breaks this dependency by computing "generate" and "propagate" signals up front in parallel and then summing them in $O (lo g n)$ depth.

If the ALU computes every operation in parallel, isn't that wasteful?

In gates, yes — the silicon is there whether or not you use it. But the alternative (a sequential ALU that picks one operation, runs it, picks the next) is much slower per cycle, because hardware sequencing is itself expensive. For a single-cycle CPU, the only knob you have is how much you can do in one cycle, and parallel evaluation is how you maximize it. Power-conscious designs use clock gating to silence the unused branches, but the gates exist regardless.

Why does a single XOR + cin trick handle both add and subtract?

Because two's complement encodes negation as "invert and add 1." Inverting $B$ can be done with XOR-by-1 (XOR with 0 leaves it alone, XOR with 1 inverts it), so a row of XORs with one input wired to sub conditionally inverts. The "+1" is wired into the carry-in slot, which is otherwise unused for plain addition. So the same circuit is "add" when sub = 0 and "subtract" when sub = 1, with no extra arithmetic — just a single shared control wire.