Performance Analysis

Gate-Level CPU Simulator

Look — at the gate level, "performance" doesn't mean GHz. We don't have a real clock yet. What we have is gates and wires. So when you ask "is this CPU fast?" the question splits into two: how much hardware is it (gate count, area, cost), and how deep is the longest chain of gates a signal has to walk through before the cycle ends (the critical path). Together with the cycle model, those two numbers tell you everything about what this machine can do per second.

This page works through both — gate count, then critical path, then the two-phase clock that ties them together, then why ripple- carry is the bottleneck, and finally what pipelining buys you and what it doesn't. Concrete numbers throughout, because you can't reason about performance with hand-waves.

Piece 1 — What we're actually measuring

Two metrics, and they fight each other. Gate count is total area: how much silicon, how much cost, how much power. Critical path depth is the longest combinational chain — measured in gate delays, where one gate delay is the time it takes a single gate's output to settle after its inputs change. The clock period of the CPU is bounded below by that longest chain. Faster than that and the next flip-flop latches garbage.

Now — here's the thing. You can almost always trade one for the other. Want a shorter critical path? Spend more gates (carry- lookahead, wider muxes, parallel decoders). Want fewer gates? Accept a longer path (ripple-carry, sequential decoders, narrower datapaths). Every interesting choice in CPU design is a point on that tradeoff curve. The single-cycle 32-bit RISC we built sits pretty far toward "few gates, slow clock" — which is fine, because we're optimizing for understandability, not speed.

Piece 2 — Gate count: where the silicon goes

Let's add it up. From the arithmetic page, a 32-bit ripple-carry adder is 5 gates per full-adder times 32 bits — 160 gates. The ALU's bitwise units (AND, OR, XOR, NOR rows) add another 128. The output multiplexer plus the add/subtract control logic is roughly 200 more. Call the whole ALU about 500 gates.

The register file is where the budget actually goes. Thirty-two registers, 32 bits each — that's 1024 flip-flops, and each flip- flop is itself 4–6 gates if you build it from NANDs in a cross- coupled latch. Plus the read-port decoders (turning a 5-bit register address into a 32-line one-hot select) and the write-port enable logic. The register file alone is comfortably over 6000 gates.

Control unit? Small. A few hundred gates of decode logic that turns the opcode bits into the control signals the datapath needs. Sign-extend is literally free — it's just wires that fan out the sign bit. PC increment and branch mux is maybe 50 gates.

# Rough gate budget for the 32-bit single-cycle RISC.
# These are order-of-magnitude estimates, not exact synthesis counts.

ADDER_GATES        = 5 * 32          # 5 gates per full-adder, 32 bits  -> 160
ALU_BITWISE_GATES  = 32 * 4          # AND/OR/XOR/NOR rows, 32 bits     -> 128
ALU_MUX_GATES      = 200             # output mux + sub/zero logic      ~ 200
REGFILE_BITS       = 32 * 32         # 32 registers, 32 bits each       -> 1024
REGFILE_PORTS      = 600             # two read ports + one write port  ~ 600
CONTROL_UNIT       = 100             # decode for a small ISA           ~ 100
SIGN_EXTEND        = 0               # just wires, no gates
PC_LOGIC           = 50              # increment + branch mux           ~ 50

# Flip-flops cost ~6 gates each in a NAND-only count:
REGFILE_FF_GATES   = REGFILE_BITS * 6   # -> 6144

total = (ADDER_GATES + ALU_BITWISE_GATES + ALU_MUX_GATES
         + REGFILE_FF_GATES + REGFILE_PORTS
         + CONTROL_UNIT + PC_LOGIC)
# total ~ 7400 gates. Low thousands once you don't double-count flip-flops.

Big picture: a single-cycle 32-bit RISC built from gates lands somewhere in the low thousands of gates if you don't count flip- flop internals, and around 7–10 thousand if you do. Modern CPUs are nine orders of magnitude bigger. But the shape of the budget is the same: register file dominates, ALU is medium, control is tiny.

Piece 3 — The critical path

The critical path is the longest combinational chain from one flip-flop to the next. In single-cycle, every instruction has to complete in one cycle, so the worst-case path through the entire datapath is what bounds the clock.

Walk it with me. The cycle starts when the PC's flip-flop updates. From there: the PC drives the instruction memory address, and we wait for the instruction to come out — call that 4 gate delays. Then the opcode fans into the control unit, 3 more gate delays of decode. Then the register file: 5-to-32 decode plus a 32-to-1 read mux, roughly 4 gate delays. Sign-extend is free — just wires. Now the operands hit the ALU.

The ALU is where the cycle gets eaten. A 32-bit ripple-carry adder is 32 stages and each stage adds about 2 gate delays to the carry chain. That's $32 im es 2 = 64$ gate delays in the worst case — for one addition. After the ALU comes data memory (about 6 gate delays for a load), then the writeback mux (2 more), and finally we latch the result back into the register file.

# Critical path of the single-cycle CPU, in gate delays.
# Every signal that flows from the rising edge to the next must
# traverse SOME path; the worst one bounds the cycle.

t_imem        = 4    # instruction memory read (small ROM)
t_decode      = 3    # control unit: a few levels of AND/OR
t_regread     = 4    # register file read port (decoder + mux)
t_signext     = 0    # pure wiring
t_alu_ripple  = 64   # 32-bit ripple-carry: 32 stages * 2 gate delays
t_dmem        = 6    # data memory read (slightly bigger than imem)
t_wb_mux      = 2    # writeback mux: ALU result vs memory result

T_cycle = (t_imem + t_decode + t_regread + t_signext
           + t_alu_ripple + t_dmem + t_wb_mux)
# T_cycle ~ 83 gate delays. The ALU alone is ~77% of that.

Total: roughly 83 gate delays. The ALU alone — really, just the ripple-carry chain inside it — is 77% of the cycle. Everything else is rounding error. If you want this CPU to go faster, the only thing worth attacking is that adder.

Piece 4 — The two-phase clock

Here's a thing that confuses people. We talk about "the cycle" like it's one indivisible tick, but inside one cycle there are really two phases happening, and the design depends on keeping them separate.

Phase 1: combinational settle. The flip-flops just updated at the rising edge. Their outputs flow into the combinational logic — gates, muxes, the ALU — and we wait. The signals propagate, the carries ripple, the muxes select. After enough time (the critical path delay), every wire in the combinational cloud has stopped changing. Phase 2: sequential latch. The next rising edge arrives and the flip-flops sample whatever the combinational logic produced, locking in the new state.

Why split it? Because if a flip-flop's output were also feeding its own input through some combinational path, and the flip-flop were transparent (let signals through whenever the clock is high), then the new output would race back into the same flip-flop and you'd get a feedback storm — outputs oscillating, glitches, nothing settling. The two-phase model — flip-flops only sample at the edge, combinational logic only runs between edges — guarantees that the input a flip-flop sees has been stable long enough to be trustworthy. This is why the cycle period must be at least the critical path: anything shorter and the combinational hasn't finished by the time the next edge hits.

Piece 5 — Why ripple-carry caps the clock

Ripple-carry dominates because it's the longest serial dependency in the whole machine. Bit 0's full-adder finishes, hands its carry to bit 1, which finishes and hands its carry to bit 2, and so on. Each step is fast — 2 gate delays — but you can't start step $i$ until step $i - 1$ is done. Thirty-two steps, no parallelism, 64 gate delays.

The fix is carry-lookahead. The trick: for each bit, you can compute two cheap signals from the inputs alone — "generate" (this bit forces a carry no matter what came in) and "propagate" (this bit will pass an incoming carry through). Both are computable in parallel for all 32 bits in one or two gate delays. Then a tree of combiners works out the actual carries in $O (l o g n)$ depth, not $O (n)$ . For 32 bits, that's about 5 gate-levels of tree on top of the 1–2 of generate/ propagate — roughly 6 gate delays total instead of 64.

# Why ripple-carry dominates: each bit waits on the one below it.

def ripple_delay(n, t_fa=2):
    """n-bit ripple-carry: every full-adder adds t_fa gate delays
    to the carry chain."""
    return n * t_fa

def cla_delay(n, t_pg=1, t_tree=1):
    """Carry-lookahead: generate/propagate happen in parallel,
    then a log-depth tree of carry combiners."""
    import math
    return t_pg + t_tree * math.ceil(math.log2(n))

# For n = 32:
#   ripple_delay(32) = 64 gate delays
#   cla_delay(32)    = 6  gate delays
# More than 10x faster -- at the cost of ~3-4x more gates.

Ten times faster. The catch: carry-lookahead uses 3–4× the gates of ripple-carry. Classic area/speed tradeoff. We use ripple-carry in this build because it's pedagogically clean and the gate count matters more than the cycle time when the goal is to understand the machine. If you wanted to push the clock, swapping in CLA is the single biggest win — see the roadmap for where this lives in the future-work list.

Piece 6 — The cycle-time tradeoff and why pipelining helps

In a single-cycle CPU, every instruction takes one cycle. Sounds fine — until you realize the cycle period is set by the slowest instruction. A load has to do imem + decode + regread + ALU + dmem + wb. An add only does imem + decode + regread + ALU + wb. But the clock period is fixed; the add gets the same long cycle as the load, idling through the data-memory stage. The fast operations subsidise the slow ones.

Pipelining is the escape valve. You chop the datapath into stages — classically IF / ID / EX / MEM / WB — with flip-flops between them. Each stage runs concurrently, on a different instruction. The cycle time becomes the slowest stage, not the slowest whole instruction. If the stages are roughly balanced, you get nearly a 5× throughput win on a 5-stage pipeline.

# Single-cycle vs pipelined timing, conceptually.

# Single-cycle: T_cycle = sum of every stage's delay.
T_single = t_imem + t_decode + t_regread + t_alu + t_dmem + t_wb
# Throughput: 1 instruction per T_single.

# 5-stage pipeline (IF / ID / EX / MEM / WB):
T_pipe = max(t_imem, t_decode, t_regread, t_alu, t_dmem, t_wb) + t_reg
# t_reg is the small extra delay of the inter-stage flip-flops.
# Throughput: 1 instruction per T_pipe (after the pipe fills).

# Each instruction still takes ~5 * T_pipe to traverse the whole pipe.
# What changed: at any moment, FIVE instructions are in flight.

Now — pipelining doesn't make a single instruction faster. The add still takes 5 cycles to walk through the pipe. What changed is that 5 instructions are in flight at once, so the throughput is one instruction per pipe-cycle even though latency is still 5. There's also a small per-stage overhead from the pipeline registers (the flip-flops between stages) — call it $t_{e x t re g}$ — but that's tiny compared to the savings.

And no — you can't just bolt registers onto a single-cycle CPU and call it pipelined. The stages have to be balanced (nobody wants the EX stage to take 64 gate delays while IF takes 4), the forwarding paths have to handle data hazards, branches need prediction or stalls. Pipelining is a real architectural rework. It's the topic of the next page in the broader story; here we just need to know that it exists, what it buys, and what it doesn't.

The necessarily-true facts

The clock period is bounded below by the longest combinational chain between flip-flops. If you clock faster than that, sequential elements latch unsettled values and the machine produces wrong results. No exceptions.
In a single-cycle CPU, the slowest instruction sets the cycle for all instructions. Add, load, branch — they all share the same clock. The fast ones spend part of the cycle idle.
Carry-lookahead replaces $O (n)$ carry depth with $O (l o g n)$ at the cost of more gates. For 32 bits, ~64 gate delays becomes ~6, with roughly 3–4× the gate count of ripple-carry.
The two-phase clock prevents flip-flop output from racing into its own input. Combinational settles between edges; sequential latches at the edge. Without this separation, feedback paths would oscillate.
Pipelining raises throughput without lowering per- cycle work. The cycle becomes the slowest stage rather than the slowest whole instruction. Latency per instruction stays the same (or grows slightly from pipeline-register overhead); throughput increases by roughly the number of stages.
Gate count and critical path trade off in opposite directions on most design knobs. Wider parallel hardware shortens the path and raises area; narrower serial hardware lengthens the path and lowers area. Carry-lookahead vs ripple-carry is the cleanest example.
The register file dominates the gate budget. 32 × 32 = 1024 flip-flops plus port logic puts it well above everything else combined. The ALU is the second-largest, and control is small.

Common questions

Why doesn't pipelining make individual instructions faster?

Because the same logic gates have to run on the instruction's data — fetch the bits, decode the opcode, run the ALU, hit memory, write back. Cutting the work into stages doesn't reduce the work; it just lets a different instruction use the next stage while this one moves on. An individual instruction still walks through every stage, so its latency is basically unchanged (slightly worse, actually, because of the pipeline-register delay between stages). What pipelining buys is throughput: at any instant, N instructions are in flight, so the machine retires one per pipe-cycle instead of one per whole- instruction-cycle.

Why two-phase clock instead of one?

Because flip-flops aren't magical — they need their input to be stable for some setup time before the clock edge, and to stay stable for some hold time after. If your clocking scheme let combinational logic and sequential latching happen at the same time, an output could race through a feedback loop and arrive at a flip-flop's input during its own latch window. The two-phase model — settle, then latch, settle, then latch — guarantees the input is stable when the edge hits. In practice modern designs use edge-triggered flip-flops which collapse this into a single rising edge, but conceptually it's still two phases: a "compute" interval between edges and a "capture" instant at the edge.

Can you pipeline a single-cycle CPU just by adding registers everywhere?

No, not cleanly. Adding registers between stages is necessary but not sufficient. First, the stages have to be balanced — if your EX stage is 64 gate delays and your IF stage is 4, the cycle is still 64 and the IF stage idles for 94% of it. You'd need to break the ALU itself into substages, which is an architectural job. Second, instructions interfere with each other when pipelined: an add that depends on the result of a load two instructions back has a "data hazard," and handling it requires forwarding paths or pipeline stalls. Branches are even worse — you've fetched the wrong instruction before you know whether the branch is taken. Real pipelining means rebuilding the datapath, the control, and the hazard logic. It's not a transparent retrofit.

How does carry-lookahead actually win?

By computing the carries in parallel instead of in sequence. For each bit position you precompute "generate" ( $g_{i} = a_{i} c d o t b_{i}$ — this bit forces a carry) and "propagate" ( $p_{i} = a_{i} o pl u s b_{i}$ — this bit will pass an incoming carry through). Both depend only on the input bits, so all 32 generates and propagates settle in 1–2 gate delays simultaneously. Then a binary tree of carry combiners computes the actual carry-in to each bit using the algebraic identity $c_{i + 1} = g_{i} + p_{i} c d o t c_{i}$ , and the tree depth is $l o g_{2} 32 = 5$ . So you go from 32 sequential stages of ripple to roughly 6 parallel gate-levels, at the cost of all the extra gates that compute the tree. The reason it works is that addition has the right algebraic structure — "carry" is associative under the right operation — to admit a parallel-prefix computation. Not every operation does, which is why we don't have parallel-prefix versions of, say, division.

Is gate delay actually constant per gate? In real silicon it depends on fan-out, wire length, transistor sizing.

Right — at the gate level we're abstracting all of that away. We're treating each gate as taking one unit of time, regardless of how many other gates its output drives or how far the wire travels. In real CMOS, fan-out matters (driving more loads slows the gate), wire RC matters (long wires are slow), and transistor sizing matters (bigger transistors switch faster but cost area and power). Synthesis tools do real timing analysis with all of those effects. But for understanding the shape of CPU performance — why ripple is slow, why lookahead is fast, why pipelining helps — the unit-delay abstraction is exactly what you want. It captures the asymptotics without the noise.

If the register file is most of the gate count, why do we obsess over the ALU?

Because gate count and critical path are different metrics. The register file uses lots of gates but they're mostly in parallel — the read-port decode is a tree, the flip-flops all latch concurrently. So it contributes ~4 gate delays to the cycle despite being thousands of gates. The ALU, with its serial ripple-carry chain, contributes 64 gate delays from a much smaller gate count. Performance optimization targets the critical path, not the area. They're correlated but not the same thing.