“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

Datapath

Gate-Level CPU Simulator

The CPU has two parts that talk to each other and do completely different jobs. The datapath is the wires and components that physically carry operand bits — the program counter, the register file, the ALU, the memories, the adders, the muxes. The control unit is a separate block that looks at the current instruction and produces a bundle of 1-bit control signals telling the datapath what to do this cycle. This page is about the datapath. The control unit gets its own page.

The thing to keep in mind: the datapath has no idea what instruction is running. It computes everything — every add, every load, every branch target — every single cycle. The control unit's signals decide which of those computations actually drives a register or a memory write at the end. Hardware doesn't branch; it multiplexes.

Piece 1 — What's in the datapath

Look at any single-cycle MIPS-style diagram and you'll see the same parts. Here they are, with what each one does on every cycle:

Program counter (PC). A 32-bit register holding the address of the instruction currently executing. The only sequential element on the main path — everything else is combinational.
Instruction memory. Address-in, instruction-out. In our simulator it's a read-only lookup from PC to a 32-bit word.
Register file. 32 registers, each 32 bits. Two read ports (so we can read $rs$ and $r t$ simultaneously) and one write port. The read ports are combinational; the write port is clocked.
ALU. Two 32-bit operands in, one 32-bit result out, plus a zero flag. Built from the adders and bitwise units of the previous page.
Data memory. Address + write-data in, read-data out, with a MemRead/MemWrite control. Where loads and stores go.
Sign-extender. Pads a 16-bit immediate to 32 bits, copying the sign bit. Pure wires — no gates beyond the fan-out of bit 15.
Shift-left-2. For branch offsets. Multiplies the immediate by 4 (since instructions are 4-byte aligned). Also pure wires — just a re-labelling of bit positions.
Two adders. One for $PC + 4$ , one for $PC + 4 + (imm ≪ 2)$ . Both run every cycle.
Multiplexers, everywhere there's a choice. The whole point of the next piece.

No "if" anywhere on that list. No conditional execution. Every box runs every cycle, on whatever bits arrive at its inputs.

Piece 2 — Multiplexers as the hardware "if"

At the gate level there is no branching. A wire carries a bit; it can't decide not to carry one. So when the datapath has a choice — should the ALU's second operand be a register or an immediate? should the register write-back value be the ALU result or a memory load? should the next PC be $PC + 4$ or the branch target? — both candidates are computed in parallel, in their own slabs of silicon, and a multiplexer picks one.

A 2-to-1 mux is just gates. Given inputs $a$ , $b$ , and a 1-bit selector $s$ , the output is $(a AND \overline{s}) OR (b AND s)$ . Wide muxes (32-bit busses) are 32 of these in parallel, sharing the selector wire.

def mux2(a, b, sel):
    """2-to-1 multiplexer, bit-wise. sel=0 picks a, sel=1 picks b.
    Built from gates: out = (a AND NOT sel) OR (b AND sel).
    """
    not_sel = NOT(sel)
    left  = [AND(ai, not_sel) for ai in a]
    right = [AND(bi, sel)     for bi in b]
    return [OR(l, r) for l, r in zip(left, right)]

There are at least four muxes on the main datapath of a basic single-cycle MIPS:

ALUSrc mux. Picks the ALU's second operand: register $r t$ (for R-type) or sign-extended immediate (for I-type loads, stores, addi).
MemToReg mux. Picks what gets written back to the register file: ALU result (R-type, addi) or data memory output (lw).
RegDst mux. Picks which register field is the write destination: $r d$ (R-type) or $r t$ (I-type).
PCSrc mux. Picks the next PC: $PC + 4$ (default) or the branch target (when Branch is asserted and the ALU's zero flag is set).

Every one of those muxes is fed by a control signal that the control unit sets based on the opcode. The control unit doesn't do any data movement itself — it just labels, on every cycle, which path through the muxes is the live one.

Piece 3 — One R-type instruction's journey

Trace add $rd, $rs, $rt through the datapath. Every wire I name is a 32-bit bus unless I say otherwise.

Fetch. The PC register's output drives the instruction memory's address port. The 32-bit instruction comes out. In parallel, an adder computes $PC + 4$ .
Decode. Decode is just bit-slicing — wires tapped off the instruction word. Bits [25:21] are $rs$ , [20:16] are $r t$ , [15:11] are $r d$ , [15:0] would be the immediate. The control unit also reads bits [31:26] (the opcode) and [5:0] (the funct field) and produces its signals.
Register read. $rs$ and $r t$ drive the register file's two read-port address inputs; the file emits the contents of those two registers on its read-data outputs. Call them $A$ and $B$ .
ALU. $A$ goes into the ALU's first operand. $B$ goes into the ALUSrc mux alongside the sign-extended immediate; for an R-type, ALUSrc = 0, so $B$ wins. ALUOp = "add", so the ALU's adder result is what the output mux selects. Out comes $A + B$ .
Memory. Data memory's address port is wired to the ALU output. It reads — but MemRead is 0 for an R-type, so whatever it produces is ignored. MemWrite is also 0, so nothing is written.
Write-back. The MemToReg mux picks between ALU output and memory output. For R-type, MemToReg = 0, so the ALU result goes through. RegWrite = 1, so on the clock edge the register file commits the value to $r d$ .
Next PC. Branch = 0, so the PCSrc mux passes $PC + 4$ . On the clock edge, that becomes the new PC.

Note step 5: the data memory did a read — it had to, because there's no "if" — but the result was thrown away by the write-back mux. That's normal.

Same trace, with actual bits

Drop concrete values in. Suppose $1 = 5, $2 = 10, PC starts at 0x00, and instruction memory at 0x00 holds 0x00221820 — the encoding of add $3, $1, $2 from the architecture page.

Fetch. PC = 0x00 → instruction memory returns 0x00221820. The PC+4 adder produces 0x04 in parallel.
Decode. Bits get sliced out by position: [25:21] = 00001 → rs = 1; [20:16] = 00010 → rt = 2; [15:11] = 00011 → rd = 3; [31:26] = 000000 (R-type opcode); [5:0] = 100000 (funct = add).
Register read. Read port A's address line carries 1; out comes $1 = 5. Port B's address line carries 2; out comes $2 = 10. Both happen in parallel.
ALU. The control unit set ALUSrc = 0, so the ALUSrc mux passes B = 10 (not the immediate). ALUOp ↔ funct together select the ALU's "add" operation. Output: $5 + 10 = 15$ . The ALU's Zero flag stays 0 (the result isn't zero).
Memory. Data memory's address port carries 15; it dutifully reads the word at address 15 and emits whatever garbage was there. MemRead = 0 and MemWrite = 0 mean nothing is acted on, but the read still happened — silicon doesn't know how to "not look."
Write-back. The MemToReg mux picks between the memory output (15's contents) and the ALU output (15). MemToReg = 0, so the ALU's 15 wins. The RegDst mux picks between rt (2) and rd (3); RegDst = 1, so rd = 3 wins. RegWrite = 1, so on the clock edge: $3 ← 15.
Next PC. The branch-target adder produced something irrelevant; Branch = 0 forces the PCSrc mux to pick PC + 4 = 0x04. On the clock edge, PC ← 0x04.

Same chain of components as the abstract trace; this time with the bits filled in. Two pieces are worth flagging. Step 5 — the memory read whose result we threw away — is the cost of "every box runs every cycle"; it's silicon spent, not time. Step 6 — the two muxes (MemToReg, RegDst) — is where the control unit actually shows up: two 1-bit signals decide whether this instruction looks like an add or a load to the write port.

Piece 4 — One load instruction's journey

Now lw $rt, offset($rs). The shape is similar but the muxes flip:

Fetch / Decode. Same as before. Opcode is lw, so the control unit produces ALUSrc = 1, MemRead = 1, MemToReg = 1, RegWrite = 1, Branch = 0.
Register read. $rs$ drives the read port; $A$ = base address. $r t$ also reads, producing some $B$ the load doesn't actually need.
ALU. ALUSrc = 1, so the second ALU operand is the sign-extended immediate (the offset). ALUOp = "add". Out comes $A + offset$ — the effective address.
Memory. The ALU output drives the data memory's address port. MemRead = 1, so the memory emits the 32-bit word at that address.
Write-back. MemToReg = 1, so the mux now passes the memory output (not the ALU output) to the register file's write-data port. The destination is $r t$ (RegDst = 0). RegWrite = 1; on the clock edge it commits.
Next PC. Branch = 0, so PC moves to $PC + 4$ .

The ALU did an add, just like for the R-type. What changed is only the source of operand B (immediate, not register) and the destination of the write-back (memory output, not ALU output). Two flipped mux selectors — that's the entire difference between an R-type ADD and a load.

Piece 5 — Branches and the next-PC mux

Branches are what the PCSrc mux is for. For beq $rs, $rt, offset:

The ALU does a subtract: $A - B$ . Its zero flag goes high iff $A = B$ .
In parallel, the branch-target adder computes $PC + 4 + (imm ≪ 2)$ . Always runs, always settles, regardless of the opcode.
Control sets Branch = 1. The actual PC source is selected by $Branch AND Zero$ : only if both are 1 does the PCSrc mux pick the branch target. Otherwise it picks $PC + 4$ .

For every non-branch instruction, Branch is 0 and the AND gate forces PCSrc to 0 — $PC + 4$ wins, regardless of what the ALU's zero flag happens to say. So the same hardware runs every cycle; the control bit decides whether the zero flag matters.

def datapath_step(pc, instr_mem, reg_file, data_mem, control):
    """One single-cycle pass. Every wire is computed; muxes select.

    'control' is the bundle of signals coming from the control unit:
    RegWrite, ALUSrc, MemRead, MemWrite, MemToReg, Branch, ALUOp, ...
    """
    # ---- Fetch ----------------------------------------------------
    instr   = instr_mem[pc]                  # 32-bit instruction
    pc_plus_4 = ripple_carry_add(pc, FOUR)   # always computed

    # ---- Decode (just slicing the instruction word) ---------------
    rs   = instr[25:21]
    rt   = instr[20:16]
    rd   = instr[15:11]
    imm  = sign_extend(instr[15:0], 32)      # 16 -> 32 bits

    # ---- Register read --------------------------------------------
    A = reg_file.read(rs)
    B = reg_file.read(rt)

    # ---- ALU ------------------------------------------------------
    alu_b = mux2(B, imm, control.ALUSrc)     # register OR immediate
    alu_out = ALU(A, alu_b, control.ALUOp)

    # ---- Memory ---------------------------------------------------
    mem_data = data_mem.read(alu_out)        # always read
    if control.MemWrite:                     # gated by control
        data_mem.write(alu_out, B)

    # ---- Write-back -----------------------------------------------
    wb = mux2(alu_out, mem_data, control.MemToReg)
    if control.RegWrite:
        reg_file.write(rd_or_rt, wb)

    # ---- Next PC --------------------------------------------------
    branch_tgt = ripple_carry_add(pc_plus_4, shift_left_2(imm))
    take_branch = AND(control.Branch, alu_zero(alu_out))
    next_pc = mux2(pc_plus_4, branch_tgt, take_branch)
    return next_pc

The Python above is shaped like the datapath, but cheats with if control.MemWrite for readability — at the gate level that's not an "if", it's a write-enable signal feeding the memory's clocked write port. Same for RegWrite. Everything else is wires and combinational gates.

Piece 6 — Single-cycle means one big combinational chain

All of this — fetch, decode, register read, ALU, memory access, write-back mux, next-PC mux — happens combinationally between two clock edges. There are no pipeline registers cutting the path into stages. The PC register and the register-file/memory write ports are the only sequential elements; everything in between is pure gates settling.

That has consequences. The clock period has to be long enough for a signal to propagate from the PC register, through instruction memory, decode, register read, ALU (worst-case operation), data memory, the write-back mux, all the way to the register-file write port — in one go. That longest chain is the critical path. The clock can never run faster than that chain's delay. The performance page is about exactly this tradeoff and what pipelining buys you.

The necessarily-true facts

Every fork in the datapath is a multiplexer, never a branch. The hardware can't choose which path to compute; it computes both and the control signal selects one. The cost of the unused path is silicon, not time.
The control unit only emits 1-bit control signals; the datapath only carries operand bits. No operand value ever flows out of the control unit, and no control signal is ever produced by the datapath proper. (The ALU's zero flag is the one wire that crosses the boundary, and it's a single bit consumed by the next-PC logic.)
The PC register is the only sequential element on the main combinational path. Register-file and data-memory writes are clocked, but they're sinks, not part of the chain that has to settle for the next instruction's address. The PC update is what runs every cycle without exception.
The critical path is the longest combinational chain through the datapath. Clock period ≥ critical-path delay. Whatever the worst-case instruction touches end-to-end sets the speed of every other instruction too.
Decode is wires. Slicing the instruction word into $rs$ , $r t$ , $r d$ , opcode, funct, immediate is just labelling fixed bit positions. No gates, no delay beyond fan-out.
Sign-extend and shift-left-2 are wires. Sign-extend replicates bit 15 onto bits 16–31; shift-left-2 relabels bit $i$ as bit $i + 2$ with two zero bits at the bottom. Both are zero-gate operations.
Every instruction visits every stage. Even an R-type ADD reads data memory and a load runs the ALU; the muxes determine which results are used. The datapath doesn't skip work.

Common questions

Why are there muxes for things you could compute conditionally?

Because at the gate level there's no "conditionally." A wire either carries a bit or it doesn't, and every gate runs every cycle on whatever bits arrive. The mux is what lets you express a choice in a system that can't actually choose — you compute both options in parallel and let a control bit gate which one drives downstream wires. The "wasted" computation isn't wasted in time (it ran in parallel with the chosen one); it's wasted in silicon, which is a one-time cost paid at fab.

Why doesn't the load instruction skip the ALU and just go straight to memory?

Because the address has to be computed — $base + offset$ — and the ALU is the adder we already have. Adding a second adder just for loads would be redundant silicon for the same work. So the load instruction reuses the ALU as an address calculator, and the ALU's output drives the memory's address port instead of the register file's write port. The MemToReg mux does the redirecting.

What happens to the unused result of the wrong mux input?

Nothing. It sat on a wire for a few hundred picoseconds and dissipated as heat. Concretely: for an R-type ADD, the data memory still got an address, still did its lookup, still produced a 32-bit word — that word arrived at the MemToReg mux's input and was discarded because MemToReg = 0. No write happened (MemWrite = 0). The dynamic power for that read is real; designs that care about power use clock gating or operand-isolation latches to silence unused units, but the logical behavior is "compute everything, ignore most of it."

If both adders run every cycle, why not use one and multiplex?

You could. The reason designs use two is timing: the $PC + 4$ adder and the branch-target adder feed different parts of the next-PC mux, and both their outputs need to be ready before that mux can settle. Sharing one adder means sequencing — first compute one sum, then the other — which either needs an extra cycle (kills single-cycle) or a faster adder (more silicon than just having two slow ones). At 32 bits and ripple-carry, two adders is the cheap answer.

Where does the control unit fit in this picture?

Off to the side, with one input (the opcode and funct fields, both sliced from the instruction word) and a handful of 1-bit outputs (ALUSrc, MemRead, MemWrite, MemToReg, RegWrite, RegDst, Branch, ALUOp). Each of those wires lands on the select port of a mux or the enable port of a memory or the op-code port of the ALU. The control unit doesn't move operand bits and doesn't see operand values — it's a pure opcode-to-control-bits combinational circuit. Its own page covers how it's built.

Why is the PC the only sequential element on the main path?

Because in a single-cycle design, every other piece of state (registers, memory) is updated at the end of the cycle, after combinational settling. The PC has to be sequential because next-PC depends on this-PC, and you need something to break that loop — a register that holds this-PC for the whole cycle while next-PC is being computed downstream. A pipelined CPU would add several more PC-like registers between stages, but for single-cycle one is enough.

Could you replace some muxes with tri-state buffers?

Yes, on a real chip — that's how busses are usually built. A tri-state buffer with its enable off is electrically "not driving" the wire, so several of them can share one bus and only one drives at a time. Logically equivalent to a mux, but at the gate-simulation level we don't have tri-state — every wire has exactly one driver — so we represent the choice explicitly with AND/OR muxes. The behavior the CPU sees is the same.

With the datapath understood, the next page is the control unit: how the opcode-to-control-bits mapping is itself just gates and a small truth table. After that, performance looks at the critical path through this whole arrangement and where the speed actually goes. The earlier pages on logic gates, arithmetic, and the architecture overview are the components this page wires together.