“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

CPU Architecture

Gate-Level CPU Simulator

Look — we have gates (logic gates) and we have arithmetic units built from gates (arithmetic). Those are the building blocks. But a pile of adders is not a CPU. A CPU is a piece of hardware that fetches a binary instruction, decodes what it means, executes it, and moves on to the next one — and it has to do this for any instruction in the instruction set, all wired up in advance, with no software in the loop.

This page is the architecture: the high-level skeleton that turns "some gates" into "a programmable machine." Five pieces. The instruction format that defines what the bits mean. The register file that holds the working state. The memory model that defines where code and data live. The five conceptual stages every instruction passes through. And the choice to do all five stages in one cycle, with everything that implies.

Piece 1 — The instruction format

An instruction is 32 bits. That's the contract: the program is a sequence of 32-bit words, and each word tells the hardware exactly one thing to do. The hardware is fixed; the program is just data. So the question is: how do you carve up 32 bits to encode "ADD register 5 to register 6, put the result in register 7" — or "load the word at address (register 4 + 12) into register 9" — or "jump to instruction 0x00400020"?

The trick MIPS uses, and that I've borrowed for this CPU, is three instruction formats: R-type (register-register), I-type (immediate), J-type (jump). The top 6 bits — the opcode — pick the format. Once the opcode is known, the remaining 26 bits are interpreted differently depending on format.

Format	[31:26]	[25:21]	[20:16]	[15:11]	[10:6]	[5:0]
R	opcode	rs	rt	rd	shamt	funct
I	opcode	rs	rt	imm16
J	opcode	addr26

R-type carries three register indices (5 bits each, addressing one of $2^{5} = 32$ registers), a 5-bit shift amount for shift instructions, and a 6-bit function code that distinguishes ADD from SUB from AND etc. — all of which share the same R-type opcode. I-type packs a 16-bit signed immediate alongside two register indices, which is what loads, stores, and addi need. J-type spends 26 bits on a target address for jumps.

Worked example: encoding "add $3, $1, $2"

Concrete: take add $3, $1, $2 — "add the contents of registers 1 and 2, put the sum in register 3." It's R-type, so each field lands in a fixed slot:

field	opcode	rs	rt	rd	shamt	funct
bits	[31:26]	[25:21]	[20:16]	[15:11]	[10:6]	[5:0]
value	`000000`	`00001`	`00010`	`00011`	`00000`	`100000`
meaning	R-type	$1	$2	$3	—	add

Concatenate the value row left to right: 000000 00001 00010 00011 00000 100000 = 0x00221820. That single 32-bit number, sitting in instruction memory, is the entire add $3, $1, $2 instruction. The hardware never sees mnemonic text — it sees these 32 bits, routed by position to the right wires. Change rs from 00001 to 00100 and you've changed the instruction to add $3, $4, $2 without touching anything else.

In software you'd write if opcode == R: read funct field. In hardware, every field is extracted from every instruction on every cycle. The wires that carry bits [5:0] are always showing the funct field; whether anyone uses that field depends on the control unit, which sees the opcode and decides. Hardware is parallel by default — selecting is the expensive operation, not computing.

def decode(instr):
    """Pull every field out of a 32-bit instruction in parallel.
    Hardware doesn't 'check the opcode and then read fields' — it
    extracts every field every cycle and lets the control unit
    decide which extractions matter."""
    opcode = (instr >> 26) & 0x3F        # bits [31:26], 6 bits
    rs     = (instr >> 21) & 0x1F        # bits [25:21], 5 bits
    rt     = (instr >> 16) & 0x1F        # bits [20:16], 5 bits
    rd     = (instr >> 11) & 0x1F        # bits [15:11], 5 bits  (R-type)
    shamt  = (instr >>  6) & 0x1F        # bits [10: 6], 5 bits  (R-type)
    funct  =  instr        & 0x3F        # bits [ 5: 0], 6 bits  (R-type)
    imm16  =  instr        & 0xFFFF      # bits [15: 0]          (I-type)
    addr26 =  instr        & 0x03FFFFFF  # bits [25: 0]          (J-type)
    return opcode, rs, rt, rd, shamt, funct, imm16, addr26

The convention that registers are indexed in fields [25:21], [20:16], [15:11] — same bit positions in R-type and I-type — is not a coincidence. It means the register file's read-port address wires can be hooked directly to those bits without a multiplexer in front. The decoder's "work" is mostly just naming wires; the bits route to the right place by their position in the instruction.

Piece 2 — The register file

Registers are the CPU's working memory: 32 of them, each 32 bits wide, fast enough to read and write inside a single cycle. They are physically a small bank of flip-flops with addressing logic in front. The instruction format reserves 5 bits per register index — that's where the "32" comes from.

The register file has two read ports and one write port. Two read ports because most instructions need to read two source operands at the same time (an ADD reads rs and rt; a store reads the address base and the data to store). One write port because no instruction writes more than one result. The number of ports is a physical property of the hardware — it's how many pieces of addressing logic and how many output buses the file has — not a software choice. Adding a third read port would cost real silicon.

class RegisterFile:
    """32 registers, two read ports, one write port. Reads are
    combinational (no clock); writes are clocked (happen on the rising
    edge). Register 0 is hardwired to zero — writes to it are ignored."""
    def __init__(self, width=32, count=32):
        self.regs = [0] * count
        self.width = width

    def read(self, ra, rb):
        # Both reads happen in parallel — two physically separate ports.
        return self.regs[ra], self.regs[rb]

    def write(self, rw, data, write_enable):
        # On the clock edge: if write_enable AND rw != 0, latch data.
        if write_enable and rw != 0:
            self.regs[rw] = data & ((1 << self.width) - 1)

Register 0 is hardwired to the constant zero. Reads of register 0 return 0; writes to register 0 are silently ignored. This sounds like a waste of one of our 32 registers, but it's enormously useful: any instruction that needs the constant 0 as an operand just names register 0, no special "load zero" instruction needed. "MOVE rd, rs" is just "ADD rd, rs, $0". "BRANCH IF ZERO rs" is just "BRANCH IF EQUAL rs, $0". A whole family of pseudo-instructions falls out of one wired constant.

The reads are combinational — the moment you put a 5-bit address on a read port, the corresponding register's contents appear on the output bus, no clock needed. The write is clocked — it only takes effect on the rising edge of the clock, and only if the write-enable signal is asserted. That asymmetry is what makes a single cycle work: you can read your sources, send them through the ALU, and write the result back, all between two clock edges, because the read happens instantly and the write waits for the edge.

Piece 3 — Memory: instructions and data, separately

The CPU has two memories. One holds the program — the sequence of 32-bit instructions — and is read-only at runtime. One holds the data the program operates on — variables, arrays, the stack — and is read/write. The program counter (PC) addresses the instruction memory; load and store instructions address the data memory.

This is a Harvard architecture: code and data live in physically separate memories. Real chips usually present a unified address space to the programmer (von Neumann) but split the caches into separate I-cache and D-cache underneath, which is Harvard at the level that matters for performance. For a single-cycle simulator, going full Harvard is a simplification that costs us nothing: it lets the fetch stage and the memory stage run in parallel without contending for the same memory port.

class InstructionMemory:
    """Read-only at runtime. Indexed by PC. Word-addressable: the
    bottom two bits of the address are ignored because every
    instruction is exactly 4 bytes."""
    def __init__(self, program):
        self.words = list(program)        # list of 32-bit ints

    def fetch(self, pc):
        return self.words[pc >> 2]


class DataMemory:
    """Read/write. Loads are combinational; stores happen on the
    clock edge when mem_write is asserted."""
    def __init__(self, size_words=1024):
        self.words = [0] * size_words

    def load(self, addr):
        return self.words[addr >> 2]

    def store(self, addr, data, mem_write):
        if mem_write:
            self.words[addr >> 2] = data

Memory is byte-addressable in the abstract — the address is in bytes, so address 0 is the first byte, address 4 is the fifth — but every instruction is exactly 4 bytes and every word is 4 bytes, so the bottom two bits of any address that names a word are always zero. The hardware ignores those bottom two bits and indexes into a word-aligned array. If you try to load a word from an unaligned address, the answer is "you don't" — that's an alignment fault, and a teaching CPU just doesn't support it.

The PC is itself a 32-bit register. After every instruction it increments by 4 (next word). Branches and jumps overwrite it with a new value computed from the instruction. There is exactly one PC, exactly one instruction in flight per cycle, and exactly one path from the PC to the next PC.

Piece 4 — The five stages of an instruction

Every instruction, regardless of type, passes through the same five conceptual stages. They are stages in the sense of "things that have to happen in this order for any of it to work" — not in the sense of "separate clock cycles." In a single-cycle CPU they all happen combinationally, in one cycle, between two clock edges.

Fetch. Read the instruction at address PC from instruction memory. The output of this stage is the 32-bit instruction word.
Decode. Split the instruction into its fields (opcode, rs, rt, rd, shamt, funct, immediate) and feed the opcode and funct into the control unit, which produces all the control signals (which ALU op, whether to read memory, whether to write a register, etc.). See the control unit page.
Register read. Use rs and rt as addresses into the register file's two read ports. Both source operands come out in parallel.
Execute. Send the operands through the ALU. For arithmetic and logic instructions, the ALU's output is the answer. For loads and stores, the ALU computes the effective address. For branches, the ALU compares the two operands.
Memory access. If the instruction is a load, read the word at the ALU's output address from data memory. If a store, write the second operand to that address. Otherwise this stage is a no-op (the wires still carry signals, they just don't do anything visible).
Write-back. If the instruction produces a register result, latch that result into the destination register on the clock edge. The data is either the ALU output (for arithmetic) or the memory output (for loads); a multiplexer controlled by mem_to_reg picks the right one.

That's six things, not five — fetch, decode, register read, execute, memory, write-back. Textbooks compress the count differently depending on whether they fold "decode" and "register read" together. The point is: there is a fixed, instruction-independent pipeline of stages, and every instruction activates the parts of it that it needs.

def step(cpu):
    """One clock cycle of a single-cycle CPU. Conceptually five
    stages — but in the hardware, all of these run as one big
    combinational chain between two clock edges."""
    # 1. FETCH: read instruction at PC.
    instr = cpu.imem.fetch(cpu.pc)

    # 2. DECODE: split into fields, generate control signals.
    op, rs, rt, rd, shamt, funct, imm16, addr26 = decode(instr)
    ctrl = cpu.control_unit(op, funct)

    # 3. REGISTER READ: pull both source operands in parallel.
    a, b = cpu.regfile.read(rs, rt)

    # 4. EXECUTE: ALU does the arithmetic / logic.
    alu_b = b if not ctrl.alu_src else sign_extend(imm16)
    alu_out = cpu.alu(a, alu_b, ctrl.alu_op)

    # 5. MEMORY: load or store, if this instruction touches memory.
    if ctrl.mem_read:
        mem_data = cpu.dmem.load(alu_out)
    elif ctrl.mem_write:
        cpu.dmem.store(alu_out, b, mem_write=True)
        mem_data = 0
    else:
        mem_data = 0

    # 6. WRITE-BACK: latch result into the register file on the edge.
    write_data = mem_data if ctrl.mem_to_reg else alu_out
    rw = rd if ctrl.reg_dst else rt
    cpu.regfile.write(rw, write_data, ctrl.reg_write)

    # Update PC for the next cycle.
    cpu.pc = next_pc(cpu.pc, ctrl, imm16, addr26, alu_out)

Notice that the Python is sequential — it has to be, because Python is sequential — but the hardware version of this is one giant feedforward circuit. There's no if in hardware: the load path and the store path and the do-nothing path all exist as wires, all the time, and the control signals select which one drives the write port. Every instruction pays for every stage's worth of gate delay, even the ones it doesn't use, because the signal still has to propagate through.

Worked example: a tiny program in motion

Here's the architecture running. A three-instruction program that loads constants into two registers and adds them:

0x00:  addi $1, $0, 5     ; $1 ← 0 + 5  = 5
0x04:  addi $2, $0, 10    ; $2 ← 0 + 10 = 10
0x08:  add  $3, $1, $2    ; $3 ← $1 + $2 = 15

The CPU starts with PC = 0x00 and every register (except $0, which is hardwired) holding zero. Each row below is the machine state at the END of that cycle, after the write-back has committed:

cycle	PC fetched	instruction	$1	$2	$3	PC after
0	—	(initial)	0	0	0	`0x00`
1	`0x00`	addi $1, $0, 5	5	0	0	`0x04`
2	`0x04`	addi $2, $0, 10	5	10	0	`0x08`
3	`0x08`	add $3, $1, $2	5	10	15	`0x0c`

Three cycles, three instructions, three register updates. Cycle 3 is the interesting one: the read ports of the register file pull $1 = 5 and $2 = 10 — values that were written in cycles 1 and 2. How does that work without a race? The clocked-write asymmetry from Piece 2: writes commit at the END of a cycle, so by the time cycle 3 begins (after the next rising edge) those values have settled and are visible to the read ports. The clock edge is the synchronization. No extra sequencing logic is needed, because the cycle boundary already enforces "every write before is finished, every read after sees it."

Piece 5 — Single-cycle is a choice, not a law

Here's the design decision that defines this CPU: every instruction completes in exactly one clock cycle. The fetch, the decode, the register read, the ALU, the memory access, and the write-back — all of it — happens between one rising edge of the clock and the next. There are no pipeline registers in the middle. The PC advances by 4 every cycle (or to a branch target), and the next instruction starts.

This is gloriously simple to reason about. There is no hazard logic, no forwarding, no stall machinery, no branch prediction. The state of the machine after cycle $n$ is determined entirely by the state after cycle $n - 1$ and the instruction fetched at PC. You can simulate it in your head.

But it's also slow. The clock period must be at least as long as the longest combinational path through the whole datapath — and that's the load instruction, which has to walk fetch → decode → register read → ALU (32-bit ripple-carry add for the address) → memory access → write-back, all serially. Every other instruction is faster, but the clock period is set by the worst case, so they all run at the load's pace. A pipelined CPU breaks that path into stages with latches between them, runs the clock at the speed of the slowest stage instead of the sum, and overlaps multiple instructions in flight at once. You pay for it with hazard logic and branch penalties; you gain roughly an order of magnitude in throughput.

For a teaching CPU built from gates, single-cycle wins on every axis except speed. We're not racing anything; the goal is that you can point to a wire and explain what's on it. (See the performance page for the actual numbers and where the time goes.)

The necessarily-true facts

Register 0 always reads as zero. The hardware doesn't store a value there — the read port for index 0 is wired to a constant 0 bus, and the write-enable for index 0 is gated off. No software discipline required; it's a property of the silicon.
An instruction can read at most two registers and write at most one. Not because of the encoding (R-type has rs, rt, and rd in its field layout, which is three register names), but because the register file has two read ports and one write port. rd is a write address, not a read address. Adding a three-source instruction to the ISA would require physically adding a third read port.
The clock period is bounded below by the critical path of the slowest instruction. For a single-cycle design, every cycle accommodates the worst case, so the CPU runs as slowly as its slowest instruction. For 32-bit ripple-carry arithmetic plus a memory access, that's a hundred-or-so gate delays; nothing on the page can change that without changing the design (carry-lookahead, pipelining, etc.).
Every instruction is 32 bits and word-aligned. Bottom two bits of any instruction address are zero, by construction. The PC increments by 4 between sequential instructions, and branch targets are computed in units of words.
Decoding is constant-time and unconditional. Every field is extracted from every instruction every cycle by the same wires. The control unit then chooses which of those extractions to act on. There is no "if R-type then read funct" in the hardware — the funct wires are always live.
The register file's reads are combinational; its writes are clocked. This asymmetry is what permits a whole instruction (read sources, do work, write back) to fit in one cycle: the read result is available immediately, and the write commits at the end of the cycle, after the result has had time to propagate through the ALU and any memory access.
Harvard architecture means fetch and memory access can happen in the same cycle. Because instruction memory and data memory are physically separate, both can be accessed without contention. A unified-memory design would either need two memory ports or would have to serialise fetch and memory access, doubling the cycle time.

Common questions

Why hardwire register 0 to zero instead of letting it be a normal register that programs are conventionally expected to keep at 0?

Because "convention" is something the compiler has to enforce on every instruction, and a single bug or a single hand-written assembly routine breaks it. A hardwired zero is free at runtime — it's just a wire to ground — and it cannot be wrong. It also means the instruction decoder can use "register 0" as a sentinel meaning "constant zero," which gives you MOVE, NEG, NOT, NOP, BRANCH-ZERO, etc. as syntactic sugar over the normal three-operand instructions. You spend one register slot, you get a half-dozen pseudo-instructions for free, and you remove a whole class of bugs.

Why a separate instruction memory and data memory? Real machines don't have that.

Real machines have a unified address space because programs need to be able to load themselves from disk, jump into JIT-compiled code, do self-modifying tricks, etc. — and because virtual memory is one mechanism for everything. But under the cache, virtually every modern CPU has split L1 caches: an I-cache and a D-cache. So the actual silicon at the speed-critical layer is Harvard. For a single-cycle simulator, going full Harvard removes a port-contention problem we'd otherwise have to solve, and it doesn't change anything at the ISA level — the program still sees one address space, because the I-memory and D-memory in this design happen to use non-overlapping addresses.

Why single-cycle if pipelining is faster?

Because the goal here is to understand a CPU, not to ship one. Pipelining adds a forest of complications — hazards, forwarding paths, stalls, branch prediction, exception handling — every one of which exists to paper over the fact that you've got multiple instructions in flight at once. Each of those is genuinely interesting on its own, but stacking them all simultaneously while you're also trying to learn how an ALU hooks up to a register file is a recipe for confusion. Single-cycle gets you a working CPU end-to-end with no hazards to think about, and once that's solid, pipelining is a well-defined transformation on top of it.

Why exactly 32 registers? Why not 16 or 64?

The number is set by how many bits you spend on each register index in the instruction. R-type has three register fields that have to fit in 32 bits along with opcode (6 bits), shamt (5 bits), and funct (6 bits) — that's 17 bits of overhead, leaving 15 bits for three indices, or 5 bits each. Five bits addresses $2^{5} = 32$ registers. You could go to 16 registers (4 bits each, 12 total, freeing 3 bits) but that cramps the compiler badly. Going to 64 registers (6 bits each, 18 total) doesn't fit without shrinking opcode or funct. So 32 is what falls out of "32-bit instructions, three register operands, MIPS-style encoding." It's also empirically a sweet spot — compilers rarely benefit from more, and 16 is too few for register-rich code.

If hardware computes every branch in parallel, what's the cost of an unused field like funct on an I-type instruction?

Roughly nothing. The funct wires are tied to bits [5:0] of the instruction whether or not the current opcode cares about them. The register-file read port for "rd" is indexed by bits [15:11] every cycle even if the instruction doesn't have an rd. A few bits of routing, a few extra register reads happen, the multiplexers in front of the writeback path ignore them. The cost is gate area, not time. This is the consistent theme of single-cycle hardware: everything happens every cycle, and selection is what makes it correct. You don't pay for unused work in latency, only in silicon.

Why is the write to the register file clocked but the read combinational?

Because the read needs to happen before the ALU can do anything, and the ALU's result needs to settle before the write-back can latch the right value. If the read were also clocked, you'd need an extra cycle just to get operands out — which is exactly what a multi-cycle design does and is exactly what we're avoiding. Combinational read means "the moment you put an address on the port, the value is on the bus" — no waiting. Clocked write means "the value is committed atomically at the cycle boundary," which is what gives the datapath a clean before/after to reason about. The asymmetry is load-bearing.

Next we'll wire these pieces together — instruction memory, register file, ALU, data memory — into the actual datapath, and then build the control unit that drives all the multiplexer-select wires you've seen referenced here.