Control Unit
Gate-Level CPU Simulator
The datapath is a pile of components — register file, ALU, memory, PC adder, multiplexers. On its own it does nothing. Every cycle, something has to tell it: "this cycle, the ALU does subtraction; the register file writes its result back; memory stays quiet; the PC takes the branch path if the Zero flag is high." That something is the control unit. It looks at the current instruction and produces, in pure combinational logic, the bundle of control wires that wakes up the right pieces of the datapath and silences the rest.
Here's the thing. The control unit doesn't do anything in the arithmetic sense. It computes nothing. Its job is purely translation — opcode bits in, control bits out — and that translation is just a wide truth table. We've seen this pattern already on the logic-gates page: a gate is its truth table; a wider gate is just a wider table. The control unit is exactly that idea, scaled up.
Piece 1 — What the control unit is for
Look — the datapath has roughly five places where the design has
to make a per-instruction choice. The ALU's B input is either a
register or a sign-extended immediate; pick one. The destination
register field is either rt or rd; pick
one. The write-back source is either the ALU's output or a value
loaded from memory; pick one. Memory is either being read, written,
or idle; pick one. The PC either steps forward by 4 or jumps to a
branch target; pick one. Each pick is a multiplexer in the
datapath, and each multiplexer needs a select line. The control
unit is the thing that drives those select lines.
Now — what does the control unit see? Just the instruction. In a
32-bit MIPS-style ISA the top six bits are the opcode, and for
R-type instructions the bottom six bits are a function code that
distinguishes add from sub from
and and so on. (See the architecture page for the
instruction format.) That's it. From those bits — and nothing else
— the control unit emits the entire control vector for this cycle.
No clock state. No previous-cycle memory. No "what did we do last time." Every instruction is independent, and the control signals are a pure function of the opcode bits sitting in the instruction register right now. That purity is the whole reason a single-cycle CPU is a tractable thing to build at the gate level — and it's what we'll keep coming back to.
Piece 2 — The signals you need to generate
Before we wire up a decoder, let's enumerate what it has to produce. For our 32-bit single-cycle CPU there are nine control signals, give or take. Each one is a single bit (or, for ALUOp, a pair of bits) and each one drives exactly one decision in the datapath.
# The control vector — every bit drives one decision in the datapath.
# For a MIPS-style 32-bit single-cycle CPU these are the standard signals:
control_signals = {
"RegWrite": 0, # 1 = write the register file this cycle
"RegDst": 0, # 0 = destination is rt, 1 = destination is rd
"ALUSrc": 0, # 0 = ALU's B input is a register, 1 = sign-extended immediate
"ALUOp": 0b00,# 2-bit hint to the ALU control (see Piece 4)
"MemRead": 0, # 1 = read data memory this cycle
"MemWrite": 0, # 1 = write data memory this cycle
"MemToReg": 0, # 0 = write-back from ALU, 1 = write-back from memory
"Branch": 0, # 1 = this is a branch instruction
"Jump": 0, # 1 = this is a jump instruction
}
Walk through them. RegWrite is the simplest — it's
the write-enable of the register file. If it's 1, whatever is on
the register file's write-data port gets latched into the
destination register at the end of the cycle. If 0, the register
file is read-only this cycle. sw and beq
don't write back anything, so their RegWrite is 0;
everything else has it high.
RegDst picks which instruction field names the
destination register. R-types put the destination in the
rd field (bits 15:11); I-types like lw
put it in rt (bits 20:16). One mux, one select line.
ALUSrc is the same idea on the ALU's B input: 0 means
"read it from the register file", 1 means "use the sign-extended
16-bit immediate from the instruction." Add and sub use a register;
addi and lw and sw use the immediate.
MemToReg picks the source for the register file's
write-data port: 0 = the ALU's result (for arithmetic), 1 = the
word loaded from memory (for lw). MemRead
and MemWrite are the data-memory enables — exactly
one of them is high for lw and sw
respectively, and both are 0 otherwise. (You really do not want
both high in the same cycle.)
Branch is the interesting one. It says "this is a
branch instruction; if the ALU's Zero flag is high, take the
branch." The actual decision to redirect the PC is
— a single AND gate sitting just outside the control unit.
Jump is unconditional and overrides PCSrc entirely.
Notice the pattern: every datapath choice has at least one control
bit attached. Nothing in the datapath moves without the control
unit's permission.
Piece 3 — The decoder is a truth table
Here's where the logic-gates idea pays off. Look at it this way: the input is a 6-bit opcode, the output is a 9-bit (or so) control vector. That's a function from possible opcodes to a 9-bit output. A function from a finite domain to a finite range is a truth table. We already know how to write those down — one row per input value, one bit-string per output.
# The control unit as a literal truth table.
# Key: 6-bit opcode. Value: the 9-bit control vector
# (RegWrite, RegDst, ALUSrc, ALUOp[1:0], MemRead, MemWrite, MemToReg, Branch, Jump).
CONTROL_ROM = {
"000000": "1_1_0_10_0_0_0_0_0", # R-type (add, sub, and, or, slt, ...)
"100011": "1_0_1_00_1_0_1_0_0", # lw (load word)
"101011": "0_x_1_00_0_1_x_0_0", # sw (store word; RegDst, MemToReg are don't-cares)
"000100": "0_x_0_01_0_0_x_1_0", # beq (branch if equal)
"000010": "0_x_x_xx_0_0_x_0_1", # j (jump)
# ... one row per opcode the ISA defines.
}
def control_unit(opcode_bits):
"""Pure combinational lookup. No state, no memory between cycles."""
key = "".join(str(b) for b in opcode_bits)
return CONTROL_ROM[key]
That's the entire control unit, conceptually. Each row says: "when
the opcode is this, the control vector is that."
The x's are don't-cares — the signal's value doesn't
matter for that instruction because the datapath piece it controls
is silenced by some other signal. (For sw, RegWrite is
0, so it doesn't matter what RegDst or MemToReg are: nothing is
being written back.)
Now — how do you turn this table into gates? Two ways, and they are the same way. First way: for each output bit, write down the list of opcodes for which it should be 1, then OR together the AND-products that recognize each of those opcodes. That's a sum-of-products implementation — two layers of gates, AND then OR, with NOT gates on the inputs as needed. Brute force. Works.
Second way: just store the table directly in a small ROM. Address the ROM with the opcode bits; each addressed location holds the 9-bit control vector for that opcode. Read it out. This is physically a lookup-table memory, but logically — and this is the point — it is the same function as the AND/OR network. Same truth table, different layout in silicon. The ROM implementation is closer to "the gate is its truth table" from the logic-gates page; the AND/OR network is the synthesised version of the same data. FPGAs blur the line: their LUTs are tiny ROMs that act as gates.
Reading the table: control vectors side by side
The CONTROL_ROM compresses each instruction's control vector into a single underscore-separated string. Easier to compare when you pivot it into a table — one column per instruction, one row per signal:
| signal | add (R-type) | lw | sw | beq | j |
|---|---|---|---|---|---|
| RegWrite | 1 | 1 | 0 | 0 | 0 |
| RegDst | 1 | 0 | x | x | x |
| ALUSrc | 0 | 1 | 1 | 0 | x |
| ALUOp[1:0] | 10 | 00 | 00 | 01 | xx |
| MemRead | 0 | 1 | 0 | 0 | 0 |
| MemWrite | 0 | 0 | 1 | 0 | 0 |
| MemToReg | 0 | 1 | x | x | x |
| Branch | 0 | 0 | 0 | 1 | 0 |
| Jump | 0 | 0 | 0 | 0 | 1 |
Read down a column to see what an instruction needs from the
datapath. Read across a row to see which instructions activate
that signal. MemRead only fires for lw;
MemWrite only for sw. RegWrite
fires for everything that produces a register result
(add, lw) and stays low for everything
that doesn't. The x entries are don't-cares — for
sw, no register is being written, so it doesn't
matter what RegDst says.
Worked example: where lw's column comes from
Why does lw $rt, offset($rs)'s column read what it
reads? Walk through it bit by bit, asking what the load instruction
needs the datapath to do.
- RegWrite = 1. A load writes a value into a register. The register file's write port has to commit on the clock edge.
- RegDst = 0. The destination register name
lives in the
rtfield (bits [20:16]) for I-type instructions, notrd. The RegDst mux picks rt. - ALUSrc = 1. The ALU has to compute the effective address: . The base comes from the register file (read port A); the offset is the sign-extended 16-bit immediate. ALUSrc selects the immediate onto the ALU's B input.
- ALUOp = 00. "Address calculation — always add." The ALU control decoder reads this and emits the ALU's add op-select.
- MemRead = 1. Data memory is read this cycle. The word at the ALU's output address arrives on the memory's read-data bus.
- MemWrite = 0. No store happens. (You really don't want both MemRead and MemWrite high in the same cycle.)
- MemToReg = 1. The write-back mux picks the memory output (not the ALU output) as the value to write into the register file.
- Branch = 0. Not a branch, so PCSrc stays at PC+4 regardless of the ALU's Zero flag.
- Jump = 0. Not a jump.
Every bit corresponds to one decision the datapath has to make,
and the answer is dictated by what the instruction is asking for.
Now compare to add's column: RegDst
flips to 1 (destination is rd), ALUSrc flips to 0
(B input is a register), ALUOp changes to 10 (look
at funct), and MemRead, MemToReg change
to 0 (no memory involved). Five bits flipped — that's the entire
structural difference between "load a word from memory into a
register" and "add two registers."
Piece 4 — Two-level ALU control
Here's a wrinkle. The ALU itself has its own little control unit.
Why? Because the main control unit only sees the opcode, but for
R-type instructions the opcode is always 000000 — the
distinction between add, sub,
and, or, slt lives in the
6-bit function code, not the opcode. The main control unit doesn't
look at the function code. So we split the decoding in two.
The main control unit emits a 2-bit signal called ALUOp
that says, roughly, "what class of ALU operation is this?"
00 means "address calculation" — this is a load or
store, the ALU should add. 01 means "branch
comparison" — this is beq, the ALU should subtract so
the Zero flag tells us whether the operands were equal.
10 means "look at the function code" — this is an
R-type, and the specific operation is encoded in the bottom six
bits of the instruction.
A second small combinational block — the ALU control — takes ALUOp and the funct field together and produces the actual 4-bit op-select that drives the ALU's internal multiplexer (see the arithmetic page for what that mux selects between).
# Piece 4: the little control unit *inside* the ALU.
# Inputs: ALUOp (2 bits, from main control) and funct (6 bits, from instruction)
# Output: 4-bit ALU op-select that picks add / sub / and / or / slt / ...
def alu_control(ALUOp, funct):
# ALUOp tells you the *class* of operation:
if ALUOp == 0b00: # lw / sw — address calculation, always add
return 0b0010 # ADD
if ALUOp == 0b01: # beq — always subtract, check Zero
return 0b0110 # SUB
if ALUOp == 0b10: # R-type — funct decides the specific op
return {
0b100000: 0b0010, # add
0b100010: 0b0110, # sub
0b100100: 0b0000, # and
0b100101: 0b0001, # or
0b101010: 0b0111, # slt
}[funct] Why bother with two levels? Couldn't the main control unit just look at opcode and funct together and emit the 4-bit ALU op directly? Yes, it could. But splitting it keeps each table small and local: the main decoder only knows opcodes; the ALU decoder only knows ALUOp + funct. The two pieces are independently verifiable, and the funct field — which only matters for R-types — doesn't have to thread through the rest of the control unit. It's the same gates either way; the partition is for the human reading the design, not for the silicon.
Piece 5 — Why control is combinational, not stateful
A natural question: shouldn't the control unit be a state machine? It feels like it should. "Fetch, decode, execute, memory, write-back" — those are stages, surely the control unit walks through them.
In a single-cycle design, no. Every instruction completes in one cycle. Within that cycle, the datapath is laid out such that fetch, decode, execute, memory, and write-back happen in a single chain of combinational gates fed by the rising clock edge. The control signals don't change during the cycle — they're set the moment the instruction register settles, and they stay constant until the next instruction arrives. So there's no sequencing for the control unit to do. Pure combinational logic suffices.
Multi-cycle and pipelined CPUs do have stateful control — a small finite-state machine that walks an instruction through its stages, or a set of pipeline registers that carry control bits forward in lockstep with the data. Those designs trade simplicity of control for either lower hardware cost (multi-cycle) or higher throughput (pipelined). But for the gate-level CPU we're building, the rule holds: control is a pure function of the current instruction.
The necessarily-true facts
- The control signals are a pure function of the current instruction. No clock state, no previous-cycle memory. The opcode (and funct field for R-types) determines every control bit, and the same instruction always produces the same control vector.
- The control unit is a truth table — the same pattern as a logic gate, just wider. 6 input bits, ~9 output bits. The "gate is its truth table" insight from the logic-gates page applies directly; an AND/OR decoder and a ROM lookup compute the same function with different physical layouts.
- Every datapath multiplexer has at least one control bit driving its select line. ALUSrc, RegDst, MemToReg, PCSrc — each one names a mux. If a datapath choice exists, something in the control vector picks the side.
- ALU control is two-level: ALUOp + funct → 4-bit op-select. The main control unit decodes opcode to ALUOp; a small second decoder combines ALUOp and funct to drive the ALU's internal multiplexer. The split is for clarity, not necessity — a single one-level decoder would compute the same function.
- Don't-care bits in the truth table are real don't-cares.
For
sw, RegWrite is 0, so RegDst and MemToReg can take any value without changing the architectural result. Synthesis tools exploit this to minimize gate count; you should not assume they'll be 0 in hardware. - Branching is . The control unit emits Branch (this is a branch instruction); the ALU emits Zero (the operands were equal). The conjunction is one AND gate outside the control unit, and it selects between PC+4 and the branch target.
Common questions
Why does the ALU have its own little control unit?
Because the main control unit only looks at the
opcode, and the opcode for every R-type instruction is the same
(000000). The choice between add,
sub, and, or,
slt lives in the 6-bit function code, not the
opcode. Rather than thread funct through the main decoder — where
it's irrelevant for every non-R-type — we hand the main decoder
a coarse 2-bit ALUOp ("address-add", "branch-sub", "look at
funct") and let a small second decoder handle the funct case
locally. Same total logic; cleaner partition.
Why is the control unit combinational rather than a state machine?
Because in a single-cycle CPU every instruction completes in exactly one cycle and the control signals are constant for the duration of that cycle. There's no sequencing to do — the datapath is a single combinational chain from instruction-register output to register-file write port, and the control unit just sets the muxes for the chain. Multi-cycle and pipelined designs do introduce control state (an FSM walking through stages, or pipeline registers carrying control bits forward), but that's a different architecture making a different tradeoff. For us: opcode in, control vector out, no memory.
How many control bits do you actually need for this CPU?
For a MIPS-style subset (R-types, lw, sw, beq, j) it's about nine: RegWrite, RegDst, ALUSrc, ALUOp[1:0], MemRead, MemWrite, MemToReg, Branch, Jump. The ALU then has a separate 4-bit op-select coming out of the ALU control block. So roughly 9 bits out of the main control unit, 4 more out of the ALU control. Add more instructions (multiply, jal, shift-by-immediate, etc.) and the control vector grows by a few bits each time. The number isn't fundamental — it's "one bit per datapath decision."
Could you make the control table a literal lookup ROM instead of gates?
Yes — and it's exactly the same function. A ROM addressed by the 6-bit opcode, with each location storing the 9-bit control vector, computes the identical truth table that an AND/OR decoder computes. The ROM is denser when the table is sparse (most opcodes don't define an instruction); the AND/OR decoder is denser when most opcodes do. Modern FPGAs split the difference: their lookup-table cells are tiny ROMs that act as gates, so the choice between "ROM" and "gates" basically disappears at that scale. The point is that the control unit is a table — how you implement the table is a layout question, not a logical one.
What happens if two control bits conflict — say MemRead and MemWrite both go high?
At the gate level, undefined behavior in the data memory: you'd be asserting both an enable to read and an enable to write on the same cycle, and the memory's response depends on its internal logic (most real SRAMs would corrupt the addressed cell). The control unit's truth table is responsible for never emitting a conflicting vector. This is why the table is written explicitly per-opcode rather than by independently computing each signal: if MemWrite is 1 then MemRead must be 0, and that constraint is baked into the rows of the ROM.
If control is just a truth table, why is anyone paid to design it?
For this CPU, the table is small and the design is mostly bookkeeping. For a real ISA — x86 with thousands of opcodes, variable-length instructions, microcoded fallback paths, multiple-issue pipelines — the "table" is enormous and the decoder is one of the most complicated pieces of the chip. The principle is the same: opcode in, control out, pure function. The engineering is in compressing the table, sharing logic across rows, meeting timing on the decoder's critical path, and handling instructions whose decode depends on context. The single-cycle educational CPU strips all that away so you can see the principle.