ALU

Gate-Level CPU Simulator

The ALU adds, subtracts, and also does the bitwise operations: AND, OR, XOR, NOR. Plus a few more — comparison (which uses subtraction), shift, sometimes multiply, depending on the ISA. At the gate level, the ALU is wide and parallel: every operation runs concurrently on the same inputs, and a multiplexer at the end selects which result actually leaves the ALU.

class ALU:
    """A 32-bit ALU. The 'op' control bits select which result the
    ALU outputs, but every result is computed in parallel — there
    is no branching at the gate level.
    """
    def __init__(self, width=32):
        self.w = width

    def __call__(self, A, B, op):
        # All possible results, all computed regardless of op:
        sum_  = ripple_carry_add(A, B)[:self.w]
        diff  = ripple_subtract(A, B)
        and_  = [AND(a, b) for a, b in zip(A, B)]
        or_   = [OR(a, b)  for a, b in zip(A, B)]
        xor_  = [XOR(a, b) for a, b in zip(A, B)]
        # ... etc.

        # A multiplexer (built from gates) picks the right one:
        return mux([sum_, diff, and_, or_, xor_, ...], op)
ALU (4-bit, ADD / AND / OR / XOR)A = 5, B = 3, op = ADD → Y = 8
ABADD= 8AND= 1OR= 7XOR= 6MUXselects ADD0123·y3·y2·y1·y0opa3b3a2b2a1b1a0b010101100
demo (A=5, B=3):
op:
all four ops compute in parallel; the MUX gates only the selected op's bus through to Y

For a single-cycle CPU, the only knob you have is how much you can do in one cycle, and parallel evaluation is how you maximize it.

Why are all four operations always running?

Hardware doesn't branch the way software does. There's no if op == ADD then add() else and() at the gate level. Every cycle, every operation circuit is computing on every input, all the time. The MUX at the end picks one result and drives Y from it; the rest get computed and thrown away. Click each of the demo buttons above with A=5, B=3 — the four op modules' values stay 8, 1, 7, 6 the whole time. The only thing that changes between clicks is which bus the MUX gates through.

Isn't that wasteful?

In gate count, yes. The silicon for the unused operations is occupied every cycle whether you use it or not. But the alternative — running operations one at a time — would itself need extra gates and an extra cycle to sequence them. For a single-cycle CPU, the only knob you have is "how much can we do in one cycle," and parallel evaluation is how you turn that knob up. Power-conscious designs add clock-gating to silence unused branches and reclaim some power, but the gates themselves are always there.

What's inside the MUX?

More gates. A 4-to-1 multiplexer is a small AND-OR network: each operation's output bus is ANDed with a "is-this-one-selected" signal decoded from the op-control bits, and the OR of all four gated buses drives the output. Same primitives as the operations themselves — gates all the way down.

About 15 AND 10 = 10 — that's a bit-mask

15 = 1111, 10 = 1010, so 15 AND 10 = 1010 = 10. ANDing a value with a mask keeps every bit where the mask is 1 and zeros the rest. It's the same trick a RISC instruction decoder uses to pull out individual fields — opcode, register indices, immediate values — from a packed 32-bit instruction word. The same AND module sitting in the ALU is doing real work elsewhere in the CPU.

The multiplexer is itself just gates (a tree of ANDs and ORs fed by the control bits). Crucially: there is no if statement in hardware. All branches of the computation happen in every cycle; the control signals select which one is allowed to drive the output. This is a fundamental difference from software: the cost of "dead" computations is real silicon, not zero.