“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

⛓ What you need to know first 1 concepts, 1 layers

The requisite-knowledge inventory for this page, bottom-up: the primitives at the base, combined upward until you reach what this page assumes. Skim the layers you already own; start wherever the ground gets unfamiliar.

base
- CPU Mental Model
↳you are here

Execution Model

What You'll Learn

How modern CPUs execute instructions
Pipelines, superscalar execution, and out-of-order execution
The difference between latency and throughput
How compilers transform code

Mental Model

Modern CPUs don't execute instructions one at a time. They use pipelines to overlap execution of multiple instructions, superscalar execution to run multiple instructions per cycle, and out-of-order execution to find parallelism even in sequential code.

Understanding this execution model helps explain why some code patterns are fast and others are slow, even when they do the same amount of work.

Key Concepts

Pipelines

Instructions are broken into stages (fetch, decode, execute, write-back). While one instruction is in the execute stage, the next is in decode, and the one after that is being fetched. This allows multiple instructions to be "in flight" simultaneously.

Superscalar Execution

Modern CPUs have multiple execution units (ALUs, FPUs, load/store units). If two independent instructions can use different units, they execute in parallel. This is called instruction-level parallelism (ILP).

Out-of-Order Execution

The CPU can reorder instructions to find parallelism. If instruction B depends on A, but C is independent, the CPU can execute C before B completes. A reorder buffer ensures results are committed in program order.

Micro-operations (µops)

Complex instructions are broken into simpler micro-operations. The CPU schedules and executes µops, not the original instructions. This allows better parallelism and scheduling.

Latency vs Throughput

These are different concepts:

Latency: How long one operation takes to complete
Throughput: How many operations can complete per unit time

Example: An ADD instruction might have 1 cycle latency but 2-4 per cycle throughput (because there are multiple ALUs). If you have independent adds, you can start a new one every cycle even though each takes multiple cycles to complete.

IPC (Instructions Per Cycle)

IPC measures how many instructions complete per clock cycle. Ideal IPC depends on the CPU's issue width (how many instructions can start per cycle). Real IPC is lower due to:

Dependencies between instructions
Resource conflicts (multiple instructions need the same unit)
Cache misses and branch mispredictions

Note: You can measure IPC using hardware counters (perf on Linux), but you can also infer it from timing measurements by comparing dependent vs independent code patterns.

How Compilers Change the Experiment

Compilers apply many optimizations that affect what the CPU actually executes:

Loop unrolling: Reduces loop overhead and exposes more ILP
Instruction scheduling: Reorders instructions to reduce dependencies
Constant propagation: Eliminates computations at compile time
Dead code elimination: Removes unused computations
Vectorization: Uses SIMD instructions when possible

Always inspect the generated assembly (-S flag) to see what the compiler actually produced. The source code and the executed code can be very different.

Expected Shape of Results

When measuring code with dependencies vs independent operations:

Dependent chains: Performance limited by latency (one operation per latency period)
Independent operations: Performance limited by throughput (multiple operations per cycle)
The ratio between them reveals the CPU's ability to find parallelism

Interpretation

When you see independent operations running faster than dependent chains, the CPU is finding and exploiting instruction-level parallelism. The out-of-order execution engine is working. When dependencies force sequential execution, you're limited by instruction latency.

Common Pitfalls

Compiler optimizations: The compiler may unroll loops or reorder operations, changing the experiment
Register pressure: Too many live variables can force spills to memory, adding latency
Resource conflicts: Multiple instructions competing for the same execution unit
False dependencies: Register renaming usually handles this, but can sometimes limit parallelism

Tooling Upgrades (Optional)

Linux: perf stat

perf stat -e instructions,cycles,ipc ./your_program shows actual IPC. Compare dependent vs independent code to see the difference.

macOS: Instruments

Instruments "Counters" template can show IPC and other microarchitectural metrics. Requires compatible hardware.

Windows: ETW

ETW can capture CPU performance counters including IPC. Windows Performance Analyzer provides visualization.

Checklist

✓ Understand the difference between latency and throughput
✓ Know what pipelines, superscalar, and out-of-order execution mean
✓ Understand how compilers can change what gets executed
✓ Ready to measure dependent vs independent code patterns