Execution Model
What You'll Learn
- How modern CPUs execute instructions
- Pipelines, superscalar execution, and out-of-order execution
- The difference between latency and throughput
- How compilers transform code
Mental Model
Modern CPUs don't execute instructions one at a time. They use pipelines to overlap execution of multiple instructions, superscalar execution to run multiple instructions per cycle, and out-of-order execution to find parallelism even in sequential code.
Understanding this execution model helps explain why some code patterns are fast and others are slow, even when they do the same amount of work.
Key Concepts
Pipelines
Instructions are broken into stages (fetch, decode, execute, write-back). While one instruction is in the execute stage, the next is in decode, and the one after that is being fetched. This allows multiple instructions to be "in flight" simultaneously.
Superscalar Execution
Modern CPUs have multiple execution units (ALUs, FPUs, load/store units). If two independent instructions can use different units, they execute in parallel. This is called instruction-level parallelism (ILP).
Out-of-Order Execution
The CPU can reorder instructions to find parallelism. If instruction B depends on A, but C is independent, the CPU can execute C before B completes. A reorder buffer ensures results are committed in program order.
Micro-operations (µops)
Complex instructions are broken into simpler micro-operations. The CPU schedules and executes µops, not the original instructions. This allows better parallelism and scheduling.
Latency vs Throughput
These are different concepts:
- Latency: How long one operation takes to complete
- Throughput: How many operations can complete per unit time
Example: An ADD instruction might have 1 cycle latency but 2-4 per cycle throughput (because there are multiple ALUs). If you have independent adds, you can start a new one every cycle even though each takes multiple cycles to complete.
IPC (Instructions Per Cycle)
IPC measures how many instructions complete per clock cycle. Ideal IPC depends on the CPU's issue width (how many instructions can start per cycle). Real IPC is lower due to:
- Dependencies between instructions
- Resource conflicts (multiple instructions need the same unit)
- Cache misses and branch mispredictions
Note: You can measure IPC using hardware counters (perf on Linux), but you can also infer it from timing measurements by comparing dependent vs independent code patterns.
How Compilers Change the Experiment
Compilers apply many optimizations that affect what the CPU actually executes:
- Loop unrolling: Reduces loop overhead and exposes more ILP
- Instruction scheduling: Reorders instructions to reduce dependencies
- Constant propagation: Eliminates computations at compile time
- Dead code elimination: Removes unused computations
- Vectorization: Uses SIMD instructions when possible
Always inspect the generated assembly (-S flag) to see what the compiler actually produced.
The source code and the executed code can be very different.
Expected Shape of Results
When measuring code with dependencies vs independent operations:
- Dependent chains: Performance limited by latency (one operation per latency period)
- Independent operations: Performance limited by throughput (multiple operations per cycle)
- The ratio between them reveals the CPU's ability to find parallelism
Interpretation
When you see independent operations running faster than dependent chains, the CPU is finding and exploiting instruction-level parallelism. The out-of-order execution engine is working. When dependencies force sequential execution, you're limited by instruction latency.
Common Pitfalls
- Compiler optimizations: The compiler may unroll loops or reorder operations, changing the experiment
- Register pressure: Too many live variables can force spills to memory, adding latency
- Resource conflicts: Multiple instructions competing for the same execution unit
- False dependencies: Register renaming usually handles this, but can sometimes limit parallelism
Tooling Upgrades (Optional)
Linux: perf stat
perf stat -e instructions,cycles,ipc ./your_program shows actual IPC. Compare dependent
vs independent code to see the difference.
macOS: Instruments
Instruments "Counters" template can show IPC and other microarchitectural metrics. Requires compatible hardware.
Windows: ETW
ETW can capture CPU performance counters including IPC. Windows Performance Analyzer provides visualization.
Checklist
- ✓ Understand the difference between latency and throughput
- ✓ Know what pipelines, superscalar, and out-of-order execution mean
- ✓ Understand how compilers can change what gets executed
- ✓ Ready to measure dependent vs independent code patterns