Execution Model

What You'll Learn

Mental Model

Modern CPUs don't execute instructions one at a time. They use pipelines to overlap execution of multiple instructions, superscalar execution to run multiple instructions per cycle, and out-of-order execution to find parallelism even in sequential code.

Understanding this execution model helps explain why some code patterns are fast and others are slow, even when they do the same amount of work.

Key Concepts

Pipelines

Instructions are broken into stages (fetch, decode, execute, write-back). While one instruction is in the execute stage, the next is in decode, and the one after that is being fetched. This allows multiple instructions to be "in flight" simultaneously.

Superscalar Execution

Modern CPUs have multiple execution units (ALUs, FPUs, load/store units). If two independent instructions can use different units, they execute in parallel. This is called instruction-level parallelism (ILP).

Out-of-Order Execution

The CPU can reorder instructions to find parallelism. If instruction B depends on A, but C is independent, the CPU can execute C before B completes. A reorder buffer ensures results are committed in program order.

Micro-operations (µops)

Complex instructions are broken into simpler micro-operations. The CPU schedules and executes µops, not the original instructions. This allows better parallelism and scheduling.

Latency vs Throughput

These are different concepts:

Example: An ADD instruction might have 1 cycle latency but 2-4 per cycle throughput (because there are multiple ALUs). If you have independent adds, you can start a new one every cycle even though each takes multiple cycles to complete.

IPC (Instructions Per Cycle)

IPC measures how many instructions complete per clock cycle. Ideal IPC depends on the CPU's issue width (how many instructions can start per cycle). Real IPC is lower due to:

Note: You can measure IPC using hardware counters (perf on Linux), but you can also infer it from timing measurements by comparing dependent vs independent code patterns.

How Compilers Change the Experiment

Compilers apply many optimizations that affect what the CPU actually executes:

Always inspect the generated assembly (-S flag) to see what the compiler actually produced. The source code and the executed code can be very different.

Expected Shape of Results

When measuring code with dependencies vs independent operations:

Interpretation

When you see independent operations running faster than dependent chains, the CPU is finding and exploiting instruction-level parallelism. The out-of-order execution engine is working. When dependencies force sequential execution, you're limited by instruction latency.

Common Pitfalls

Tooling Upgrades (Optional)

Linux: perf stat

perf stat -e instructions,cycles,ipc ./your_program shows actual IPC. Compare dependent vs independent code to see the difference.

macOS: Instruments

Instruments "Counters" template can show IPC and other microarchitectural metrics. Requires compatible hardware.

Windows: ETW

ETW can capture CPU performance counters including IPC. Windows Performance Analyzer provides visualization.

Checklist