Instruction Latency vs Throughput

What You'll Learn

Mental Model

Not all instructions are created equal. Simple operations like ADD complete in 1 cycle. Complex operations like DIV can take 20-40 cycles. Understanding these costs helps explain performance differences.

Latency: How long one operation takes.
Throughput: How many operations can start per cycle (when independent).

Experiment: Add Chain

#include <chrono>
#include <iostream>

using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;

int main() {
    volatile int a = 1, b = 2, c = 3, d = 4;
    const int iterations = 10000000;
    
    auto start = Clock::now();
    for (int i = 0; i < iterations; ++i) {
        a = a + b;  // Dependency chain
        b = b + c;
        c = c + d;
        d = d + a;
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    double ns_per_op = static_cast<double>(elapsed.count()) / (iterations * 4);
    
    std::cout << "ADD chain: " << ns_per_op << " ns/operation\n";
    return 0;
}

Experiment: Multiply Chain

// Same structure, but with multiplication
volatile int a = 1, b = 2, c = 3, d = 4;
for (int i = 0; i < iterations; ++i) {
    a = a * b;
    b = b * c;
    c = c * d;
    d = d * a;
}

Experiment: Divide Chain

// Same structure, but with division
volatile int a = 1000000, b = 2, c = 3, d = 4;
for (int i = 0; i < iterations; ++i) {
    a = a / b;  // Much slower!
    b = b / c;
    c = c / d;
    d = d / a;
}

Experiment: Independent Operations (Throughput)

// Independent operations to measure throughput
volatile int sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0;
for (int i = 0; i < iterations; ++i) {
    sum1 = sum1 + i;      // Independent
    sum2 = sum2 + (i * 2);
    sum3 = sum3 + (i * 3);
    sum4 = sum4 + (i * 4);
}

What to Measure

Expected Shape of Results

Typical results on modern CPUs:

Independent operations will be faster than dependency chains because the CPU can overlap execution.

Interpretation

Why division dominates: Division is implemented iteratively (like long division). It requires many cycles because the algorithm is inherently sequential. Multiplication can use parallel adders, but division cannot.

Why unrolling can increase throughput: When operations are independent, unrolling exposes more parallelism. The CPU can start multiple operations per cycle, limited only by execution unit availability.

Assembly Inspection

To see what the compiler actually generated:

Look for the actual instruction opcodes (ADD, MUL, DIV, IDIV) and see if the compiler unrolled loops or reordered operations.

Common Pitfalls

Tooling Upgrades (Optional)

Linux: perf stat

perf stat -e arith.div,arith.mul,arith.fpu ./your_program can count specific instruction types.

macOS: Instruments

Instruments can show time spent in different instruction types, though with less detail than perf.

Windows: ETW

ETW can capture instruction-level events, though setup is more complex.

Checklist