“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

Instruction Latency vs Throughput

What You'll Learn

Why different instructions have different costs
How to measure instruction latency
How to measure instruction throughput
Why division is so much slower than addition

Mental Model

Not all instructions are created equal. Simple operations like ADD complete in 1 cycle. Complex operations like DIV can take 20-40 cycles. Understanding these costs helps explain performance differences.

Latency: How long one operation takes.
Throughput: How many operations can start per cycle (when independent).

Experiment: Add Chain

#include <chrono>
#include <iostream>

using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;

int main() {
    volatile int a = 1, b = 2, c = 3, d = 4;
    const int iterations = 10000000;
    
    auto start = Clock::now();
    for (int i = 0; i < iterations; ++i) {
        a = a + b;  // Dependency chain
        b = b + c;
        c = c + d;
        d = d + a;
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    double ns_per_op = static_cast<double>(elapsed.count()) / (iterations * 4);
    
    std::cout << "ADD chain: " << ns_per_op << " ns/operation\n";
    return 0;
}

Experiment: Multiply Chain

// Same structure, but with multiplication
volatile int a = 1, b = 2, c = 3, d = 4;
for (int i = 0; i < iterations; ++i) {
    a = a * b;
    b = b * c;
    c = c * d;
    d = d * a;
}

Experiment: Divide Chain

// Same structure, but with division
volatile int a = 1000000, b = 2, c = 3, d = 4;
for (int i = 0; i < iterations; ++i) {
    a = a / b;  // Much slower!
    b = b / c;
    c = c / d;
    d = d / a;
}

Experiment: Independent Operations (Throughput)

// Independent operations to measure throughput
volatile int sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0;
for (int i = 0; i < iterations; ++i) {
    sum1 = sum1 + i;      // Independent
    sum2 = sum2 + (i * 2);
    sum3 = sum3 + (i * 3);
    sum4 = sum4 + (i * 4);
}

What to Measure

Latency: ns/operation for dependency chains (ADD, MUL, DIV)
Throughput: ns/operation for independent operations
Ratio: Compare ADD vs MUL vs DIV latency

Expected Shape of Results

Typical results on modern CPUs:

ADD: ~0.25-0.5 ns/op (1 cycle latency, 2-4 per cycle throughput)
MUL: ~0.5-1 ns/op (3-5 cycle latency, 1-2 per cycle throughput)
DIV: ~10-40 ns/op (20-40 cycle latency, very low throughput)

Independent operations will be faster than dependency chains because the CPU can overlap execution.

Interpretation

Why division dominates: Division is implemented iteratively (like long division). It requires many cycles because the algorithm is inherently sequential. Multiplication can use parallel adders, but division cannot.

Why unrolling can increase throughput: When operations are independent, unrolling exposes more parallelism. The CPU can start multiple operations per cycle, limited only by execution unit availability.

Assembly Inspection

To see what the compiler actually generated:

GCC/Clang: g++ -O3 -S your_file.cpp (creates your_file.s)
MSVC: cl /O2 /FA your_file.cpp (creates your_file.asm)

Look for the actual instruction opcodes (ADD, MUL, DIV, IDIV) and see if the compiler unrolled loops or reordered operations.

Common Pitfalls

Integer vs floating-point: FP division is often faster than integer division
Power-of-two division: Compiler optimizes x / 2 to x >> 1
Constant folding: Compiler may compute constant expressions at compile time
Register allocation: Too many variables can cause spills, adding memory latency

Tooling Upgrades (Optional)

Linux: perf stat

perf stat -e arith.div,arith.mul,arith.fpu ./your_program can count specific instruction types.

macOS: Instruments

Instruments can show time spent in different instruction types, though with less detail than perf.

Windows: ETW

ETW can capture instruction-level events, though setup is more complex.

Checklist

✓ Measured ADD, MUL, and DIV latency (dependency chains)
✓ Measured throughput for independent operations
✓ Compared ratios between instruction types
✓ Inspected assembly to verify compiler output
✓ Verified results are consistent