Instruction Latency vs Throughput
What You'll Learn
- Why different instructions have different costs
- How to measure instruction latency
- How to measure instruction throughput
- Why division is so much slower than addition
Mental Model
Not all instructions are created equal. Simple operations like ADD complete in 1 cycle. Complex operations like DIV can take 20-40 cycles. Understanding these costs helps explain performance differences.
Latency: How long one operation takes.
Throughput: How many operations can start per cycle (when independent).
Experiment: Add Chain
#include <chrono>
#include <iostream>
using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;
int main() {
volatile int a = 1, b = 2, c = 3, d = 4;
const int iterations = 10000000;
auto start = Clock::now();
for (int i = 0; i < iterations; ++i) {
a = a + b; // Dependency chain
b = b + c;
c = c + d;
d = d + a;
}
auto end = Clock::now();
auto elapsed = std::chrono::duration_cast<Duration>(end - start);
double ns_per_op = static_cast<double>(elapsed.count()) / (iterations * 4);
std::cout << "ADD chain: " << ns_per_op << " ns/operation\n";
return 0;
} Experiment: Multiply Chain
// Same structure, but with multiplication
volatile int a = 1, b = 2, c = 3, d = 4;
for (int i = 0; i < iterations; ++i) {
a = a * b;
b = b * c;
c = c * d;
d = d * a;
} Experiment: Divide Chain
// Same structure, but with division
volatile int a = 1000000, b = 2, c = 3, d = 4;
for (int i = 0; i < iterations; ++i) {
a = a / b; // Much slower!
b = b / c;
c = c / d;
d = d / a;
} Experiment: Independent Operations (Throughput)
// Independent operations to measure throughput
volatile int sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0;
for (int i = 0; i < iterations; ++i) {
sum1 = sum1 + i; // Independent
sum2 = sum2 + (i * 2);
sum3 = sum3 + (i * 3);
sum4 = sum4 + (i * 4);
} What to Measure
- Latency: ns/operation for dependency chains (ADD, MUL, DIV)
- Throughput: ns/operation for independent operations
- Ratio: Compare ADD vs MUL vs DIV latency
Expected Shape of Results
Typical results on modern CPUs:
- ADD: ~0.25-0.5 ns/op (1 cycle latency, 2-4 per cycle throughput)
- MUL: ~0.5-1 ns/op (3-5 cycle latency, 1-2 per cycle throughput)
- DIV: ~10-40 ns/op (20-40 cycle latency, very low throughput)
Independent operations will be faster than dependency chains because the CPU can overlap execution.
Interpretation
Why division dominates: Division is implemented iteratively (like long division). It requires many cycles because the algorithm is inherently sequential. Multiplication can use parallel adders, but division cannot.
Why unrolling can increase throughput: When operations are independent, unrolling exposes more parallelism. The CPU can start multiple operations per cycle, limited only by execution unit availability.
Assembly Inspection
To see what the compiler actually generated:
- GCC/Clang:
g++ -O3 -S your_file.cpp(createsyour_file.s) - MSVC:
cl /O2 /FA your_file.cpp(createsyour_file.asm)
Look for the actual instruction opcodes (ADD, MUL, DIV, IDIV) and see if the compiler unrolled loops or reordered operations.
Common Pitfalls
- Integer vs floating-point: FP division is often faster than integer division
- Power-of-two division: Compiler optimizes
x / 2tox >> 1 - Constant folding: Compiler may compute constant expressions at compile time
- Register allocation: Too many variables can cause spills, adding memory latency
Tooling Upgrades (Optional)
Linux: perf stat
perf stat -e arith.div,arith.mul,arith.fpu ./your_program can count specific instruction types.
macOS: Instruments
Instruments can show time spent in different instruction types, though with less detail than perf.
Windows: ETW
ETW can capture instruction-level events, though setup is more complex.
Checklist
- ✓ Measured ADD, MUL, and DIV latency (dependency chains)
- ✓ Measured throughput for independent operations
- ✓ Compared ratios between instruction types
- ✓ Inspected assembly to verify compiler output
- ✓ Verified results are consistent