ILP Experiments
What You'll Learn
- How to measure instruction-level parallelism
- Why dependency chains are slow
- How loop unrolling affects performance
- Understanding issue width through measurement
Mental Model
When instructions depend on each other, the CPU must wait. When they're independent, the CPU can execute them in parallel. The difference in performance between these patterns reveals the CPU's ability to find and exploit instruction-level parallelism.
Experiment 1: Dependency Chain
This code creates a chain of dependencies—each operation depends on the previous one:
#include <chrono>
#include <iostream>
using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;
int main() {
volatile int sum = 0;
const int iterations = 1000000;
auto start = Clock::now();
for (int i = 0; i < iterations; ++i) {
sum = sum + i; // Each iteration depends on previous sum
}
auto end = Clock::now();
auto elapsed = std::chrono::duration_cast<Duration>(end - start);
double ns_per_iter = static_cast<double>(elapsed.count()) / iterations;
std::cout << "Dependency chain: " << ns_per_iter << " ns/iter\n";
std::cout << "Sum (to prevent optimization): " << sum << "\n";
return 0;
} Compile: g++ -O3 -o dep_chain dep_chain.cpp
Run: ./dep_chain
Experiment 2: Independent Operations
This code performs independent operations that can execute in parallel:
#include <chrono>
#include <iostream>
using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;
int main() {
volatile int sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0;
const int iterations = 1000000;
auto start = Clock::now();
for (int i = 0; i < iterations; ++i) {
sum1 = sum1 + i; // Independent
sum2 = sum2 + (i * 2); // Independent
sum3 = sum3 + (i * 3); // Independent
sum4 = sum4 + (i * 4); // Independent
}
auto end = Clock::now();
auto elapsed = std::chrono::duration_cast<Duration>(end - start);
double ns_per_iter = static_cast<double>(elapsed.count()) / iterations;
std::cout << "Independent ops: " << ns_per_iter << " ns/iter\n";
std::cout << "Sums (to prevent optimization): "
<< sum1 << " " << sum2 << " " << sum3 << " " << sum4 << "\n";
return 0;
} Experiment 3: Loop Unrolling
Unrolling the loop manually to expose more parallelism:
#include <chrono>
#include <iostream>
using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;
int main() {
volatile int sum = 0;
const int iterations = 1000000;
auto start = Clock::now();
for (int i = 0; i < iterations; i += 4) {
sum += i;
sum += i + 1;
sum += i + 2;
sum += i + 3;
}
auto end = Clock::now();
auto elapsed = std::chrono::duration_cast<Duration>(end - start);
double ns_per_iter = static_cast<double>(elapsed.count()) / iterations;
std::cout << "Unrolled (4x): " << ns_per_iter << " ns/iter\n";
std::cout << "Sum: " << sum << "\n";
return 0;
} What to Measure
- ns/iter: Nanoseconds per iteration for each pattern
- Ratio: Compare dependency chain vs independent operations
- Unroll scaling: How performance changes with unroll factor (2x, 4x, 8x)
Expected Shape of Results
You should see:
- Dependency chain: Slowest, limited by instruction latency
- Independent operations: Faster, limited by throughput
- Unrolled: Often fastest, reduces loop overhead and exposes more ILP
The exact numbers depend on your CPU, but the relative performance should show clear differences. Independent operations should be 2-4x faster than dependency chains on modern CPUs.
Interpretation
When independent operations are faster, the CPU is finding parallelism. The out-of-order execution engine is reordering and overlapping instructions. When unrolling helps, it's because:
- Loop overhead (branch, increment, compare) is amortized
- More independent operations are visible to the scheduler
- Better register allocation opportunities
The ratio between dependency chain and independent operations gives you intuition about the CPU's issue width—how many instructions it can start per cycle.
Common Pitfalls
- Compiler unrolling: The compiler may unroll loops automatically. Check assembly.
- Register pressure: Too many variables can force register spills, adding memory latency
- False dependencies: Even "independent" operations might share resources
- Measurement noise: Use median of many runs, not a single measurement
Tooling Upgrades (Optional)
Linux: perf stat
perf stat -e instructions,cycles,ipc ./your_program shows actual IPC. Compare the three
experiments to see how IPC changes.
macOS: Instruments
Instruments "Counters" template can show IPC and instruction retirement rates. Compare dependent vs independent patterns.
Windows: ETW
ETW can capture CPU performance counters. Use Windows Performance Analyzer to visualize IPC differences.
Checklist
- ✓ Measured dependency chain performance
- ✓ Measured independent operations performance
- ✓ Measured unrolled loop performance
- ✓ Compared ratios between patterns
- ✓ Verified results are consistent across runs
- ✓ Checked assembly to see what compiler produced