“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

ILP Experiments

What You'll Learn

How to measure instruction-level parallelism
Why dependency chains are slow
How loop unrolling affects performance
Understanding issue width through measurement

Mental Model

When instructions depend on each other, the CPU must wait. When they're independent, the CPU can execute them in parallel. The difference in performance between these patterns reveals the CPU's ability to find and exploit instruction-level parallelism.

Experiment 1: Dependency Chain

This code creates a chain of dependencies—each operation depends on the previous one:

#include <chrono>
#include <iostream>

using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;

int main() {
    volatile int sum = 0;
    const int iterations = 1000000;
    
    auto start = Clock::now();
    for (int i = 0; i < iterations; ++i) {
        sum = sum + i;  // Each iteration depends on previous sum
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    double ns_per_iter = static_cast<double>(elapsed.count()) / iterations;
    
    std::cout << "Dependency chain: " << ns_per_iter << " ns/iter\n";
    std::cout << "Sum (to prevent optimization): " << sum << "\n";
    
    return 0;
}

Compile: g++ -O3 -o dep_chain dep_chain.cpp
Run: ./dep_chain

Experiment 2: Independent Operations

This code performs independent operations that can execute in parallel:

#include <chrono>
#include <iostream>

using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;

int main() {
    volatile int sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0;
    const int iterations = 1000000;
    
    auto start = Clock::now();
    for (int i = 0; i < iterations; ++i) {
        sum1 = sum1 + i;      // Independent
        sum2 = sum2 + (i * 2); // Independent
        sum3 = sum3 + (i * 3); // Independent
        sum4 = sum4 + (i * 4); // Independent
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    double ns_per_iter = static_cast<double>(elapsed.count()) / iterations;
    
    std::cout << "Independent ops: " << ns_per_iter << " ns/iter\n";
    std::cout << "Sums (to prevent optimization): " 
              << sum1 << " " << sum2 << " " << sum3 << " " << sum4 << "\n";
    
    return 0;
}

Experiment 3: Loop Unrolling

Unrolling the loop manually to expose more parallelism:

#include <chrono>
#include <iostream>

using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;

int main() {
    volatile int sum = 0;
    const int iterations = 1000000;
    
    auto start = Clock::now();
    for (int i = 0; i < iterations; i += 4) {
        sum += i;
        sum += i + 1;
        sum += i + 2;
        sum += i + 3;
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    double ns_per_iter = static_cast<double>(elapsed.count()) / iterations;
    
    std::cout << "Unrolled (4x): " << ns_per_iter << " ns/iter\n";
    std::cout << "Sum: " << sum << "\n";
    
    return 0;
}

What to Measure

ns/iter: Nanoseconds per iteration for each pattern
Ratio: Compare dependency chain vs independent operations
Unroll scaling: How performance changes with unroll factor (2x, 4x, 8x)

Expected Shape of Results

You should see:

Dependency chain: Slowest, limited by instruction latency
Independent operations: Faster, limited by throughput
Unrolled: Often fastest, reduces loop overhead and exposes more ILP

The exact numbers depend on your CPU, but the relative performance should show clear differences. Independent operations should be 2-4x faster than dependency chains on modern CPUs.

Interpretation

When independent operations are faster, the CPU is finding parallelism. The out-of-order execution engine is reordering and overlapping instructions. When unrolling helps, it's because:

Loop overhead (branch, increment, compare) is amortized
More independent operations are visible to the scheduler
Better register allocation opportunities

The ratio between dependency chain and independent operations gives you intuition about the CPU's issue width—how many instructions it can start per cycle.

Common Pitfalls

Compiler unrolling: The compiler may unroll loops automatically. Check assembly.
Register pressure: Too many variables can force register spills, adding memory latency
False dependencies: Even "independent" operations might share resources
Measurement noise: Use median of many runs, not a single measurement

Tooling Upgrades (Optional)

Linux: perf stat

perf stat -e instructions,cycles,ipc ./your_program shows actual IPC. Compare the three experiments to see how IPC changes.

macOS: Instruments

Instruments "Counters" template can show IPC and instruction retirement rates. Compare dependent vs independent patterns.

Windows: ETW

ETW can capture CPU performance counters. Use Windows Performance Analyzer to visualize IPC differences.

Checklist

✓ Measured dependency chain performance
✓ Measured independent operations performance
✓ Measured unrolled loop performance
✓ Compared ratios between patterns
✓ Verified results are consistent across runs
✓ Checked assembly to see what compiler produced