ILP Experiments

What You'll Learn

Mental Model

When instructions depend on each other, the CPU must wait. When they're independent, the CPU can execute them in parallel. The difference in performance between these patterns reveals the CPU's ability to find and exploit instruction-level parallelism.

Experiment 1: Dependency Chain

This code creates a chain of dependencies—each operation depends on the previous one:

#include <chrono>
#include <iostream>

using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;

int main() {
    volatile int sum = 0;
    const int iterations = 1000000;
    
    auto start = Clock::now();
    for (int i = 0; i < iterations; ++i) {
        sum = sum + i;  // Each iteration depends on previous sum
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    double ns_per_iter = static_cast<double>(elapsed.count()) / iterations;
    
    std::cout << "Dependency chain: " << ns_per_iter << " ns/iter\n";
    std::cout << "Sum (to prevent optimization): " << sum << "\n";
    
    return 0;
}

Compile: g++ -O3 -o dep_chain dep_chain.cpp
Run: ./dep_chain

Experiment 2: Independent Operations

This code performs independent operations that can execute in parallel:

#include <chrono>
#include <iostream>

using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;

int main() {
    volatile int sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0;
    const int iterations = 1000000;
    
    auto start = Clock::now();
    for (int i = 0; i < iterations; ++i) {
        sum1 = sum1 + i;      // Independent
        sum2 = sum2 + (i * 2); // Independent
        sum3 = sum3 + (i * 3); // Independent
        sum4 = sum4 + (i * 4); // Independent
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    double ns_per_iter = static_cast<double>(elapsed.count()) / iterations;
    
    std::cout << "Independent ops: " << ns_per_iter << " ns/iter\n";
    std::cout << "Sums (to prevent optimization): " 
              << sum1 << " " << sum2 << " " << sum3 << " " << sum4 << "\n";
    
    return 0;
}

Experiment 3: Loop Unrolling

Unrolling the loop manually to expose more parallelism:

#include <chrono>
#include <iostream>

using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;

int main() {
    volatile int sum = 0;
    const int iterations = 1000000;
    
    auto start = Clock::now();
    for (int i = 0; i < iterations; i += 4) {
        sum += i;
        sum += i + 1;
        sum += i + 2;
        sum += i + 3;
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    double ns_per_iter = static_cast<double>(elapsed.count()) / iterations;
    
    std::cout << "Unrolled (4x): " << ns_per_iter << " ns/iter\n";
    std::cout << "Sum: " << sum << "\n";
    
    return 0;
}

What to Measure

Expected Shape of Results

You should see:

The exact numbers depend on your CPU, but the relative performance should show clear differences. Independent operations should be 2-4x faster than dependency chains on modern CPUs.

Interpretation

When independent operations are faster, the CPU is finding parallelism. The out-of-order execution engine is reordering and overlapping instructions. When unrolling helps, it's because:

The ratio between dependency chain and independent operations gives you intuition about the CPU's issue width—how many instructions it can start per cycle.

Common Pitfalls

Tooling Upgrades (Optional)

Linux: perf stat

perf stat -e instructions,cycles,ipc ./your_program shows actual IPC. Compare the three experiments to see how IPC changes.

macOS: Instruments

Instruments "Counters" template can show IPC and instruction retirement rates. Compare dependent vs independent patterns.

Windows: ETW

ETW can capture CPU performance counters. Use Windows Performance Analyzer to visualize IPC differences.

Checklist