Branch Prediction Experiments

What You'll Learn

Mental Model

By comparing code with predictable branches vs unpredictable branches, we can measure the cost of mispredictions. The difference in runtime reveals how much the CPU is paying for wrong guesses.

Experiment 1: Always-Taken Branch

This branch is always taken—completely predictable:

#include <chrono>
#include <iostream>
#include <vector>

using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;

int main() {
    std::vector<int> data(100000, 1);
    volatile int sum = 0;
    const int iterations = 1000;
    
    auto start = Clock::now();
    for (int iter = 0; iter < iterations; ++iter) {
        for (size_t i = 0; i < data.size(); ++i) {
            if (data[i] > 0) {  // Always true!
                sum += data[i];
            }
        }
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    double ns_per_element = static_cast<double>(elapsed.count()) / 
                           (iterations * data.size());
    
    std::cout << "Always-taken: " << ns_per_element << " ns/element\n";
    return 0;
}

Experiment 2: Always-Not-Taken Branch

// Same structure, but branch is never taken
std::vector<int> data(100000, -1);
for (size_t i = 0; i < data.size(); ++i) {
    if (data[i] > 0) {  // Always false!
        sum += data[i];
    }
}

Experiment 3: Random Branch

#include <random>

std::vector<int> data(100000);
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(0, 1);

for (size_t i = 0; i < data.size(); ++i) {
    data[i] = dis(gen);  // Random 0 or 1
}

// Now the branch is unpredictable
for (size_t i = 0; i < data.size(); ++i) {
    if (data[i] > 0) {  // 50/50 probability
        sum += data[i];
    }
}

Experiment 4: Sorted vs Unsorted (Classic)

This is the classic branch prediction demonstration:

#include <algorithm>
#include <random>

// Unsorted data
std::vector<int> unsorted(100000);
std::iota(unsorted.begin(), unsorted.end(), 0);
std::shuffle(unsorted.begin(), unsorted.end(), std::mt19937{});

// Sorted data
std::vector<int> sorted(100000);
std::iota(sorted.begin(), sorted.end(), 0);
// Already sorted

// Measure both
volatile int sum = 0;
const int threshold = 50000;

// Unsorted version
auto start = Clock::now();
for (int iter = 0; iter < iterations; ++iter) {
    for (int val : unsorted) {
        if (val < threshold) {  // Unpredictable!
            sum += val;
        }
    }
}
auto end = Clock::now();
double unsorted_time = std::chrono::duration_cast<Duration>(end - start).count();

// Sorted version
start = Clock::now();
for (int iter = 0; iter < iterations; ++iter) {
    for (int val : sorted) {
        if (val < threshold) {  // Predictable pattern!
            sum += val;
        }
    }
}
end = Clock::now();
double sorted_time = std::chrono::duration_cast<Duration>(end - start).count();

std::cout << "Unsorted: " << unsorted_time / (iterations * unsorted.size()) << " ns/element\n";
std::cout << "Sorted:   " << sorted_time / (iterations * sorted.size()) << " ns/element\n";
std::cout << "Speedup:  " << unsorted_time / sorted_time << "x\n";

What to Measure

Expected Shape of Results

Typical results:

The sorted vs unsorted difference can be 2-5x on modern CPUs, even though they do the same computation!

Interpretation

When branches are predictable, the CPU speculates correctly and execution proceeds smoothly. When branches are unpredictable, the CPU guesses wrong frequently, causing pipeline flushes. The difference in runtime is the cost of those flushes.

Speculation: The CPU executes instructions from the predicted path before knowing if it's correct. This is why mispredictions are expensive—all that speculative work is wasted.

Pipeline flush: When a misprediction is detected, the pipeline must be cleared and restarted from the correct path. This takes 10-20 cycles.

Common Pitfalls

Tooling Upgrades (Optional)

Linux: perf stat

perf stat -e branches,branch-misses ./your_program shows exact misprediction counts. Compare sorted vs unsorted to see the difference.

macOS: Instruments

Instruments can show branch prediction statistics on supported hardware. Compare the different experiments.

Windows: ETW

ETW can capture branch prediction events. Use WPA to analyze branch behavior differences.

Checklist