“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

Branch Prediction Experiments

What You'll Learn

How to measure branch prediction effects
The performance difference between predictable and unpredictable branches
Why sorted data can be faster than unsorted data
How speculation and pipeline flushes affect performance

Mental Model

By comparing code with predictable branches vs unpredictable branches, we can measure the cost of mispredictions. The difference in runtime reveals how much the CPU is paying for wrong guesses.

Experiment 1: Always-Taken Branch

This branch is always taken—completely predictable:

#include <chrono>
#include <iostream>
#include <vector>

using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;

int main() {
    std::vector<int> data(100000, 1);
    volatile int sum = 0;
    const int iterations = 1000;
    
    auto start = Clock::now();
    for (int iter = 0; iter < iterations; ++iter) {
        for (size_t i = 0; i < data.size(); ++i) {
            if (data[i] > 0) {  // Always true!
                sum += data[i];
            }
        }
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    double ns_per_element = static_cast<double>(elapsed.count()) / 
                           (iterations * data.size());
    
    std::cout << "Always-taken: " << ns_per_element << " ns/element\n";
    return 0;
}

Experiment 2: Always-Not-Taken Branch

// Same structure, but branch is never taken
std::vector<int> data(100000, -1);
for (size_t i = 0; i < data.size(); ++i) {
    if (data[i] > 0) {  // Always false!
        sum += data[i];
    }
}

Experiment 3: Random Branch

#include <random>

std::vector<int> data(100000);
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(0, 1);

for (size_t i = 0; i < data.size(); ++i) {
    data[i] = dis(gen);  // Random 0 or 1
}

// Now the branch is unpredictable
for (size_t i = 0; i < data.size(); ++i) {
    if (data[i] > 0) {  // 50/50 probability
        sum += data[i];
    }
}

Experiment 4: Sorted vs Unsorted (Classic)

This is the classic branch prediction demonstration:

#include <algorithm>
#include <random>

// Unsorted data
std::vector<int> unsorted(100000);
std::iota(unsorted.begin(), unsorted.end(), 0);
std::shuffle(unsorted.begin(), unsorted.end(), std::mt19937{});

// Sorted data
std::vector<int> sorted(100000);
std::iota(sorted.begin(), sorted.end(), 0);
// Already sorted

// Measure both
volatile int sum = 0;
const int threshold = 50000;

// Unsorted version
auto start = Clock::now();
for (int iter = 0; iter < iterations; ++iter) {
    for (int val : unsorted) {
        if (val < threshold) {  // Unpredictable!
            sum += val;
        }
    }
}
auto end = Clock::now();
double unsorted_time = std::chrono::duration_cast<Duration>(end - start).count();

// Sorted version
start = Clock::now();
for (int iter = 0; iter < iterations; ++iter) {
    for (int val : sorted) {
        if (val < threshold) {  // Predictable pattern!
            sum += val;
        }
    }
}
end = Clock::now();
double sorted_time = std::chrono::duration_cast<Duration>(end - start).count();

std::cout << "Unsorted: " << unsorted_time / (iterations * unsorted.size()) << " ns/element\n";
std::cout << "Sorted:   " << sorted_time / (iterations * sorted.size()) << " ns/element\n";
std::cout << "Speedup:  " << unsorted_time / sorted_time << "x\n";

What to Measure

ns/element: Time per element processed
Ratio: Compare predictable vs unpredictable branches
Speedup: How much faster predictable branches are

Expected Shape of Results

Typical results:

Always-taken/not-taken: Fast (~0.5-1 ns/element), near-zero mispredictions
Random branch: Slow (~5-10 ns/element), ~50% mispredictions
Sorted data: Fast, predictable pattern
Unsorted data: Slow, unpredictable pattern

The sorted vs unsorted difference can be 2-5x on modern CPUs, even though they do the same computation!

Interpretation

When branches are predictable, the CPU speculates correctly and execution proceeds smoothly. When branches are unpredictable, the CPU guesses wrong frequently, causing pipeline flushes. The difference in runtime is the cost of those flushes.

Speculation: The CPU executes instructions from the predicted path before knowing if it's correct. This is why mispredictions are expensive—all that speculative work is wasted.

Pipeline flush: When a misprediction is detected, the pipeline must be cleared and restarted from the correct path. This takes 10-20 cycles.

Common Pitfalls

Compiler optimizations: Compiler may eliminate branches or use conditional moves
Small datasets: May fit in cache, changing results
Measurement noise: Use median of many runs
Branchless code: Compiler may generate branchless code (ternary operators, CMOV)

Tooling Upgrades (Optional)

Linux: perf stat

perf stat -e branches,branch-misses ./your_program shows exact misprediction counts. Compare sorted vs unsorted to see the difference.

macOS: Instruments

Instruments can show branch prediction statistics on supported hardware. Compare the different experiments.

Windows: ETW

ETW can capture branch prediction events. Use WPA to analyze branch behavior differences.

Checklist

✓ Measured always-taken branch performance
✓ Measured always-not-taken branch performance
✓ Measured random branch performance
✓ Measured sorted vs unsorted data performance
✓ Compared ratios to understand mispredict cost
✓ Verified results are consistent