Branch Prediction Experiments
What You'll Learn
- How to measure branch prediction effects
- The performance difference between predictable and unpredictable branches
- Why sorted data can be faster than unsorted data
- How speculation and pipeline flushes affect performance
Mental Model
By comparing code with predictable branches vs unpredictable branches, we can measure the cost of mispredictions. The difference in runtime reveals how much the CPU is paying for wrong guesses.
Experiment 1: Always-Taken Branch
This branch is always taken—completely predictable:
#include <chrono>
#include <iostream>
#include <vector>
using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;
int main() {
std::vector<int> data(100000, 1);
volatile int sum = 0;
const int iterations = 1000;
auto start = Clock::now();
for (int iter = 0; iter < iterations; ++iter) {
for (size_t i = 0; i < data.size(); ++i) {
if (data[i] > 0) { // Always true!
sum += data[i];
}
}
}
auto end = Clock::now();
auto elapsed = std::chrono::duration_cast<Duration>(end - start);
double ns_per_element = static_cast<double>(elapsed.count()) /
(iterations * data.size());
std::cout << "Always-taken: " << ns_per_element << " ns/element\n";
return 0;
} Experiment 2: Always-Not-Taken Branch
// Same structure, but branch is never taken
std::vector<int> data(100000, -1);
for (size_t i = 0; i < data.size(); ++i) {
if (data[i] > 0) { // Always false!
sum += data[i];
}
} Experiment 3: Random Branch
#include <random>
std::vector<int> data(100000);
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(0, 1);
for (size_t i = 0; i < data.size(); ++i) {
data[i] = dis(gen); // Random 0 or 1
}
// Now the branch is unpredictable
for (size_t i = 0; i < data.size(); ++i) {
if (data[i] > 0) { // 50/50 probability
sum += data[i];
}
} Experiment 4: Sorted vs Unsorted (Classic)
This is the classic branch prediction demonstration:
#include <algorithm>
#include <random>
// Unsorted data
std::vector<int> unsorted(100000);
std::iota(unsorted.begin(), unsorted.end(), 0);
std::shuffle(unsorted.begin(), unsorted.end(), std::mt19937{});
// Sorted data
std::vector<int> sorted(100000);
std::iota(sorted.begin(), sorted.end(), 0);
// Already sorted
// Measure both
volatile int sum = 0;
const int threshold = 50000;
// Unsorted version
auto start = Clock::now();
for (int iter = 0; iter < iterations; ++iter) {
for (int val : unsorted) {
if (val < threshold) { // Unpredictable!
sum += val;
}
}
}
auto end = Clock::now();
double unsorted_time = std::chrono::duration_cast<Duration>(end - start).count();
// Sorted version
start = Clock::now();
for (int iter = 0; iter < iterations; ++iter) {
for (int val : sorted) {
if (val < threshold) { // Predictable pattern!
sum += val;
}
}
}
end = Clock::now();
double sorted_time = std::chrono::duration_cast<Duration>(end - start).count();
std::cout << "Unsorted: " << unsorted_time / (iterations * unsorted.size()) << " ns/element\n";
std::cout << "Sorted: " << sorted_time / (iterations * sorted.size()) << " ns/element\n";
std::cout << "Speedup: " << unsorted_time / sorted_time << "x\n"; What to Measure
- ns/element: Time per element processed
- Ratio: Compare predictable vs unpredictable branches
- Speedup: How much faster predictable branches are
Expected Shape of Results
Typical results:
- Always-taken/not-taken: Fast (~0.5-1 ns/element), near-zero mispredictions
- Random branch: Slow (~5-10 ns/element), ~50% mispredictions
- Sorted data: Fast, predictable pattern
- Unsorted data: Slow, unpredictable pattern
The sorted vs unsorted difference can be 2-5x on modern CPUs, even though they do the same computation!
Interpretation
When branches are predictable, the CPU speculates correctly and execution proceeds smoothly. When branches are unpredictable, the CPU guesses wrong frequently, causing pipeline flushes. The difference in runtime is the cost of those flushes.
Speculation: The CPU executes instructions from the predicted path before knowing if it's correct. This is why mispredictions are expensive—all that speculative work is wasted.
Pipeline flush: When a misprediction is detected, the pipeline must be cleared and restarted from the correct path. This takes 10-20 cycles.
Common Pitfalls
- Compiler optimizations: Compiler may eliminate branches or use conditional moves
- Small datasets: May fit in cache, changing results
- Measurement noise: Use median of many runs
- Branchless code: Compiler may generate branchless code (ternary operators, CMOV)
Tooling Upgrades (Optional)
Linux: perf stat
perf stat -e branches,branch-misses ./your_program shows exact misprediction counts.
Compare sorted vs unsorted to see the difference.
macOS: Instruments
Instruments can show branch prediction statistics on supported hardware. Compare the different experiments.
Windows: ETW
ETW can capture branch prediction events. Use WPA to analyze branch behavior differences.
Checklist
- ✓ Measured always-taken branch performance
- ✓ Measured always-not-taken branch performance
- ✓ Measured random branch performance
- ✓ Measured sorted vs unsorted data performance
- ✓ Compared ratios to understand mispredict cost
- ✓ Verified results are consistent