Hardware Prefetch (What It Helps / When It Fails)

What You'll Learn

Mental Model

CPUs have hardware prefetchers that try to predict what data you'll need next and load it into cache ahead of time. They work well for predictable patterns (sequential, fixed stride) but fail for unpredictable patterns (random access).

Experiment 1: Sequential Access

#include <chrono>
#include <iostream>
#include <vector>

using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;

double measure_sequential(std::vector<int>& data) {
    volatile int sum = 0;
    const int iterations = data.size();
    
    auto start = Clock::now();
    for (size_t i = 0; i < iterations; ++i) {
        sum += data[i];  // Sequential access
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    return static_cast<double>(elapsed.count()) / iterations;
}

int main() {
    std::vector<int> data(100 * 1024 * 1024 / sizeof(int), 1);  // 100 MB
    
    double sequential = measure_sequential(data);
    std::cout << "Sequential: " << sequential << " ns/element\n";
    return 0;
}

Experiment 2: Fixed Stride

double measure_stride(std::vector<int>& data, size_t stride) {
    volatile int sum = 0;
    const int iterations = data.size();
    
    auto start = Clock::now();
    for (size_t i = 0; i < iterations; i += stride) {
        sum += data[i];
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    return static_cast<double>(elapsed.count()) / (iterations / stride);
}

// Try different strides
for (size_t stride = 1; stride <= 64; stride *= 2) {
    double time = measure_stride(data, stride);
    std::cout << "Stride " << stride << ": " << time << " ns/element\n";
}

Experiment 3: Random Access

#include <random>
#include <algorithm>

double measure_random(std::vector<int>& data) {
    std::vector<size_t> indices(data.size());
    std::iota(indices.begin(), indices.end(), 0);
    std::shuffle(indices.begin(), indices.end(), std::mt19937{});
    
    volatile int sum = 0;
    auto start = Clock::now();
    for (size_t idx : indices) {
        sum += data[idx];  // Random access
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    return static_cast<double>(elapsed.count()) / data.size();
}

double random_time = measure_random(data);
std::cout << "Random: " << random_time << " ns/element\n";

What to Measure

Expected Shape of Results

Interpretation

Prefetchers predict patterns: Sequential and fixed-stride access are predictable. The prefetcher loads cache lines ahead of your access, hiding memory latency.

Random access fails: Prefetchers can't predict random patterns. Every access is a cache miss, exposing full DRAM latency.

Checklist