“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

Hardware Prefetch (What It Helps / When It Fails)

What You'll Learn

How hardware prefetchers work
What access patterns prefetchers can predict
When prefetching fails
Measuring prefetch effectiveness

Mental Model

CPUs have hardware prefetchers that try to predict what data you'll need next and load it into cache ahead of time. They work well for predictable patterns (sequential, fixed stride) but fail for unpredictable patterns (random access).

Experiment 1: Sequential Access

#include <chrono>
#include <iostream>
#include <vector>

using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;

double measure_sequential(std::vector<int>& data) {
    volatile int sum = 0;
    const int iterations = data.size();
    
    auto start = Clock::now();
    for (size_t i = 0; i < iterations; ++i) {
        sum += data[i];  // Sequential access
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    return static_cast<double>(elapsed.count()) / iterations;
}

int main() {
    std::vector<int> data(100 * 1024 * 1024 / sizeof(int), 1);  // 100 MB
    
    double sequential = measure_sequential(data);
    std::cout << "Sequential: " << sequential << " ns/element\n";
    return 0;
}

Experiment 2: Fixed Stride

double measure_stride(std::vector<int>& data, size_t stride) {
    volatile int sum = 0;
    const int iterations = data.size();
    
    auto start = Clock::now();
    for (size_t i = 0; i < iterations; i += stride) {
        sum += data[i];
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    return static_cast<double>(elapsed.count()) / (iterations / stride);
}

// Try different strides
for (size_t stride = 1; stride <= 64; stride *= 2) {
    double time = measure_stride(data, stride);
    std::cout << "Stride " << stride << ": " << time << " ns/element\n";
}

Experiment 3: Random Access

#include <random>
#include <algorithm>

double measure_random(std::vector<int>& data) {
    std::vector<size_t> indices(data.size());
    std::iota(indices.begin(), indices.end(), 0);
    std::shuffle(indices.begin(), indices.end(), std::mt19937{});
    
    volatile int sum = 0;
    auto start = Clock::now();
    for (size_t idx : indices) {
        sum += data[idx];  // Random access
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    return static_cast<double>(elapsed.count()) / data.size();
}

double random_time = measure_random(data);
std::cout << "Random: " << random_time << " ns/element\n";

What to Measure

ns/element: Time per element for different access patterns
Bandwidth: GB/s throughput
Ratio: Compare sequential vs random access

Expected Shape of Results

Sequential: Fast (~0.5-2 ns/element), prefetcher helps
Fixed stride: Fast if stride is small, slower for large strides
Random: Slow (~50-200 ns/element), prefetcher can't help

Interpretation

Prefetchers predict patterns: Sequential and fixed-stride access are predictable. The prefetcher loads cache lines ahead of your access, hiding memory latency.

Random access fails: Prefetchers can't predict random patterns. Every access is a cache miss, exposing full DRAM latency.

Checklist

✓ Measured sequential, stride, and random access performance
✓ Understood when prefetching helps vs fails