Hardware Prefetch (What It Helps / When It Fails)
What You'll Learn
- How hardware prefetchers work
- What access patterns prefetchers can predict
- When prefetching fails
- Measuring prefetch effectiveness
Mental Model
CPUs have hardware prefetchers that try to predict what data you'll need next and load it into cache ahead of time. They work well for predictable patterns (sequential, fixed stride) but fail for unpredictable patterns (random access).
Experiment 1: Sequential Access
#include <chrono>
#include <iostream>
#include <vector>
using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;
double measure_sequential(std::vector<int>& data) {
volatile int sum = 0;
const int iterations = data.size();
auto start = Clock::now();
for (size_t i = 0; i < iterations; ++i) {
sum += data[i]; // Sequential access
}
auto end = Clock::now();
auto elapsed = std::chrono::duration_cast<Duration>(end - start);
return static_cast<double>(elapsed.count()) / iterations;
}
int main() {
std::vector<int> data(100 * 1024 * 1024 / sizeof(int), 1); // 100 MB
double sequential = measure_sequential(data);
std::cout << "Sequential: " << sequential << " ns/element\n";
return 0;
} Experiment 2: Fixed Stride
double measure_stride(std::vector<int>& data, size_t stride) {
volatile int sum = 0;
const int iterations = data.size();
auto start = Clock::now();
for (size_t i = 0; i < iterations; i += stride) {
sum += data[i];
}
auto end = Clock::now();
auto elapsed = std::chrono::duration_cast<Duration>(end - start);
return static_cast<double>(elapsed.count()) / (iterations / stride);
}
// Try different strides
for (size_t stride = 1; stride <= 64; stride *= 2) {
double time = measure_stride(data, stride);
std::cout << "Stride " << stride << ": " << time << " ns/element\n";
} Experiment 3: Random Access
#include <random>
#include <algorithm>
double measure_random(std::vector<int>& data) {
std::vector<size_t> indices(data.size());
std::iota(indices.begin(), indices.end(), 0);
std::shuffle(indices.begin(), indices.end(), std::mt19937{});
volatile int sum = 0;
auto start = Clock::now();
for (size_t idx : indices) {
sum += data[idx]; // Random access
}
auto end = Clock::now();
auto elapsed = std::chrono::duration_cast<Duration>(end - start);
return static_cast<double>(elapsed.count()) / data.size();
}
double random_time = measure_random(data);
std::cout << "Random: " << random_time << " ns/element\n"; What to Measure
- ns/element: Time per element for different access patterns
- Bandwidth: GB/s throughput
- Ratio: Compare sequential vs random access
Expected Shape of Results
- Sequential: Fast (~0.5-2 ns/element), prefetcher helps
- Fixed stride: Fast if stride is small, slower for large strides
- Random: Slow (~50-200 ns/element), prefetcher can't help
Interpretation
Prefetchers predict patterns: Sequential and fixed-stride access are predictable. The prefetcher loads cache lines ahead of your access, hiding memory latency.
Random access fails: Prefetchers can't predict random patterns. Every access is a cache miss, exposing full DRAM latency.
Checklist
- ✓ Measured sequential, stride, and random access performance
- ✓ Understood when prefetching helps vs fails