Cache Size Detection (Timing Sweep)
What You'll Learn
- How to detect cache sizes using timing alone
- Why latency increases in steps
- How to measure memory access patterns
- Understanding cache hierarchy through measurement
Mental Model
As you increase the working set size, you'll see step-like increases in access latency. Each step corresponds to a cache level being exceeded. By measuring when latency jumps, you can detect cache sizes without any hardware counters.
Experiment: Working Set Sweep
This experiment sweeps through different working set sizes and measures access time:
#include <chrono>
#include <iostream>
#include <vector>
#include <iomanip>
using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;
// Pointer-chasing to force memory access
double measure_access_time(size_t working_set_bytes) {
size_t elements = working_set_bytes / sizeof(size_t);
std::vector<size_t> data(elements);
// Create a linked list pattern (pointer chase)
for (size_t i = 0; i < elements - 1; ++i) {
data[i] = (i + 1) * sizeof(size_t);
}
data[elements - 1] = 0; // Terminate
// Warmup
size_t ptr = 0;
for (int i = 0; i < 1000; ++i) {
ptr = data[ptr / sizeof(size_t)];
}
// Measure
const int iterations = 100000;
volatile size_t sink = 0;
auto start = Clock::now();
ptr = 0;
for (int i = 0; i < iterations; ++i) {
ptr = data[ptr / sizeof(size_t)];
sink = ptr; // Prevent optimization
}
auto end = Clock::now();
auto elapsed = std::chrono::duration_cast<Duration>(end - start);
return static_cast<double>(elapsed.count()) / iterations;
}
int main() {
std::cout << "Working Set (KB)\tAccess Time (ns)\n";
std::cout << std::fixed << std::setprecision(2);
// Sweep from 1 KB to 256 MB
for (size_t kb = 1; kb <= 256 * 1024; kb *= 2) {
size_t bytes = kb * 1024;
double ns = measure_access_time(bytes);
std::cout << kb << "\t\t" << ns << "\n";
}
return 0;
} Compile: g++ -O3 -o cache_sweep cache_sweep.cpp
Run: ./cache_sweep
Alternative: Stride-Walk Pattern
A simpler pattern that's easier to understand:
// Stride-walk: access every Nth element
double measure_stride(size_t working_set_bytes, size_t stride = 64) {
size_t elements = working_set_bytes / sizeof(int);
std::vector<int> data(elements, 1);
volatile int sum = 0;
const int iterations = 1000000;
auto start = Clock::now();
for (int i = 0; i < iterations; ++i) {
size_t idx = (i * stride) % elements;
sum += data[idx];
}
auto end = Clock::now();
auto elapsed = std::chrono::duration_cast<Duration>(end - start);
return static_cast<double>(elapsed.count()) / iterations;
} What to Measure
- ns/access: Time per memory access vs working set size
- Latency jumps: Where access time increases suddenly
- Plateaus: Stable latency regions (within a cache level)
Expected Shape of Results
You should see a step-like curve:
- Small working sets (< L1): Fast (~1-3 ns), fits in L1
- L1 to L2 transition: Jump to ~10-20 ns
- L2 to L3 transition: Jump to ~40-75 ns
- L3 to DRAM transition: Jump to ~100-300 ns
The exact sizes depend on your CPU, but you should see clear steps. Plot the data to visualize the cache hierarchy.
Interpretation
Where the jumps come from: When the working set exceeds a cache level, data must be fetched from the next level down. Each level is slower but larger. The latency jump is the difference between cache levels.
Why it's step-like: Caches have fixed sizes. Once you exceed a cache, all accesses go to the next level, causing a sudden latency increase. The plateau before the jump is when you're still within that cache level.
Results Section
Plot your results here: Create a log-log plot of working set size (x-axis) vs access time (y-axis). You should see clear steps.
Insert your plot here
Expected: Step-like curve with jumps at L1→L2, L2→L3, L3→DRAM boundaries
Common Pitfalls
- Prefetching: Hardware prefetchers can mask cache misses in sequential access
- TLB effects: Large working sets may cause TLB misses, adding latency
- Measurement noise: Use median of many runs, not single measurements
- Pointer-chasing overhead: The pointer arithmetic adds some overhead
- Virtual memory: Very large working sets may cause page faults
Tooling Upgrades (Optional)
Linux: perf stat
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./your_program
shows exact cache miss counts. Compare with timing measurements.
macOS: Instruments
Instruments can show cache miss rates on supported hardware. Useful for validating timing measurements.
Windows: ETW
ETW can capture cache-related performance counters. Windows Performance Analyzer provides visualization.
Checklist
- ✓ Measured access time across working set sizes (1 KB → 256 MB)
- ✓ Identified latency jumps (L1→L2, L2→L3, L3→DRAM)
- ✓ Plotted results and identified cache boundaries
- ✓ Verified results are consistent across runs
- ✓ Compared with known cache sizes (if available)