“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

Cache Size Detection (Timing Sweep)

What You'll Learn

How to detect cache sizes using timing alone
Why latency increases in steps
How to measure memory access patterns
Understanding cache hierarchy through measurement

Mental Model

As you increase the working set size, you'll see step-like increases in access latency. Each step corresponds to a cache level being exceeded. By measuring when latency jumps, you can detect cache sizes without any hardware counters.

Experiment: Working Set Sweep

This experiment sweeps through different working set sizes and measures access time:

#include <chrono>
#include <iostream>
#include <vector>
#include <iomanip>

using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;

// Pointer-chasing to force memory access
double measure_access_time(size_t working_set_bytes) {
    size_t elements = working_set_bytes / sizeof(size_t);
    std::vector<size_t> data(elements);
    
    // Create a linked list pattern (pointer chase)
    for (size_t i = 0; i < elements - 1; ++i) {
        data[i] = (i + 1) * sizeof(size_t);
    }
    data[elements - 1] = 0;  // Terminate
    
    // Warmup
    size_t ptr = 0;
    for (int i = 0; i < 1000; ++i) {
        ptr = data[ptr / sizeof(size_t)];
    }
    
    // Measure
    const int iterations = 100000;
    volatile size_t sink = 0;
    
    auto start = Clock::now();
    ptr = 0;
    for (int i = 0; i < iterations; ++i) {
        ptr = data[ptr / sizeof(size_t)];
        sink = ptr;  // Prevent optimization
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    return static_cast<double>(elapsed.count()) / iterations;
}

int main() {
    std::cout << "Working Set (KB)\tAccess Time (ns)\n";
    std::cout << std::fixed << std::setprecision(2);
    
    // Sweep from 1 KB to 256 MB
    for (size_t kb = 1; kb <= 256 * 1024; kb *= 2) {
        size_t bytes = kb * 1024;
        double ns = measure_access_time(bytes);
        std::cout << kb << "\t\t" << ns << "\n";
    }
    
    return 0;
}

Compile: g++ -O3 -o cache_sweep cache_sweep.cpp
Run: ./cache_sweep

Alternative: Stride-Walk Pattern

A simpler pattern that's easier to understand:

// Stride-walk: access every Nth element
double measure_stride(size_t working_set_bytes, size_t stride = 64) {
    size_t elements = working_set_bytes / sizeof(int);
    std::vector<int> data(elements, 1);
    
    volatile int sum = 0;
    const int iterations = 1000000;
    
    auto start = Clock::now();
    for (int i = 0; i < iterations; ++i) {
        size_t idx = (i * stride) % elements;
        sum += data[idx];
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    return static_cast<double>(elapsed.count()) / iterations;
}

What to Measure

ns/access: Time per memory access vs working set size
Latency jumps: Where access time increases suddenly
Plateaus: Stable latency regions (within a cache level)

Expected Shape of Results

You should see a step-like curve:

Small working sets (< L1): Fast (~1-3 ns), fits in L1
L1 to L2 transition: Jump to ~10-20 ns
L2 to L3 transition: Jump to ~40-75 ns
L3 to DRAM transition: Jump to ~100-300 ns

The exact sizes depend on your CPU, but you should see clear steps. Plot the data to visualize the cache hierarchy.

Interpretation

Where the jumps come from: When the working set exceeds a cache level, data must be fetched from the next level down. Each level is slower but larger. The latency jump is the difference between cache levels.

Why it's step-like: Caches have fixed sizes. Once you exceed a cache, all accesses go to the next level, causing a sudden latency increase. The plateau before the jump is when you're still within that cache level.

Results Section

Plot your results here: Create a log-log plot of working set size (x-axis) vs access time (y-axis). You should see clear steps.

Insert your plot here

Expected: Step-like curve with jumps at L1→L2, L2→L3, L3→DRAM boundaries

Common Pitfalls

Prefetching: Hardware prefetchers can mask cache misses in sequential access
TLB effects: Large working sets may cause TLB misses, adding latency
Measurement noise: Use median of many runs, not single measurements
Pointer-chasing overhead: The pointer arithmetic adds some overhead
Virtual memory: Very large working sets may cause page faults

Tooling Upgrades (Optional)

Linux: perf stat

perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./your_program shows exact cache miss counts. Compare with timing measurements.

macOS: Instruments

Instruments can show cache miss rates on supported hardware. Useful for validating timing measurements.

Windows: ETW

ETW can capture cache-related performance counters. Windows Performance Analyzer provides visualization.

Checklist

✓ Measured access time across working set sizes (1 KB → 256 MB)
✓ Identified latency jumps (L1→L2, L2→L3, L3→DRAM)
✓ Plotted results and identified cache boundaries
✓ Verified results are consistent across runs
✓ Compared with known cache sizes (if available)