Measurement Methodology

What You'll Learn

Mental Model

Benchmarking is harder than it looks. CPUs have variable frequency, caches warm up, compilers optimize away "unused" code, and background processes add noise. To get reliable measurements, we need to control these factors.

The goal: measure the true cost of your code, not artifacts of the measurement process.

Portable Benchmark Harness

Here's a portable C++ benchmark harness that works on all platforms:

#include <chrono>
#include <vector>
#include <algorithm>
#include <iostream>
#include <iomanip>

// Portable high-resolution timer
using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;

// Prevent compiler from optimizing away work
template<typename T>
void sink(T&& value) {
    asm volatile("" : "+r"(value) : : "memory");
}

// Benchmark a function
template<typename Func>
double benchmark(Func&& func, int iterations = 1000, int warmup = 100) {
    // Warmup runs to fill caches and stabilize CPU frequency
    for (int i = 0; i < warmup; ++i) {
        func();
    }
    
    // Actual measurements
    std::vector<double> times;
    times.reserve(iterations);
    
    for (int i = 0; i < iterations; ++i) {
        auto start = Clock::now();
        func();
        auto end = Clock::now();
        
        auto duration = std::chrono::duration_cast<Duration>(end - start);
        times.push_back(duration.count());
    }
    
    // Use median to reduce impact of outliers
    std::sort(times.begin(), times.end());
    double median = times[times.size() / 2];
    double p95 = times[static_cast<size_t>(times.size() * 0.95)];
    
    std::cout << "Median: " << std::fixed << std::setprecision(2) 
              << median << " ns\n";
    std::cout << "P95:    " << p95 << " ns\n";
    std::cout << "Min:    " << times[0] << " ns\n";
    std::cout << "Max:    " << times.back() << " ns\n";
    
    return median;
}

// Example usage
int main() {
    volatile int sum = 0;  // volatile prevents optimization
    
    auto func = [&sum]() {
        for (int i = 0; i < 1000; ++i) {
            sum += i;
        }
    };
    
    benchmark(func, 1000, 100);
    return 0;
}

Key Principles

1. Warmup Runs

CPUs start at low frequency and ramp up. Caches are cold. Run warmup iterations before measuring to stabilize the system. Typically 100-1000 warmup runs are sufficient.

2. Avoid Debug Builds

Always compile with optimizations enabled (-O3 for GCC/Clang, /O2 for MSVC). Debug builds have different performance characteristics and aren't representative.

3. Prevent Dead Code Elimination

Compilers will optimize away code that doesn't produce visible side effects. Use:

4. Minimize I/O

File I/O, console output, and system calls add huge variance. Do all I/O outside the timed region. Pre-allocate data structures.

5. Measure Variance

Don't just report the mean. Report median (robust to outliers), percentiles (P95, P99), min, and max. High variance indicates measurement problems.

6. Reduce Noise

Close background applications. On Linux, use taskset to pin to a CPU core. On macOS, use caffeinate to prevent sleep. Isolate the process from system activity.

7. Use High-Resolution Timers

std::chrono::high_resolution_clock provides nanosecond precision on modern systems. This is sufficient for microbenchmarks.

What to Measure

Always report units clearly. "10 ns" is very different from "10 ms".

Expected Shape of Results

Good benchmarks show:

If you see high variance or inconsistent results, investigate noise sources before drawing conclusions.

Common Pitfalls

Compiler Optimizations

Compilers are aggressive. They will unroll loops, eliminate dead code, and hoist invariants. Use -O3 but be aware of what the compiler is doing. Inspect assembly if needed.

CPU Frequency Scaling

Modern CPUs change frequency based on load and temperature. Warmup helps, but for very precise measurements, you may need to disable frequency scaling (requires admin privileges).

Cache Effects

First run after allocation is slower (cold cache). Subsequent runs are faster (warm cache). This is why warmup is essential.

Allocation Overhead

Memory allocation can dominate timing. Pre-allocate all buffers before the timed region.

Tooling Upgrades (Optional)

Linux: perf stat

perf stat -r 10 ./your_benchmark provides CPU frequency, cache misses, branch misses, and more. Very useful for understanding what's happening.

macOS: Instruments

Xcode Instruments can profile your code and show time profiler, counters, and system calls. Useful for identifying bottlenecks.

Windows: ETW

Event Tracing for Windows (ETW) can capture CPU events. Windows Performance Analyzer (WPA) provides visualization. More complex to set up but powerful.

Checklist