Measurement Methodology
What You'll Learn
- How to write reliable, portable benchmarks
- Techniques to reduce measurement noise
- How to avoid compiler optimizations that invalidate experiments
- Best practices for timing measurements
Mental Model
Benchmarking is harder than it looks. CPUs have variable frequency, caches warm up, compilers optimize away "unused" code, and background processes add noise. To get reliable measurements, we need to control these factors.
The goal: measure the true cost of your code, not artifacts of the measurement process.
Portable Benchmark Harness
Here's a portable C++ benchmark harness that works on all platforms:
#include <chrono>
#include <vector>
#include <algorithm>
#include <iostream>
#include <iomanip>
// Portable high-resolution timer
using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;
// Prevent compiler from optimizing away work
template<typename T>
void sink(T&& value) {
asm volatile("" : "+r"(value) : : "memory");
}
// Benchmark a function
template<typename Func>
double benchmark(Func&& func, int iterations = 1000, int warmup = 100) {
// Warmup runs to fill caches and stabilize CPU frequency
for (int i = 0; i < warmup; ++i) {
func();
}
// Actual measurements
std::vector<double> times;
times.reserve(iterations);
for (int i = 0; i < iterations; ++i) {
auto start = Clock::now();
func();
auto end = Clock::now();
auto duration = std::chrono::duration_cast<Duration>(end - start);
times.push_back(duration.count());
}
// Use median to reduce impact of outliers
std::sort(times.begin(), times.end());
double median = times[times.size() / 2];
double p95 = times[static_cast<size_t>(times.size() * 0.95)];
std::cout << "Median: " << std::fixed << std::setprecision(2)
<< median << " ns\n";
std::cout << "P95: " << p95 << " ns\n";
std::cout << "Min: " << times[0] << " ns\n";
std::cout << "Max: " << times.back() << " ns\n";
return median;
}
// Example usage
int main() {
volatile int sum = 0; // volatile prevents optimization
auto func = [&sum]() {
for (int i = 0; i < 1000; ++i) {
sum += i;
}
};
benchmark(func, 1000, 100);
return 0;
} Key Principles
1. Warmup Runs
CPUs start at low frequency and ramp up. Caches are cold. Run warmup iterations before measuring to stabilize the system. Typically 100-1000 warmup runs are sufficient.
2. Avoid Debug Builds
Always compile with optimizations enabled (-O3 for GCC/Clang, /O2 for MSVC).
Debug builds have different performance characteristics and aren't representative.
3. Prevent Dead Code Elimination
Compilers will optimize away code that doesn't produce visible side effects. Use:
volatilevariablesasm volatile("" ::: "memory")memory barriers- Actually consume computed values (print, accumulate, etc.)
4. Minimize I/O
File I/O, console output, and system calls add huge variance. Do all I/O outside the timed region. Pre-allocate data structures.
5. Measure Variance
Don't just report the mean. Report median (robust to outliers), percentiles (P95, P99), min, and max. High variance indicates measurement problems.
6. Reduce Noise
Close background applications. On Linux, use taskset to pin to a CPU core.
On macOS, use caffeinate to prevent sleep. Isolate the process from system activity.
7. Use High-Resolution Timers
std::chrono::high_resolution_clock provides nanosecond precision on modern systems.
This is sufficient for microbenchmarks.
What to Measure
- Wall-clock time: Total elapsed time (nanoseconds)
- Throughput: Operations per second or bytes per second
- Latency: Time per operation
- Scaling: How performance changes with input size or thread count
Always report units clearly. "10 ns" is very different from "10 ms".
Expected Shape of Results
Good benchmarks show:
- Low variance (P95 within 2x of median)
- Consistent results across runs
- Clear trends when varying parameters
If you see high variance or inconsistent results, investigate noise sources before drawing conclusions.
Common Pitfalls
Compiler Optimizations
Compilers are aggressive. They will unroll loops, eliminate dead code, and hoist invariants.
Use -O3 but be aware of what the compiler is doing. Inspect assembly if needed.
CPU Frequency Scaling
Modern CPUs change frequency based on load and temperature. Warmup helps, but for very precise measurements, you may need to disable frequency scaling (requires admin privileges).
Cache Effects
First run after allocation is slower (cold cache). Subsequent runs are faster (warm cache). This is why warmup is essential.
Allocation Overhead
Memory allocation can dominate timing. Pre-allocate all buffers before the timed region.
Tooling Upgrades (Optional)
Linux: perf stat
perf stat -r 10 ./your_benchmark provides CPU frequency, cache misses, branch misses,
and more. Very useful for understanding what's happening.
macOS: Instruments
Xcode Instruments can profile your code and show time profiler, counters, and system calls. Useful for identifying bottlenecks.
Windows: ETW
Event Tracing for Windows (ETW) can capture CPU events. Windows Performance Analyzer (WPA) provides visualization. More complex to set up but powerful.
Checklist
- ✓ Compiled with optimizations enabled (
-O3or/O2) - ✓ Warmup runs included (100-1000 iterations)
- ✓ Dead code elimination prevented (volatile or memory barriers)
- ✓ I/O moved outside timed region
- ✓ Variance reported (median, P95, min, max)
- ✓ Background processes minimized
- ✓ Results are consistent across multiple runs
- ✓ Units clearly labeled (ns, µs, ms, etc.)