Performance Engineering Self-Test Checklist

A timing-first self-test for CPU performance engineering. For each question: explain simply, design a microbenchmark, predict outcomes, and interpret results.

Use this page to assess whether you actually understand performance engineering.

How to Use This Checklist

Tip: Keep your microbenchmarks tiny, control one variable, and validate by inspecting assembly when relevant.

Template (Copy/Paste Per Question)

Q#: [Paste question here]

- Explain it simply:
  [Write a 2–6 sentence explanation for a smart general engineer.]

- Design a microbenchmark:
  [What's the minimal code + setup? What input sizes? What to vary? How to avoid compiler elision?]

- Interpret results:
  [What shapes do you expect? What would confirm/deny your hypothesis? What confounders exist?]

- Notes / Links:
  [Anything useful: compiler flags, tooling notes, known gotchas.]

I. Measurement & Methodology (1–15)

Q1: What does "performance" mean — latency, throughput, tail latency, or all three?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q2: Why is measuring wall-clock time often sufficient?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q3: When is wall-clock time misleading?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q4: What is a warmup run and why does it matter?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q5: Why should you measure median instead of mean?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q6: What causes high variance in microbenchmarks?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q7: How does CPU frequency scaling affect results?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q8: Why can debug builds invalidate performance experiments?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q9: What is dead-code elimination and how does it ruin benchmarks?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q10: How do you prevent the compiler from optimizing away work?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q11: What is the difference between microbenchmark and macrobenchmark?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q12: Why is isolating a single variable important in experiments?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q13: What does "constant folding" mean?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q14: Why can I/O distort CPU benchmarks?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q15: Why must you inspect assembly for serious performance work?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

II. Execution Core & ILP (16–30)

Q16: What is the difference between latency and throughput?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q17: What is instruction-level parallelism (ILP)?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q18: What is a superscalar processor?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q19: What does "out-of-order execution" mean?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q20: What is a reorder buffer?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q21: What is a dependency chain?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q22: Why does a long dependency chain reduce throughput?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q23: Why can loop unrolling improve performance?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q24: What is instruction issue width?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q25: What limits IPC (instructions per cycle)?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q26: Why is division much slower than addition?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q27: What is a pipeline stall?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q28: What is register renaming?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q29: Why do independent instructions execute faster than dependent ones?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q30: What happens when the reorder buffer fills?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

III. Branch Prediction (31–40)

Q31: Why are branch mispredictions expensive?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q32: What happens in the pipeline during a mispredict?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q33: Why is sorted data often faster than unsorted?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q34: What makes a branch predictable?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q35: Why is random branching slow?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q36: What is speculative execution?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q37: Why can removing branches improve performance?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q38: What is branchless programming?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q39: When does branchless code hurt performance?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q40: Why does a predictable branch become "free"?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

IV. Memory Hierarchy (41–60)

Q41: What is a cache line?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q42: Why is spatial locality important?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q43: What is temporal locality?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q44: Why is sequential access fast?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q45: Why is random access slow?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q46: What causes cache thrashing?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q47: What is cache associativity?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q48: What is a conflict miss?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q49: What is a compulsory miss?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q50: What is a capacity miss?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q51: Why can a small stride destroy performance?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q52: How do you detect cache sizes experimentally?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q53: Why is pointer chasing slow?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q54: What is the difference between L1, L2, and L3?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q55: Why are large working sets slow?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q56: What is the hardware prefetcher?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q57: When does prefetching fail?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q58: Why does structure-of-arrays outperform array-of-structures in many cases?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q59: What is false sharing?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q60: Why does padding sometimes dramatically improve performance?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

V. Virtual Memory & TLB (61–68)

Q61: What is virtual memory?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q62: What is a page?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q63: What is the TLB?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q64: Why are TLB misses expensive?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q65: What happens during a page walk?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q66: Why does accessing one element per page hurt performance?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q67: What are huge pages?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q68: When do page faults occur?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

VI. Memory Bandwidth & Roofline (69–76)

Q69: What is the difference between memory latency and bandwidth?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q70: Why does adding threads not always increase bandwidth?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q71: What is memory saturation?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q72: What is operational intensity?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q73: What is the roofline model?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q74: What makes a kernel memory-bound?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q75: What makes a kernel compute-bound?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q76: Why can compute-bound kernels scale better than memory-bound ones?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

VII. SIMD & Vectorization (77–84)

Q77: What is SIMD?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q78: Why is vectorization powerful?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q79: Why doesn't vectorization always give linear speedup?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q80: What prevents auto-vectorization?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q81: Why is alignment important?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q82: What is the difference between scalar and vector loads?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q83: Why can memory become the bottleneck after vectorization?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q84: What is horizontal reduction and why is it tricky?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

VIII. Concurrency & Contention (85–92)

Q85: What is cache coherence?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q86: Why do shared writes scale poorly?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q87: What is a coherence protocol?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q88: What is false sharing in multithreaded code?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q89: Why do mutexes collapse under contention?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q90: What is lock-free programming?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q91: Why do atomic increments not scale well?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q92: What is tail latency and why does contention increase it?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

IX. Compilers & Code Generation (93–100)

Q93: Why does `-O3` change performance dramatically?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q94: What is inlining and when does it help?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q95: What is loop vectorization?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q96: What is loop interchange?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q97: What is strength reduction?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q98: Why can function calls hurt hot loops?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q99: What is instruction cache pressure?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]

Q100: Why is reading assembly an essential performance skill?

  • Explain it simply: [...]
  • Design a microbenchmark: [...]
  • Interpret results: [...]