Performance Engineering Self-Test Checklist
A timing-first self-test for CPU performance engineering. For each question: explain simply, design a microbenchmark, predict outcomes, and interpret results.
Use this page to assess whether you actually understand performance engineering.
How to Use This Checklist
- Pick a section.
- Choose 3–5 questions.
- Write your answers in the placeholders.
- Implement the microbenchmark (portable timing-first).
- Run it, collect results, and write your interpretation.
- Revisit later and refine.
Tip: Keep your microbenchmarks tiny, control one variable, and validate by inspecting assembly when relevant.
Template (Copy/Paste Per Question)
Q#: [Paste question here]
- Explain it simply:
[Write a 2–6 sentence explanation for a smart general engineer.]
- Design a microbenchmark:
[What's the minimal code + setup? What input sizes? What to vary? How to avoid compiler elision?]
- Interpret results:
[What shapes do you expect? What would confirm/deny your hypothesis? What confounders exist?]
- Notes / Links:
[Anything useful: compiler flags, tooling notes, known gotchas.]
I. Measurement & Methodology (1–15)
Q1: What does "performance" mean — latency, throughput, tail latency, or all three?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q2: Why is measuring wall-clock time often sufficient?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q3: When is wall-clock time misleading?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q4: What is a warmup run and why does it matter?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q5: Why should you measure median instead of mean?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q6: What causes high variance in microbenchmarks?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q7: How does CPU frequency scaling affect results?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q8: Why can debug builds invalidate performance experiments?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q9: What is dead-code elimination and how does it ruin benchmarks?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q10: How do you prevent the compiler from optimizing away work?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q11: What is the difference between microbenchmark and macrobenchmark?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q12: Why is isolating a single variable important in experiments?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q13: What does "constant folding" mean?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q14: Why can I/O distort CPU benchmarks?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q15: Why must you inspect assembly for serious performance work?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
II. Execution Core & ILP (16–30)
Q16: What is the difference between latency and throughput?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q17: What is instruction-level parallelism (ILP)?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q18: What is a superscalar processor?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q19: What does "out-of-order execution" mean?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q20: What is a reorder buffer?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q21: What is a dependency chain?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q22: Why does a long dependency chain reduce throughput?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q23: Why can loop unrolling improve performance?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q24: What is instruction issue width?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q25: What limits IPC (instructions per cycle)?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q26: Why is division much slower than addition?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q27: What is a pipeline stall?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q28: What is register renaming?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q29: Why do independent instructions execute faster than dependent ones?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q30: What happens when the reorder buffer fills?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
III. Branch Prediction (31–40)
Q31: Why are branch mispredictions expensive?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q32: What happens in the pipeline during a mispredict?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q33: Why is sorted data often faster than unsorted?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q34: What makes a branch predictable?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q35: Why is random branching slow?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q36: What is speculative execution?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q37: Why can removing branches improve performance?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q38: What is branchless programming?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q39: When does branchless code hurt performance?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q40: Why does a predictable branch become "free"?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
IV. Memory Hierarchy (41–60)
Q41: What is a cache line?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q42: Why is spatial locality important?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q43: What is temporal locality?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q44: Why is sequential access fast?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q45: Why is random access slow?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q46: What causes cache thrashing?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q47: What is cache associativity?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q48: What is a conflict miss?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q49: What is a compulsory miss?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q50: What is a capacity miss?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q51: Why can a small stride destroy performance?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q52: How do you detect cache sizes experimentally?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q53: Why is pointer chasing slow?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q54: What is the difference between L1, L2, and L3?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q55: Why are large working sets slow?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q56: What is the hardware prefetcher?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q57: When does prefetching fail?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q58: Why does structure-of-arrays outperform array-of-structures in many cases?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q59: What is false sharing?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q60: Why does padding sometimes dramatically improve performance?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
V. Virtual Memory & TLB (61–68)
Q61: What is virtual memory?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q62: What is a page?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q63: What is the TLB?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q64: Why are TLB misses expensive?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q65: What happens during a page walk?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q66: Why does accessing one element per page hurt performance?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q67: What are huge pages?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q68: When do page faults occur?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
VI. Memory Bandwidth & Roofline (69–76)
Q69: What is the difference between memory latency and bandwidth?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q70: Why does adding threads not always increase bandwidth?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q71: What is memory saturation?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q72: What is operational intensity?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q73: What is the roofline model?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q74: What makes a kernel memory-bound?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q75: What makes a kernel compute-bound?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q76: Why can compute-bound kernels scale better than memory-bound ones?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
VII. SIMD & Vectorization (77–84)
Q77: What is SIMD?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q78: Why is vectorization powerful?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q79: Why doesn't vectorization always give linear speedup?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q80: What prevents auto-vectorization?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q81: Why is alignment important?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q82: What is the difference between scalar and vector loads?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q83: Why can memory become the bottleneck after vectorization?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q84: What is horizontal reduction and why is it tricky?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
VIII. Concurrency & Contention (85–92)
Q85: What is cache coherence?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q86: Why do shared writes scale poorly?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q87: What is a coherence protocol?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q88: What is false sharing in multithreaded code?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q89: Why do mutexes collapse under contention?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q90: What is lock-free programming?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q91: Why do atomic increments not scale well?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q92: What is tail latency and why does contention increase it?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
IX. Compilers & Code Generation (93–100)
Q93: Why does `-O3` change performance dramatically?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q94: What is inlining and when does it help?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q95: What is loop vectorization?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q96: What is loop interchange?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q97: What is strength reduction?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q98: Why can function calls hurt hot loops?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q99: What is instruction cache pressure?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]
Q100: Why is reading assembly an essential performance skill?
- Explain it simply: [...]
- Design a microbenchmark: [...]
- Interpret results: [...]