“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

Performance Engineering Self-Test Checklist

A timing-first self-test for CPU performance engineering. For each question: explain simply, design a microbenchmark, predict outcomes, and interpret results.

Use this page to assess whether you actually understand performance engineering.

How to Use This Checklist

Pick a section.
Choose 3–5 questions.
Write your answers in the placeholders.
Implement the microbenchmark (portable timing-first).
Run it, collect results, and write your interpretation.
Revisit later and refine.

Tip: Keep your microbenchmarks tiny, control one variable, and validate by inspecting assembly when relevant.

Template (Copy/Paste Per Question)

Q#: [Paste question here]
- Explain it simply:

  [Write a 2–6 sentence explanation for a smart general engineer.]
- Design a microbenchmark:

  [What's the minimal code + setup? What input sizes? What to vary? How to avoid compiler elision?]
- Interpret results:

  [What shapes do you expect? What would confirm/deny your hypothesis? What confounders exist?]
- Notes / Links:

  [Anything useful: compiler flags, tooling notes, known gotchas.]

I. Measurement & Methodology (1–15)

Q1: What does "performance" mean — latency, throughput, tail latency, or all three?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q2: Why is measuring wall-clock time often sufficient?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q3: When is wall-clock time misleading?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q4: What is a warmup run and why does it matter?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q5: Why should you measure median instead of mean?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q6: What causes high variance in microbenchmarks?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q7: How does CPU frequency scaling affect results?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q8: Why can debug builds invalidate performance experiments?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q9: What is dead-code elimination and how does it ruin benchmarks?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q10: How do you prevent the compiler from optimizing away work?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q11: What is the difference between microbenchmark and macrobenchmark?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q12: Why is isolating a single variable important in experiments?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q13: What does "constant folding" mean?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q14: Why can I/O distort CPU benchmarks?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q15: Why must you inspect assembly for serious performance work?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

II. Execution Core & ILP (16–30)

Q16: What is the difference between latency and throughput?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q17: What is instruction-level parallelism (ILP)?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q18: What is a superscalar processor?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q19: What does "out-of-order execution" mean?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q20: What is a reorder buffer?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q21: What is a dependency chain?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q22: Why does a long dependency chain reduce throughput?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q23: Why can loop unrolling improve performance?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q24: What is instruction issue width?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q25: What limits IPC (instructions per cycle)?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q26: Why is division much slower than addition?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q27: What is a pipeline stall?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q28: What is register renaming?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q29: Why do independent instructions execute faster than dependent ones?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q30: What happens when the reorder buffer fills?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

III. Branch Prediction (31–40)

Q31: Why are branch mispredictions expensive?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q32: What happens in the pipeline during a mispredict?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q33: Why is sorted data often faster than unsorted?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q34: What makes a branch predictable?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q35: Why is random branching slow?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q36: What is speculative execution?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q37: Why can removing branches improve performance?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q38: What is branchless programming?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q39: When does branchless code hurt performance?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q40: Why does a predictable branch become "free"?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

IV. Memory Hierarchy (41–60)

Q41: What is a cache line?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q42: Why is spatial locality important?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q43: What is temporal locality?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q44: Why is sequential access fast?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q45: Why is random access slow?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q46: What causes cache thrashing?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q47: What is cache associativity?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q48: What is a conflict miss?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q49: What is a compulsory miss?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q50: What is a capacity miss?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q51: Why can a small stride destroy performance?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q52: How do you detect cache sizes experimentally?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q53: Why is pointer chasing slow?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q54: What is the difference between L1, L2, and L3?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q55: Why are large working sets slow?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q56: What is the hardware prefetcher?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q57: When does prefetching fail?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q58: Why does structure-of-arrays outperform array-of-structures in many cases?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q59: What is false sharing?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q60: Why does padding sometimes dramatically improve performance?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

V. Virtual Memory & TLB (61–68)

Q61: What is virtual memory?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q62: What is a page?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q63: What is the TLB?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q64: Why are TLB misses expensive?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q65: What happens during a page walk?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q66: Why does accessing one element per page hurt performance?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q67: What are huge pages?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q68: When do page faults occur?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

VI. Memory Bandwidth & Roofline (69–76)

Q69: What is the difference between memory latency and bandwidth?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q70: Why does adding threads not always increase bandwidth?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q71: What is memory saturation?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q72: What is operational intensity?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q73: What is the roofline model?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q74: What makes a kernel memory-bound?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q75: What makes a kernel compute-bound?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q76: Why can compute-bound kernels scale better than memory-bound ones?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

VII. SIMD & Vectorization (77–84)

Q77: What is SIMD?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q78: Why is vectorization powerful?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q79: Why doesn't vectorization always give linear speedup?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q80: What prevents auto-vectorization?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q81: Why is alignment important?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q82: What is the difference between scalar and vector loads?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q83: Why can memory become the bottleneck after vectorization?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q84: What is horizontal reduction and why is it tricky?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

VIII. Concurrency & Contention (85–92)

Q85: What is cache coherence?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q86: Why do shared writes scale poorly?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q87: What is a coherence protocol?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q88: What is false sharing in multithreaded code?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q89: Why do mutexes collapse under contention?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q90: What is lock-free programming?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q91: Why do atomic increments not scale well?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q92: What is tail latency and why does contention increase it?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

IX. Compilers & Code Generation (93–100)

Q93: Why does `-O3` change performance dramatically?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q94: What is inlining and when does it help?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q95: What is loop vectorization?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q96: What is loop interchange?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q97: What is strength reduction?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q98: Why can function calls hurt hot loops?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q99: What is instruction cache pressure?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]

Q100: Why is reading assembly an essential performance skill?

Explain it simply: [...]
Design a microbenchmark: [...]
Interpret results: [...]