Attention From Scratch
Interview Prep
The problem
Implement scaled dot-product attention and a single-head self-attention block in NumPy. Then add causal masking for autoregressive generation. This is the load-bearing primitive of every transformer; if you can't write it from memory in 15 minutes, expect to be asked to.
Pattern: "soft lookup" by content similarity
Attention is a soft, differentiable version of dictionary lookup. You have queries , keys , and values . For each query, compute a similarity score against every key (), softmax the scores into weights, then return the weighted sum of values. The output for query is:
The scaling keeps the softmax inputs from growing with dimension — for large , dot products have variance , and unscaled softmax would saturate to near-one-hot, killing gradients.
Scaled dot-product attention
import numpy as np
def softmax(x, axis=-1):
x = x - x.max(axis=axis, keepdims=True)
e = np.exp(x)
return e / e.sum(axis=axis, keepdims=True)
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q: (..., n_q, d_k) queries
K: (..., n_k, d_k) keys
V: (..., n_k, d_v) values
mask: optional boolean array, True where the position should be masked out.
Returns: (..., n_q, d_v) attended output.
"""
d_k = Q.shape[-1]
scores = Q @ K.swapaxes(-1, -2) / np.sqrt(d_k) # (..., n_q, n_k)
if mask is not None:
scores = np.where(mask, -1e9, scores)
weights = softmax(scores, axis=-1) # (..., n_q, n_k)
return weights @ V, weights
Three shape conventions to get right: Q @ K.T is
, the score
matrix. Softmax along axis=-1 normalizes per
query: each row sums to 1. Then weights @ V is
.
Self-attention block
Self-attention is just attention where are three learned linear projections of the same input. Each token attends to every other token in the sequence — including itself — through these three different views.
def self_attention_block(X, W_q, W_k, W_v, W_o, mask=None):
"""
Single-head self-attention.
X: (n, d_model)
W_q: (d_model, d_k)
W_k: (d_model, d_k)
W_v: (d_model, d_v)
W_o: (d_v, d_model)
"""
Q = X @ W_q # (n, d_k)
K = X @ W_k # (n, d_k)
V = X @ W_v # (n, d_v)
attended, _ = scaled_dot_product_attention(Q, K, V, mask=mask)
return attended @ W_o # (n, d_model) The output projection mixes information across the different value channels back into the model dimension. With one attention head this is a single linear layer; with multi-head, it concatenates head outputs first.
Causal masking for autoregressive generation
During autoregressive generation, position must not attend to positions — otherwise the model could "cheat" by looking at future tokens during training. The mask sets future positions' scores to before the softmax, so their weights become zero.
def causal_mask(n):
"""Mask for causal (autoregressive) self-attention.
Returns a boolean (n, n) array where position i can attend to positions 0..i.
True = masked OUT (future positions)."""
return np.triu(np.ones((n, n), dtype=bool), k=1)
Combine: scaled_dot_product_attention(Q, K, V, mask=causal_mask(n))
gives causal self-attention. This is what's inside every GPT
decoder layer.
Trace
# Tiny worked example with n=3 tokens, d_k=2.
Q = np.array([[1.0, 0.0],
[0.0, 1.0],
[1.0, 1.0]])
K = Q.copy() # self-attention with simple Q=K
V = np.eye(3) # one-hot values
# scores = Q @ K^T / sqrt(2)
# = [[1, 0, 1],
# [0, 1, 1],
# [1, 1, 2]] / sqrt(2)
# After softmax along axis=-1, the attention weights are roughly:
# row 0: high on positions 0 and 2 (similar Q), low on position 1
# row 2: highest on position 2 (most similar to itself)
# Output = weights @ V picks out a soft combination of the three rows of V. Complexity
- Time: — the quadratic term is the score matrix and the output matmul. This is the famous "quadratic in sequence length" cost that motivates flash attention, linear attention, Performer, and friends.
- Space: for the score matrix. For long contexts, this dominates: a 32K context needs ~4 GB of attention scores per head in float32.
What the interviewer is testing
- Do you remember the formula? verbatim, including the scaling. Forgetting the is a "did you actually understand it" red flag.
- Can you get the shapes right?
Q @ K.swapaxes(-1, -2)is the safe way to transpose only the last two axes when batched.Q.Tworks only for 2D arrays. - Can you mask correctly? Adding
(or
-1e9) before softmax, not multiplying weights by zero after — the latter doesn't renormalize. - Do you know why ? The variance-control argument. The interviewer often asks for it.
Variations worth knowing
- Multi-head attention: split into heads of size , apply attention independently per head, concatenate, project. Same total parameters as single-head but with parallel subspaces.
- Cross-attention: queries come from one sequence, keys/values from another. Used in the encoder-decoder attention of the original "Attention Is All You Need" transformer.
- Flash attention: rewrite the computation to tile Q, K, V into chunks that fit in SRAM, fuse softmax with the matmuls, never materialize the full matrix. Same algorithm, ~10× faster wall-clock on modern GPUs.
- Linear attention: replace softmax with a kernel that factorizes into , dropping the cost to . Trades expressivity for linear scaling — useful for very long contexts.
- Rotary positional embeddings (RoPE): inject position into Q and K via rotation matrices, not additive embeddings. Standard in LLaMA-style models.