“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

Attention From Scratch

Interview Prep

HardML Engineeringmltransformernumpy

The problem

Implement scaled dot-product attention and a single-head self-attention block in NumPy. Then add causal masking for autoregressive generation. Every transformer is built around this primitive; if you can't write it from memory in 15 minutes, expect to be asked to.

Pattern: "soft lookup" by content similarity

Attention is a soft, differentiable version of dictionary lookup. You have queries $Q$ , keys $K$ , and values $V$ . For each query, compute a similarity score against every key ( $score_{ij} = Q_{i} \cdot K_{j}$ ), softmax the scores into weights, then return the weighted sum of values. The output for query $i$ is:

Attention (Q, K, V)_{i} = j \sum softmax_{j} (Q_{i} K_{j}^{⊤} / d_{k}) V_{j}

The $d_{k}$ scaling keeps the softmax inputs from growing with dimension — for large $d_{k}$ , dot products have variance $\propto d_{k}$ , and unscaled softmax would saturate to near-one-hot, killing gradients.

Scaled dot-product attention

import numpy as np

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: (..., n_q, d_k)   queries
    K: (..., n_k, d_k)   keys
    V: (..., n_k, d_v)   values
    mask: optional boolean array, True where the position should be masked out.

    Returns: (..., n_q, d_v) attended output.
    """
    d_k = Q.shape[-1]
    scores = Q @ K.swapaxes(-1, -2) / np.sqrt(d_k)      # (..., n_q, n_k)
    if mask is not None:
        scores = np.where(mask, -1e9, scores)
    weights = softmax(scores, axis=-1)                   # (..., n_q, n_k)
    return weights @ V, weights

Three shape conventions to get right: Q @ K.T is $(n_{q}, d_{k}) \times (d_{k}, n_{k}) = (n_{q}, n_{k})$ , the score matrix. Softmax along axis=-1 normalizes per query: each row sums to 1. Then weights @ V is $(n_{q}, n_{k}) \times (n_{k}, d_{v}) = (n_{q}, d_{v})$ .

Self-attention block

Self-attention is just attention where $Q, K, V$ are three learned linear projections of the same input. Each token attends to every other token in the sequence — including itself — through these three different views.

def self_attention_block(X, W_q, W_k, W_v, W_o, mask=None):
    """
    Single-head self-attention.
    X:   (n, d_model)
    W_q: (d_model, d_k)
    W_k: (d_model, d_k)
    W_v: (d_model, d_v)
    W_o: (d_v, d_model)
    """
    Q = X @ W_q                                  # (n, d_k)
    K = X @ W_k                                  # (n, d_k)
    V = X @ W_v                                  # (n, d_v)

    attended, _ = scaled_dot_product_attention(Q, K, V, mask=mask)
    return attended @ W_o                        # (n, d_model)

The output projection $W_{o}$ mixes information across the different value channels back into the model dimension. With one attention head this is a single linear layer; with multi-head, it concatenates head outputs first.

Causal masking for autoregressive generation

During autoregressive generation, position $i$ must not attend to positions $> i$ — otherwise the model could "cheat" by looking at future tokens during training. The mask sets future positions' scores to $- \infty$ before the softmax, so their weights become zero.

def causal_mask(n):
    """Mask for causal (autoregressive) self-attention.
    Returns a boolean (n, n) array where position i can attend to positions 0..i.
    True = masked OUT (future positions)."""
    return np.triu(np.ones((n, n), dtype=bool), k=1)

Combine: scaled_dot_product_attention(Q, K, V, mask=causal_mask(n)) gives causal self-attention. This is what's inside every GPT decoder layer.

Trace

# Tiny worked example with n=3 tokens, d_k=2.
Q = np.array([[1.0, 0.0],
              [0.0, 1.0],
              [1.0, 1.0]])
K = Q.copy()                 # self-attention with simple Q=K
V = np.eye(3)                # one-hot values

# scores = Q @ K^T / sqrt(2)
#        = [[1, 0, 1],
#           [0, 1, 1],
#           [1, 1, 2]] / sqrt(2)

# After softmax along axis=-1, the attention weights are roughly:
#   row 0: high on positions 0 and 2 (similar Q), low on position 1
#   row 2: highest on position 2 (most similar to itself)

# Output = weights @ V picks out a soft combination of the three rows of V.

Complexity

Time: $O (n^{2} \cdot d_{k} + n^{2} \cdot d_{v})$ — the quadratic $n^{2}$ term is the score matrix and the output matmul. This is the famous "quadratic in sequence length" cost that motivates flash attention, linear attention, Performer, and friends.
Space: $O (n^{2})$ for the score matrix. For long contexts, this dominates: a 32K context needs ~4 GB of attention scores per head in float32.

What the interviewer is testing

Do you remember the formula? $softmax (Q K^{⊤} / d_{k}) V$ verbatim, including the scaling. Forgetting the $d_{k}$ is a "did you actually understand it" red flag.
Can you get the shapes right? Q @ K.swapaxes(-1, -2) is the safe way to transpose only the last two axes when batched. Q.T works only for 2D arrays.
Can you mask correctly? Adding $- \infty$ (or -1e9) before softmax, not multiplying weights by zero after — the latter doesn't renormalize.
Do you know why $d_{k}$ ? The variance-control argument. The interviewer often asks for it.

Variations worth knowing

Multi-head attention: split $d_{model}$ into $h$ heads of size $d_{k} = d_{model} / h$ , apply attention independently per head, concatenate, project. Same total parameters as single-head but with parallel subspaces.
Cross-attention: queries come from one sequence, keys/values from another. Used in the encoder-decoder attention of the original "Attention Is All You Need" transformer.
Flash attention: rewrite the computation to tile Q, K, V into chunks that fit in SRAM, fuse softmax with the matmuls, never materialize the full $n \times n$ matrix. Same algorithm, ~10× faster wall-clock on modern GPUs.
Linear attention: replace softmax with a kernel that factorizes into $ϕ (Q) (ϕ (K)^{⊤} V)$ , dropping the cost to $O (n d^{2})$ . Trades expressivity for linear scaling — useful for very long contexts.
Rotary positional embeddings (RoPE): inject position into Q and K via rotation matrices, not additive embeddings. Standard in LLaMA-style models.