Attention From Scratch

Interview Prep

HardML Engineeringmltransformernumpy

The problem

Implement scaled dot-product attention and a single-head self-attention block in NumPy. Then add causal masking for autoregressive generation. This is the load-bearing primitive of every transformer; if you can't write it from memory in 15 minutes, expect to be asked to.

Pattern: "soft lookup" by content similarity

Attention is a soft, differentiable version of dictionary lookup. You have queries , keys , and values . For each query, compute a similarity score against every key (), softmax the scores into weights, then return the weighted sum of values. The output for query is:

The scaling keeps the softmax inputs from growing with dimension — for large , dot products have variance , and unscaled softmax would saturate to near-one-hot, killing gradients.

Scaled dot-product attention

import numpy as np

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: (..., n_q, d_k)   queries
    K: (..., n_k, d_k)   keys
    V: (..., n_k, d_v)   values
    mask: optional boolean array, True where the position should be masked out.

    Returns: (..., n_q, d_v) attended output.
    """
    d_k = Q.shape[-1]
    scores = Q @ K.swapaxes(-1, -2) / np.sqrt(d_k)      # (..., n_q, n_k)
    if mask is not None:
        scores = np.where(mask, -1e9, scores)
    weights = softmax(scores, axis=-1)                   # (..., n_q, n_k)
    return weights @ V, weights

Three shape conventions to get right: Q @ K.T is , the score matrix. Softmax along axis=-1 normalizes per query: each row sums to 1. Then weights @ V is .

Self-attention block

Self-attention is just attention where are three learned linear projections of the same input. Each token attends to every other token in the sequence — including itself — through these three different views.

def self_attention_block(X, W_q, W_k, W_v, W_o, mask=None):
    """
    Single-head self-attention.
    X:   (n, d_model)
    W_q: (d_model, d_k)
    W_k: (d_model, d_k)
    W_v: (d_model, d_v)
    W_o: (d_v, d_model)
    """
    Q = X @ W_q                                  # (n, d_k)
    K = X @ W_k                                  # (n, d_k)
    V = X @ W_v                                  # (n, d_v)

    attended, _ = scaled_dot_product_attention(Q, K, V, mask=mask)
    return attended @ W_o                        # (n, d_model)

The output projection mixes information across the different value channels back into the model dimension. With one attention head this is a single linear layer; with multi-head, it concatenates head outputs first.

Causal masking for autoregressive generation

During autoregressive generation, position must not attend to positions — otherwise the model could "cheat" by looking at future tokens during training. The mask sets future positions' scores to before the softmax, so their weights become zero.

def causal_mask(n):
    """Mask for causal (autoregressive) self-attention.
    Returns a boolean (n, n) array where position i can attend to positions 0..i.
    True = masked OUT (future positions)."""
    return np.triu(np.ones((n, n), dtype=bool), k=1)

Combine: scaled_dot_product_attention(Q, K, V, mask=causal_mask(n)) gives causal self-attention. This is what's inside every GPT decoder layer.

Trace

# Tiny worked example with n=3 tokens, d_k=2.
Q = np.array([[1.0, 0.0],
              [0.0, 1.0],
              [1.0, 1.0]])
K = Q.copy()                 # self-attention with simple Q=K
V = np.eye(3)                # one-hot values

# scores = Q @ K^T / sqrt(2)
#        = [[1, 0, 1],
#           [0, 1, 1],
#           [1, 1, 2]] / sqrt(2)

# After softmax along axis=-1, the attention weights are roughly:
#   row 0: high on positions 0 and 2 (similar Q), low on position 1
#   row 2: highest on position 2 (most similar to itself)

# Output = weights @ V picks out a soft combination of the three rows of V.

Complexity

What the interviewer is testing

Variations worth knowing