Backprop a 2-Layer MLP
Interview Prep
The problem
Implement forward and backward pass for a 2-layer MLP with sigmoid hidden activation and softmax output, trained with cross-entropy loss. Use only NumPy — no autograd. Make it batched. Then verify your backward pass against a finite-difference numerical gradient.
This is the canonical ML engineer interview question — the autograd primitive every candidate should be able to derive from first principles. If you can do this, you can write any layer's backward by hand.
Pattern: the chain rule, applied layer by layer
Forward pass produces a chain of intermediate values . Backward pass computes the gradient of with respect to every parameter by walking that chain in reverse, multiplying each local Jacobian. The softmax-followed-by-cross-entropy combination simplifies to , which is the single most important shortcut to memorize.
Forward pass
import numpy as np
def sigmoid(z): return 1.0 / (1.0 + np.exp(-z))
def sigmoid_prime(a): return a * (1.0 - a) # given a = sigmoid(z)
def softmax(z):
z = z - z.max(axis=-1, keepdims=True)
e = np.exp(z)
return e / e.sum(axis=-1, keepdims=True)
def forward(X, W1, b1, W2, b2):
"""
X: (n, d_in) batch of inputs
W1: (d_in, d_h) first-layer weights
b1: (d_h,) first-layer biases
W2: (d_h, d_out) second-layer weights
b2: (d_out,) second-layer biases
"""
z1 = X @ W1 + b1 # (n, d_h)
a1 = sigmoid(z1) # (n, d_h)
z2 = a1 @ W2 + b2 # (n, d_out)
p = softmax(z2) # (n, d_out)
return p, (X, a1, p) def cross_entropy(p, y):
"""Average cross-entropy. y is integer labels of shape (n,)."""
n = len(y)
return -np.log(p[np.arange(n), y]).mean() Backward pass
def backward(cache, y, W2):
"""Returns gradients dW1, db1, dW2, db2."""
X, a1, p = cache
n = len(y)
# softmax + cross-entropy combined gradient (the classical simplification):
# dL/dz2 = (p - one_hot(y)) / n
dz2 = p.copy()
dz2[np.arange(n), y] -= 1.0
dz2 /= n # (n, d_out)
dW2 = a1.T @ dz2 # (d_h, d_out)
db2 = dz2.sum(axis=0) # (d_out,)
da1 = dz2 @ W2.T # (n, d_h)
dz1 = da1 * sigmoid_prime(a1) # (n, d_h)
dW1 = X.T @ dz1 # (d_in, d_h)
db1 = dz1.sum(axis=0) # (d_h,)
return dW1, db1, dW2, db2
Read the shapes carefully — the most common backprop bug is a
transposed matmul. The rule of thumb: weights gradient =
(input to that layer).T @ (gradient of that layer's output).
For dW2: the input to layer 2 is a1, so
dW2 = a1.T @ dz2. The bias gradient is just
dz.sum(axis=0) because the bias is broadcast across
the batch.
One training step
def train_step(X, y, W1, b1, W2, b2, lr=0.1):
p, cache = forward(X, W1, b1, W2, b2)
loss = cross_entropy(p, y)
dW1, db1, dW2, db2 = backward(cache, y, W2)
W1 -= lr * dW1; b1 -= lr * db1
W2 -= lr * dW2; b2 -= lr * db2
return loss Verifying with finite differences
Always test your analytical gradient against finite differences on
a small example. If |analytical - numerical| / (|analytical|
+ |numerical|) < 1e-5 for every parameter, you're good.
Discrepancy > 1e-3 means a real bug. Don't ship a
custom backward pass without this check.
def numerical_grad(f, x, eps=1e-5):
"""Two-sided finite-difference gradient. For testing only."""
g = np.zeros_like(x)
it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
i = it.multi_index
orig = x[i].copy()
x[i] = orig + eps; f_plus = f()
x[i] = orig - eps; f_minus = f()
x[i] = orig
g[i] = (f_plus - f_minus) / (2 * eps)
it.iternext()
return g Complexity per step
- Time: — two matmuls dominate, forward and backward each cost the same.
- Space: for the cached activations needed by the backward pass.
- Numerical-check time: forward passes for parameters — too slow for production training, but essential for testing custom layers.
What the interviewer is testing
- Do you know the softmax-CE shortcut? The simplified form is the single most important identity. Candidates who don't know it spend 10 minutes re-deriving the Jacobian of softmax separately.
- Can you handle batched shapes? The
ndimension threading through every matmul. Bias broadcasting needssum(axis=0)on the way back. - Did you cache the right intermediates?
X, a1, pare the minimum — anything else is reconstructible. - Can you grad-check it? Asking "how would you validate this code is correct?" is the natural follow-up.
Variations worth knowing
- ReLU instead of sigmoid:
a = max(z, 0),dz = da * (z > 0). No vanishing-gradient issue at large , but dead neurons at . - Layer normalization: the backward pass picks up extra terms from the mean and variance computations. Tedious to derive — a frequent follow-up question.
- L2 weight decay: add to the loss. Gradient just adds to the matmul terms.
- Multi-class to binary: swap softmax+CE for sigmoid+BCE. The simplified backward is identical in form — — just a different range of values.