Backprop a 2-Layer MLP

Interview Prep

StandardML Engineeringmlautogradnumpy

The problem

Implement forward and backward pass for a 2-layer MLP with sigmoid hidden activation and softmax output, trained with cross-entropy loss. Use only NumPy — no autograd. Make it batched. Then verify your backward pass against a finite-difference numerical gradient.

This is the canonical ML engineer interview question — the autograd primitive every candidate should be able to derive from first principles. If you can do this, you can write any layer's backward by hand.

Pattern: the chain rule, applied layer by layer

Forward pass produces a chain of intermediate values $X \to z_{1} \to a_{1} \to z_{2} \to p \to L$ . Backward pass computes the gradient of $L$ with respect to every parameter by walking that chain in reverse, multiplying each local Jacobian. The softmax-followed-by-cross-entropy combination simplifies to $\partial L / \partial z_{2} = (p - y_{onehot}) / n$ , which is the single most important shortcut to memorize.

Forward pass

import numpy as np

def sigmoid(z):           return 1.0 / (1.0 + np.exp(-z))
def sigmoid_prime(a):     return a * (1.0 - a)              # given a = sigmoid(z)
def softmax(z):
    z = z - z.max(axis=-1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=-1, keepdims=True)

def forward(X, W1, b1, W2, b2):
    """
    X:  (n, d_in)      batch of inputs
    W1: (d_in, d_h)    first-layer weights
    b1: (d_h,)         first-layer biases
    W2: (d_h, d_out)   second-layer weights
    b2: (d_out,)       second-layer biases
    """
    z1 = X @ W1 + b1               # (n, d_h)
    a1 = sigmoid(z1)               # (n, d_h)
    z2 = a1 @ W2 + b2              # (n, d_out)
    p  = softmax(z2)               # (n, d_out)
    return p, (X, a1, p)

def cross_entropy(p, y):
    """Average cross-entropy. y is integer labels of shape (n,)."""
    n = len(y)
    return -np.log(p[np.arange(n), y]).mean()

Backward pass

def backward(cache, y, W2):
    """Returns gradients dW1, db1, dW2, db2."""
    X, a1, p = cache
    n = len(y)

    # softmax + cross-entropy combined gradient (the classical simplification):
    #   dL/dz2 = (p - one_hot(y)) / n
    dz2 = p.copy()
    dz2[np.arange(n), y] -= 1.0
    dz2 /= n                                   # (n, d_out)

    dW2 = a1.T @ dz2                           # (d_h, d_out)
    db2 = dz2.sum(axis=0)                      # (d_out,)

    da1 = dz2 @ W2.T                           # (n, d_h)
    dz1 = da1 * sigmoid_prime(a1)              # (n, d_h)

    dW1 = X.T @ dz1                            # (d_in, d_h)
    db1 = dz1.sum(axis=0)                      # (d_h,)

    return dW1, db1, dW2, db2

Read the shapes carefully — the most common backprop bug is a transposed matmul. The rule of thumb: weights gradient = (input to that layer).T @ (gradient of that layer's output). For dW2: the input to layer 2 is a1, so dW2 = a1.T @ dz2. The bias gradient is just dz.sum(axis=0) because the bias is broadcast across the batch.

One training step

def train_step(X, y, W1, b1, W2, b2, lr=0.1):
    p, cache = forward(X, W1, b1, W2, b2)
    loss = cross_entropy(p, y)
    dW1, db1, dW2, db2 = backward(cache, y, W2)
    W1 -= lr * dW1; b1 -= lr * db1
    W2 -= lr * dW2; b2 -= lr * db2
    return loss

Verifying with finite differences

def numerical_grad(f, x, eps=1e-5):
    """Two-sided finite-difference gradient. For testing only."""
    g = np.zeros_like(x)
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        i = it.multi_index
        orig = x[i].copy()
        x[i] = orig + eps; f_plus  = f()
        x[i] = orig - eps; f_minus = f()
        x[i] = orig
        g[i] = (f_plus - f_minus) / (2 * eps)
        it.iternext()
    return g

Complexity per step

Time: $O (n \cdot (d_{in} d_{h} + d_{h} d_{out}))$ — two matmuls dominate, forward and backward each cost the same.
Space: $O (n d_{h})$ for the cached activations needed by the backward pass.
Numerical-check time: $O (P)$ forward passes for $P$ parameters — too slow for production training, but essential for testing custom layers.

What the interviewer is testing

Do you know the softmax-CE shortcut? The simplified $d z_{2} = p - y$ form is the single most important identity. Candidates who don't know it spend 10 minutes re-deriving the Jacobian of softmax separately.
Can you handle batched shapes? The n dimension threading through every matmul. Bias broadcasting needs sum(axis=0) on the way back.
Did you cache the right intermediates? X, a1, p are the minimum — anything else is reconstructible.
Can you grad-check it? Asking "how would you validate this code is correct?" is the natural follow-up.

Variations worth knowing

ReLU instead of sigmoid: a = max(z, 0), dz = da * (z > 0). No vanishing-gradient issue at large $∣ z ∣$ , but dead neurons at $z < 0$ .
Layer normalization: the backward pass picks up extra terms from the mean and variance computations. Tedious to derive — a frequent follow-up question.
L₂ weight decay: add $λ \sum W_{ij}^{2}$ to the loss. Gradient just adds $2 λW$ to the matmul terms.
Multi-class to binary: swap softmax+CE for sigmoid+BCE. The simplified backward is identical in form — $d z = p - y$ — just a different range of values.