Backprop a 2-Layer MLP

Interview Prep

StandardML Engineeringmlautogradnumpy

The problem

Implement forward and backward pass for a 2-layer MLP with sigmoid hidden activation and softmax output, trained with cross-entropy loss. Use only NumPy — no autograd. Make it batched. Then verify your backward pass against a finite-difference numerical gradient.

This is the canonical ML engineer interview question — the autograd primitive every candidate should be able to derive from first principles. If you can do this, you can write any layer's backward by hand.

Pattern: the chain rule, applied layer by layer

Forward pass produces a chain of intermediate values . Backward pass computes the gradient of with respect to every parameter by walking that chain in reverse, multiplying each local Jacobian. The softmax-followed-by-cross-entropy combination simplifies to , which is the single most important shortcut to memorize.

Forward pass

import numpy as np

def sigmoid(z):           return 1.0 / (1.0 + np.exp(-z))
def sigmoid_prime(a):     return a * (1.0 - a)              # given a = sigmoid(z)
def softmax(z):
    z = z - z.max(axis=-1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=-1, keepdims=True)

def forward(X, W1, b1, W2, b2):
    """
    X:  (n, d_in)      batch of inputs
    W1: (d_in, d_h)    first-layer weights
    b1: (d_h,)         first-layer biases
    W2: (d_h, d_out)   second-layer weights
    b2: (d_out,)       second-layer biases
    """
    z1 = X @ W1 + b1               # (n, d_h)
    a1 = sigmoid(z1)               # (n, d_h)
    z2 = a1 @ W2 + b2              # (n, d_out)
    p  = softmax(z2)               # (n, d_out)
    return p, (X, a1, p)
def cross_entropy(p, y):
    """Average cross-entropy. y is integer labels of shape (n,)."""
    n = len(y)
    return -np.log(p[np.arange(n), y]).mean()

Backward pass

def backward(cache, y, W2):
    """Returns gradients dW1, db1, dW2, db2."""
    X, a1, p = cache
    n = len(y)

    # softmax + cross-entropy combined gradient (the classical simplification):
    #   dL/dz2 = (p - one_hot(y)) / n
    dz2 = p.copy()
    dz2[np.arange(n), y] -= 1.0
    dz2 /= n                                   # (n, d_out)

    dW2 = a1.T @ dz2                           # (d_h, d_out)
    db2 = dz2.sum(axis=0)                      # (d_out,)

    da1 = dz2 @ W2.T                           # (n, d_h)
    dz1 = da1 * sigmoid_prime(a1)              # (n, d_h)

    dW1 = X.T @ dz1                            # (d_in, d_h)
    db1 = dz1.sum(axis=0)                      # (d_h,)

    return dW1, db1, dW2, db2

Read the shapes carefully — the most common backprop bug is a transposed matmul. The rule of thumb: weights gradient = (input to that layer).T @ (gradient of that layer's output). For dW2: the input to layer 2 is a1, so dW2 = a1.T @ dz2. The bias gradient is just dz.sum(axis=0) because the bias is broadcast across the batch.

One training step

def train_step(X, y, W1, b1, W2, b2, lr=0.1):
    p, cache = forward(X, W1, b1, W2, b2)
    loss = cross_entropy(p, y)
    dW1, db1, dW2, db2 = backward(cache, y, W2)
    W1 -= lr * dW1; b1 -= lr * db1
    W2 -= lr * dW2; b2 -= lr * db2
    return loss

Verifying with finite differences

Always test your analytical gradient against finite differences on a small example. If |analytical - numerical| / (|analytical| + |numerical|) < 1e-5 for every parameter, you're good. Discrepancy > 1e-3 means a real bug. Don't ship a custom backward pass without this check.

def numerical_grad(f, x, eps=1e-5):
    """Two-sided finite-difference gradient. For testing only."""
    g = np.zeros_like(x)
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        i = it.multi_index
        orig = x[i].copy()
        x[i] = orig + eps; f_plus  = f()
        x[i] = orig - eps; f_minus = f()
        x[i] = orig
        g[i] = (f_plus - f_minus) / (2 * eps)
        it.iternext()
    return g

Complexity per step

What the interviewer is testing

Variations worth knowing