Dropout From Scratch

Interview Prep

StandardML Engineeringmlregularizationnumpy

The problem

Implement dropout — the regularization layer that, during training, randomly zeroes out a fraction p of activations. Implement both the forward and backward passes, and explain correctly when the rescaling factor 1/(1−p) is applied.

Pattern: stochastic noise + rescaling for unbiased expectation

Dropout has a simple goal: prevent co-adaptation by forcing every unit to function with a random subset of its peers. Mechanically, each forward pass at training time multiplies the activations by an independent Bernoulli(1 − p) mask. Naive: this would shrink the expected magnitude of every activation by a factor of (1 − p), so we have to compensate either by scaling at training time (inverted dropout) or at inference time (old-style dropout).

Modern frameworks all use inverted dropout: scale by 1/(1 − p) at train time, do nothing at inference. This keeps inference identical to a plain forward pass, which is a huge simplification when models are deployed.

Inverted dropout (modern)

import numpy as np

class Dropout:
    """Inverted dropout: scaling happens at TRAIN time so inference is identity."""
    def __init__(self, p: float, seed: int = 0):
        assert 0.0 <= p < 1.0
        self.p = p                       # probability of DROPPING a unit
        self.rng = np.random.default_rng(seed)
        self.mask = None

    def forward(self, x: np.ndarray, train: bool) -> np.ndarray:
        if not train or self.p == 0.0:
            return x                     # eval: no-op
        keep = 1.0 - self.p
        # Bernoulli mask, scaled by 1/keep so E[mask * x] = x
        self.mask = (self.rng.random(x.shape) < keep).astype(x.dtype) / keep
        return x * self.mask

    def backward(self, dout: np.ndarray) -> np.ndarray:
        # Same mask multiplies the upstream gradient
        return dout * self.mask

For contrast: old-style

# "Old-style" dropout: scale at inference instead of training.
# Both are mathematically equivalent. The MODERN convention is inverted dropout.

def forward_oldstyle(x, p, rng, train):
    keep = 1.0 - p
    if train:
        mask = (rng.random(x.shape) < keep).astype(x.dtype)
        return x * mask, mask                    # NO scaling here
    else:
        return x * keep, None                    # scaling at inference instead

Trace

x = [[1.0, 2.0, 3.0, 4.0]],  p = 0.5

keep = 0.5
Bernoulli(0.5) draw, say [1, 0, 1, 1]   (3 out of 4 kept)
mask (after divide by keep) = [2.0, 0.0, 2.0, 2.0]

out = x * mask = [[2.0, 0.0, 6.0, 8.0]]

E[out] entrywise:
  E[out_i] = E[mask_i] * x_i
           = (keep * (1/keep) + (1-keep)*0) * x_i
           = x_i        ✓   expected output equals input

Backward (dout=[[1,1,1,1]]):
  dx = dout * mask = [[2, 0, 2, 2]]
  Same scaled mask passes through.

The backward pass

Dropout is a pointwise multiplication by a constant mask (constant for this forward pass). The chain rule is therefore trivial: multiply the upstream gradient by the same rescaled mask. No new randomness in the backward pass — that would be a bug, because it would decorrelate the gradient from the forward computation.

Complexity

Why dropout reduces overfitting

Hinton's original framing: training with dropout is an approximate ensemble over the 2^N sub-networks obtainable by deleting subsets of units. Each forward pass samples one sub-network; inference uses all of them with equal weighting (and the rescaling enforces that the marginal magnitudes match). This is why dropout effectively regularizes like an averaged ensemble without the cost of training many models.

Variations worth knowing