Dropout From Scratch

Interview Prep

StandardML Engineeringmlregularizationnumpy

The problem

Implement dropout — the regularization layer that, during training, randomly zeroes out a fraction p of activations. Implement both the forward and backward passes, and explain correctly when the rescaling factor 1/(1−p) is applied.

Pattern: stochastic noise + rescaling for unbiased expectation

Dropout has a simple goal: prevent co-adaptation by forcing every unit to function with a random subset of its peers. Mechanically, each forward pass at training time multiplies the activations by an independent Bernoulli(1 − p) mask. Naive: this would shrink the expected magnitude of every activation by a factor of (1 − p), so we have to compensate either by scaling at training time (inverted dropout) or at inference time (old-style dropout).

Modern frameworks all use inverted dropout: scale by 1/(1 − p) at train time, do nothing at inference. This keeps inference identical to a plain forward pass, which is a huge simplification when models are deployed.

Inverted dropout (modern)

import numpy as np

class Dropout:
    """Inverted dropout: scaling happens at TRAIN time so inference is identity."""
    def __init__(self, p: float, seed: int = 0):
        assert 0.0 <= p < 1.0
        self.p = p                       # probability of DROPPING a unit
        self.rng = np.random.default_rng(seed)
        self.mask = None

    def forward(self, x: np.ndarray, train: bool) -> np.ndarray:
        if not train or self.p == 0.0:
            return x                     # eval: no-op
        keep = 1.0 - self.p
        # Bernoulli mask, scaled by 1/keep so E[mask * x] = x
        self.mask = (self.rng.random(x.shape) < keep).astype(x.dtype) / keep
        return x * self.mask

    def backward(self, dout: np.ndarray) -> np.ndarray:
        # Same mask multiplies the upstream gradient
        return dout * self.mask

For contrast: old-style

# "Old-style" dropout: scale at inference instead of training.
# Both are mathematically equivalent. The MODERN convention is inverted dropout.

def forward_oldstyle(x, p, rng, train):
    keep = 1.0 - p
    if train:
        mask = (rng.random(x.shape) < keep).astype(x.dtype)
        return x * mask, mask                    # NO scaling here
    else:
        return x * keep, None                    # scaling at inference instead

Trace

x = [[1.0, 2.0, 3.0, 4.0]],  p = 0.5

keep = 0.5
Bernoulli(0.5) draw, say [1, 0, 1, 1]   (3 out of 4 kept)
mask (after divide by keep) = [2.0, 0.0, 2.0, 2.0]

out = x * mask = [[2.0, 0.0, 6.0, 8.0]]

E[out] entrywise:
  E[out_i] = E[mask_i] * x_i
           = (keep * (1/keep) + (1-keep)*0) * x_i
           = x_i        ✓   expected output equals input

Backward (dout=[[1,1,1,1]]):
  dx = dout * mask = [[2, 0, 2, 2]]
  Same scaled mask passes through.

The backward pass

Dropout is a pointwise multiplication by a constant mask (constant for this forward pass). The chain rule is therefore trivial: multiply the upstream gradient by the same rescaled mask. No new randomness in the backward pass — that would be a bug, because it would decorrelate the gradient from the forward computation.

Complexity

Time: O(N) forward and backward.
Space: O(N) for the mask, which must be cached for the backward pass.

Why dropout reduces overfitting

Hinton's original framing: training with dropout is an approximate ensemble over the 2^N sub-networks obtainable by deleting subsets of units. Each forward pass samples one sub-network; inference uses all of them with equal weighting (and the rescaling enforces that the marginal magnitudes match). This is why dropout effectively regularizes like an averaged ensemble without the cost of training many models.

Variations worth knowing

Spatial dropout (dropout2d): for CNNs — instead of dropping individual pixels, drop entire feature maps. Avoids the wasteful regularization caused by spatial correlation.
DropConnect: drop weights (not activations). A close cousin; rarely used in modern practice.
Variational / locked dropout (RNNs): use the same mask at every timestep within a sequence. Necessary for stable RNN training; without it, dropout's variance per timestep ruins long-range gradients.
Inference-time dropout (MC dropout): keep dropout active at inference and average many forward passes. Gives a cheap Bayesian uncertainty estimate — a popular tool in uncertainty quantification.
Stochastic depth: drop entire residual blocks with some probability. Used in very-deep ResNets and modern transformer training as a regularizer.