Dropout From Scratch
Interview Prep
The problem
Implement dropout — the regularization layer that, during training,
randomly zeroes out a fraction p of activations.
Implement both the forward and backward passes, and explain
correctly when the rescaling factor 1/(1−p) is
applied.
Pattern: stochastic noise + rescaling for unbiased expectation
Dropout has a simple goal: prevent co-adaptation by forcing every
unit to function with a random subset of its peers. Mechanically,
each forward pass at training time multiplies the activations by
an independent Bernoulli(1 − p) mask. Naive: this
would shrink the expected magnitude of every activation by a
factor of (1 − p), so we have to compensate either by
scaling at training time (inverted dropout) or at
inference time (old-style dropout).
Modern frameworks all use inverted dropout: scale by
1/(1 − p) at train time, do nothing at
inference. This keeps inference identical to a plain forward pass,
which is a huge simplification when models are deployed.
Inverted dropout (modern)
import numpy as np
class Dropout:
"""Inverted dropout: scaling happens at TRAIN time so inference is identity."""
def __init__(self, p: float, seed: int = 0):
assert 0.0 <= p < 1.0
self.p = p # probability of DROPPING a unit
self.rng = np.random.default_rng(seed)
self.mask = None
def forward(self, x: np.ndarray, train: bool) -> np.ndarray:
if not train or self.p == 0.0:
return x # eval: no-op
keep = 1.0 - self.p
# Bernoulli mask, scaled by 1/keep so E[mask * x] = x
self.mask = (self.rng.random(x.shape) < keep).astype(x.dtype) / keep
return x * self.mask
def backward(self, dout: np.ndarray) -> np.ndarray:
# Same mask multiplies the upstream gradient
return dout * self.mask For contrast: old-style
# "Old-style" dropout: scale at inference instead of training.
# Both are mathematically equivalent. The MODERN convention is inverted dropout.
def forward_oldstyle(x, p, rng, train):
keep = 1.0 - p
if train:
mask = (rng.random(x.shape) < keep).astype(x.dtype)
return x * mask, mask # NO scaling here
else:
return x * keep, None # scaling at inference instead Trace
x = [[1.0, 2.0, 3.0, 4.0]], p = 0.5
keep = 0.5
Bernoulli(0.5) draw, say [1, 0, 1, 1] (3 out of 4 kept)
mask (after divide by keep) = [2.0, 0.0, 2.0, 2.0]
out = x * mask = [[2.0, 0.0, 6.0, 8.0]]
E[out] entrywise:
E[out_i] = E[mask_i] * x_i
= (keep * (1/keep) + (1-keep)*0) * x_i
= x_i ✓ expected output equals input
Backward (dout=[[1,1,1,1]]):
dx = dout * mask = [[2, 0, 2, 2]]
Same scaled mask passes through. The backward pass
Dropout is a pointwise multiplication by a constant mask (constant for this forward pass). The chain rule is therefore trivial: multiply the upstream gradient by the same rescaled mask. No new randomness in the backward pass — that would be a bug, because it would decorrelate the gradient from the forward computation.
Complexity
- Time:
O(N)forward and backward. - Space:
O(N)for the mask, which must be cached for the backward pass.
Why dropout reduces overfitting
Hinton's original framing: training with dropout is an approximate
ensemble over the 2^N sub-networks obtainable by
deleting subsets of units. Each forward pass samples one
sub-network; inference uses all of them with equal weighting (and
the rescaling enforces that the marginal magnitudes match). This
is why dropout effectively regularizes like an averaged ensemble
without the cost of training many models.
Variations worth knowing
- Spatial dropout (dropout2d): for CNNs — instead of dropping individual pixels, drop entire feature maps. Avoids the wasteful regularization caused by spatial correlation.
- DropConnect: drop weights (not activations). A close cousin; rarely used in modern practice.
- Variational / locked dropout (RNNs): use the same mask at every timestep within a sequence. Necessary for stable RNN training; without it, dropout's variance per timestep ruins long-range gradients.
- Inference-time dropout (MC dropout): keep dropout active at inference and average many forward passes. Gives a cheap Bayesian uncertainty estimate — a popular tool in uncertainty quantification.
- Stochastic depth: drop entire residual blocks with some probability. Used in very-deep ResNets and modern transformer training as a regularizer.