Online Mean and Variance (Welford)
Interview Prep
The problem
Compute the mean and variance of a stream of values in one pass,
without storing the data and without numerical instability. Each
incoming value updates the running estimates in O(1).
Why not the textbook formula?
The "obvious" online formula is to keep
Σx and Σx², then compute
var = Σx²/n − (Σx/n)². This works on paper. In floating
point it's a disaster: when the mean is large relative to the spread
(e.g., temperatures around 1000 with variance 0.01), you subtract
two huge nearly-equal numbers and lose every digit of precision.
This is catastrophic cancellation.
def mean_var_two_pass(xs):
n = len(xs)
s1 = sum(xs); s2 = sum(x*x for x in xs)
mean = s1 / n
var = s2 / n - mean * mean # numerically unstable!
return mean, var Pattern: incremental updates, no cancellation
Welford's recurrence updates the mean and a quantity M2
(the running sum of squared deviations) directly, never forming
large intermediate sums. The trick is the asymmetric pair of deltas:
delta uses the old mean, delta2 uses the
new mean. The product telescopes the contribution to the
variance exactly.
Solution
class RunningStats:
"""Welford's online algorithm for mean and variance."""
def __init__(self):
self.n = 0
self.mean = 0.0
self.M2 = 0.0 # sum of squared deviations from current mean
def update(self, x: float) -> None:
self.n += 1
delta = x - self.mean
self.mean += delta / self.n
delta2 = x - self.mean # NOTE: uses the NEW mean
self.M2 += delta * delta2
@property
def variance(self) -> float:
return self.M2 / self.n if self.n else 0.0 # population variance
@property
def sample_variance(self) -> float:
return self.M2 / (self.n - 1) if self.n > 1 else 0.0 Trace
xs = [4, 7, 13, 16]
n=1: x=4 delta=4-0=4 mean=0+4/1=4 delta2=4-4=0 M2=4*0=0
n=2: x=7 delta=7-4=3 mean=4+3/2=5.5 delta2=7-5.5=1.5 M2=0+3*1.5=4.5
n=3: x=13 delta=13-5.5=7.5 mean=5.5+7.5/3=8.0 delta2=13-8=5 M2=4.5+7.5*5=42
n=4: x=16 delta=16-8=8 mean=8+8/4=10 delta2=16-10=6 M2=42+8*6=90
mean = 10
variance = M2/n = 90/4 = 22.5 Sanity check: mean of [4, 7, 13, 16] is 10. Deviations are [−6, −3, 3, 6], squared and summed: 36 + 9 + 9 + 36 = 90. Population variance = 90/4 = 22.5. ✓
Reference: naive two-pass
def mean_var_naive(xs: list[float]) -> tuple[float, float]:
n = len(xs)
mean = sum(xs) / n
var = sum((x - mean) ** 2 for x in xs) / n
return mean, var Complexity
- Time:
O(1)per update,O(n)total. - Space:
O(1). - Numerical: backward stable, no cancellation.
Variations worth knowing
- Combining two running stats: if you've computed
(n_a, mean_a, M2_a)and(n_b, mean_b, M2_b)on disjoint partitions, you can merge them inO(1):delta = mean_b − mean_a,n = n_a + n_b,mean = mean_a + delta · n_b / n,M2 = M2_a + M2_b + delta² · n_a · n_b / n. This is the basis for parallel reductions of variance. - Weighted / exponentially-decayed: swap
1/nfor a weight or a fixed learning rateαto get an exponential moving average / variance. Standard in streaming anomaly detection and Adam-style optimizers. - Higher moments: Welford generalizes to skewness
and kurtosis with extra
M3,M4running sums (Pébay 2008). Same numerical advantage. - Co-variance and regression: the same recurrence
with a
Cxyterm gives streaming covariance and streaming least-squares slope/intercept.