Reservoir Sampling
Interview Prep
The problem
You're given a stream of items of unknown length. You see each item exactly once and can keep only items in memory at a time. When the stream ends, return a sample of items chosen uniformly at random from all items in the stream.
The naive "store everything, then sample" needs memory. The naive "pick when you see it" can't know the right probability because the total length isn't known yet. Reservoir sampling threads the needle: memory, one pass, every item ends up with probability exactly of being in the final sample.
Warm-up: k = 1
Start with the special case. The current item is the candidate; when item arrives, replace the candidate with probability . After all items, the held candidate is uniform over the stream.
import random
def sample_one(stream):
"""Return a uniformly random element from a stream of unknown length."""
chosen = None
for i, x in enumerate(stream, start=1):
if random.random() < 1 / i:
chosen = x
return chosen Why this works: item is the final candidate iff it was chosen at step (probability ) and not displaced at any later step (probability ). Multiply: . Uniform. ✓
The general algorithm (Algorithm R)
For general : fill the reservoir with the first
items. For each subsequent item at
index , generate a random index
; if , replace
reservoir[j] with . Done.
import random
def reservoir_sample(stream, k: int):
"""Return k uniformly-random items from a stream of unknown length."""
reservoir = []
for i, x in enumerate(stream):
if i < k:
reservoir.append(x) # fill the reservoir
else:
j = random.randrange(i + 1) # uniform in [0, i]
if j < k:
reservoir[j] = x # replace a random slot
return reservoir Why it's uniform
Claim: every element in the stream has probability k/n of being in the
final reservoir (where n is the stream length).
Proof for element x at position i ≥ k:
P(x enters) = k / (i + 1)
[random.randrange(i+1) returns one of {0,...,i};
j < k with probability k/(i+1)]
P(x stays after seeing element at j > i, j ≥ k):
P(x not evicted at step j) = 1 - (1/k) · (k/(j+1)) = 1 - 1/(j+1) = j/(j+1)
P(x in reservoir at end of stream of length n)
= (k/(i+1)) · ∏_{j=i+1}^{n-1} (j/(j+1))
= (k/(i+1)) · ((i+1)/(i+2)) · ((i+2)/(i+3)) · ... · ((n-1)/n)
= k/n. ✓
(The first k elements get the same k/n by a slightly different argument:
they enter with probability 1, then each can be evicted at every later step.) Complexity
- Time:
O(n)— one pass, constant work per item. - Space:
O(k)— the reservoir. - Random calls:
O(n)calls to the RNG. Algorithm L (Li, 1994) improves this toO(k(1 + log(n/k)))RNG calls by sampling a geometric distribution for the gap to the next item that enters the reservoir.
Why this matters for ML
Reservoir sampling is the foundational primitive for any "uniform sample of a stream" — training-data subsampling from a log file too big to fit in memory, monitoring a fraction of production inference requests, A/B-test logging at fixed rate, sampling cells from a massive single-cell RNA-seq output, sampling random walks from a knowledge graph. Any time the dataset is bigger than memory or the stream is unbounded, this is what you reach for.
Variations worth knowing
- Weighted reservoir sampling: each item has a weight and you want probability proportional to weight. Algorithm A-Res: assign each item a key with , keep the largest keys (priority queue).
- Distributed reservoir sampling: each worker maintains a local reservoir; merge them by sampling proportionally to (n_local) across workers. Standard pattern in MapReduce/Spark sampling jobs.
- Sliding-window sampling: want a uniform sample of the last items in the stream. The basic reservoir doesn't handle eviction; you need a more elaborate data structure (chain-sampler or biased reservoir).