Reservoir Sampling

Interview Prep

Warm-upML Engineeringmlstreamingprobability

The problem

You're given a stream of items of unknown length. You see each item exactly once and can keep only $k$ items in memory at a time. When the stream ends, return a sample of $k$ items chosen uniformly at random from all $n$ items in the stream.

The naive "store everything, then sample" needs $O (n)$ memory. The naive "pick when you see it" can't know the right probability because the total length isn't known yet. Reservoir sampling threads the needle: $O (k)$ memory, one pass, every item ends up with probability exactly $k / n$ of being in the final sample.

Warm-up: k = 1

Start with the special case. The current item is the candidate; when item $i$ arrives, replace the candidate with probability $1/ i$ . After all $n$ items, the held candidate is uniform over the stream.

import random

def sample_one(stream):
    """Return a uniformly random element from a stream of unknown length."""
    chosen = None
    for i, x in enumerate(stream, start=1):
        if random.random() < 1 / i:
            chosen = x
    return chosen

Why this works: item $i$ is the final candidate iff it was chosen at step $i$ (probability $1/ i$ ) and not displaced at any later step (probability $\prod_{j > i} (1 - 1/ j) = \prod_{j > i} (j - 1) / j = i / n$ ). Multiply: $(1/ i) \cdot (i / n) = 1/ n$ . Uniform. ✓

The general algorithm (Algorithm R)

For general $k$ : fill the reservoir with the first $k$ items. For each subsequent item $x$ at index $i \geq k$ , generate a random index $j \in [0, i]$ ; if $j < k$ , replace reservoir[j] with $x$ . Done.

import random

def reservoir_sample(stream, k: int):
    """Return k uniformly-random items from a stream of unknown length."""
    reservoir = []
    for i, x in enumerate(stream):
        if i < k:
            reservoir.append(x)              # fill the reservoir
        else:
            j = random.randrange(i + 1)      # uniform in [0, i]
            if j < k:
                reservoir[j] = x             # replace a random slot
    return reservoir

Why it's uniform

Claim: every element in the stream has probability k/n of being in the
       final reservoir (where n is the stream length).

Proof for element x at position i ≥ k:
  P(x enters)            = k / (i + 1)
                           [random.randrange(i+1) returns one of {0,...,i};
                            j < k with probability k/(i+1)]

  P(x stays after seeing element at j > i, j ≥ k):
    P(x not evicted at step j) = 1 - (1/k) · (k/(j+1)) = 1 - 1/(j+1) = j/(j+1)

  P(x in reservoir at end of stream of length n)
    = (k/(i+1)) · ∏_{j=i+1}^{n-1} (j/(j+1))
    = (k/(i+1)) · ((i+1)/(i+2)) · ((i+2)/(i+3)) · ... · ((n-1)/n)
    = k/n.   ✓

(The first k elements get the same k/n by a slightly different argument:
 they enter with probability 1, then each can be evicted at every later step.)

Complexity

Time: O(n) — one pass, constant work per item.
Space: O(k) — the reservoir.
Random calls: O(n) calls to the RNG. Algorithm L (Li, 1994) improves this to O(k(1 + log(n/k))) RNG calls by sampling a geometric distribution for the gap to the next item that enters the reservoir.

Why this matters for ML

Reservoir sampling is the foundational primitive for any "uniform sample of a stream" — training-data subsampling from a log file too big to fit in memory, monitoring a fraction of production inference requests, A/B-test logging at fixed rate, sampling cells from a massive single-cell RNA-seq output, sampling random walks from a knowledge graph. Any time the dataset is bigger than memory or the stream is unbounded, this is what you reach for.

Variations worth knowing

Weighted reservoir sampling: each item has a weight $w_{i}$ and you want probability proportional to weight. Algorithm A-Res: assign each item a key $u_{i}^{1/ w_{i}}$ with $u_{i} \sim Uniform (0, 1)$ , keep the $k$ largest keys (priority queue).
Distributed reservoir sampling: each worker maintains a local reservoir; merge them by sampling proportionally to (n_local) across workers. Standard pattern in MapReduce/Spark sampling jobs.
Sliding-window sampling: want a uniform sample of the last $W$ items in the stream. The basic reservoir doesn't handle eviction; you need a more elaborate data structure (chain-sampler or biased reservoir).