Reservoir Sampling

Interview Prep

Warm-upML Engineeringmlstreamingprobability

The problem

You're given a stream of items of unknown length. You see each item exactly once and can keep only items in memory at a time. When the stream ends, return a sample of items chosen uniformly at random from all items in the stream.

The naive "store everything, then sample" needs memory. The naive "pick when you see it" can't know the right probability because the total length isn't known yet. Reservoir sampling threads the needle: memory, one pass, every item ends up with probability exactly of being in the final sample.

Warm-up: k = 1

Start with the special case. The current item is the candidate; when item arrives, replace the candidate with probability . After all items, the held candidate is uniform over the stream.

import random

def sample_one(stream):
    """Return a uniformly random element from a stream of unknown length."""
    chosen = None
    for i, x in enumerate(stream, start=1):
        if random.random() < 1 / i:
            chosen = x
    return chosen

Why this works: item is the final candidate iff it was chosen at step (probability ) and not displaced at any later step (probability ). Multiply: . Uniform. ✓

The general algorithm (Algorithm R)

For general : fill the reservoir with the first items. For each subsequent item at index , generate a random index ; if , replace reservoir[j] with . Done.

import random

def reservoir_sample(stream, k: int):
    """Return k uniformly-random items from a stream of unknown length."""
    reservoir = []
    for i, x in enumerate(stream):
        if i < k:
            reservoir.append(x)              # fill the reservoir
        else:
            j = random.randrange(i + 1)      # uniform in [0, i]
            if j < k:
                reservoir[j] = x             # replace a random slot
    return reservoir

Why it's uniform

Claim: every element in the stream has probability k/n of being in the
       final reservoir (where n is the stream length).

Proof for element x at position i ≥ k:
  P(x enters)            = k / (i + 1)
                           [random.randrange(i+1) returns one of {0,...,i};
                            j < k with probability k/(i+1)]

  P(x stays after seeing element at j > i, j ≥ k):
    P(x not evicted at step j) = 1 - (1/k) · (k/(j+1)) = 1 - 1/(j+1) = j/(j+1)

  P(x in reservoir at end of stream of length n)
    = (k/(i+1)) · ∏_{j=i+1}^{n-1} (j/(j+1))
    = (k/(i+1)) · ((i+1)/(i+2)) · ((i+2)/(i+3)) · ... · ((n-1)/n)
    = k/n.   ✓

(The first k elements get the same k/n by a slightly different argument:
 they enter with probability 1, then each can be evicted at every later step.)

Complexity

Why this matters for ML

Reservoir sampling is the foundational primitive for any "uniform sample of a stream" — training-data subsampling from a log file too big to fit in memory, monitoring a fraction of production inference requests, A/B-test logging at fixed rate, sampling cells from a massive single-cell RNA-seq output, sampling random walks from a knowledge graph. Any time the dataset is bigger than memory or the stream is unbounded, this is what you reach for.

Variations worth knowing