Daily Software Engineering: The Reservoir Sampling Algorithm: Picking K Random Items From a Stream You Can't Fit in Memory

The Reservoir Sampling Algorithm: Picking K Random Items From a Stream You Can't Fit in Memory

2026-06-01

You have a log stream of unknown length — could be a million entries, could be a billion — and you need to grab 100 of them uniformly at random for sampling. The naive approach: load everything, shuffle, take the first 100. Congratulations, you just OOM'd your service. Reservoir sampling solves this in O(n) time and O(k) memory, where k is your sample size — independent of the stream's length.

The algorithm (Algorithm R, by Jeffrey Vitter):

Fill a reservoir array with the first k items from the stream.
For each subsequent item at index i (i ≥ k), pick a random integer j in [0, i].
If j < k, replace reservoir[j] with the new item. Otherwise, discard it.

Why it works: When you've seen n items, any given item's probability of being in the reservoir is exactly k/n. The proof is induction: item i enters with probability k/i, and survives each subsequent step with probability (1 - 1/(i+1)) × (1 - 1/(i+2)) × ... × (1 - 1/n), which telescopes to i/n. Multiply: k/i × i/n = k/n. Uniform.

Real-world example: You're operating a payment processor handling 50,000 transactions per second. The fraud team wants 1,000 transactions per minute sampled for manual review — uniformly distributed, not just "the first 1,000." You can't buffer 3 million transactions per minute. So each worker maintains a 1,000-slot reservoir, applies Algorithm R as transactions stream past, and at the minute boundary ships the reservoir to the review queue and resets. Memory cost: ~1,000 transaction records per worker. Constant. Forever.

Rule of thumb: If you need a uniform random sample of size k from a stream whose total length n is unknown or too large to materialize, reservoir sampling is almost always the right answer. The expected number of writes to the reservoir after the initial fill is k × ln(n/k) — for k=100 and n=1 billion, that's roughly 1,600 writes total. Cheap.

Pitfalls: Don't use Math.random() seeded once at process start if you need cryptographic-grade uniformity — use a proper CSPRNG. Don't use Algorithm R for weighted sampling — there's a variant called A-Res for that. And if you parallelize across workers, each worker produces a uniform sample of its own stream, not the global one — you need to merge reservoirs with weights proportional to per-worker item counts to recover global uniformity.

See it in action: Check out 1st yr. Vs Final yr. MBBS student 🔥🤯#shorts #neet by Dr.Sumedha Gupta MBBS to see this theory applied.

Key Takeaway: Reservoir sampling lets you draw a uniform k-sized random sample from a stream of unknown length using O(k) memory and a single pass — no buffering required.

All newsletters