The HyperLogLog: Counting Billions of Unique Items in Kilobytes

2026-05-29

You're tracking unique visitors across a CDN. A naive approach: stash every visitor ID in a Set. With 100 million uniques and 16-byte IDs, that's 1.6 GB of RAM per counter — per region, per day, per campaign. Multiply by your dimensions and you're out of memory before lunch.

HyperLogLog (HLL) gives you an estimate of the cardinality of a set using a fixed, tiny amount of memory — typically 12 KB for ~2% error, regardless of whether you're counting 1,000 or 1,000,000,000 items.

The intuition: hash each item to a uniformly random bit string. Count the maximum number of leading zeros you've ever seen. If you've seen a hash starting with 10 zeros, you've probably observed roughly 2^10 = 1024 distinct items — because that pattern only shows up once in ~1024 random hashes. HLL refines this by splitting hashes into m buckets, tracking the max-leading-zeros per bucket, and harmonic-averaging the results to cancel out variance.

The cardinality estimate for m buckets is roughly: estimate ≈ α · m² / Σ(2^(-M[i])), where M[i] is the max leading zeros in bucket i. Standard error is 1.04 / √m. So 16,384 buckets (12 KB at 6 bits each) gives ~0.8% error.

Real-world example: Reddit uses HLL to count unique viewers per post. They can't store every user ID for every post — billions of view events daily. Instead, each post has a small HLL sketch updated on every view. The displayed "viewers" number is an estimate within ~1%, which is exactly what you want for social proof. Redis ships PFADD and PFCOUNT commands; BigQuery has HLL_COUNT.MERGE; Presto has approx_distinct.

The killer feature is mergeability. Two HLL sketches can be combined bucket-by-bucket (take the max per bucket) to get a sketch representing the union of both sets — without re-scanning the underlying data. This means you can shard collection across servers, store hourly sketches, then merge them into daily, weekly, monthly counts on demand. With exact sets, you can't do this cheaply: union of two sets of 100M items requires touching all 200M items.

When NOT to use it: when you need exact counts (billing, compliance, fraud detection), when cardinality is small enough that a Set fits comfortably, or when you need to enumerate the items — HLL gives you a count, never the elements themselves.

See it in action: Check out How HyperLogLog Actually Works — Counting Billions of Unique Items in 12KB by State
Flow to see this theory applied.
Key Takeaway: HyperLogLog trades ~1% accuracy for a constant ~12 KB memory footprint and free mergeability, making it the right tool for unique-counter dashboards at any scale.