2026-05-28
A Bloom filter is a probabilistic data structure that answers one question: "Have I seen this item before?" Its answer is either "definitely not" or "probably yes". It never says "definitely yes," and it never misses a real hit. False positives are allowed; false negatives are impossible. That asymmetry is exactly what makes it useful.
Mechanically, it's a bit array of size m and k independent hash functions. To insert an item, hash it k ways and set those bits to 1. To check membership, hash the same way and inspect those bits — if any bit is 0, the item was never inserted. If all are 1, it might have been inserted (or some combination of other items happened to set those exact bits).
Real-world example: Chrome's malicious URL detection. Chrome can't ship a list of every dangerous URL to your machine — that's gigabytes. Instead, it ships a Bloom filter. When you click a link, Chrome checks the filter locally. If the answer is "definitely not malicious," the page loads instantly with zero network calls. If it's "probably malicious," Chrome makes a (slower) authoritative API call to Google's Safe Browsing service to confirm. The filter eliminates 99%+ of those expensive calls.
Cassandra uses Bloom filters per SSTable to skip disk reads for keys that aren't there. CDNs use them to avoid origin fetches for known-missing assets. Databases use them in join optimizations.
The sizing rule of thumb: for n items and target false-positive rate p:
Concretely: 1 million items at 1% false-positive rate needs ~9.6 bits per item (~1.2 MB) and 7 hash functions. Compare that to a HashSet storing 1M 50-byte URLs — that's 50+ MB. You get a 40x memory reduction for accepting a 1% false-positive rate.
Where it bites you:
Reach for it when your hot path does many membership checks against a large set, the cost of a real lookup is high (disk, network), and a small false-positive rate is tolerable because you verify on the slow path anyway.
