Daily Software Engineering: The Bloom Filter: Saying "Definitely Not" Fast and Cheap

The Bloom Filter: Saying "Definitely Not" Fast and Cheap

2026-05-28

A Bloom filter is a probabilistic data structure that answers one question: "Have I seen this item before?" Its answer is either "definitely not" or "probably yes". It never says "definitely yes," and it never misses a real hit. False positives are allowed; false negatives are impossible. That asymmetry is exactly what makes it useful.

Mechanically, it's a bit array of size m and k independent hash functions. To insert an item, hash it k ways and set those bits to 1. To check membership, hash the same way and inspect those bits — if any bit is 0, the item was never inserted. If all are 1, it might have been inserted (or some combination of other items happened to set those exact bits).

Real-world example: Chrome's malicious URL detection. Chrome can't ship a list of every dangerous URL to your machine — that's gigabytes. Instead, it ships a Bloom filter. When you click a link, Chrome checks the filter locally. If the answer is "definitely not malicious," the page loads instantly with zero network calls. If it's "probably malicious," Chrome makes a (slower) authoritative API call to Google's Safe Browsing service to confirm. The filter eliminates 99%+ of those expensive calls.

Cassandra uses Bloom filters per SSTable to skip disk reads for keys that aren't there. CDNs use them to avoid origin fetches for known-missing assets. Databases use them in join optimizations.

The sizing rule of thumb: for n items and target false-positive rate p:

Bit array size: m ≈ -n · ln(p) / (ln 2)²
Hash functions: k ≈ (m/n) · ln 2

Concretely: 1 million items at 1% false-positive rate needs ~9.6 bits per item (~1.2 MB) and 7 hash functions. Compare that to a HashSet storing 1M 50-byte URLs — that's 50+ MB. You get a 40x memory reduction for accepting a 1% false-positive rate.

Where it bites you:

You can't delete from a standard Bloom filter — clearing bits would create false negatives for other items sharing those bits. Use a Counting Bloom filter (counters instead of bits) if you need deletes, at ~4x the memory cost.
Sizing is fixed at creation. If you exceed your planned n, the false-positive rate degrades fast. Scalable Bloom filters chain multiple filters to grow dynamically.
It's a filter, not a cache. A "probably yes" answer still requires the authoritative lookup. The value is in skipping the lookup on the "definitely not" path.

Reach for it when your hot path does many membership checks against a large set, the cost of a real lookup is high (disk, network), and a small false-positive rate is tolerable because you verify on the slow path anyway.

See it in action: Check out let s implement a bloom filter extremely random blog posts by CodeHelp to see this theory applied.

Key Takeaway: Bloom filters trade a tunable false-positive rate for massive memory savings, letting you cheaply reject the 99% of queries that would otherwise hit slow storage.

All newsletters