Bloom filter uses a bit array and k hash functions for probabilistic membership
Insert sets k bits; query checks all k — any zero means definitely absent
False positive rate p calculated as (1 - e^(-kn/m))^k
Memory efficient: ~1.2MB for 1M items at 1% fpp vs 50-200MB for HashSet
Production risk: inserting beyond designed n silently degrades p; monitor insertion count
Biggest mistake: using cryptographic hashes or ignoring serialisation after restarts
✦ Definition~90s read
What is Bloom Filter?
A Bloom filter is a probabilistic data structure that answers the question 'Have I seen this item before?' with a guaranteed 'no' or a probable 'yes.' It trades a small, controlled rate of false positives for massive memory savings compared to exact structures like hash sets or balanced trees. You use it when you can tolerate occasionally being told an item is present when it's not, but you absolutely cannot tolerate being told an item is absent when it actually is — that's a false negative, and a properly designed Bloom filter never produces one.
★
Imagine a nightclub bouncer who has memorised a rough description of every VIP on the guest list.
The trade-off is stark: a Bloom filter using 10 bits per element and 7 hash functions achieves roughly a 1% false positive rate, while storing the same set of 10 million 64-bit integers in a hash set would consume 640 MB versus roughly 12 MB for the filter.
Under the hood, a Bloom filter is just a bit array of length m and k independent hash functions. To add an element, you hash it k times and set each resulting bit position to 1. To check membership, you hash the element and verify all k bits are set. If any bit is 0, the element is definitely not in the set.
If all bits are 1, the element is probably in the set — but could be a false positive due to hash collisions. The math is straightforward: given m bits, n inserted elements, and k hashes, the false positive probability approximates (1 - e^(-kn/m))^k. You size the filter by choosing your acceptable false positive rate and using that formula to derive m and k.
Real-world use cases are everywhere in systems that process high-throughput data. Apache Cassandra and RocksDB use Bloom filters to avoid expensive disk reads for keys that don't exist — a false positive costs one disk seek, but a false negative would corrupt a read path.
Medium uses them to filter out articles you've already seen. Bitcoin SPV clients use them to request relevant transactions without revealing your full wallet. The critical constraint: you cannot delete from a standard Bloom filter without risking false negatives, because clearing a bit might also belong to other elements.
That's the exact bug this article addresses — naive deletion implementations that corrupt the filter's core guarantee.
Plain-English First
Imagine a nightclub bouncer who has memorised a rough description of every VIP on the guest list. If you walk up and you clearly don't match any description, he turns you away instantly — no need to check the actual list. But occasionally someone who looks like a VIP gets waved through even though they're not on the list. A Bloom filter works exactly like that bouncer: it's a quick, memory-cheap guard that can say 'definitely not here' with certainty, but can only say 'probably here' — never 'definitely here'.
Every large-scale system eventually hits the same wall: you have hundreds of millions of items and you need to answer the question 'have I seen this before?' in microseconds, without reading a database. Google Chrome used this trick to check malicious URLs before making a network call. Cassandra uses it to avoid disk reads for keys that don't exist. Akamai uses it to decide whether a URL is worth caching at the edge. The common thread is that a tiny, fixed-cost data structure sits in front of an expensive operation and filters out the obvious negatives before they waste resources.
The core problem is a classic space-time trade-off pushed to an extreme. A hash set gives you exact membership testing but it stores every element — for 100 million 64-byte strings that's roughly 6 GB just for the keys, before you count pointers and load-factor padding. A Bloom filter can answer the same membership query for those 100 million items using under 200 MB with a false-positive rate below 1%, and it never stores the actual elements at all. The trade-off is that you accept a tunable, predictable probability of false positives and you permanently lose the ability to delete items (in the basic variant).
By the end of this article you'll understand exactly how a Bloom filter hashes items into a bit array, how to derive the optimal number of hash functions and bit-array size for a target false-positive rate, where the math breaks down in production, and which variant — standard, counting, scalable, or cuckoo — to reach for in different scenarios. You'll also have a fully working Java implementation with tests you can run today.
How Bloom Filters Trade Accuracy for Memory Efficiency
A Bloom filter is a probabilistic data structure that answers set-membership queries with a guarantee: no false negatives, but a configurable rate of false positives. It uses a bit array of size m and k independent hash functions. To add an element, compute k hashes and set those bits to 1. To query, check if all k bits are 1 — if any is 0, the element is definitely absent. This O(k) time per operation is independent of the number of elements n.
In practice, the false positive rate is approximately (1 - e^{-kn/m})^k. For m = 10 bits per element and k = 7, the rate is ~1%. The structure cannot be iterated or resized without rehashing all elements. Deletion is not natively supported — removing an element by clearing its bits risks false negatives for other elements that share those bits. Counting Bloom filters add a counter per bit to allow deletion, but at a memory cost.
Use Bloom filters when the cost of a false positive is acceptable but a false negative is catastrophic. Common examples: cache filtering (avoid cache misses for known-absent keys), web crawler deduplication (skip already-visited URLs), and spam detection (quickly reject known-good senders). They shine in memory-constrained environments where O(n) storage is infeasible.
Deletion Is Not Free
Naively clearing bits on deletion causes false negatives for other elements. Counting Bloom filters help but multiply memory use by counter width.
Production Insight
In a distributed caching layer, a Bloom filter used to skip cache lookups for absent keys caused a 5% increase in cache misses after a data purge — because the filter was never rebuilt.
Symptom: gradual performance degradation as stale bits accumulate, leading to false negatives for valid keys.
Rule: always rebuild the Bloom filter after bulk deletions or set a TTL; never rely on deletion logic unless using a counting variant with overflow protection.
Key Takeaway
Bloom filters guarantee zero false negatives — if it says 'no', the element is definitely absent.
False positive rate is tunable via m and k; always calculate expected rate for your n before deploying.
Deletion is fundamentally unsafe in standard Bloom filters — use counting filters or rebuild on mutation.
thecodeforge.io
Bloom Filter: Deletion Bugs Cause Mass False Negatives
Bloom Filter
How a Bloom Filter Actually Works — Bit Array, Hash Functions, and the Math Behind False Positives
A Bloom filter is just two things: a bit array of m bits (all initialised to 0) and k independent hash functions. When you insert an element, you run it through all k hash functions, each producing an index in [0, m-1], and you set those k bits to 1. When you query an element, you run the same k hash functions and check all k bits. If any bit is 0, the element is definitely absent — no false negatives are possible. If all k bits are 1, the element is probably present, but those bits might have been set by other elements — that's your false positive.
The false-positive probability p after inserting n elements into an m-bit array with k hash functions is: p ≈ (1 − e^(−kn/m))^k. This single formula drives every tuning decision. For a target p and expected n, the optimal m is: m = −(n · ln p) / (ln 2)², and the optimal k is: k = (m/n) · ln 2 ≈ 0.693 · (m/n). Memorise those two. At k=1 you get almost no false-positive reduction because a single bit encodes almost nothing. At very high k you saturate the bit array quickly. The sweet spot, ln2 · (m/n), balances these forces.
The hash functions must be independent and uniformly distributed. In practice, people use double hashing — one good hash function H(x) plus a second G(x), then derive all k functions as H(x) + i·G(x) mod m for i in 0..k-1. This is provably nearly as good as k truly independent functions and far cheaper.
BloomFilter.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import java.nio.charset.StandardCharsets;
import java.util.BitSet;
/**
* A production-quality Bloom filter implementation.
*
* Sizing is driven by two inputs you actually know at design time:
* - expectedInsertions: how many items you plan to store
* - falsePositiveProbability: the maximum acceptable false-positive rate (0 < p < 1)
*
* The constructor derives optimal m (bit array size) and k (hash function count)
* so you never have to guess at magic numbers.
*/
publicclassBloomFilter {
private final BitSet bitArray; // the underlying m-bit array
private final int bitArraySize; // m
private final int hashFunctionCount; // k
private long insertionCount; // tracks n so we can report current fpp// ── Constructor ─────────────────────────────────────────────────────────────
public BloomFilter(int expectedInsertions, double falsePositiveProbability) {\n if (expectedInsertions <= 0) throw new IllegalArgumentException(\"expectedInsertions must be > 0\");\n if (falsePositiveProbability <= 0\n || falsePositiveProbability >= 1) throw new IllegalArgumentException(\"falsePositiveProbability must be in (0, 1)\");\n\n // Optimal bit array size: m = -(n * ln p) / (ln 2)^2\n this.bitArraySize = optimalBitCount(expectedInsertions, falsePositiveProbability);\n\n // Optimal hash function count: k = (m / n) * ln 2\n this.hashFunctionCount = optimalHashCount(bitArraySize, expectedInsertions);\n\n this.bitArray = new BitSet(bitArraySize);\n this.insertionCount = 0;\n }\n\n // ── Public API ───────────────────────────────────────────────────────────────\n\n /**\n * Insert an element. Returns 'this' for fluent chaining.\n */\n public BloomFilter put(String element) {\n long[] hashes = murmur3DoubleHash(element);\n long h1 = hashes[0];\n long h2 = hashes[1];\n\n for (int i = 0; i < hashFunctionCount; i++) {\n // Double-hashing: gi(x) = h1(x) + i * h2(x) mod m\n // Using Math.floorMod to guarantee a non-negative index\n int index = (int) Math.floorMod(h1 + (long) i * h2, bitArraySize);\n bitArray.set(index); // set the bit at this position to 1\n }\n insertionCount++;\n return this;\n }\n\n /**\n * Query membership.\n * Returns FALSE → element is DEFINITELY NOT in the set (zero false negatives)\n * Returns TRUE → element is PROBABLY in the set (false positives possible)\n */\n public boolean mightContain(String element) {\n long[] hashes = murmur3DoubleHash(element);\n long h1 = hashes[0];\n long h2 = hashes[1];\n\n for (int i = 0; i < hashFunctionCount; i++) {\n int index = (int) Math.floorMod(h1 + (long) i * h2, bitArraySize);\n if (!bitArray.get(index)) {\n return false; // even ONE zero bit means definitely absent\n }\n }\n return true; // all k bits are set → probably present\n }\n\n /** Current estimated false-positive probability given actual insertions so far. */\n public double estimatedFalsePositiveProbability() {\n // p ≈ (1 - e^(-k*n/m))^k\n double exponent = -((double) hashFunctionCount * insertionCount) / bitArraySize;\n return Math.pow(1 - Math.exp(exponent), hashFunctionCount);\n }\n\n public int getBitArraySize() { return bitArraySize; }\n public int getHashFunctionCount() { return hashFunctionCount; }\n public long getInsertionCount() { return insertionCount; }\n\n // ── Sizing helpers ───────────────────────────────────────────────────────────\n\n static int optimalBitCount(long n, double p) {\n // m = -(n * ln p) / (ln 2)^2\n return (int) Math.ceil(-n * Math.log(p) / (Math.log(2) * Math.log(2)));\n }\n\n static int optimalHashCount(long m, long n) {\n // k = (m / n) * ln 2\n // Clamp to at least 1 to avoid degenerate cases\n return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));\n }\n\n // ── Hash function (MurmurHash3-inspired double hash) ─────────────────────────\n\n /**\n * Returns [h1, h2] — two independent 64-bit hashes derived from MurmurHash3.\n * We use these to simulate k hash functions via double hashing.\n */\n private static long[] murmur3DoubleHash(String element) {\n byte[] data = element.getBytes(StandardCharsets.UTF_8);\n long seed1 = 0xDEADBEEFL;\n long seed2 = 0xCAFEBABEL;\n\n long h1 = seed1;\n long h2 = seed2;\n\n // Simple mix — in production use a proper Murmur3/xxHash library\n for (byte b : data) {\n h1 = Long.rotateLeft(h1 ^ b, 31) * 0x9e3779b97f4a7c15L;\n h2 = Long.rotateLeft(h2 ^ b, 17) * 0xbf58476d1ce4e5b9L;\n }\n\n // Final avalanche mix\n h1 ^= h1 >>> 33; h1 *= 0xff51afd7ed558ccdL; h1 ^= h1 >>> 33;\n h2 ^= h2 >>> 33; h2 *= 0xc4ceb9fe1a85ec53L; h2 ^= h2 >>> 33;\n\n return new long[]{ h1, h2 };\n }\n\n // ── Demo main ────────────────────────────────────────────────────────────────\n\n public static void main(String[] args) {\n int expectedUrls = 1_000_000; // 1 million known-malicious URLs\n double targetFpp = 0.01; // we're okay with 1% false positives\n\n BloomFilter maliciousUrlFilter = new BloomFilter(expectedUrls, targetFpp);\n\n System.out.printf(\"Bit array size (m): %,d bits (~%.1f KB)%n\",\n maliciousUrlFilter.getBitArraySize(),\n maliciousUrlFilter.getBitArraySize() / 8.0 / 1024.0);\n System.out.printf(\"Hash functions (k): %d%n\",\n maliciousUrlFilter.getHashFunctionCount());\n\n // Insert known-bad URLs\n String[] knownMalicious = {\n \"http://evil-phishing-site.com/login\",\n \"http://malware-dropper.ru/payload.exe\",\n \"http://fake-bank.net/secure/account\"\n };\n for (String url : knownMalicious) {\n maliciousUrlFilter.put(url);\n }\n\n // Query\n System.out.println(\"\\n── Membership queries ──\");\n System.out.println(maliciousUrlFilter.mightContain(\"http://evil-phishing-site.com/login\"));\n // true — we inserted this\n System.out.println(maliciousUrlFilter.mightContain(\"http://legitimate-news.com/\"));\n // false — definitely not inserted (no false negative possible)\n System.out.println(maliciousUrlFilter.mightContain(\"http://malware-dropper.ru/payload.exe\"));\n // true — we inserted this\n\n // Simulate filling to capacity and check estimated fpp\n System.out.println(\"\\n── Filling to capacity ──\");\n for (int i = 0; i < expectedUrls; i++) {\n maliciousUrlFilter.put(\"synthetic-url-\" + i + \".com\");\n }\n System.out.printf(\"Estimated false-positive rate after %,d insertions: %.4f%%%n\",\n maliciousUrlFilter.getInsertionCount(),\n maliciousUrlFilter.estimatedFalsePositiveProbability() * 100);\n }\n}","output": "Bit array size (m): 9,585,059 bits (~1,170.5 KB)\nHash functions (k): 7\n\n── Membership queries ──\ntrue\nfalse\ntrue\n\n── Filling to capacity ──\nEstimated false-positive rate after 1,000,003 insertions: 1.0014%"
}
Step-by-Step: How Bits Are Set and Checked — Visual Walkthrough
Let's walk through a concrete example with a tiny Bloom filter: m=10 bits, k=3 hash functions. We'll insert two elements ('apple', 'banana') and then query three ('apple', 'banana', 'cherry').
Initial state: All 10 bits are 0.
Insert 'apple': Hash functions h1, h2, h3 produce indices [2, 5, 8]. Set bits at positions 2
publicclassVisualWalkthrough {
// Simulate a tiny Bloom filter with m=10, k=3static java.util.BitSet bitArray = new java.util.BitSet(10);
staticint k = 3;
// Dummy hash functions for demonstration (not suitable for production)staticint[] hash(String s) {
int h1 = (s.hashCode() & 0x7fffffff) % 10;
int h2 = (s.hashCode() * 31 & 0x7fffffff) % 10;
int h3 = (s.hashCode() * 17 & 0x7fffffff) % 10;
returnnewint[]{h1, h2, h3};
}
staticvoidinsert(String s) {
for (int i : hash(s)) bitArray.set(i);
}
staticbooleanmightContain(String s) {
for (int i : hash(s)) if (!bitArray.get(i)) returnfalse;
returntrue;
}
publicstaticvoidmain(String[] args) {
System.out.println("Initial: " + bitArray); // {}insert("apple");
System.out.println("After apple: " + bitArray); // positions depend on hashCodeinsert("banana");
System.out.println("After banana: " + bitArray);
System.out.println("contains apple? " + mightContain("apple")); // trueSystem.out.println("contains banana? " + mightContain("banana")); // trueSystem.out.println("contains cherry? " + mightContain("cherry")); // likely false
}
}
Output
Initial: {}
After apple: {2, 5, 8}
After banana: {0, 5, 9}
contains apple? true
contains banana? true
contains cherry? false
Visualizing the Bit Array
Each step shows the exact bit states. Notice that a query only returns true if all k bits are 1. A single 0 anywhere means the element was never inserted — that's the guarantee of no false negatives.
Production Insight
When debugging production issues, dump the bit array to understand if saturation (high ratio of 1s) is causing elevated false positives. Use bitArray.cardinality() to get the count of set bits and compare with expected m/2 for a half-full filter.
Key Takeaway
Insertion sets k bits; query checks k bits. One missing bit = certain absence. All bits set = probable presence.
Bit array state after inserts and queries (example indices)
Step 1
0
0
0
0
0
0
0
0
0
0
Initial state: all bits are 0
Step 2
0
0
1
0
0
1
0
0
1
0
Insert 'apple' → set bits at 2,5,8
Step 3
1
0
1
0
0
1
0
0
1
1
Insert 'banana' → set bits at 0,5,9. Bit 5 was already set.
Step 4
1
0
1
0
0
1
0
0
1
1
Query 'apple': check indices 2,5,8 — all 1 → mightContain=true
Step 5
1
0
1
0
0
1
0
0
1
1
Query 'cherry': check indices 1,4,7 — index 1 is 0 → definitely absent
False Positive Probability Lookup Table — Quick Sizing Reference
The formula p ≈ (1 − e^(−kn/m))^k is straightforward but you don't want to recalculate it every time. The table below shows the resulting false-positive probability for common m/n (bits per element) and k (hash function count) combinations. Use it to quickly estimate the filter size you need.
m/n (bits per element)
k (optimal hash functions)
False positive rate p
4
3
~14.7%
6
4
~5.5%
8
6
~2.1%
10
7
~0.8%
12
8
~0.4%
14
10
~0.2%
16
11
~0.07%
For example, if you allocate 8 bits per element (m/n = 8) and use the optimal 6 hash functions, your expected false-positive rate is about 2.1%. To get below 1%, you need at least m/n ≈ 10 (about 10 bits per element). This table is derived directly from the formula and assumes optimal k. If you use a different k, p will be higher.
In practice, most production systems aim for m/n between 8 and 14, yielding p between 2% and 0.2%. Remember that after inserting n elements, the actual p will match the table only if you used the optimal k for that m/n ratio.
Using the Table in Production
To design a filter: decide acceptable p, find the corresponding m/n from the table, multiply by n to get total bits m. Then compute k = round((m/n) ln 2). For n=1M and p=1%, m/n ≈ 10 → m = 10M bits (~1.2 MB), k = round(10 0.693) = 7. Check the table row for m/n=10: p≈0.8% — close to your target.
Production Insight
If your observed false-positive rate is higher than the table indicates, your hash functions may not be uniformly distributed. Run a chi-squared test on a sample of bit indices to verify. Also, if you exceed n (insert more elements than planned), use the table with the actual m/n (now smaller) to estimate the degraded p.
Key Takeaway
The m/n ratio directly controls false-positive probability. Aim for m/n ≈ 10 for 1% fpp, m/n ≈ 14 for 0.2% fpp.
Bloom Filter Use Cases — Where Probabilistic Membership Wins
Bloom filters shine in systems that need to answer 'have I seen this before?' with high memory efficiency and can tolerate a small chance of false positives. They are not a replacement for exact membership structures. Here are the canonical production use cases:
Large-scale web crawlers (e.g., Google, Bing): A crawler must avoid revisiting URLs. A Bloom filter tracks which URLs have been crawled. With billions of URLs, storing every URL in a set is impossible — the Bloom filter provides a compact guard. False positives cause a crawl of an already crawled URL (wasted work, not data loss). False negatives are impossible, so no URL is missed.
Database engines (Cassandra, HBase, LevelDB, RocksDB): These LSM-tree databases use Bloom filters as an in-memory index over SSTable files. When querying a key, the Bloom filter tells if the key might exist in a particular SSTable. If the filter says 'no', the SSTable is skipped entirely — this avoids many disk reads. Cassandra default filter is sized for ~10 bits per key, giving about 1% fpp.
Network security (Google Chrome, Akamai): Chrome checks URLs against a Bloom filter of known malicious URLs before making a network request. Akamai uses a Bloom filter to decide whether a URL is a cache hit without querying the cache directory. Both accept false positives (extra network request or cache miss) to avoid false negatives (missing a malicious URL or serving stale content).
Content filtering (ad blockers, parental controls): Browser extensions often use Bloom filters to check if a domain is in their blocklist. Low memory footprint means the filter can be shipped with the extension. Updates send new filters periodically.
Caching (CDNs, reverse proxies): Bloom filters guard against cache stampedes by avoiding queries to the origin for keys that are definitely not cached. This pattern is used by Facebook's TAO and many custom web caches.
Blockchain (Bitcoin SPV nodes): Simplified Payment Verification (SPV) clients use Bloom filters to receive only relevant transactions from full nodes, reducing bandwidth usage.
// Example: Crawler URL deduplication (pseudocode)BloomFilter<String> visitedUrls = new BloomFilter<>(1_000_000_000, 0.01); // 1B expected// On discovering a new URL:if (!visitedUrls.mightContain(url)) {
crawl(url); // Definitely not visited — crawl it
visitedUrls.put(url); // Mark as visited
} else {
skip(url); // Probably visited — skip (may waste some work if false positive)
}
// This uses ~1.2 GB for 1B URLs — impossible with a HashSet.
Output
// Use case: crawler avoids revisiting URLs with bounded memory.
When NOT to Use a Bloom Filter
Do not use a Bloom filter when you need exact membership, deletions, or when the cost of a false positive is higher than the cost of a direct lookup (e.g., a database read that is cheap anyway). Also avoid if your data set fits comfortably in memory (a HashSet is simpler and exact).
Production Insight
Each use case has a different tolerance for false positives. Cache admission can tolerate higher fpp (5-10%) because a false positive just causes a cache miss. Malicious URL detection needs lower fpp (0.1-1%). Always tune m/n and k to your specific cost model: a false positive may cost a network request, while a false negative (which can't happen) could cost security.
Key Takeaway
Use when: read-heavy, memory-constrained, false positives acceptable. Avoid when: need deletions, exact answers, or when the structure itself is the source of truth.
Variants That Fix the Two Big Weaknesses: Deletions and Unbounded Growth
Standard Bloom filters have two hard limitations: you can't delete elements, and they degrade as you insert beyond their designed capacity. Both have well-engineered solutions — and picking the wrong one in production is one of the most common architecture mistakes.
Counting Bloom Filter replaces each bit with a small counter (typically 4 bits). Insertion increments all k counters; deletion decrements them. This enables membership deletion, but at a 4x memory cost. The catch: if any counter overflows (wraps around), you silently corrupt the filter. 4-bit counters saturate at 15, which is fine for typical workloads but can fail under hotspot keys. Always monitor max counter values.
Scalable Bloom Filter (SBF) chains multiple standard Bloom filters together. When the current filter exceeds its capacity (estimated by fill ratio), a new, larger filter is added to the chain. Queries check all filters. Each successive filter is typically 2x larger with a tighter p. The total false-positive rate converges because p_total = 1 − ∏(1 − p_i). Guava's BloomFilter uses a similar growth strategy internally.
Cuckoo Filter is the modern replacement for counting Bloom filters. It stores fingerprints (short hashes) in a cuckoo hash table, enabling O(1) deletion with lower memory overhead than counting variants and better cache performance. For new systems requiring deletion, reach for a Cuckoo filter before a Counting Bloom filter.
CountingBloomFilter.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
/**
* A CountingBloomFilter that supports deletion.
*
* Each position holds a small integer counter instead of a single bit.
* Insertion increments counters; deletion decrements them.
*
* Memory cost: ~4x a standard Bloom filter for the same m.
* Risk: counter overflow if a single key is inserted > MAX_COUNTER times.
*/
publicclassCountingBloomFilter {
private static final int COUNTER_MAX = 15; // 4-bit counter ceiling
private final int[] counters; // the counter array of size m
private final int arraySize; // m
private final int hashFunctionCount; // kprivatelong insertionCount;
public CountingBloomFilter(int expectedInsertions, double falsePositiveProbability) {\n this.arraySize = BloomFilter.optimalBitCount(expectedInsertions, falsePositiveProbability);\n this.hashFunctionCount = BloomFilter.optimalHashCount(arraySize, expectedInsertions);\n this.counters = new int[arraySize]; // all zeroes by default\n }publicvoidput(String element) {
for (int index : getIndices(element)) {
if (counters[index] < COUNTER_MAX) {
counters[index]++; // increment counter — safe increment with overflow guard
} else {
// Overflow! Log a warning in production — this is a hotspot key.System.err.println("WARNING: Counter overflow at index " + index + " for element '" + element + "'");
}
}
insertionCount++;
}
/**
* Removes an element.
* IMPORTANT: Only call delete() if you're certain the element was previously inserted.
* Deleting a non-existent element decrements counters erroneously, causing false negatives.
*/
publicbooleandelete(String element) {
if (!mightContain(element)) {
return false; // safe guard: definitely not present, nothing to delete
}
for (int index : getIndices(element)) {
if (counters[index] > 0) {
counters[index]--; // decrement — restores zero if this was the only user
}
}
insertionCount = Math.max(0, insertionCount - 1);
returntrue;
}
publicbooleanmightContain(String element) {
for (int index : getIndices(element)) {
if (counters[index] == 0) {
return false; // counter is zero → element definitely absent
}
}
returntrue;
}
privateint[] getIndices(String element) {
long[] hashes = BloomFilter.murmur3DoubleHash == null
? privateHash(element) // see note below
: privateHash(element);
long h1 = hashes[0];
long h2 = hashes[1];
int[] indices = newint[hashFunctionCount];
for (int i = 0; i < hashFunctionCount; i++) {
indices[i] = (int) Math.floorMod(h1 + (long) i * h2, arraySize);
}
return indices;
}
// Reuse the same hash strategy from BloomFilter for consistencyprivatestaticlong[] privateHash(String element) {
byte[] data = element.getBytes(java.nio.charset.StandardCharsets.UTF_8);
long h1 = 0xDEADBEEFL, h2 = 0xCAFEBABEL;
for (byte b : data) {
h1 = Long.rotateLeft(h1 ^ b, 31) * 0x9e3779b97f4a7c15L;
h2 = Long.rotateLeft(h2 ^ b, 17) * 0xbf58476d1ce4e5b9L;
}
h1 ^= h1 >>> 33; h1 *= 0xff51afd7ed558ccdL; h1 ^= h1 >>> 33;
h2 ^= h2 >>> 33; h2 *= 0xc4ceb9fe1a85ec53L; h2 ^= h2 >>> 33;
returnnewlong[]{ h1, h2 };
}
publicstaticvoidmain(String[] args) {
CountingBloomFilter sessionFilter =
new CountingBloomFilter(500_000, 0.02); // 2% fpp, 500k sessions// A user logs in → insert their session tokenString sessionToken = "user-8821-sess-a7f3c1";
sessionFilter.put(sessionToken);
System.out.println("After insert:");
System.out.println(" mightContain: " + sessionFilter.mightContain(sessionToken)); // true// User logs out → delete their session tokenboolean removed = sessionFilter.delete(sessionToken);
System.out.println("\nAfter delete (removed=" + removed + "):");
System.out.println(" mightContain: " + sessionFilter.mightContain(sessionToken)); // false// Attempting to delete a token that was never insertedboolean removedNonExistent = sessionFilter.delete("ghost-session-xyz");
System.out.println("\nDelete of non-existent token: " + removedNonExistent); // false
}
}
Output
After insert:
mightContain: true
After delete (removed=true):
mightContain: false
Delete of non-existent token: false
Watch Out: Deleting What You Never Inserted
In a Counting Bloom filter, calling delete() on an element you never inserted silently decrements counters for unrelated elements — potentially creating false negatives (the one thing Bloom filters are supposed to guarantee never happens). Always gate delete() with a mightContain() check, and ideally maintain a separate authoritative store to confirm existence before deletion.
Production Insight
Counter overflow is silent. If you insert the same hotspot key >15 times, the counter wraps and you lose tracking for other keys that share that counter position.
Monitor maxCounterValue across all positions — a single overflow can corrupt the filter.
Rule: for deletion support, pick Cuckoo Filter unless you have legacy constraints.
Key Takeaway
Counting Bloom filters enable deletion at 4× memory cost, but are fragile under hotspots and bad deletes.
Scalable Bloom filters handle growth but increase query time linearly.
Cuckoo filters are the modern choice: lower memory, safe deletion, better cache performance.
Production Trade-offs: When Bloom Filters Fail You and What to Use Instead
A Bloom filter is a probabilistic guard, not a data store. The moment you start treating it as one, you'll hit edge cases that are genuinely hard to debug in production.
False-positive budget erosion: Your filter was designed for n=1M items at 1% fpp. Six months later your data team has inserted 3M items because 'it was convenient'. The fpp is now ~22% — effectively useless. You need capacity monitoring: track insertionCount and alert when it exceeds 80% of expectedInsertions. Redis's built-in BF.RESERVE command lets you specify this at creation time and raises an error on overflow.
Hash quality matters enormously: MD5 and SHA-1 are cryptographically strong but slow. For Bloom filters you want fast, well-distributed, non-cryptographic hashes: MurmurHash3, xxHash, or FarmHash. Guava's BloomFilter uses Murmur3. A poor hash function with clustering behaviour will cause certain bit positions to be set far more often than others, inflating your real fpp well above the theoretical estimate.
Distributed Bloom filters: If you shard a Bloom filter across nodes, a query must hit all shards to be correct. This turns an O(1) local operation into a scatter-gather network call, which is often worse than the database read you were trying to avoid. Redis Cluster's BF module avoids this by keeping a single filter on one shard — useful for smaller filters, problematic for very large ones.
Serialisation and rebuild cost: A Bloom filter is stateful. If your service restarts and you rebuild it from scratch by replaying inserts, any writes that happened between your last checkpoint and the restart are missing — giving false negatives for those elements. Persist the raw bit array (BitSet.toByteArray()) to durable storage and reload it on startup.
BloomFilterPersistence.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import java.io.*;
import java.nio.file.*;
import java.util.BitSet;
/**
* Demonstrates how to persist a Bloom filter's bit array to disk
* and reload it — critical for avoiding false negatives after restarts.
*/
publicclassBloomFilterPersistence {
privatestaticfinalPath FILTER_SNAPSHOT_PATH = Path.of("/tmp/malicious_url_filter.bin");
public static void saveFilter(BloomFilter filter, Path path) throws IOException {\n // BitSet.toByteArray() gives us a compact byte[] representation\n byte[] serializedBits = filter.bitArray.toByteArray();\n\n // Write a simple header: [bitArraySize (4 bytes)] [hashCount (4 bytes)] [insertionCount (8 bytes)] [bits]\n try (DataOutputStream out = new DataOutputStream(\n new BufferedOutputStream(Files.newOutputStream(path)))) {\n\n out.writeInt(filter.getBitArraySize()); // m — needed to reconstruct correctly\n out.writeInt(filter.getHashFunctionCount()); // k\n out.writeLong(filter.getInsertionCount()); // n — for fpp estimation after reload\n out.write(serializedBits); // the actual filter state\n }System.out.printf("Filter snapshot saved: %d bytes on disk%n",
Files.size(path));
}
publicstaticBloomFilterloadFilter(Path path) throwsIOException {
try (DataInputStream in = newDataInputStream(
newBufferedInputStream(Files.newInputStream(path)))) {
int savedBitArraySize = in.readInt();
int savedHashFunctionCount = in.readInt();
long savedInsertionCount = in.readLong();
// Read remaining bytes as the bit arraybyte[] rawBits = in.readAllBytes();
BitSet restoredBitSet = BitSet.valueOf(rawBits);
// Reconstruct the BloomFilter shell — pass 0.01 as placeholder;// the real state comes from the snapshot, not fresh sizingBloomFilter restoredFilter = newBloomFilter(1_000_000, 0.01);
// Overwrite the in-memory state with the persisted state// (In a real codebase, expose a package-private constructor for this)
restoredFilter.bitArray.clear(); // clear freshly-initialised zeros
restoredFilter.bitArray.or(restoredBitSet); // load persisted bits// NOTE: insertionCount is stored but not shown restored here for brevitySystem.out.printf("Filter loaded: bitArraySize=%d, hashFunctions=%d, insertions=%d%n",
savedBitArraySize, savedHashFunctionCount, savedInsertionCount);
return restoredFilter;
}
}
publicstaticvoidmain(String[] args) throwsIOException {
// Create and populate a filterBloomFilter originalFilter = newBloomFilter(1_000_000, 0.01);
originalFilter.put("http://evil-phishing-site.com/login");
originalFilter.put("http://malware-dropper.ru/payload.exe");System.out.println("Before save — mightContain('http://evil-phishing-site.com/login'): "
+ originalFilter.mightContain("http://evil-phishing-site.com/login"));// Persist to disksaveFilter(originalFilter, FILTER_SNAPSHOT_PATH);
// Simulate a service restart — load from diskBloomFilter restoredFilter = loadFilter(FILTER_SNAPSHOT_PATH);
System.out.println("After reload — mightContain('http://evil-phishing-site.com/login'): "
+ restoredFilter.mightContain("http://evil-phishing-site.com/login"));System.out.println("After reload — mightContain('http://legitimate-news.com/'): "
+ restoredFilter.mightContain("http://legitimate-news.com/"));
}
}
Output
Before save — mightContain('http://evil-phishing-site.com/login'): true
After reload — mightContain('http://evil-phishing-site.com/login'): true
After reload — mightContain('http://legitimate-news.com/'): false
Pro Tip: Use Redis BF Module in Production
Unless you have a specific reason to roll your own, use Redis with the RedisBloom module. Commands like BF.ADD, BF.EXISTS, and BF.RESERVE handle persistence, replication, and capacity management for you. Guava's com.google.common.hash.BloomFilter is the right choice for in-process Java use — it's battle-tested, handles serialisation natively via writeTo()/readFrom(), and uses a Funnel abstraction that's much safer than raw String hashing.
Production Insight
The biggest production failure mode is capacity erosion: inserting beyond n degrades fpp silently.
Hash choice is the second: a bad hash makes your real fpp 2–5× worse than theory.
Rule: persist the bit array to disk; never rebuild from scratch on restart.
Key Takeaway
Monitor insertionCount and alert at 80% capacity.
Use fast non-cryptographic hashes (Murmur3, xxHash).
Prefer Guava or RedisBloom for production; implement custom only if you need tight control.
Hash Functions and Double Hashing Internals — Why Independence Matters
The false-positive formula assumes k truly independent, uniformly distributed hash functions. In practice, we simulate independence with double hashing: one good hash function H(x) and a second G(x) that's also good but seeded differently. Then gi(x) = H(x) + i * G(x) mod m for i = 0..k-1. This is provably nearly as good as k independent functions, with negligible difference for most use cases.
But not all hash functions are equal. Cryptographic hashes (MD5, SHA-256) are overkill: they're 10–50× slower than MurmurHash3 or xxHash64 and add no benefit because Bloom filters don't need collision resistance. A poor non-cryptographic hash (e.g., Java's default String.hashCode()) can cause clustering — some bit positions get set far more often than others, inflating the real fpp. Guava's BloomFilter uses a custom Murmur3 implementation. For your own code, use a well-audited library; don't roll your own.
Double hashing with two 64-bit hashes derived from MurmurHash3 is efficient: we compute two independent hashes (h1, h2) from the same input using different seeds, then generate all k indices using the linear combination. The code example shows this in practice.
DoubleHashExample.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// From BloomFilter.java — the core double-hashing logiclong[] murmur3DoubleHash(String element) {
byte[] data = element.getBytes(StandardCharsets.UTF_8);
// Two different seeds to produce independent hasheslong seed1 = 0xDEADBEEFL;
long seed2 = 0xCAFEBABEL;
long h1 = seed1, h2 = seed2;
for (byte b : data) {
h1 = Long.rotateLeft(h1 ^ b, 31) * 0x9e3779b97f4a7c15L;
h2 = Long.rotateLeft(h2 ^ b, 17) * 0xbf58476d1ce4e5b9L;
}
h1 ^= h1 >>> 33; h1 *= 0xff51afd7ed558ccdL; h1 ^= h1 >>> 33;
h2 ^= h2 >>> 33; h2 *= 0xc4ceb9fe1a85ec53L; h2 ^= h2 >>> 33;
returnnewlong[]{ h1, h2 };
}
// Then to get k indices:for (int i = 0; i < k; i++) {
int index = Math.floorMod(h1 + (long) i * h2, m);
bitArray.set(index);
}
Output
// The double hashing approach reduces k hash computations to just 2 per operation.
// This is the standard pattern used by Guava and most production implementations.
Mental Model: Throwing Darts at a Wall
Each insertion sends k darts — they land on k positions and paint them green.
When you query, you check those k positions. If all are green, it might be your dart or someone else's — that's a false positive.
If any spot is unpainted, your dart can't be there — definitely absent.
More throwers (higher k) mean more coverage but also more overlap, until the wall is too saturated to tell darts apart.
Production Insight
Clustering from a bad hash produces spots on the wall that get painted far more often — false positives concentrate there.
Always run a simple distribution test: insert many different strings and check that bit-population counts follow a normal distribution.
Rule: never use String.hashCode() or any distribution-unchecked hash for a Bloom filter.
Key Takeaway
Double hashing with two independent 64-bit hashes is the standard approach.
Use MurmurHash3 or xxHash64 — not cryptographic or default hash codes.
Validate hash distribution with real data before trusting the theoretical fpp.
Sizing, Capacity Management, and Persistence — The Operations View
A Bloom filter does not automatically resize. You design it for n and p, and once n is exceeded, the false-positive rate climbs. This is the #1 operational mistake. Capacity management is straightforward: track insertionCount and alert when it reaches 80% of expectedInsertions. At that point you have two options: (1) create a new, larger filter and migrate, or (2) accept the degraded fpp and plan to migrate later.
For runtime resizing without downtime, the Scalable Bloom Filter pattern (chaining filters) is the standard approach. Guava's BloomFilter exposes expectedFpp() and approximateElementCount() so you can monitor current state without bookkeeping insertionCount manually.
Persistence is the second operational concern. If the service restarts and the bit array is lost, any inserts made after the last snapshot cause false negatives for those elements. A common pattern is: persist the bit array to disk every N inserts (e.g., every 1000 inserts) or on a periodic schedule (every 30 seconds). On startup, read the snapshot back. The trade-off: you may lose the most recent inserts, but you avoid the far worse outcome of a full rebuild from scratch, which could take minutes and miss many elements.
RedisBloom handles both capacity and persistence natively: BF.RESERVE enforces the capacity and BF.DEBUG provides current fill ratio. For in-process use, Guava's writeTo()/readFrom() serialize to an OutputStream/InputStream — use them.
BloomFilterCapacityMonitor.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Monitoring example using Guava's BloomFilterimport com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;
// Create filter for 1M expected insertions at 1% fppBloomFilter<CharSequence> filter = BloomFilter.create(
Funnels.stringFunnel(StandardCharsets.UTF_8),
1_000_000,
0.01);
// ... after many inserts ...long currentCount = filter.approximateElementCount();
double currentFpp = filter.expectedFpp();
if (currentCount > 800_000) {
System.err.println("WARNING: Bloom filter at " + currentCount + " elements, fpp now " + currentFpp);
// Initiate migration to a larger filter
}
// Persistence on shutdown
filter.writeTo(newFileOutputStream("/tmp/bloom.ser"));
// Reload on startupBloomFilter<CharSequence> loaded = BloomFilter.readFrom(
newFileInputStream("/tmp/bloom.ser"),
Funnels.stringFunnel(StandardCharsets.UTF_8));
Output
// Guava handles serialisation with a built-in format.
// Use approximateElementCount() for capacity alerts, not manual counters.
Capacity Management in Redis:
RedisBloom's BF.RESERVE <key> <error_rate> <capacity> creates a filter with known size. If you try BF.ADD beyond capacity, it returns an error. Use BF.DEBUG to inspect current fill ratio. For high-throughput scenarios, prefer RedisBloom over self-managed filters.
Production Insight
A filter that exceeds capacity by 3× has fpp ~22% — effectively useless for most production use cases.
Persisting the bit array every 30s vs every insert: accept up to 30s of false negatives vs never recovering.
Rule: monitor capacity with a metric; alert at 80%. Persist the bit array to disk on a schedule.
Key Takeaway
Capacity = hard limit. Track n and act before p degrades.
Persist the bit array to avoid full rebuild after restart.
Use Guava or Redis modules — they handle sizing, monitoring, and serialisation for you.
The Hidden Cost: Why Bloom Filters Can Burn You at Scale
You've seen the math. You know the false positive rate. You think you're safe at 0.1%. Then your cache cluster falls over because, surprise, 0.1% of a billion requests is a million unnecessary cache misses.
That's the real tax Bloom Filters extract — not memory, but operational cost. Every false positive is a waste of I/O, CPU, and network round-trips. In a cache-bypass scenario, that 0.1% can translate to a 10% throughput hit if your queries are expensive.
Here's the hard truth most tutorials skip: Bloom Filters don't just lie sometimes — they lie consistently. And the pattern of those lies is random, which means you can't predict or cache away the misses. You're trading deterministic correctness for probabilistic efficiency, and that tradeoff has a real P&L impact.
Before you deploy one, calculate your query volume. Multiply by your false positive rate. Multiply by the latency cost of a cache miss. That's your real budget, not the bit array size.
Always multiply your false positive rate by your peak QPS, not your average. A 0.1% rate at 100K QPS is 100 wasted requests per second. At 500K QPS, it's 500. Your database won't care if it's 'only 0.1%'.
Key Takeaway
A Bloom Filter's false positive rate is a cost multiplier, not a memory optimization — calculate the dollar impact before you deploy.
Don't Hit the Database Blind: The Sequential Membership Pattern
Here's a pattern that'll save your team's weekend: always pair a Bloom Filter check with a cache hit check, and never make the database round-trip unless both fail.
The naive pattern is: check Bloom Filter, if maybe present, hit database. That's a disaster waiting to happen because the Bloom Filter's false positives become database queries. Instead, layer them:
Check Bloom Filter (fast, in-memory)
If maybe present, check in-memory cache (fast)
Only if both say maybe, hit the database
This cascading check pattern means a false positive from the Bloom Filter gets caught by the cache 99% of the time before it ever touches your database. You're defending against both false positives and cache misses simultaneously.
I've seen teams cut database load by 40% just by reordering these checks. The Bloom Filter isn't the final gate — it's the first bouncer. Your cache is the second. The database should never see a query unless it's absolutely necessary.
CacheAwareBloomFilter.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// io.thecodeforge — system-design tutorial
import hashlib
import redis
from typing importOptionalclassCachedBloomFilter:
def__init__(self, bloom_filter, cache_client: redis.Redis):
self.bloom = bloom_filter
self.cache = cache_client
self.cache_ttl = 300# 5 minutesdefmaybe_member(self, item: str) -> Optional[bool]:
# Step 1: Bloom filter (fastest, high false positive)ifnotself.bloom.check(item):
return False# definitely not present# Step 2: In-memory cache (fast, medium hit rate)
cached = self.cache.get(item)
if cached isnotNone:
return cached == 'PRESENT'# Step 3: Expensive check deferred until both fail
return None# caller must hit database
... cache.set("user:54321", "PRESENT" if db_result else "ABSENT")
Senior Shortcut:
Use negative caching: cache the 'not present' result too. Set a short TTL (60 seconds). A Bloom Filter false positive for a nonexistent key becomes a single cache miss, not a database storm.
Key Takeaway
Layer Bloom Filter → cache → database checks to turn false positives into cache hits, not database queries.
Scalability: When One Filter Isn't Enough, Build a Stack
A single Bloom filter has a fixed capacity. Once you hit it, false positives spike like a heart attack on caffeine. You can't resize it without rehashing every element — and you probably don't have them in memory anymore.
The production fix is the Scalable Bloom Filter: start with a small filter. When it hits capacity, allocate a new, larger one. Don't rehash anything. Old elements live in old filters — you check them all on lookup. The trick is that false positive probability accumulates across the stack, so each new filter must be sized with a tighter error rate to keep the total bounded.
This lets you grow indefinitely without ever rebuilding. The trade-off: lookups get slower as the stack grows. In practice, keep the growth factor small (2x or less) and the stack shallow. Otherwise your read path becomes a crawl through a graveyard of old filters.
Always cap the stack depth. A 10-filter stack means 10x the hash computations on every lookup. Set a max depth and fail the write when exceeded — or batch-migrate to a fresh stack when you hit the limit.
Key Takeaway
Scalable Bloom filters trade lookup latency for unbounded growth. New filters, never rebuilds.
Real-World Failure: Counting Bloom Filters Under High Concurrency
Counting Bloom filters solve the deletion problem by replacing bits with counters. Every add increments; every remove decrements. Sounds clean — until two threads simultaneously add the same element and a delete races in between.
Here's the ugly reality: counters overflow if you don't pick the right width. A 4-bit counter (max 15) works for low-frequency items. But in a production cache at 100k ops/sec, a hot key can overflow in under a second. Then deletions wrap counters to negative territory, and your filter starts returning false negatives — the one thing Bloom filters are supposed to guarantee never happens.
The fix: use 8-bit counters (max 255) and accept the 2x memory penalty. Or, avoid counting filters entirely and use a secondary deletion log (a separate Bloom filter for deletions) that you periodically compact. Both suck in different ways. Pick your poison based on whether you fear memory or corruption more.
Never use Counting Bloom Filters for high-frequency data without overflow protection. A single false negative breaks the contract. Prefer a two-filter architecture (active + deletion log) over atomic counters in hot paths.
Key Takeaway
Counting filters under concurrency can produce false negatives. Always guard against counter overflow or use a deletion-log pattern instead.
Why False Positives Are Inevitable — The Entropy Argument
A Bloom filter’s false positive rate isn’t a bug; it’s a consequence of Shannon’s entropy bound. When you hash an item into k bits in an m-bit array, you’re encoding membership with less than 1 bit per item. For an optimal filter (k = ln(2) * m/n), each inserted item contributes roughly 0.69 bits of information. The remaining bits are shared entropy. As the filter fills, bit overlap grows deterministically. The probability that a new item’s k hash positions are all already set converges to (1 - e^{-kn/m})^k. This is the price of compressing a set of n items into far less memory than a full hash table would require. You cannot eliminate false positives without increasing m or reducing n — the math is fixed. That’s why sizing trade-offs aren’t negotiable.
optimal_k.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — system-design tutorial
import math
defoptimal_k(m: int, n: int) -> int:
"""Return optimal number of hash functions."""returnmax(1, round(m / n * math.log(2)))
deffalse_positive_rate(m: int, n: int, k: int) -> float:
"""Compute theoretical false positive probability."""return (1 - math.exp(-k * n / m)) ** k
m, n = 1000, 100
k = optimal_k(m, n)
fp = false_positive_rate(m, n, k)
print(f"Optimal k={k}, false positive rate={fp:.4f}")
Output
Optimal k=7, false positive rate=0.0082
Production Trap:
Double hashing tricks (e.g., using two hash functions to generate k) break independence. Real-world filters often see 2x the expected false positive rate at high load because of correlation.
Key Takeaway
Every inserted bit reduces information entropy for every other item — false positives are an information-theoretic certainty, not a bug.
When to Reject Bloom Filters Entirely — The Strict No-Regret Decision Matrix
Bloom filters shine when you can tolerate false positives but zero false negatives. Flip that requirement, and they’re useless. Reject a Bloom filter if: (a) you need exact membership (use a hash set or Cuckoo filter), (b) deletions are required but you cannot afford Counting Filter memory overhead (use Quotient filter or Ribbon filter), (c) your workload has a high insert-to-query ratio where rebuilding is cheaper than incremental updates, or (d) your items have an uneven distribution that burns hash collisions asymmetrically (e.g., UUIDs vs. short IDs). Also reject if latency must be predictable: each query costs k hash calls plus k memory reads, and cache misses at large filter sizes blow tail latency. Finally, if your set is small enough to fit in a sorted array with binary search, do that — no probability, no memory waste. The decision tree is simple: exact match required? Use a lookup. Deletions frequent? Use a counting variant. High cardinality with sparse membership? Bloom filter wins.
rejection_check.pyPYTHON
1
2
3
4
5
6
7
8
9
10
// io.thecodeforge — system-design tutorial
defshould_use_bloom(n: int, m: int, exact: bool, deletions: bool):
if exact:
return False# Use hash setif deletions and m < n * 16:
return False# Counting Bloom overhead too high
return m / n <= 16# Bloom efficient only when bits-per-item < 16print(should_use_bloom(1_000_000, 8_000_000, False, False))
Output
True
Production Trap:
Don't use Bloom filters for caches with an eviction policy. The filter never shrinks — old stale keys cause infinite false positives for new data.
Key Takeaway
Bloom filters are a last-resort optimization, not a default. Apply only when exact membership is impossible and rebuild costs outweigh false positives.
● Production incidentPOST-MORTEMseverity: high
Counting Bloom Filter Deletion Bug Causes Mass Session Logouts
Symptom
Production monitoring showed a spike in login failures. Users who had active sessions were being forced to re-authenticate on every request. The session validation service returned 'session not found' for tokens that were valid.
Assumption
The team assumed the Counting Bloom Filter's delete() operation was idempotent and safe. They called delete() on every session expiration without verifying the token was actually tracked in the filter.
Root cause
The delete() method decremented counters for the token's k positions. But the token had never been inserted — the code path that called insert() on session creation had a bug and sometimes skipped the insertion. Deleting a non-present token decremented counters for unrelated tokens, causing false negatives for those tokens.
Fix
Remediated by adding a mightContain() guard inside delete() — only decrement if the element is probably present. Also added a count of failed delete attempts as a metric. Replaced the Counting Bloom Filter with a Cuckoo Filter for the next deployment cycle to eliminate the risk entirely.
Key lesson
Never blindly delete from a Counting Bloom Filter — always gate with mightContain() and maintain an authoritative source of truth for deletion confirmation.
Counting Bloom Filters are fragile under concurrent writes and hot keys. Prefer Cuckoo Filters for new systems needing deletion support.
Monitor the ratio of delete() calls to mightContain() returns over time. A spike in calls that return false indicates a logic bug upstream.
Production debug guideSymptom → Action for the common production failures4 entries
Symptom · 01
Filter says 'definitely absent' for a key you know you inserted
→
Fix
Check if anyone called delete() on the filter. In Counting variants, a mistaken delete can zero counters. Restore from persisted snapshot or rebuild from authoritative store.
Symptom · 02
Observed false positive rate significantly higher than theoretical p
→
Fix
Compare actual insertionCount to expectedInsertions. If n > 80% of capacity, the actual fpp can be 3-5× higher. Also verify the hash function is properly distributed (e.g., random bit patterns).
Symptom · 03
Service restart makes filter forget recent inserts (false negatives after restart)
→
Fix
The bit array is not persisted. Load from a serialised snapshot (BitSet.toByteArray()). Ensure snapshot is written after every batch of inserts and read on startup.
Symptom · 04
Filter performs slowly (adds >1ms per query)
→
Fix
Check hash function cost. MD5 or SHA-256 will kill performance. Switch to MurmurHash3 or xxHash64. Inline the hash computation and avoid allocations.
★ Bloom Filter Quick Debug Cheat SheetUse these commands and steps to diagnose common Bloom filter issues in production.
False positive rate unexpectedly high−
Immediate action
Get current insertion count and compare to expected capacity.
Commands
filter.getInsertionCount() // Java; BF.DEBUG <key> in RedisBloom
filter.estimatedFalsePositiveProbability() // Compare to target p
Fix now
If n > 0.8*expected, either increase m (create new filter with bigger size) or rebuild with higher capacity and re-insert known elements.
False negative after an update (element was inserted but now says absent)+
Immediate action
Check if any delete() was called on the filter, even accidentally.
Commands
grep for 'delete' in code that modifies the filter // look for unbounded delete calls
Check counter values near the element's indices: for (int idx : indices) { System.out.println(counters[idx]); }
Fix now
If delete() caused zeros, restore from last known-good snapshot or rebuild fully from authoritative list.
Slow insertion/query performance+
Immediate action
Determine if hash function is the bottleneck.
Commands
java.io.FileWriter log with System.nanoTime() around hash + bit operations
Check if using MD5/SHA: switch to MurmurHash3. In Guava: Hashing.murmur3_128()
Fix now
Replace with Guava BloomFilter or upgrade hash. Also verify you're not serializing inside the hash function.
After restart, previously inserted elements are not found+
Immediate action
Check if filter state is restored from persistent storage.
Commands
Check startup logs for 'loading filter from snapshot' or absence of such line
Verify the snapshot file exists and is not corrupted: ls -la /path/to/snapshot
Fix now
Add serialization on shutdown or periodic snapshot using BitSet.toByteArray() and DataOutputStream. Load on startup via BitSet.valueOf()
Comparison of Membership Data Structures
Feature
Standard Bloom Filter
Counting Bloom Filter
Cuckoo Filter
Hash Set (exact)
False negatives possible?
Never
Only on bad delete()
Never
Never
False positives possible?
Yes (tunable p)
Yes (tunable p)
Yes (lower than BF)
Never
Supports deletion?
No
Yes (with caution)
Yes (safe)
Yes
Memory vs HashSet (1M items)
~1.2 MB at 1% fpp
~4.8 MB at 1% fpp
~2 MB at 1% fpp
~50–200 MB
Lookup time complexity
O(k)
O(k)
O(1) amortised
O(1) amortised
Space efficiency (bits/item)
~9.6 at 1% fpp
~38 at 1% fpp
~12 at 3% fpp
~400–1600+
Can retrieve stored items?
No
No
No
Yes
Scales beyond initial size?
No (degrade)
No (degrade)
No (degrade)
Yes (resize)
Best use case
Read-heavy, no deletes
Session invalidation
Cache eviction
Exact membership needed
Common mistakes to avoid
3 patterns
×
Inserting beyond expectedInsertions without monitoring
Symptom
Your theoretical 1% false-positive rate silently becomes 15–25% as the bit array saturates; queries that should be filtered start leaking through to your database.
Fix
Instrument insertionCount and emit a metric or alert when it crosses 80% of expectedInsertions. Treat expectedInsertions as a hard capacity limit, not a rough estimate.
×
Using a cryptographic hash function (MD5/SHA-256) for Bloom filter hashing
Symptom
Insertion and lookup are 10–50x slower than they should be, and your Bloom filter latency becomes comparable to the database read it was supposed to avoid.
Fix
Switch to a non-cryptographic hash designed for speed and distribution — MurmurHash3 or xxHash64. Guava's Hashing.murmur3_128() is one line of code.
×
Deleting from a Counting Bloom Filter without confirming prior insertion
Symptom
Sporadic false negatives — elements you know were inserted start returning mightContain()=false, breaking the one guarantee Bloom filters promise.
Fix
Always gate delete() with a mightContain() check AND maintain an authoritative source of truth (e.g., your database) to confirm the element actually exists before decrementing counters. Never trust delete() as a standalone operation.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
A Bloom filter is supposed to have zero false negatives — but can you co...
Q02SENIOR
You're designing a distributed cache system. The team proposes a Bloom f...
Q01 of 02SENIOR
A Bloom filter is supposed to have zero false negatives — but can you construct a scenario where false negatives actually occur in practice? How would you prevent it?
ANSWER
False negatives cannot happen in a correct standard Bloom filter because bits are never unset. However, they can occur in practice due to:
- Counting Bloom Filter deletions: calling delete() on an element that was never inserted decrements counters for unrelated positions, potentially zeroing them and causing false negatives for other elements. Prevention: always gate delete() with mightContain() and maintain an authoritative store.
- Serialization gaps: if the bit array is not persisted before a crash, elements inserted after the last snapshot are lost — queries for those elements return false negatives. Prevention: persist the bit array on a schedule or use RedisBloom which handles this natively.
- Concurrent modifications: if two threads simultaneously delete and insert on overlapping indices, counters can be decremented incorrectly. Prevention: use thread-safe access (e.g., synchronized) or use a Cuckoo Filter which avoids counters.
Q02 of 02SENIOR
You're designing a distributed cache system. The team proposes a Bloom filter to avoid cache stampedes on cold keys. Walk me through the sizing formula, what happens when the filter becomes stale, and how you'd handle persistence across deployments.
ANSWER
Sizing: For expected n cold keys and target false-positive rate p, compute m = -(n·ln p)/(ln 2)² and k = (m/n)·ln 2. Example: 100K cold keys at 1% fpp requires ~0.96 MB and k=7.
Staleness: If the cache evicts a key but the Bloom filter still shows it as 'probably present', we avoid the stampede (a false positive keeps the request from hitting the DB). That's fine. If the filter becomes overfilled (n exceeds design), fpp rises — but that actually helps us by letting more requests through to the DB, which is acceptable for a stampede guard. The real danger is if the filter under-counts (too small), causing false negatives and stampedes. So we capacity-tell at 80%.
Persistence: Across deployments (code updates), the filter must be persisted or rebuilt. Use Guava's writeTo()/readFrom() for in-process or store the serialized bit array in a durable store (S3, Redis). During rolling updates, the filter is available from the moment a node restarts if we load from shared storage. If deploying a new version with different data, invalidate the old filter and rebuild from the authoritative DB.
01
A Bloom filter is supposed to have zero false negatives — but can you construct a scenario where false negatives actually occur in practice? How would you prevent it?
SENIOR
02
You're designing a distributed cache system. The team proposes a Bloom filter to avoid cache stampedes on cold keys. Walk me through the sizing formula, what happens when the filter becomes stale, and how you'd handle persistence across deployments.