Exercise: Bloom Filter Probabilistic Data Structure
Difficulty - Advanced
Estimated Time - 2-4 hours
Learning Objectives
- Understand probabilistic data structures and their trade-offs
- Implement Bloom filters with optimal hash function selection
- Master bit manipulation and efficient memory usage
- Calculate and tune false positive rates
- Apply Bloom filters to real-world performance optimization problems
- Implement advanced variants
Problem Statement
How can you check if an element exists in a massive dataset using minimal memory? A naive hash set might require gigabytes of RAM for millions of items, but a Bloom filter can answer "definitely not present" or "probably present" using just kilobytes of memory.
A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It can produce false positives but never false negatives. This trade-off makes Bloom filters ideal for applications where memory is constrained and occasional false positives are acceptable.
Real-World Scenario:
Consider a web crawler that visits billions of URLs. Before crawling a URL, it needs to check if it's already been visited. Storing billions of URLs in a hash set would require enormous memory:
1// Without Bloom filter: Huge memory usage
2type NaiveCrawler struct {
3 visited map[string]bool // 1 billion URLs * 100 bytes = 100 GB!
4}
5
6func AlreadyVisited(url string) bool {
7 return c.visited[url]
8}
9
10// With Bloom filter: Minimal memory usage
11type EfficientCrawler struct {
12 bloom *BloomFilter // 1 billion URLs * 10 bits = 1.25 GB
13 // If false positive, re-crawl
14}
15
16func ProbablyVisited(url string) bool {
17 return c.bloom.MayContain(url)
18}
With a 1% false positive rate, the Bloom filter might incorrectly say 1% of URLs were visited when they weren't, causing the crawler to skip them. But it uses 98% less memory than a hash set!
Requirements
Functional Requirements
-
Basic Bloom Filter
- Initialize with size and number of hash functions
Add(item): Insert an item into the filterMayContain(item): Check if item might be in the filterClear(): Reset all bits to zero
-
Optimal Parameter Calculation
- Given expected number of items and desired false positive rate
- Calculate optimal bit array size
- Calculate optimal number of hash functions
- Formula: m = -(n * ln(p)) /^2), k = * ln(2)
-
Multiple Hash Functions
- Implement k independent hash functions
- Use double hashing technique for efficiency
- Ensure uniform distribution across bit array
-
Counting Bloom Filter
- Support deletion by using counters instead of bits
- Each position is a counter
- Increment on add, decrement on remove
- Handle counter overflow
-
Scalable Bloom Filter
- Dynamically grow as more items are added
- Chain multiple Bloom filters with increasing sizes
- Maintain consistent false positive rate as filter scales
Non-Functional Requirements
- Space Efficiency: Use bit-level operations for minimal memory
- Performance: O(k) time for add and query operations
- Accuracy: Achieve specified false positive rate
- Concurrency: Thread-safe for concurrent access
- Serialization: Support saving/loading filter state
Background and Theory
Bloom filters were invented by Burton Howard Bloom in 1970. They work by using multiple hash functions to set bits in a bit array.
How Bloom Filters Work
Data Structure:
- Bit array of size m
- k different hash functions
Add Operation:
- Hash the item with each of k hash functions
- Set the bit at each hash value to 1
Query Operation:
- Hash the item with each of k hash functions
- Check if all k bits are set to 1
- If any bit is 0: definitely not in set
- If all bits are 1: probably in set
Why False Positives?
Multiple different items can hash to the same bits. After adding many items, many bits are set to 1. A query for a non-existent item might find all its bits set to 1 by chance.
Mathematical Analysis
False Positive Probability:
After inserting n items into a Bloom filter with m bits and k hash functions:
P(false positive) ≈)^k
Optimal Number of Hash Functions:
k = * ln(2) ≈ 0.693 * m/n
Optimal Bit Array Size:
For desired false positive rate p and n items:
m = -(n * ln(p)) /^2) ≈ -1.44 * n * ln(p)
Example:
- n = 1,000,000 items
- p = 0.01
- m ≈ 9,585,059 bits ≈ 1.14 MB
- k ≈ 7 hash functions
Use Cases
- Google Chrome: Uses Bloom filter to check if URL is malicious before full lookup
- Bitcoin: Uses Bloom filters in SPV clients
- Cassandra & HBase: Uses Bloom filters to avoid disk reads for non-existent data
- Akamai CDN: Uses Bloom filters for cache filtering
- Spell Checkers: Quick check if word is in dictionary
- Network Routers: Packet filtering and duplicate detection
- Blockchain: Ethereum uses Bloom filters for log filtering
Implementation Challenges
Challenge 1: Efficient Hash Function Generation
Computing k independent hash functions is expensive. How do you efficiently generate multiple hashes?
Solution Approach:
Use double hashing technique: compute only 2 hash functions, then simulate k hashes:
hash_i(x) = + i * h2(x)) mod m
This gives good distribution while being much faster than computing k separate hashes.
Challenge 2: Bit-Level Operations
To minimize memory, you need to pack bits tightly and manipulate individual bits efficiently.
Solution Approach:
- Use
[]uint64as underlying storage - Calculate byte index:
index / 64 - Calculate bit offset:
index % 64 - Use bitwise operations:
array[byteIdx] |=to set bit - Use bitwise AND to check bit:
array[byteIdx] & != 0
Challenge 3: Optimal Parameter Selection
Users shouldn't need to understand Bloom filter mathematics. Provide easy configuration.
Solution Approach:
- Provide
NewOptimal(expectedItems, falsePositiveRate)constructor - Automatically calculate optimal m and k
- Validate parameters and warn about unrealistic configurations
- Provide helper functions to estimate performance
Challenge 4: Thread Safety
Multiple goroutines might query and update the Bloom filter concurrently.
Solution Approach:
- Use
sync.RWMutexfor concurrent access - Read operations use read lock
- Write operations use write lock
- Consider lock-free implementations using atomic operations for higher performance
Hints
Hint 1: Bit Array Design
Use a slice of uint64 for efficient bit storage and operations:
1type BloomFilter struct {
2 bits []uint64 // Bit array
3 size uint64 // Total number of bits
4 k uint // Number of hash functions
5 n uint64 // Number of items added
6 mu sync.RWMutex
7}
8
9// Set bit at position i
10func setBit(i uint64) {
11 idx := i / 64
12 offset := i % 64
13 bf.bits[idx] |=
14}
15
16// Check if bit at position i is set
17func getBit(i uint64) bool {
18 idx := i / 64
19 offset := i % 64
20 return) != 0
21}
Hint 2: Double Hashing for Multiple Hash Functions
Use two hash functions to simulate k hash functions:
1func getHashValues(data []byte) []uint64 {
2 // Use two hash functions: fnv and murmur
3 h1 := fnvHash(data)
4 h2 := murmurHash(data)
5
6 hashes := make([]uint64, bf.k)
7 for i := uint(0); i < bf.k; i++ {
8 // Double hashing: h1 + i*h2
9 hash :=*h2) % bf.size
10 hashes[i] = hash
11 }
12 return hashes
13}
This is much faster than computing k separate hash functions.
Hint 3: Optimal Parameter Calculation
Implement the mathematical formulas:
1func OptimalParameters(n uint64, p float64) {
2 // m = -(n * ln(p)) /^2)
3 ln2 := math.Log(2)
4 m_float := -(float64(n) * math.Log(p)) /
5 m = uint64(math.Ceil(m_float))
6
7 // k = * ln(2)
8 k_float := / float64(n)) * ln2
9 k = uint(math.Round(k_float))
10
11 // Ensure reasonable bounds
12 if k < 1 {
13 k = 1
14 }
15 if k > 10 {
16 k = 10 // More than 10 hash functions is rarely optimal
17 }
18
19 return m, k
20}
Hint 4: False Positive Rate Estimation
Calculate actual false positive rate based on current state:
1func EstimateFalsePositiveRate() float64 {
2 // p ≈)^k
3 if bf.n == 0 {
4 return 0
5 }
6
7 exponent := -float64(bf.k) * float64(bf.n) / float64(bf.size)
8 p := math.Pow(1 - math.Exp(exponent), float64(bf.k))
9
10 return p
11}
Hint 5: Counting Bloom Filter
For deletions, use counters instead of bits:
1type CountingBloomFilter struct {
2 counters []uint8 // 4-bit or 8-bit counters
3 size uint64
4 k uint
5 mu sync.RWMutex
6}
7
8func Add(data []byte) {
9 cbf.mu.Lock()
10 defer cbf.mu.Unlock()
11
12 for _, hash := range cbf.getHashValues(data) {
13 if cbf.counters[hash] < 255 { // Prevent overflow
14 cbf.counters[hash]++
15 }
16 }
17}
18
19func Remove(data []byte) {
20 cbf.mu.Lock()
21 defer cbf.mu.Unlock()
22
23 for _, hash := range cbf.getHashValues(data) {
24 if cbf.counters[hash] > 0 {
25 cbf.counters[hash]--
26 }
27 }
28}
Note: Uses more memory but supports deletion.
Solution
Show Complete Solution
Approach
This implementation provides:
- Standard Bloom Filter: Space-efficient with optimal parameters
- Counting Bloom Filter: Supports deletion with counter-based storage
- Scalable Bloom Filter: Dynamically grows to maintain false positive rate
- Hash Functions: Efficient double hashing technique
- Serialization: Save/load filter state
- Statistics: Track performance metrics and estimate false positive rate
Implementation
1package bloomfilter
2
3import (
4 "bytes"
5 "encoding/binary"
6 "encoding/gob"
7 "hash"
8 "hash/fnv"
9 "math"
10 "sync"
11
12 "github.com/spaolacci/murmur3"
13)
14
15// BloomFilter is a space-efficient probabilistic data structure
16type BloomFilter struct {
17 bits []uint64 // Bit array
18 size uint64 // Total number of bits
19 k uint // Number of hash functions
20 n uint64 // Number of items added
21 mu sync.RWMutex
22}
23
24// NewBloomFilter creates a new Bloom filter with specified parameters
25func NewBloomFilter(size uint64, k uint) *BloomFilter {
26 // Round up size to nearest multiple of 64 for efficiency
27 numWords := / 64
28 actualSize := numWords * 64
29
30 return &BloomFilter{
31 bits: make([]uint64, numWords),
32 size: actualSize,
33 k: k,
34 n: 0,
35 }
36}
37
38// NewOptimalBloomFilter creates a Bloom filter with optimal parameters
39// for the expected number of items and desired false positive rate
40func NewOptimalBloomFilter(expectedItems uint64, falsePositiveRate float64) *BloomFilter {
41 m, k := OptimalParameters(expectedItems, falsePositiveRate)
42 return NewBloomFilter(m, k)
43}
44
45// OptimalParameters calculates optimal size and hash count
46func OptimalParameters(n uint64, p float64) {
47 if p <= 0 || p >= 1 {
48 panic("false positive rate must be between 0 and 1")
49 }
50
51 // m = -(n * ln(p)) /^2)
52 ln2 := math.Log(2)
53 mFloat := -(float64(n) * math.Log(p)) /
54 m = uint64(math.Ceil(mFloat))
55
56 // k = * ln(2)
57 kFloat := / float64(n)) * ln2
58 k = uint(math.Round(kFloat))
59
60 // Reasonable bounds
61 if k < 1 {
62 k = 1
63 }
64 if k > 20 {
65 k = 20
66 }
67
68 return m, k
69}
70
71// Add inserts an item into the Bloom filter
72func Add(data []byte) {
73 bf.mu.Lock()
74 defer bf.mu.Unlock()
75
76 hashes := bf.getHashValues(data)
77 for _, hash := range hashes {
78 bf.setBit(hash)
79 }
80 bf.n++
81}
82
83// AddString is a convenience method to add a string
84func AddString(s string) {
85 bf.Add([]byte(s))
86}
87
88// MayContain checks if an item might be in the filter
89// Returns false: definitely not in filter
90// Returns true: probably in filter
91func MayContain(data []byte) bool {
92 bf.mu.RLock()
93 defer bf.mu.RUnlock()
94
95 hashes := bf.getHashValues(data)
96 for _, hash := range hashes {
97 if !bf.getBit(hash) {
98 return false // Definitely not present
99 }
100 }
101 return true // Probably present
102}
103
104// MayContainString is a convenience method for strings
105func MayContainString(s string) bool {
106 return bf.MayContain([]byte(s))
107}
108
109// Clear resets the Bloom filter
110func Clear() {
111 bf.mu.Lock()
112 defer bf.mu.Unlock()
113
114 for i := range bf.bits {
115 bf.bits[i] = 0
116 }
117 bf.n = 0
118}
119
120// Count returns the number of items added
121func Count() uint64 {
122 bf.mu.RLock()
123 defer bf.mu.RUnlock()
124 return bf.n
125}
126
127// Size returns the size of the bit array
128func Size() uint64 {
129 return bf.size
130}
131
132// K returns the number of hash functions
133func K() uint {
134 return bf.k
135}
136
137// EstimateFalsePositiveRate estimates current false positive probability
138func EstimateFalsePositiveRate() float64 {
139 bf.mu.RLock()
140 defer bf.mu.RUnlock()
141
142 if bf.n == 0 {
143 return 0
144 }
145
146 // p ≈)^k
147 exponent := -float64(bf.k) * float64(bf.n) / float64(bf.size)
148 p := math.Pow(1-math.Exp(exponent), float64(bf.k))
149
150 return p
151}
152
153// FillRatio returns the proportion of bits set to 1
154func FillRatio() float64 {
155 bf.mu.RLock()
156 defer bf.mu.RUnlock()
157
158 setBits := uint64(0)
159 for _, word := range bf.bits {
160 setBits += uint64(popcount(word))
161 }
162
163 return float64(setBits) / float64(bf.size)
164}
165
166// Internal methods
167
168func setBit(i uint64) {
169 idx := i / 64
170 offset := i % 64
171 bf.bits[idx] |=
172}
173
174func getBit(i uint64) bool {
175 idx := i / 64
176 offset := i % 64
177 return) != 0
178}
179
180func getHashValues(data []byte) []uint64 {
181 // Use double hashing: two hash functions to simulate k hash functions
182 h1 := hashFNV(data)
183 h2 := hashMurmur(data)
184
185 hashes := make([]uint64, bf.k)
186 for i := uint(0); i < bf.k; i++ {
187 // Combined hash: h1 + i*h2
188 hash :=*h2) % bf.size
189 hashes[i] = hash
190 }
191
192 return hashes
193}
194
195// Hash functions
196
197func hashFNV(data []byte) uint64 {
198 h := fnv.New64a()
199 h.Write(data)
200 return h.Sum64()
201}
202
203func hashMurmur(data []byte) uint64 {
204 h := murmur3.New64()
205 h.Write(data)
206 return h.Sum64()
207}
208
209// popcount counts the number of set bits in a uint64
210func popcount(x uint64) int {
211 count := 0
212 for x != 0 {
213 count++
214 x &= x - 1 // Clear the lowest set bit
215 }
216 return count
217}
218
219// Serialization
220
221// MarshalBinary implements encoding.BinaryMarshaler
222func MarshalBinary() {
223 bf.mu.RLock()
224 defer bf.mu.RUnlock()
225
226 buf := new(bytes.Buffer)
227
228 // Write metadata
229 binary.Write(buf, binary.LittleEndian, bf.size)
230 binary.Write(buf, binary.LittleEndian, uint32(bf.k))
231 binary.Write(buf, binary.LittleEndian, bf.n)
232
233 // Write bit array
234 for _, word := range bf.bits {
235 binary.Write(buf, binary.LittleEndian, word)
236 }
237
238 return buf.Bytes(), nil
239}
240
241// UnmarshalBinary implements encoding.BinaryUnmarshaler
242func UnmarshalBinary(data []byte) error {
243 bf.mu.Lock()
244 defer bf.mu.Unlock()
245
246 buf := bytes.NewReader(data)
247
248 // Read metadata
249 binary.Read(buf, binary.LittleEndian, &bf.size)
250 var k32 uint32
251 binary.Read(buf, binary.LittleEndian, &k32)
252 bf.k = uint(k32)
253 binary.Read(buf, binary.LittleEndian, &bf.n)
254
255 // Read bit array
256 numWords := / 64
257 bf.bits = make([]uint64, numWords)
258 for i := range bf.bits {
259 binary.Read(buf, binary.LittleEndian, &bf.bits[i])
260 }
261
262 return nil
263}
264
265// CountingBloomFilter supports deletion using counters
266type CountingBloomFilter struct {
267 counters []uint8 // 8-bit counters
268 size uint64
269 k uint
270 n uint64
271 mu sync.RWMutex
272}
273
274// NewCountingBloomFilter creates a counting Bloom filter
275func NewCountingBloomFilter(size uint64, k uint) *CountingBloomFilter {
276 return &CountingBloomFilter{
277 counters: make([]uint8, size),
278 size: size,
279 k: k,
280 n: 0,
281 }
282}
283
284// Add inserts an item
285func Add(data []byte) {
286 cbf.mu.Lock()
287 defer cbf.mu.Unlock()
288
289 hashes := cbf.getHashValues(data)
290 for _, hash := range hashes {
291 if cbf.counters[hash] < 255 { // Prevent overflow
292 cbf.counters[hash]++
293 }
294 }
295 cbf.n++
296}
297
298// Remove deletes an item
299func Remove(data []byte) {
300 cbf.mu.Lock()
301 defer cbf.mu.Unlock()
302
303 hashes := cbf.getHashValues(data)
304 for _, hash := range hashes {
305 if cbf.counters[hash] > 0 {
306 cbf.counters[hash]--
307 }
308 }
309 if cbf.n > 0 {
310 cbf.n--
311 }
312}
313
314// MayContain checks if item might be present
315func MayContain(data []byte) bool {
316 cbf.mu.RLock()
317 defer cbf.mu.RUnlock()
318
319 hashes := cbf.getHashValues(data)
320 for _, hash := range hashes {
321 if cbf.counters[hash] == 0 {
322 return false
323 }
324 }
325 return true
326}
327
328func getHashValues(data []byte) []uint64 {
329 h1 := hashFNV(data)
330 h2 := hashMurmur(data)
331
332 hashes := make([]uint64, cbf.k)
333 for i := uint(0); i < cbf.k; i++ {
334 hash :=*h2) % cbf.size
335 hashes[i] = hash
336 }
337 return hashes
338}
339
340// ScalableBloomFilter grows dynamically to maintain false positive rate
341type ScalableBloomFilter struct {
342 filters []*BloomFilter
343 p float64 // Target false positive rate
344 r float64 // Tightening ratio
345 s uint // Growth factor
346 mu sync.RWMutex
347}
348
349// NewScalableBloomFilter creates a scalable Bloom filter
350func NewScalableBloomFilter(initialCapacity uint64, falsePositiveRate, tighteningRatio float64) *ScalableBloomFilter {
351 sbf := &ScalableBloomFilter{
352 filters: make([]*BloomFilter, 0),
353 p: falsePositiveRate,
354 r: tighteningRatio,
355 s: 2, // Double size each time
356 }
357
358 // Create initial filter
359 sbf.addFilter(initialCapacity)
360
361 return sbf
362}
363
364func addFilter(capacity uint64) {
365 // Each subsequent filter has tighter false positive rate
366 p := sbf.p * math.Pow(sbf.r, float64(len(sbf.filters)))
367 filter := NewOptimalBloomFilter(capacity, p)
368 sbf.filters = append(sbf.filters, filter)
369}
370
371// Add inserts an item
372func Add(data []byte) {
373 sbf.mu.Lock()
374 defer sbf.mu.Unlock()
375
376 // Add to current filter
377 currentFilter := sbf.filters[len(sbf.filters)-1]
378 currentFilter.Add(data)
379
380 // Check if we need to grow
381 if currentFilter.EstimateFalsePositiveRate() > sbf.p {
382 // Create new filter with larger capacity
383 newCapacity := currentFilter.n * uint64(sbf.s)
384 sbf.addFilter(newCapacity)
385 }
386}
387
388// MayContain checks all filters
389func MayContain(data []byte) bool {
390 sbf.mu.RLock()
391 defer sbf.mu.RUnlock()
392
393 // Check all filters
394 for _, filter := range sbf.filters {
395 if filter.MayContain(data) {
396 return true
397 }
398 }
399 return false
400}
401
402// Count returns total items added
403func Count() uint64 {
404 sbf.mu.RLock()
405 defer sbf.mu.RUnlock()
406
407 total := uint64(0)
408 for _, filter := range sbf.filters {
409 total += filter.Count()
410 }
411 return total
412}
413
414// Utility functions
415
416// Union creates a new Bloom filter that is the union of two filters
417// Both filters must have the same size and number of hash functions
418func Union(bf1, bf2 *BloomFilter) {
419 if bf1.size != bf2.size || bf1.k != bf2.k {
420 return nil, fmt.Errorf("filters must have same size and k")
421 }
422
423 result := NewBloomFilter(bf1.size, bf1.k)
424
425 bf1.mu.RLock()
426 bf2.mu.RLock()
427 defer bf1.mu.RUnlock()
428 defer bf2.mu.RUnlock()
429
430 // Bitwise OR
431 for i := range result.bits {
432 result.bits[i] = bf1.bits[i] | bf2.bits[i]
433 }
434
435 result.n = bf1.n + bf2.n
436
437 return result, nil
438}
439
440// Intersection creates a new Bloom filter that is the intersection of two filters
441func Intersection(bf1, bf2 *BloomFilter) {
442 if bf1.size != bf2.size || bf1.k != bf2.k {
443 return nil, fmt.Errorf("filters must have same size and k")
444 }
445
446 result := NewBloomFilter(bf1.size, bf1.k)
447
448 bf1.mu.RLock()
449 bf2.mu.RLock()
450 defer bf1.mu.RUnlock()
451 defer bf2.mu.RUnlock()
452
453 // Bitwise AND
454 for i := range result.bits {
455 result.bits[i] = bf1.bits[i] & bf2.bits[i]
456 }
457
458 // Note: n is approximate after intersection
459 if bf1.n < bf2.n {
460 result.n = bf1.n
461 } else {
462 result.n = bf2.n
463 }
464
465 return result, nil
466}
Testing
1package bloomfilter
2
3import (
4 "fmt"
5 "testing"
6)
7
8func TestBasicOperations(t *testing.T) {
9 bf := NewOptimalBloomFilter(1000, 0.01)
10
11 // Add items
12 items := []string{"apple", "banana", "cherry", "date"}
13 for _, item := range items {
14 bf.AddString(item)
15 }
16
17 // Test membership
18 for _, item := range items {
19 if !bf.MayContainString(item) {
20 t.Errorf("Item %s should be in filter", item)
21 }
22 }
23
24 // Test non-membership
25 if bf.MayContainString("grape") {
26 t.Log("False positive for 'grape'")
27 }
28}
29
30func TestFalsePositiveRate(t *testing.T) {
31 n := uint64(10000)
32 p := 0.01 // 1% target false positive rate
33
34 bf := NewOptimalBloomFilter(n, p)
35
36 // Add n items
37 for i := uint64(0); i < n; i++ {
38 bf.AddString(fmt.Sprintf("item%d", i))
39 }
40
41 // Test false positive rate
42 falsePositives := 0
43 testSize := 10000
44
45 for i := 0; i < testSize; i++ {
46 // Test items not in filter
47 if bf.MayContainString(fmt.Sprintf("notinset%d", i)) {
48 falsePositives++
49 }
50 }
51
52 actualRate := float64(falsePositives) / float64(testSize)
53 estimatedRate := bf.EstimateFalsePositiveRate()
54
55 t.Logf("Target FP rate: %.4f", p)
56 t.Logf("Actual FP rate: %.4f", actualRate)
57 t.Logf("Estimated FP rate: %.4f", estimatedRate)
58
59 // Allow some tolerance
60 if actualRate > p*2 {
61 t.Errorf("False positive rate too high: %.4f > %.4f", actualRate, p*2)
62 }
63}
64
65func TestOptimalParameters(t *testing.T) {
66 tests := []struct {
67 n uint64
68 p float64
69 }{
70 {1000, 0.01},
71 {10000, 0.001},
72 {100000, 0.01},
73 }
74
75 for _, test := range tests {
76 m, k := OptimalParameters(test.n, test.p)
77 t.Logf("n=%d, p=%.4f => m=%d, k=%d, memory=%d KB",
78 test.n, test.p, m, k, m/8/1024)
79 }
80}
81
82func TestCountingBloomFilter(t *testing.T) {
83 cbf := NewCountingBloomFilter(10000, 4)
84
85 // Add and remove
86 cbf.Add([]byte("test"))
87 if !cbf.MayContain([]byte("test")) {
88 t.Error("Item should be present after add")
89 }
90
91 cbf.Remove([]byte("test"))
92 if cbf.MayContain([]byte("test")) {
93 t.Error("Item should not be present after remove")
94 }
95}
96
97func TestScalableBloomFilter(t *testing.T) {
98 sbf := NewScalableBloomFilter(100, 0.01, 0.9)
99
100 // Add many items
101 for i := 0; i < 1000; i++ {
102 sbf.Add([]byte(fmt.Sprintf("item%d", i)))
103 }
104
105 // Verify all items present
106 for i := 0; i < 1000; i++ {
107 if !sbf.MayContain([]byte(fmt.Sprintf("item%d", i))) {
108 t.Errorf("Item %d should be present", i)
109 }
110 }
111
112 t.Logf("Scalable filter grew to %d filters", len(sbf.filters))
113}
114
115func TestSerialization(t *testing.T) {
116 bf := NewOptimalBloomFilter(1000, 0.01)
117
118 // Add items
119 for i := 0; i < 100; i++ {
120 bf.AddString(fmt.Sprintf("item%d", i))
121 }
122
123 // Serialize
124 data, err := bf.MarshalBinary()
125 if err != nil {
126 t.Fatalf("Failed to marshal: %v", err)
127 }
128
129 // Deserialize
130 bf2 := &BloomFilter{}
131 if err := bf2.UnmarshalBinary(data); err != nil {
132 t.Fatalf("Failed to unmarshal: %v", err)
133 }
134
135 // Verify items present in deserialized filter
136 for i := 0; i < 100; i++ {
137 if !bf2.MayContainString(fmt.Sprintf("item%d", i)) {
138 t.Errorf("Item %d should be present after deserialization", i)
139 }
140 }
141}
142
143func TestUnionIntersection(t *testing.T) {
144 bf1 := NewOptimalBloomFilter(1000, 0.01)
145 bf2 := NewOptimalBloomFilter(1000, 0.01)
146
147 // Add different items to each
148 for i := 0; i < 50; i++ {
149 bf1.AddString(fmt.Sprintf("item%d", i))
150 bf2.AddString(fmt.Sprintf("item%d", i+25))
151 }
152
153 // Union should contain all items
154 union, err := Union(bf1, bf2)
155 if err != nil {
156 t.Fatalf("Union failed: %v", err)
157 }
158
159 for i := 0; i < 75; i++ {
160 if !union.MayContainString(fmt.Sprintf("item%d", i)) {
161 t.Errorf("Union should contain item%d", i)
162 }
163 }
164
165 // Intersection should contain overlapping items
166 intersection, err := Intersection(bf1, bf2)
167 if err != nil {
168 t.Fatalf("Intersection failed: %v", err)
169 }
170
171 for i := 25; i < 50; i++ {
172 if !intersection.MayContainString(fmt.Sprintf("item%d", i)) {
173 t.Errorf("Intersection should contain item%d", i)
174 }
175 }
176}
177
178func BenchmarkAdd(b *testing.B) {
179 bf := NewOptimalBloomFilter(uint64(b.N), 0.01)
180
181 b.ResetTimer()
182 for i := 0; i < b.N; i++ {
183 bf.AddString(fmt.Sprintf("item%d", i))
184 }
185}
186
187func BenchmarkMayContain(b *testing.B) {
188 bf := NewOptimalBloomFilter(10000, 0.01)
189
190 // Pre-populate
191 for i := 0; i < 10000; i++ {
192 bf.AddString(fmt.Sprintf("item%d", i))
193 }
194
195 b.ResetTimer()
196 for i := 0; i < b.N; i++ {
197 bf.MayContainString(fmt.Sprintf("item%d", i%10000))
198 }
199}
Usage Example
1package main
2
3import (
4 "fmt"
5 "bloomfilter"
6)
7
8func main() {
9 // Example 1: Web crawler URL deduplication
10 fmt.Println("=== Web Crawler Example ===")
11
12 // Expect 1 million URLs, 1% false positive rate
13 crawler := bloomfilter.NewOptimalBloomFilter(1000000, 0.01)
14
15 urls := []string{
16 "https://example.com/page1",
17 "https://example.com/page2",
18 "https://example.com/page3",
19 }
20
21 for _, url := range urls {
22 if crawler.MayContainString(url) {
23 fmt.Printf("Skipping: %s\n", url)
24 } else {
25 fmt.Printf("Crawling: %s\n", url)
26 crawler.AddString(url)
27 }
28 }
29
30 fmt.Printf("\nMemory usage: %d KB\n", crawler.Size()/8/1024)
31 fmt.Printf("Fill ratio: %.2f%%\n", crawler.FillRatio()*100)
32 fmt.Printf("False positive rate: %.4f%%\n", crawler.EstimateFalsePositiveRate()*100)
33
34 // Example 2: Spell checker
35 fmt.Println("\n=== Spell Checker Example ===")
36
37 dictionary := bloomfilter.NewOptimalBloomFilter(100000, 0.001)
38
39 // Add dictionary words
40 words := []string{"hello", "world", "golang", "programming"}
41 for _, word := range words {
42 dictionary.AddString(word)
43 }
44
45 // Check spelling
46 testWords := []string{"hello", "wrld", "golang", "xyz"}
47 for _, word := range testWords {
48 if dictionary.MayContainString(word) {
49 fmt.Printf("'%s' - probably correct\n", word)
50 } else {
51 fmt.Printf("'%s' - DEFINITELY misspelled\n", word)
52 }
53 }
54
55 // Example 3: Database cache filter
56 fmt.Println("\n=== Database Cache Filter Example ===")
57
58 cacheFilter := bloomfilter.NewOptimalBloomFilter(10000, 0.01)
59
60 // Simulate cached keys
61 cachedKeys := []string{"user:1001", "user:1002", "user:1003"}
62 for _, key := range cachedKeys {
63 cacheFilter.AddString(key)
64 }
65
66 // Check before database query
67 checkKey := func(key string) {
68 if cacheFilter.MayContainString(key) {
69 fmt.Printf("%s - Check cache first\n", key)
70 } else {
71 fmt.Printf("%s - Skip cache, query database directly\n", key)
72 }
73 }
74
75 checkKey("user:1001") // In cache
76 checkKey("user:9999") // Not in cache
77
78 // Example 4: Scalable Bloom filter
79 fmt.Println("\n=== Scalable Bloom Filter Example ===")
80
81 scalable := bloomfilter.NewScalableBloomFilter(100, 0.01, 0.9)
82
83 // Add items beyond initial capacity
84 for i := 0; i < 500; i++ {
85 scalable.Add([]byte(fmt.Sprintf("item%d", i)))
86 }
87
88 fmt.Printf("Added %d items\n", scalable.Count())
89 fmt.Printf("Filter grew to %d levels\n", len(scalable.filters))
90}
Testing Your Solution
1. Basic Functionality
- Add items and verify they return true for
MayContain - Verify items not added return false
- Test
Clear()resets the filter
2. False Positive Rate Validation
- Add n items
- Query 10,000 items NOT in the filter
- Count false positives
- Verify rate is close to target rate
3. Optimal Parameters
- Test that calculated parameters produce expected false positive rate
- Verify memory usage is reasonable
- Check that hash count is in reasonable range
4. Serialization
- Add items to filter
- Serialize to bytes
- Deserialize to new filter
- Verify all items still detected
5. Performance
- Benchmark add operations
- Benchmark query operations
- Compare memory usage vs hash set
- Test with millions of items
Edge Cases
- Empty filter
- Single item
- Many collisions
- Maximum capacity
- Invalid parameters
Bonus Challenges
-
Compressed Bloom Filter: Implement compression to further reduce memory usage
-
Partitioned Bloom Filter: Split into multiple partitions for better cache locality
-
Cuckoo Filter: Implement alternative that supports deletion without counters
-
Bloom Filter Index: Use Bloom filter to index database for fast negative lookups
-
Network Protocol: Implement Bloom filter synchronization between distributed nodes
-
Dynamic Optimal K: Adjust number of hash functions based on fill ratio
-
Bloom Filter Visualization: Create web UI to visualize bit array and collisions
Key Takeaways
- Probabilistic data structures trade perfect accuracy for massive space savings
- False positives are possible but false negatives are impossible
- Optimal parameters can be mathematically calculated from n and p
- Double hashing efficiently simulates multiple independent hash functions
- Bit manipulation enables extremely compact storage
- Real-world applications include caching, databases, networks, and blockchain
- Memory savings can be 90-99% compared to hash sets
- Counting Bloom filters enable deletion at cost of more memory
- Scalable Bloom filters maintain false positive rate as they grow
- Trade-offs between accuracy, memory, and query time must be understood
References
- Original Paper: "Space/Time Trade-offs in Hash Coding with Allowable Errors" by Burton H. Bloom
- Analysis: "Network Applications of Bloom Filters: A Survey" by Broder and Mitzenmacher
- Counting Bloom Filters: "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol" by Fan et al.
- Scalable Bloom Filters: "Scalable Bloom Filters" by Almeida et al.
- Cuckoo Filters: "Cuckoo Filter: Practically Better Than Bloom" by Fan et al.
- Applications: Google Chrome safe browsing, Bitcoin SPV, Cassandra, Akamai
- Hash Functions: MurmurHash3 and FNV-1a documentation
- Wikipedia: Comprehensive overview with mathematical analysis