Exercise: Bloom Filter Probabilistic Data Structure

Difficulty - Advanced

Estimated Time - 2-4 hours

Learning Objectives

Understand probabilistic data structures and their trade-offs
Implement Bloom filters with optimal hash function selection
Master bit manipulation and efficient memory usage
Calculate and tune false positive rates
Apply Bloom filters to real-world performance optimization problems
Implement advanced variants

Problem Statement

How can you check if an element exists in a massive dataset using minimal memory? A naive hash set might require gigabytes of RAM for millions of items, but a Bloom filter can answer "definitely not present" or "probably present" using just kilobytes of memory.

A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It can produce false positives but never false negatives. This trade-off makes Bloom filters ideal for applications where memory is constrained and occasional false positives are acceptable.

Real-World Scenario:

Consider a web crawler that visits billions of URLs. Before crawling a URL, it needs to check if it's already been visited. Storing billions of URLs in a hash set would require enormous memory:

 1// Without Bloom filter: Huge memory usage
 2type NaiveCrawler struct {
 3    visited map[string]bool  // 1 billion URLs * 100 bytes = 100 GB!
 4}
 5
 6func AlreadyVisited(url string) bool {
 7    return c.visited[url]
 8}
 9
10// With Bloom filter: Minimal memory usage
11type EfficientCrawler struct {
12    bloom *BloomFilter  // 1 billion URLs * 10 bits = 1.25 GB
13    // If false positive, re-crawl
14}
15
16func ProbablyVisited(url string) bool {
17    return c.bloom.MayContain(url)
18}

With a 1% false positive rate, the Bloom filter might incorrectly say 1% of URLs were visited when they weren't, causing the crawler to skip them. But it uses 98% less memory than a hash set!

Requirements

Functional Requirements

Basic Bloom Filter
- Initialize with size and number of hash functions
- Add(item): Insert an item into the filter
- MayContain(item): Check if item might be in the filter
- Clear(): Reset all bits to zero
Optimal Parameter Calculation
- Given expected number of items and desired false positive rate
- Calculate optimal bit array size
- Calculate optimal number of hash functions
- Formula: m = -(n * ln(p)) /^2), k = * ln(2)
Multiple Hash Functions
- Implement k independent hash functions
- Use double hashing technique for efficiency
- Ensure uniform distribution across bit array
Counting Bloom Filter
- Support deletion by using counters instead of bits
- Each position is a counter
- Increment on add, decrement on remove
- Handle counter overflow
Scalable Bloom Filter
- Dynamically grow as more items are added
- Chain multiple Bloom filters with increasing sizes
- Maintain consistent false positive rate as filter scales

Non-Functional Requirements

Space Efficiency: Use bit-level operations for minimal memory
Performance: O(k) time for add and query operations
Accuracy: Achieve specified false positive rate
Concurrency: Thread-safe for concurrent access
Serialization: Support saving/loading filter state

Background and Theory

Bloom filters were invented by Burton Howard Bloom in 1970. They work by using multiple hash functions to set bits in a bit array.

How Bloom Filters Work

Data Structure:

Bit array of size m
k different hash functions

Add Operation:

Hash the item with each of k hash functions
Set the bit at each hash value to 1

Query Operation:

Hash the item with each of k hash functions
Check if all k bits are set to 1
If any bit is 0: definitely not in set
If all bits are 1: probably in set

Why False Positives?
Multiple different items can hash to the same bits. After adding many items, many bits are set to 1. A query for a non-existent item might find all its bits set to 1 by chance.

Mathematical Analysis

False Positive Probability:
After inserting n items into a Bloom filter with m bits and k hash functions:

P(false positive) ≈)^k

Optimal Number of Hash Functions:

k = * ln(2) ≈ 0.693 * m/n

Optimal Bit Array Size:
For desired false positive rate p and n items:

m = -(n * ln(p)) /^2) ≈ -1.44 * n * ln(p)

Example:

n = 1,000,000 items
p = 0.01
m ≈ 9,585,059 bits ≈ 1.14 MB
k ≈ 7 hash functions

Use Cases

Google Chrome: Uses Bloom filter to check if URL is malicious before full lookup
Bitcoin: Uses Bloom filters in SPV clients
Cassandra & HBase: Uses Bloom filters to avoid disk reads for non-existent data
Akamai CDN: Uses Bloom filters for cache filtering
Spell Checkers: Quick check if word is in dictionary
Network Routers: Packet filtering and duplicate detection
Blockchain: Ethereum uses Bloom filters for log filtering

Implementation Challenges

Challenge 1: Efficient Hash Function Generation

Computing k independent hash functions is expensive. How do you efficiently generate multiple hashes?

Solution Approach:
Use double hashing technique: compute only 2 hash functions, then simulate k hashes:

hash_i(x) = + i * h2(x)) mod m

This gives good distribution while being much faster than computing k separate hashes.

Challenge 2: Bit-Level Operations

To minimize memory, you need to pack bits tightly and manipulate individual bits efficiently.

Solution Approach:

Use []uint64 as underlying storage
Calculate byte index: index / 64
Calculate bit offset: index % 64
Use bitwise operations: array[byteIdx] |= to set bit
Use bitwise AND to check bit: array[byteIdx] & != 0

Challenge 3: Optimal Parameter Selection

Users shouldn't need to understand Bloom filter mathematics. Provide easy configuration.

Solution Approach:

Provide NewOptimal(expectedItems, falsePositiveRate) constructor
Automatically calculate optimal m and k
Validate parameters and warn about unrealistic configurations
Provide helper functions to estimate performance

Challenge 4: Thread Safety

Multiple goroutines might query and update the Bloom filter concurrently.

Solution Approach:

Use sync.RWMutex for concurrent access
Read operations use read lock
Write operations use write lock
Consider lock-free implementations using atomic operations for higher performance

Hints

Hint 1: Bit Array Design

Use a slice of uint64 for efficient bit storage and operations:

 1type BloomFilter struct {
 2    bits []uint64  // Bit array
 3    size uint64    // Total number of bits
 4    k    uint      // Number of hash functions
 5    n    uint64    // Number of items added
 6    mu   sync.RWMutex
 7}
 8
 9// Set bit at position i
10func setBit(i uint64) {
11    idx := i / 64
12    offset := i % 64
13    bf.bits[idx] |=
14}
15
16// Check if bit at position i is set
17func getBit(i uint64) bool {
18    idx := i / 64
19    offset := i % 64
20    return) != 0
21}

Hint 2: Double Hashing for Multiple Hash Functions

Use two hash functions to simulate k hash functions:

 1func getHashValues(data []byte) []uint64 {
 2    // Use two hash functions: fnv and murmur
 3    h1 := fnvHash(data)
 4    h2 := murmurHash(data)
 5
 6    hashes := make([]uint64, bf.k)
 7    for i := uint(0); i < bf.k; i++ {
 8        // Double hashing: h1 + i*h2
 9        hash :=*h2) % bf.size
10        hashes[i] = hash
11    }
12    return hashes
13}

This is much faster than computing k separate hash functions.

Hint 3: Optimal Parameter Calculation

Implement the mathematical formulas:

 1func OptimalParameters(n uint64, p float64) {
 2    // m = -(n * ln(p)) /^2)
 3    ln2 := math.Log(2)
 4    m_float := -(float64(n) * math.Log(p)) /
 5    m = uint64(math.Ceil(m_float))
 6
 7    // k = * ln(2)
 8    k_float := / float64(n)) * ln2
 9    k = uint(math.Round(k_float))
10
11    // Ensure reasonable bounds
12    if k < 1 {
13        k = 1
14    }
15    if k > 10 {
16        k = 10  // More than 10 hash functions is rarely optimal
17    }
18
19    return m, k
20}

Hint 4: False Positive Rate Estimation

Calculate actual false positive rate based on current state:

 1func EstimateFalsePositiveRate() float64 {
 2    // p ≈)^k
 3    if bf.n == 0 {
 4        return 0
 5    }
 6
 7    exponent := -float64(bf.k) * float64(bf.n) / float64(bf.size)
 8    p := math.Pow(1 - math.Exp(exponent), float64(bf.k))
 9
10    return p
11}

Hint 5: Counting Bloom Filter

For deletions, use counters instead of bits:

 1type CountingBloomFilter struct {
 2    counters []uint8  // 4-bit or 8-bit counters
 3    size     uint64
 4    k        uint
 5    mu       sync.RWMutex
 6}
 7
 8func Add(data []byte) {
 9    cbf.mu.Lock()
10    defer cbf.mu.Unlock()
11
12    for _, hash := range cbf.getHashValues(data) {
13        if cbf.counters[hash] < 255 {  // Prevent overflow
14            cbf.counters[hash]++
15        }
16    }
17}
18
19func Remove(data []byte) {
20    cbf.mu.Lock()
21    defer cbf.mu.Unlock()
22
23    for _, hash := range cbf.getHashValues(data) {
24        if cbf.counters[hash] > 0 {
25            cbf.counters[hash]--
26        }
27    }
28}

Note: Uses more memory but supports deletion.

Solution

Show Complete Solution

Approach

This implementation provides:

Standard Bloom Filter: Space-efficient with optimal parameters
Counting Bloom Filter: Supports deletion with counter-based storage
Scalable Bloom Filter: Dynamically grows to maintain false positive rate
Hash Functions: Efficient double hashing technique
Serialization: Save/load filter state
Statistics: Track performance metrics and estimate false positive rate

Implementation

  1package bloomfilter
  2
  3import (
  4	"bytes"
  5	"encoding/binary"
  6	"encoding/gob"
  7	"hash"
  8	"hash/fnv"
  9	"math"
 10	"sync"
 11
 12	"github.com/spaolacci/murmur3"
 13)
 14
 15// BloomFilter is a space-efficient probabilistic data structure
 16type BloomFilter struct {
 17	bits []uint64   // Bit array
 18	size uint64     // Total number of bits
 19	k    uint       // Number of hash functions
 20	n    uint64     // Number of items added
 21	mu   sync.RWMutex
 22}
 23
 24// NewBloomFilter creates a new Bloom filter with specified parameters
 25func NewBloomFilter(size uint64, k uint) *BloomFilter {
 26	// Round up size to nearest multiple of 64 for efficiency
 27	numWords := / 64
 28	actualSize := numWords * 64
 29
 30	return &BloomFilter{
 31		bits: make([]uint64, numWords),
 32		size: actualSize,
 33		k:    k,
 34		n:    0,
 35	}
 36}
 37
 38// NewOptimalBloomFilter creates a Bloom filter with optimal parameters
 39// for the expected number of items and desired false positive rate
 40func NewOptimalBloomFilter(expectedItems uint64, falsePositiveRate float64) *BloomFilter {
 41	m, k := OptimalParameters(expectedItems, falsePositiveRate)
 42	return NewBloomFilter(m, k)
 43}
 44
 45// OptimalParameters calculates optimal size and hash count
 46func OptimalParameters(n uint64, p float64) {
 47	if p <= 0 || p >= 1 {
 48		panic("false positive rate must be between 0 and 1")
 49	}
 50
 51	// m = -(n * ln(p)) /^2)
 52	ln2 := math.Log(2)
 53	mFloat := -(float64(n) * math.Log(p)) /
 54	m = uint64(math.Ceil(mFloat))
 55
 56	// k = * ln(2)
 57	kFloat := / float64(n)) * ln2
 58	k = uint(math.Round(kFloat))
 59
 60	// Reasonable bounds
 61	if k < 1 {
 62		k = 1
 63	}
 64	if k > 20 {
 65		k = 20
 66	}
 67
 68	return m, k
 69}
 70
 71// Add inserts an item into the Bloom filter
 72func Add(data []byte) {
 73	bf.mu.Lock()
 74	defer bf.mu.Unlock()
 75
 76	hashes := bf.getHashValues(data)
 77	for _, hash := range hashes {
 78		bf.setBit(hash)
 79	}
 80	bf.n++
 81}
 82
 83// AddString is a convenience method to add a string
 84func AddString(s string) {
 85	bf.Add([]byte(s))
 86}
 87
 88// MayContain checks if an item might be in the filter
 89// Returns false: definitely not in filter
 90// Returns true: probably in filter
 91func MayContain(data []byte) bool {
 92	bf.mu.RLock()
 93	defer bf.mu.RUnlock()
 94
 95	hashes := bf.getHashValues(data)
 96	for _, hash := range hashes {
 97		if !bf.getBit(hash) {
 98			return false // Definitely not present
 99		}
100	}
101	return true // Probably present
102}
103
104// MayContainString is a convenience method for strings
105func MayContainString(s string) bool {
106	return bf.MayContain([]byte(s))
107}
108
109// Clear resets the Bloom filter
110func Clear() {
111	bf.mu.Lock()
112	defer bf.mu.Unlock()
113
114	for i := range bf.bits {
115		bf.bits[i] = 0
116	}
117	bf.n = 0
118}
119
120// Count returns the number of items added
121func Count() uint64 {
122	bf.mu.RLock()
123	defer bf.mu.RUnlock()
124	return bf.n
125}
126
127// Size returns the size of the bit array
128func Size() uint64 {
129	return bf.size
130}
131
132// K returns the number of hash functions
133func K() uint {
134	return bf.k
135}
136
137// EstimateFalsePositiveRate estimates current false positive probability
138func EstimateFalsePositiveRate() float64 {
139	bf.mu.RLock()
140	defer bf.mu.RUnlock()
141
142	if bf.n == 0 {
143		return 0
144	}
145
146	// p ≈)^k
147	exponent := -float64(bf.k) * float64(bf.n) / float64(bf.size)
148	p := math.Pow(1-math.Exp(exponent), float64(bf.k))
149
150	return p
151}
152
153// FillRatio returns the proportion of bits set to 1
154func FillRatio() float64 {
155	bf.mu.RLock()
156	defer bf.mu.RUnlock()
157
158	setBits := uint64(0)
159	for _, word := range bf.bits {
160		setBits += uint64(popcount(word))
161	}
162
163	return float64(setBits) / float64(bf.size)
164}
165
166// Internal methods
167
168func setBit(i uint64) {
169	idx := i / 64
170	offset := i % 64
171	bf.bits[idx] |=
172}
173
174func getBit(i uint64) bool {
175	idx := i / 64
176	offset := i % 64
177	return) != 0
178}
179
180func getHashValues(data []byte) []uint64 {
181	// Use double hashing: two hash functions to simulate k hash functions
182	h1 := hashFNV(data)
183	h2 := hashMurmur(data)
184
185	hashes := make([]uint64, bf.k)
186	for i := uint(0); i < bf.k; i++ {
187		// Combined hash: h1 + i*h2
188		hash :=*h2) % bf.size
189		hashes[i] = hash
190	}
191
192	return hashes
193}
194
195// Hash functions
196
197func hashFNV(data []byte) uint64 {
198	h := fnv.New64a()
199	h.Write(data)
200	return h.Sum64()
201}
202
203func hashMurmur(data []byte) uint64 {
204	h := murmur3.New64()
205	h.Write(data)
206	return h.Sum64()
207}
208
209// popcount counts the number of set bits in a uint64
210func popcount(x uint64) int {
211	count := 0
212	for x != 0 {
213		count++
214		x &= x - 1 // Clear the lowest set bit
215	}
216	return count
217}
218
219// Serialization
220
221// MarshalBinary implements encoding.BinaryMarshaler
222func MarshalBinary() {
223	bf.mu.RLock()
224	defer bf.mu.RUnlock()
225
226	buf := new(bytes.Buffer)
227
228	// Write metadata
229	binary.Write(buf, binary.LittleEndian, bf.size)
230	binary.Write(buf, binary.LittleEndian, uint32(bf.k))
231	binary.Write(buf, binary.LittleEndian, bf.n)
232
233	// Write bit array
234	for _, word := range bf.bits {
235		binary.Write(buf, binary.LittleEndian, word)
236	}
237
238	return buf.Bytes(), nil
239}
240
241// UnmarshalBinary implements encoding.BinaryUnmarshaler
242func UnmarshalBinary(data []byte) error {
243	bf.mu.Lock()
244	defer bf.mu.Unlock()
245
246	buf := bytes.NewReader(data)
247
248	// Read metadata
249	binary.Read(buf, binary.LittleEndian, &bf.size)
250	var k32 uint32
251	binary.Read(buf, binary.LittleEndian, &k32)
252	bf.k = uint(k32)
253	binary.Read(buf, binary.LittleEndian, &bf.n)
254
255	// Read bit array
256	numWords := / 64
257	bf.bits = make([]uint64, numWords)
258	for i := range bf.bits {
259		binary.Read(buf, binary.LittleEndian, &bf.bits[i])
260	}
261
262	return nil
263}
264
265// CountingBloomFilter supports deletion using counters
266type CountingBloomFilter struct {
267	counters []uint8 // 8-bit counters
268	size     uint64
269	k        uint
270	n        uint64
271	mu       sync.RWMutex
272}
273
274// NewCountingBloomFilter creates a counting Bloom filter
275func NewCountingBloomFilter(size uint64, k uint) *CountingBloomFilter {
276	return &CountingBloomFilter{
277		counters: make([]uint8, size),
278		size:     size,
279		k:        k,
280		n:        0,
281	}
282}
283
284// Add inserts an item
285func Add(data []byte) {
286	cbf.mu.Lock()
287	defer cbf.mu.Unlock()
288
289	hashes := cbf.getHashValues(data)
290	for _, hash := range hashes {
291		if cbf.counters[hash] < 255 { // Prevent overflow
292			cbf.counters[hash]++
293		}
294	}
295	cbf.n++
296}
297
298// Remove deletes an item
299func Remove(data []byte) {
300	cbf.mu.Lock()
301	defer cbf.mu.Unlock()
302
303	hashes := cbf.getHashValues(data)
304	for _, hash := range hashes {
305		if cbf.counters[hash] > 0 {
306			cbf.counters[hash]--
307		}
308	}
309	if cbf.n > 0 {
310		cbf.n--
311	}
312}
313
314// MayContain checks if item might be present
315func MayContain(data []byte) bool {
316	cbf.mu.RLock()
317	defer cbf.mu.RUnlock()
318
319	hashes := cbf.getHashValues(data)
320	for _, hash := range hashes {
321		if cbf.counters[hash] == 0 {
322			return false
323		}
324	}
325	return true
326}
327
328func getHashValues(data []byte) []uint64 {
329	h1 := hashFNV(data)
330	h2 := hashMurmur(data)
331
332	hashes := make([]uint64, cbf.k)
333	for i := uint(0); i < cbf.k; i++ {
334		hash :=*h2) % cbf.size
335		hashes[i] = hash
336	}
337	return hashes
338}
339
340// ScalableBloomFilter grows dynamically to maintain false positive rate
341type ScalableBloomFilter struct {
342	filters    []*BloomFilter
343	p          float64 // Target false positive rate
344	r          float64 // Tightening ratio
345	s          uint    // Growth factor
346	mu         sync.RWMutex
347}
348
349// NewScalableBloomFilter creates a scalable Bloom filter
350func NewScalableBloomFilter(initialCapacity uint64, falsePositiveRate, tighteningRatio float64) *ScalableBloomFilter {
351	sbf := &ScalableBloomFilter{
352		filters: make([]*BloomFilter, 0),
353		p:       falsePositiveRate,
354		r:       tighteningRatio,
355		s:       2, // Double size each time
356	}
357
358	// Create initial filter
359	sbf.addFilter(initialCapacity)
360
361	return sbf
362}
363
364func addFilter(capacity uint64) {
365	// Each subsequent filter has tighter false positive rate
366	p := sbf.p * math.Pow(sbf.r, float64(len(sbf.filters)))
367	filter := NewOptimalBloomFilter(capacity, p)
368	sbf.filters = append(sbf.filters, filter)
369}
370
371// Add inserts an item
372func Add(data []byte) {
373	sbf.mu.Lock()
374	defer sbf.mu.Unlock()
375
376	// Add to current filter
377	currentFilter := sbf.filters[len(sbf.filters)-1]
378	currentFilter.Add(data)
379
380	// Check if we need to grow
381	if currentFilter.EstimateFalsePositiveRate() > sbf.p {
382		// Create new filter with larger capacity
383		newCapacity := currentFilter.n * uint64(sbf.s)
384		sbf.addFilter(newCapacity)
385	}
386}
387
388// MayContain checks all filters
389func MayContain(data []byte) bool {
390	sbf.mu.RLock()
391	defer sbf.mu.RUnlock()
392
393	// Check all filters
394	for _, filter := range sbf.filters {
395		if filter.MayContain(data) {
396			return true
397		}
398	}
399	return false
400}
401
402// Count returns total items added
403func Count() uint64 {
404	sbf.mu.RLock()
405	defer sbf.mu.RUnlock()
406
407	total := uint64(0)
408	for _, filter := range sbf.filters {
409		total += filter.Count()
410	}
411	return total
412}
413
414// Utility functions
415
416// Union creates a new Bloom filter that is the union of two filters
417// Both filters must have the same size and number of hash functions
418func Union(bf1, bf2 *BloomFilter) {
419	if bf1.size != bf2.size || bf1.k != bf2.k {
420		return nil, fmt.Errorf("filters must have same size and k")
421	}
422
423	result := NewBloomFilter(bf1.size, bf1.k)
424
425	bf1.mu.RLock()
426	bf2.mu.RLock()
427	defer bf1.mu.RUnlock()
428	defer bf2.mu.RUnlock()
429
430	// Bitwise OR
431	for i := range result.bits {
432		result.bits[i] = bf1.bits[i] | bf2.bits[i]
433	}
434
435	result.n = bf1.n + bf2.n
436
437	return result, nil
438}
439
440// Intersection creates a new Bloom filter that is the intersection of two filters
441func Intersection(bf1, bf2 *BloomFilter) {
442	if bf1.size != bf2.size || bf1.k != bf2.k {
443		return nil, fmt.Errorf("filters must have same size and k")
444	}
445
446	result := NewBloomFilter(bf1.size, bf1.k)
447
448	bf1.mu.RLock()
449	bf2.mu.RLock()
450	defer bf1.mu.RUnlock()
451	defer bf2.mu.RUnlock()
452
453	// Bitwise AND
454	for i := range result.bits {
455		result.bits[i] = bf1.bits[i] & bf2.bits[i]
456	}
457
458	// Note: n is approximate after intersection
459	if bf1.n < bf2.n {
460		result.n = bf1.n
461	} else {
462		result.n = bf2.n
463	}
464
465	return result, nil
466}

Testing

  1package bloomfilter
  2
  3import (
  4	"fmt"
  5	"testing"
  6)
  7
  8func TestBasicOperations(t *testing.T) {
  9	bf := NewOptimalBloomFilter(1000, 0.01)
 10
 11	// Add items
 12	items := []string{"apple", "banana", "cherry", "date"}
 13	for _, item := range items {
 14		bf.AddString(item)
 15	}
 16
 17	// Test membership
 18	for _, item := range items {
 19		if !bf.MayContainString(item) {
 20			t.Errorf("Item %s should be in filter", item)
 21		}
 22	}
 23
 24	// Test non-membership
 25	if bf.MayContainString("grape") {
 26		t.Log("False positive for 'grape'")
 27	}
 28}
 29
 30func TestFalsePositiveRate(t *testing.T) {
 31	n := uint64(10000)
 32	p := 0.01 // 1% target false positive rate
 33
 34	bf := NewOptimalBloomFilter(n, p)
 35
 36	// Add n items
 37	for i := uint64(0); i < n; i++ {
 38		bf.AddString(fmt.Sprintf("item%d", i))
 39	}
 40
 41	// Test false positive rate
 42	falsePositives := 0
 43	testSize := 10000
 44
 45	for i := 0; i < testSize; i++ {
 46		// Test items not in filter
 47		if bf.MayContainString(fmt.Sprintf("notinset%d", i)) {
 48			falsePositives++
 49		}
 50	}
 51
 52	actualRate := float64(falsePositives) / float64(testSize)
 53	estimatedRate := bf.EstimateFalsePositiveRate()
 54
 55	t.Logf("Target FP rate: %.4f", p)
 56	t.Logf("Actual FP rate: %.4f", actualRate)
 57	t.Logf("Estimated FP rate: %.4f", estimatedRate)
 58
 59	// Allow some tolerance
 60	if actualRate > p*2 {
 61		t.Errorf("False positive rate too high: %.4f > %.4f", actualRate, p*2)
 62	}
 63}
 64
 65func TestOptimalParameters(t *testing.T) {
 66	tests := []struct {
 67		n uint64
 68		p float64
 69	}{
 70		{1000, 0.01},
 71		{10000, 0.001},
 72		{100000, 0.01},
 73	}
 74
 75	for _, test := range tests {
 76		m, k := OptimalParameters(test.n, test.p)
 77		t.Logf("n=%d, p=%.4f => m=%d, k=%d, memory=%d KB",
 78			test.n, test.p, m, k, m/8/1024)
 79	}
 80}
 81
 82func TestCountingBloomFilter(t *testing.T) {
 83	cbf := NewCountingBloomFilter(10000, 4)
 84
 85	// Add and remove
 86	cbf.Add([]byte("test"))
 87	if !cbf.MayContain([]byte("test")) {
 88		t.Error("Item should be present after add")
 89	}
 90
 91	cbf.Remove([]byte("test"))
 92	if cbf.MayContain([]byte("test")) {
 93		t.Error("Item should not be present after remove")
 94	}
 95}
 96
 97func TestScalableBloomFilter(t *testing.T) {
 98	sbf := NewScalableBloomFilter(100, 0.01, 0.9)
 99
100	// Add many items
101	for i := 0; i < 1000; i++ {
102		sbf.Add([]byte(fmt.Sprintf("item%d", i)))
103	}
104
105	// Verify all items present
106	for i := 0; i < 1000; i++ {
107		if !sbf.MayContain([]byte(fmt.Sprintf("item%d", i))) {
108			t.Errorf("Item %d should be present", i)
109		}
110	}
111
112	t.Logf("Scalable filter grew to %d filters", len(sbf.filters))
113}
114
115func TestSerialization(t *testing.T) {
116	bf := NewOptimalBloomFilter(1000, 0.01)
117
118	// Add items
119	for i := 0; i < 100; i++ {
120		bf.AddString(fmt.Sprintf("item%d", i))
121	}
122
123	// Serialize
124	data, err := bf.MarshalBinary()
125	if err != nil {
126		t.Fatalf("Failed to marshal: %v", err)
127	}
128
129	// Deserialize
130	bf2 := &BloomFilter{}
131	if err := bf2.UnmarshalBinary(data); err != nil {
132		t.Fatalf("Failed to unmarshal: %v", err)
133	}
134
135	// Verify items present in deserialized filter
136	for i := 0; i < 100; i++ {
137		if !bf2.MayContainString(fmt.Sprintf("item%d", i)) {
138			t.Errorf("Item %d should be present after deserialization", i)
139		}
140	}
141}
142
143func TestUnionIntersection(t *testing.T) {
144	bf1 := NewOptimalBloomFilter(1000, 0.01)
145	bf2 := NewOptimalBloomFilter(1000, 0.01)
146
147	// Add different items to each
148	for i := 0; i < 50; i++ {
149		bf1.AddString(fmt.Sprintf("item%d", i))
150		bf2.AddString(fmt.Sprintf("item%d", i+25))
151	}
152
153	// Union should contain all items
154	union, err := Union(bf1, bf2)
155	if err != nil {
156		t.Fatalf("Union failed: %v", err)
157	}
158
159	for i := 0; i < 75; i++ {
160		if !union.MayContainString(fmt.Sprintf("item%d", i)) {
161			t.Errorf("Union should contain item%d", i)
162		}
163	}
164
165	// Intersection should contain overlapping items
166	intersection, err := Intersection(bf1, bf2)
167	if err != nil {
168		t.Fatalf("Intersection failed: %v", err)
169	}
170
171	for i := 25; i < 50; i++ {
172		if !intersection.MayContainString(fmt.Sprintf("item%d", i)) {
173			t.Errorf("Intersection should contain item%d", i)
174		}
175	}
176}
177
178func BenchmarkAdd(b *testing.B) {
179	bf := NewOptimalBloomFilter(uint64(b.N), 0.01)
180
181	b.ResetTimer()
182	for i := 0; i < b.N; i++ {
183		bf.AddString(fmt.Sprintf("item%d", i))
184	}
185}
186
187func BenchmarkMayContain(b *testing.B) {
188	bf := NewOptimalBloomFilter(10000, 0.01)
189
190	// Pre-populate
191	for i := 0; i < 10000; i++ {
192		bf.AddString(fmt.Sprintf("item%d", i))
193	}
194
195	b.ResetTimer()
196	for i := 0; i < b.N; i++ {
197		bf.MayContainString(fmt.Sprintf("item%d", i%10000))
198	}
199}

Usage Example

 1package main
 2
 3import (
 4	"fmt"
 5	"bloomfilter"
 6)
 7
 8func main() {
 9	// Example 1: Web crawler URL deduplication
10	fmt.Println("=== Web Crawler Example ===")
11
12	// Expect 1 million URLs, 1% false positive rate
13	crawler := bloomfilter.NewOptimalBloomFilter(1000000, 0.01)
14
15	urls := []string{
16		"https://example.com/page1",
17		"https://example.com/page2",
18		"https://example.com/page3",
19	}
20
21	for _, url := range urls {
22		if crawler.MayContainString(url) {
23			fmt.Printf("Skipping: %s\n", url)
24		} else {
25			fmt.Printf("Crawling: %s\n", url)
26			crawler.AddString(url)
27		}
28	}
29
30	fmt.Printf("\nMemory usage: %d KB\n", crawler.Size()/8/1024)
31	fmt.Printf("Fill ratio: %.2f%%\n", crawler.FillRatio()*100)
32	fmt.Printf("False positive rate: %.4f%%\n", crawler.EstimateFalsePositiveRate()*100)
33
34	// Example 2: Spell checker
35	fmt.Println("\n=== Spell Checker Example ===")
36
37	dictionary := bloomfilter.NewOptimalBloomFilter(100000, 0.001)
38
39	// Add dictionary words
40	words := []string{"hello", "world", "golang", "programming"}
41	for _, word := range words {
42		dictionary.AddString(word)
43	}
44
45	// Check spelling
46	testWords := []string{"hello", "wrld", "golang", "xyz"}
47	for _, word := range testWords {
48		if dictionary.MayContainString(word) {
49			fmt.Printf("'%s' - probably correct\n", word)
50		} else {
51			fmt.Printf("'%s' - DEFINITELY misspelled\n", word)
52		}
53	}
54
55	// Example 3: Database cache filter
56	fmt.Println("\n=== Database Cache Filter Example ===")
57
58	cacheFilter := bloomfilter.NewOptimalBloomFilter(10000, 0.01)
59
60	// Simulate cached keys
61	cachedKeys := []string{"user:1001", "user:1002", "user:1003"}
62	for _, key := range cachedKeys {
63		cacheFilter.AddString(key)
64	}
65
66	// Check before database query
67	checkKey := func(key string) {
68		if cacheFilter.MayContainString(key) {
69			fmt.Printf("%s - Check cache first\n", key)
70		} else {
71			fmt.Printf("%s - Skip cache, query database directly\n", key)
72		}
73	}
74
75	checkKey("user:1001") // In cache
76	checkKey("user:9999") // Not in cache
77
78	// Example 4: Scalable Bloom filter
79	fmt.Println("\n=== Scalable Bloom Filter Example ===")
80
81	scalable := bloomfilter.NewScalableBloomFilter(100, 0.01, 0.9)
82
83	// Add items beyond initial capacity
84	for i := 0; i < 500; i++ {
85		scalable.Add([]byte(fmt.Sprintf("item%d", i)))
86	}
87
88	fmt.Printf("Added %d items\n", scalable.Count())
89	fmt.Printf("Filter grew to %d levels\n", len(scalable.filters))
90}

Testing Your Solution

1. Basic Functionality

Add items and verify they return true for MayContain
Verify items not added return false
Test Clear() resets the filter

2. False Positive Rate Validation

Add n items
Query 10,000 items NOT in the filter
Count false positives
Verify rate is close to target rate

3. Optimal Parameters

Test that calculated parameters produce expected false positive rate
Verify memory usage is reasonable
Check that hash count is in reasonable range

4. Serialization

Add items to filter
Serialize to bytes
Deserialize to new filter
Verify all items still detected

5. Performance

Benchmark add operations
Benchmark query operations
Compare memory usage vs hash set
Test with millions of items

Edge Cases

Empty filter
Single item
Many collisions
Maximum capacity
Invalid parameters

Bonus Challenges

Compressed Bloom Filter: Implement compression to further reduce memory usage
Partitioned Bloom Filter: Split into multiple partitions for better cache locality
Cuckoo Filter: Implement alternative that supports deletion without counters
Bloom Filter Index: Use Bloom filter to index database for fast negative lookups
Network Protocol: Implement Bloom filter synchronization between distributed nodes
Dynamic Optimal K: Adjust number of hash functions based on fill ratio
Bloom Filter Visualization: Create web UI to visualize bit array and collisions

Key Takeaways

Probabilistic data structures trade perfect accuracy for massive space savings
False positives are possible but false negatives are impossible
Optimal parameters can be mathematically calculated from n and p
Double hashing efficiently simulates multiple independent hash functions
Bit manipulation enables extremely compact storage
Real-world applications include caching, databases, networks, and blockchain
Memory savings can be 90-99% compared to hash sets
Counting Bloom filters enable deletion at cost of more memory
Scalable Bloom filters maintain false positive rate as they grow
Trade-offs between accuracy, memory, and query time must be understood

References

Original Paper: "Space/Time Trade-offs in Hash Coding with Allowable Errors" by Burton H. Bloom
Analysis: "Network Applications of Bloom Filters: A Survey" by Broder and Mitzenmacher
Counting Bloom Filters: "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol" by Fan et al.
Scalable Bloom Filters: "Scalable Bloom Filters" by Almeida et al.
Cuckoo Filters: "Cuckoo Filter: Practically Better Than Bloom" by Fan et al.
Applications: Google Chrome safe browsing, Bitcoin SPV, Cassandra, Akamai
Hash Functions: MurmurHash3 and FNV-1a documentation
Wikipedia: Comprehensive overview with mathematical analysis