Bloom Filter Probabilistic Data Structure

Exercise: Bloom Filter Probabilistic Data Structure

Difficulty - Advanced

Estimated Time - 2-4 hours

Learning Objectives

  • Understand probabilistic data structures and their trade-offs
  • Implement Bloom filters with optimal hash function selection
  • Master bit manipulation and efficient memory usage
  • Calculate and tune false positive rates
  • Apply Bloom filters to real-world performance optimization problems
  • Implement advanced variants

Problem Statement

How can you check if an element exists in a massive dataset using minimal memory? A naive hash set might require gigabytes of RAM for millions of items, but a Bloom filter can answer "definitely not present" or "probably present" using just kilobytes of memory.

A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It can produce false positives but never false negatives. This trade-off makes Bloom filters ideal for applications where memory is constrained and occasional false positives are acceptable.

Real-World Scenario:

Consider a web crawler that visits billions of URLs. Before crawling a URL, it needs to check if it's already been visited. Storing billions of URLs in a hash set would require enormous memory:

 1// Without Bloom filter: Huge memory usage
 2type NaiveCrawler struct {
 3    visited map[string]bool  // 1 billion URLs * 100 bytes = 100 GB!
 4}
 5
 6func AlreadyVisited(url string) bool {
 7    return c.visited[url]
 8}
 9
10// With Bloom filter: Minimal memory usage
11type EfficientCrawler struct {
12    bloom *BloomFilter  // 1 billion URLs * 10 bits = 1.25 GB
13    // If false positive, re-crawl
14}
15
16func ProbablyVisited(url string) bool {
17    return c.bloom.MayContain(url)
18}

With a 1% false positive rate, the Bloom filter might incorrectly say 1% of URLs were visited when they weren't, causing the crawler to skip them. But it uses 98% less memory than a hash set!

Requirements

Functional Requirements

  1. Basic Bloom Filter

    • Initialize with size and number of hash functions
    • Add(item): Insert an item into the filter
    • MayContain(item): Check if item might be in the filter
    • Clear(): Reset all bits to zero
  2. Optimal Parameter Calculation

    • Given expected number of items and desired false positive rate
    • Calculate optimal bit array size
    • Calculate optimal number of hash functions
    • Formula: m = -(n * ln(p)) /^2), k = * ln(2)
  3. Multiple Hash Functions

    • Implement k independent hash functions
    • Use double hashing technique for efficiency
    • Ensure uniform distribution across bit array
  4. Counting Bloom Filter

    • Support deletion by using counters instead of bits
    • Each position is a counter
    • Increment on add, decrement on remove
    • Handle counter overflow
  5. Scalable Bloom Filter

    • Dynamically grow as more items are added
    • Chain multiple Bloom filters with increasing sizes
    • Maintain consistent false positive rate as filter scales

Non-Functional Requirements

  • Space Efficiency: Use bit-level operations for minimal memory
  • Performance: O(k) time for add and query operations
  • Accuracy: Achieve specified false positive rate
  • Concurrency: Thread-safe for concurrent access
  • Serialization: Support saving/loading filter state

Background and Theory

Bloom filters were invented by Burton Howard Bloom in 1970. They work by using multiple hash functions to set bits in a bit array.

How Bloom Filters Work

Data Structure:

  • Bit array of size m
  • k different hash functions

Add Operation:

  1. Hash the item with each of k hash functions
  2. Set the bit at each hash value to 1

Query Operation:

  1. Hash the item with each of k hash functions
  2. Check if all k bits are set to 1
  3. If any bit is 0: definitely not in set
  4. If all bits are 1: probably in set

Why False Positives?
Multiple different items can hash to the same bits. After adding many items, many bits are set to 1. A query for a non-existent item might find all its bits set to 1 by chance.

Mathematical Analysis

False Positive Probability:
After inserting n items into a Bloom filter with m bits and k hash functions:

P(false positive) ≈)^k

Optimal Number of Hash Functions:

k = * ln(2) ≈ 0.693 * m/n

Optimal Bit Array Size:
For desired false positive rate p and n items:

m = -(n * ln(p)) /^2) ≈ -1.44 * n * ln(p)

Example:

  • n = 1,000,000 items
  • p = 0.01
  • m ≈ 9,585,059 bits ≈ 1.14 MB
  • k ≈ 7 hash functions

Use Cases

  1. Google Chrome: Uses Bloom filter to check if URL is malicious before full lookup
  2. Bitcoin: Uses Bloom filters in SPV clients
  3. Cassandra & HBase: Uses Bloom filters to avoid disk reads for non-existent data
  4. Akamai CDN: Uses Bloom filters for cache filtering
  5. Spell Checkers: Quick check if word is in dictionary
  6. Network Routers: Packet filtering and duplicate detection
  7. Blockchain: Ethereum uses Bloom filters for log filtering

Implementation Challenges

Challenge 1: Efficient Hash Function Generation

Computing k independent hash functions is expensive. How do you efficiently generate multiple hashes?

Solution Approach:
Use double hashing technique: compute only 2 hash functions, then simulate k hashes:

hash_i(x) = + i * h2(x)) mod m

This gives good distribution while being much faster than computing k separate hashes.

Challenge 2: Bit-Level Operations

To minimize memory, you need to pack bits tightly and manipulate individual bits efficiently.

Solution Approach:

  • Use []uint64 as underlying storage
  • Calculate byte index: index / 64
  • Calculate bit offset: index % 64
  • Use bitwise operations: array[byteIdx] |= to set bit
  • Use bitwise AND to check bit: array[byteIdx] & != 0

Challenge 3: Optimal Parameter Selection

Users shouldn't need to understand Bloom filter mathematics. Provide easy configuration.

Solution Approach:

  • Provide NewOptimal(expectedItems, falsePositiveRate) constructor
  • Automatically calculate optimal m and k
  • Validate parameters and warn about unrealistic configurations
  • Provide helper functions to estimate performance

Challenge 4: Thread Safety

Multiple goroutines might query and update the Bloom filter concurrently.

Solution Approach:

  • Use sync.RWMutex for concurrent access
  • Read operations use read lock
  • Write operations use write lock
  • Consider lock-free implementations using atomic operations for higher performance

Hints

Hint 1: Bit Array Design

Use a slice of uint64 for efficient bit storage and operations:

 1type BloomFilter struct {
 2    bits []uint64  // Bit array
 3    size uint64    // Total number of bits
 4    k    uint      // Number of hash functions
 5    n    uint64    // Number of items added
 6    mu   sync.RWMutex
 7}
 8
 9// Set bit at position i
10func setBit(i uint64) {
11    idx := i / 64
12    offset := i % 64
13    bf.bits[idx] |=
14}
15
16// Check if bit at position i is set
17func getBit(i uint64) bool {
18    idx := i / 64
19    offset := i % 64
20    return) != 0
21}
Hint 2: Double Hashing for Multiple Hash Functions

Use two hash functions to simulate k hash functions:

 1func getHashValues(data []byte) []uint64 {
 2    // Use two hash functions: fnv and murmur
 3    h1 := fnvHash(data)
 4    h2 := murmurHash(data)
 5
 6    hashes := make([]uint64, bf.k)
 7    for i := uint(0); i < bf.k; i++ {
 8        // Double hashing: h1 + i*h2
 9        hash :=*h2) % bf.size
10        hashes[i] = hash
11    }
12    return hashes
13}

This is much faster than computing k separate hash functions.

Hint 3: Optimal Parameter Calculation

Implement the mathematical formulas:

 1func OptimalParameters(n uint64, p float64) {
 2    // m = -(n * ln(p)) /^2)
 3    ln2 := math.Log(2)
 4    m_float := -(float64(n) * math.Log(p)) /
 5    m = uint64(math.Ceil(m_float))
 6
 7    // k = * ln(2)
 8    k_float := / float64(n)) * ln2
 9    k = uint(math.Round(k_float))
10
11    // Ensure reasonable bounds
12    if k < 1 {
13        k = 1
14    }
15    if k > 10 {
16        k = 10  // More than 10 hash functions is rarely optimal
17    }
18
19    return m, k
20}
Hint 4: False Positive Rate Estimation

Calculate actual false positive rate based on current state:

 1func EstimateFalsePositiveRate() float64 {
 2    // p ≈)^k
 3    if bf.n == 0 {
 4        return 0
 5    }
 6
 7    exponent := -float64(bf.k) * float64(bf.n) / float64(bf.size)
 8    p := math.Pow(1 - math.Exp(exponent), float64(bf.k))
 9
10    return p
11}
Hint 5: Counting Bloom Filter

For deletions, use counters instead of bits:

 1type CountingBloomFilter struct {
 2    counters []uint8  // 4-bit or 8-bit counters
 3    size     uint64
 4    k        uint
 5    mu       sync.RWMutex
 6}
 7
 8func Add(data []byte) {
 9    cbf.mu.Lock()
10    defer cbf.mu.Unlock()
11
12    for _, hash := range cbf.getHashValues(data) {
13        if cbf.counters[hash] < 255 {  // Prevent overflow
14            cbf.counters[hash]++
15        }
16    }
17}
18
19func Remove(data []byte) {
20    cbf.mu.Lock()
21    defer cbf.mu.Unlock()
22
23    for _, hash := range cbf.getHashValues(data) {
24        if cbf.counters[hash] > 0 {
25            cbf.counters[hash]--
26        }
27    }
28}

Note: Uses more memory but supports deletion.

Solution

Show Complete Solution

Approach

This implementation provides:

  1. Standard Bloom Filter: Space-efficient with optimal parameters
  2. Counting Bloom Filter: Supports deletion with counter-based storage
  3. Scalable Bloom Filter: Dynamically grows to maintain false positive rate
  4. Hash Functions: Efficient double hashing technique
  5. Serialization: Save/load filter state
  6. Statistics: Track performance metrics and estimate false positive rate

Implementation

  1package bloomfilter
  2
  3import (
  4	"bytes"
  5	"encoding/binary"
  6	"encoding/gob"
  7	"hash"
  8	"hash/fnv"
  9	"math"
 10	"sync"
 11
 12	"github.com/spaolacci/murmur3"
 13)
 14
 15// BloomFilter is a space-efficient probabilistic data structure
 16type BloomFilter struct {
 17	bits []uint64   // Bit array
 18	size uint64     // Total number of bits
 19	k    uint       // Number of hash functions
 20	n    uint64     // Number of items added
 21	mu   sync.RWMutex
 22}
 23
 24// NewBloomFilter creates a new Bloom filter with specified parameters
 25func NewBloomFilter(size uint64, k uint) *BloomFilter {
 26	// Round up size to nearest multiple of 64 for efficiency
 27	numWords := / 64
 28	actualSize := numWords * 64
 29
 30	return &BloomFilter{
 31		bits: make([]uint64, numWords),
 32		size: actualSize,
 33		k:    k,
 34		n:    0,
 35	}
 36}
 37
 38// NewOptimalBloomFilter creates a Bloom filter with optimal parameters
 39// for the expected number of items and desired false positive rate
 40func NewOptimalBloomFilter(expectedItems uint64, falsePositiveRate float64) *BloomFilter {
 41	m, k := OptimalParameters(expectedItems, falsePositiveRate)
 42	return NewBloomFilter(m, k)
 43}
 44
 45// OptimalParameters calculates optimal size and hash count
 46func OptimalParameters(n uint64, p float64) {
 47	if p <= 0 || p >= 1 {
 48		panic("false positive rate must be between 0 and 1")
 49	}
 50
 51	// m = -(n * ln(p)) /^2)
 52	ln2 := math.Log(2)
 53	mFloat := -(float64(n) * math.Log(p)) /
 54	m = uint64(math.Ceil(mFloat))
 55
 56	// k = * ln(2)
 57	kFloat := / float64(n)) * ln2
 58	k = uint(math.Round(kFloat))
 59
 60	// Reasonable bounds
 61	if k < 1 {
 62		k = 1
 63	}
 64	if k > 20 {
 65		k = 20
 66	}
 67
 68	return m, k
 69}
 70
 71// Add inserts an item into the Bloom filter
 72func Add(data []byte) {
 73	bf.mu.Lock()
 74	defer bf.mu.Unlock()
 75
 76	hashes := bf.getHashValues(data)
 77	for _, hash := range hashes {
 78		bf.setBit(hash)
 79	}
 80	bf.n++
 81}
 82
 83// AddString is a convenience method to add a string
 84func AddString(s string) {
 85	bf.Add([]byte(s))
 86}
 87
 88// MayContain checks if an item might be in the filter
 89// Returns false: definitely not in filter
 90// Returns true: probably in filter
 91func MayContain(data []byte) bool {
 92	bf.mu.RLock()
 93	defer bf.mu.RUnlock()
 94
 95	hashes := bf.getHashValues(data)
 96	for _, hash := range hashes {
 97		if !bf.getBit(hash) {
 98			return false // Definitely not present
 99		}
100	}
101	return true // Probably present
102}
103
104// MayContainString is a convenience method for strings
105func MayContainString(s string) bool {
106	return bf.MayContain([]byte(s))
107}
108
109// Clear resets the Bloom filter
110func Clear() {
111	bf.mu.Lock()
112	defer bf.mu.Unlock()
113
114	for i := range bf.bits {
115		bf.bits[i] = 0
116	}
117	bf.n = 0
118}
119
120// Count returns the number of items added
121func Count() uint64 {
122	bf.mu.RLock()
123	defer bf.mu.RUnlock()
124	return bf.n
125}
126
127// Size returns the size of the bit array
128func Size() uint64 {
129	return bf.size
130}
131
132// K returns the number of hash functions
133func K() uint {
134	return bf.k
135}
136
137// EstimateFalsePositiveRate estimates current false positive probability
138func EstimateFalsePositiveRate() float64 {
139	bf.mu.RLock()
140	defer bf.mu.RUnlock()
141
142	if bf.n == 0 {
143		return 0
144	}
145
146	// p ≈)^k
147	exponent := -float64(bf.k) * float64(bf.n) / float64(bf.size)
148	p := math.Pow(1-math.Exp(exponent), float64(bf.k))
149
150	return p
151}
152
153// FillRatio returns the proportion of bits set to 1
154func FillRatio() float64 {
155	bf.mu.RLock()
156	defer bf.mu.RUnlock()
157
158	setBits := uint64(0)
159	for _, word := range bf.bits {
160		setBits += uint64(popcount(word))
161	}
162
163	return float64(setBits) / float64(bf.size)
164}
165
166// Internal methods
167
168func setBit(i uint64) {
169	idx := i / 64
170	offset := i % 64
171	bf.bits[idx] |=
172}
173
174func getBit(i uint64) bool {
175	idx := i / 64
176	offset := i % 64
177	return) != 0
178}
179
180func getHashValues(data []byte) []uint64 {
181	// Use double hashing: two hash functions to simulate k hash functions
182	h1 := hashFNV(data)
183	h2 := hashMurmur(data)
184
185	hashes := make([]uint64, bf.k)
186	for i := uint(0); i < bf.k; i++ {
187		// Combined hash: h1 + i*h2
188		hash :=*h2) % bf.size
189		hashes[i] = hash
190	}
191
192	return hashes
193}
194
195// Hash functions
196
197func hashFNV(data []byte) uint64 {
198	h := fnv.New64a()
199	h.Write(data)
200	return h.Sum64()
201}
202
203func hashMurmur(data []byte) uint64 {
204	h := murmur3.New64()
205	h.Write(data)
206	return h.Sum64()
207}
208
209// popcount counts the number of set bits in a uint64
210func popcount(x uint64) int {
211	count := 0
212	for x != 0 {
213		count++
214		x &= x - 1 // Clear the lowest set bit
215	}
216	return count
217}
218
219// Serialization
220
221// MarshalBinary implements encoding.BinaryMarshaler
222func MarshalBinary() {
223	bf.mu.RLock()
224	defer bf.mu.RUnlock()
225
226	buf := new(bytes.Buffer)
227
228	// Write metadata
229	binary.Write(buf, binary.LittleEndian, bf.size)
230	binary.Write(buf, binary.LittleEndian, uint32(bf.k))
231	binary.Write(buf, binary.LittleEndian, bf.n)
232
233	// Write bit array
234	for _, word := range bf.bits {
235		binary.Write(buf, binary.LittleEndian, word)
236	}
237
238	return buf.Bytes(), nil
239}
240
241// UnmarshalBinary implements encoding.BinaryUnmarshaler
242func UnmarshalBinary(data []byte) error {
243	bf.mu.Lock()
244	defer bf.mu.Unlock()
245
246	buf := bytes.NewReader(data)
247
248	// Read metadata
249	binary.Read(buf, binary.LittleEndian, &bf.size)
250	var k32 uint32
251	binary.Read(buf, binary.LittleEndian, &k32)
252	bf.k = uint(k32)
253	binary.Read(buf, binary.LittleEndian, &bf.n)
254
255	// Read bit array
256	numWords := / 64
257	bf.bits = make([]uint64, numWords)
258	for i := range bf.bits {
259		binary.Read(buf, binary.LittleEndian, &bf.bits[i])
260	}
261
262	return nil
263}
264
265// CountingBloomFilter supports deletion using counters
266type CountingBloomFilter struct {
267	counters []uint8 // 8-bit counters
268	size     uint64
269	k        uint
270	n        uint64
271	mu       sync.RWMutex
272}
273
274// NewCountingBloomFilter creates a counting Bloom filter
275func NewCountingBloomFilter(size uint64, k uint) *CountingBloomFilter {
276	return &CountingBloomFilter{
277		counters: make([]uint8, size),
278		size:     size,
279		k:        k,
280		n:        0,
281	}
282}
283
284// Add inserts an item
285func Add(data []byte) {
286	cbf.mu.Lock()
287	defer cbf.mu.Unlock()
288
289	hashes := cbf.getHashValues(data)
290	for _, hash := range hashes {
291		if cbf.counters[hash] < 255 { // Prevent overflow
292			cbf.counters[hash]++
293		}
294	}
295	cbf.n++
296}
297
298// Remove deletes an item
299func Remove(data []byte) {
300	cbf.mu.Lock()
301	defer cbf.mu.Unlock()
302
303	hashes := cbf.getHashValues(data)
304	for _, hash := range hashes {
305		if cbf.counters[hash] > 0 {
306			cbf.counters[hash]--
307		}
308	}
309	if cbf.n > 0 {
310		cbf.n--
311	}
312}
313
314// MayContain checks if item might be present
315func MayContain(data []byte) bool {
316	cbf.mu.RLock()
317	defer cbf.mu.RUnlock()
318
319	hashes := cbf.getHashValues(data)
320	for _, hash := range hashes {
321		if cbf.counters[hash] == 0 {
322			return false
323		}
324	}
325	return true
326}
327
328func getHashValues(data []byte) []uint64 {
329	h1 := hashFNV(data)
330	h2 := hashMurmur(data)
331
332	hashes := make([]uint64, cbf.k)
333	for i := uint(0); i < cbf.k; i++ {
334		hash :=*h2) % cbf.size
335		hashes[i] = hash
336	}
337	return hashes
338}
339
340// ScalableBloomFilter grows dynamically to maintain false positive rate
341type ScalableBloomFilter struct {
342	filters    []*BloomFilter
343	p          float64 // Target false positive rate
344	r          float64 // Tightening ratio
345	s          uint    // Growth factor
346	mu         sync.RWMutex
347}
348
349// NewScalableBloomFilter creates a scalable Bloom filter
350func NewScalableBloomFilter(initialCapacity uint64, falsePositiveRate, tighteningRatio float64) *ScalableBloomFilter {
351	sbf := &ScalableBloomFilter{
352		filters: make([]*BloomFilter, 0),
353		p:       falsePositiveRate,
354		r:       tighteningRatio,
355		s:       2, // Double size each time
356	}
357
358	// Create initial filter
359	sbf.addFilter(initialCapacity)
360
361	return sbf
362}
363
364func addFilter(capacity uint64) {
365	// Each subsequent filter has tighter false positive rate
366	p := sbf.p * math.Pow(sbf.r, float64(len(sbf.filters)))
367	filter := NewOptimalBloomFilter(capacity, p)
368	sbf.filters = append(sbf.filters, filter)
369}
370
371// Add inserts an item
372func Add(data []byte) {
373	sbf.mu.Lock()
374	defer sbf.mu.Unlock()
375
376	// Add to current filter
377	currentFilter := sbf.filters[len(sbf.filters)-1]
378	currentFilter.Add(data)
379
380	// Check if we need to grow
381	if currentFilter.EstimateFalsePositiveRate() > sbf.p {
382		// Create new filter with larger capacity
383		newCapacity := currentFilter.n * uint64(sbf.s)
384		sbf.addFilter(newCapacity)
385	}
386}
387
388// MayContain checks all filters
389func MayContain(data []byte) bool {
390	sbf.mu.RLock()
391	defer sbf.mu.RUnlock()
392
393	// Check all filters
394	for _, filter := range sbf.filters {
395		if filter.MayContain(data) {
396			return true
397		}
398	}
399	return false
400}
401
402// Count returns total items added
403func Count() uint64 {
404	sbf.mu.RLock()
405	defer sbf.mu.RUnlock()
406
407	total := uint64(0)
408	for _, filter := range sbf.filters {
409		total += filter.Count()
410	}
411	return total
412}
413
414// Utility functions
415
416// Union creates a new Bloom filter that is the union of two filters
417// Both filters must have the same size and number of hash functions
418func Union(bf1, bf2 *BloomFilter) {
419	if bf1.size != bf2.size || bf1.k != bf2.k {
420		return nil, fmt.Errorf("filters must have same size and k")
421	}
422
423	result := NewBloomFilter(bf1.size, bf1.k)
424
425	bf1.mu.RLock()
426	bf2.mu.RLock()
427	defer bf1.mu.RUnlock()
428	defer bf2.mu.RUnlock()
429
430	// Bitwise OR
431	for i := range result.bits {
432		result.bits[i] = bf1.bits[i] | bf2.bits[i]
433	}
434
435	result.n = bf1.n + bf2.n
436
437	return result, nil
438}
439
440// Intersection creates a new Bloom filter that is the intersection of two filters
441func Intersection(bf1, bf2 *BloomFilter) {
442	if bf1.size != bf2.size || bf1.k != bf2.k {
443		return nil, fmt.Errorf("filters must have same size and k")
444	}
445
446	result := NewBloomFilter(bf1.size, bf1.k)
447
448	bf1.mu.RLock()
449	bf2.mu.RLock()
450	defer bf1.mu.RUnlock()
451	defer bf2.mu.RUnlock()
452
453	// Bitwise AND
454	for i := range result.bits {
455		result.bits[i] = bf1.bits[i] & bf2.bits[i]
456	}
457
458	// Note: n is approximate after intersection
459	if bf1.n < bf2.n {
460		result.n = bf1.n
461	} else {
462		result.n = bf2.n
463	}
464
465	return result, nil
466}

Testing

  1package bloomfilter
  2
  3import (
  4	"fmt"
  5	"testing"
  6)
  7
  8func TestBasicOperations(t *testing.T) {
  9	bf := NewOptimalBloomFilter(1000, 0.01)
 10
 11	// Add items
 12	items := []string{"apple", "banana", "cherry", "date"}
 13	for _, item := range items {
 14		bf.AddString(item)
 15	}
 16
 17	// Test membership
 18	for _, item := range items {
 19		if !bf.MayContainString(item) {
 20			t.Errorf("Item %s should be in filter", item)
 21		}
 22	}
 23
 24	// Test non-membership
 25	if bf.MayContainString("grape") {
 26		t.Log("False positive for 'grape'")
 27	}
 28}
 29
 30func TestFalsePositiveRate(t *testing.T) {
 31	n := uint64(10000)
 32	p := 0.01 // 1% target false positive rate
 33
 34	bf := NewOptimalBloomFilter(n, p)
 35
 36	// Add n items
 37	for i := uint64(0); i < n; i++ {
 38		bf.AddString(fmt.Sprintf("item%d", i))
 39	}
 40
 41	// Test false positive rate
 42	falsePositives := 0
 43	testSize := 10000
 44
 45	for i := 0; i < testSize; i++ {
 46		// Test items not in filter
 47		if bf.MayContainString(fmt.Sprintf("notinset%d", i)) {
 48			falsePositives++
 49		}
 50	}
 51
 52	actualRate := float64(falsePositives) / float64(testSize)
 53	estimatedRate := bf.EstimateFalsePositiveRate()
 54
 55	t.Logf("Target FP rate: %.4f", p)
 56	t.Logf("Actual FP rate: %.4f", actualRate)
 57	t.Logf("Estimated FP rate: %.4f", estimatedRate)
 58
 59	// Allow some tolerance
 60	if actualRate > p*2 {
 61		t.Errorf("False positive rate too high: %.4f > %.4f", actualRate, p*2)
 62	}
 63}
 64
 65func TestOptimalParameters(t *testing.T) {
 66	tests := []struct {
 67		n uint64
 68		p float64
 69	}{
 70		{1000, 0.01},
 71		{10000, 0.001},
 72		{100000, 0.01},
 73	}
 74
 75	for _, test := range tests {
 76		m, k := OptimalParameters(test.n, test.p)
 77		t.Logf("n=%d, p=%.4f => m=%d, k=%d, memory=%d KB",
 78			test.n, test.p, m, k, m/8/1024)
 79	}
 80}
 81
 82func TestCountingBloomFilter(t *testing.T) {
 83	cbf := NewCountingBloomFilter(10000, 4)
 84
 85	// Add and remove
 86	cbf.Add([]byte("test"))
 87	if !cbf.MayContain([]byte("test")) {
 88		t.Error("Item should be present after add")
 89	}
 90
 91	cbf.Remove([]byte("test"))
 92	if cbf.MayContain([]byte("test")) {
 93		t.Error("Item should not be present after remove")
 94	}
 95}
 96
 97func TestScalableBloomFilter(t *testing.T) {
 98	sbf := NewScalableBloomFilter(100, 0.01, 0.9)
 99
100	// Add many items
101	for i := 0; i < 1000; i++ {
102		sbf.Add([]byte(fmt.Sprintf("item%d", i)))
103	}
104
105	// Verify all items present
106	for i := 0; i < 1000; i++ {
107		if !sbf.MayContain([]byte(fmt.Sprintf("item%d", i))) {
108			t.Errorf("Item %d should be present", i)
109		}
110	}
111
112	t.Logf("Scalable filter grew to %d filters", len(sbf.filters))
113}
114
115func TestSerialization(t *testing.T) {
116	bf := NewOptimalBloomFilter(1000, 0.01)
117
118	// Add items
119	for i := 0; i < 100; i++ {
120		bf.AddString(fmt.Sprintf("item%d", i))
121	}
122
123	// Serialize
124	data, err := bf.MarshalBinary()
125	if err != nil {
126		t.Fatalf("Failed to marshal: %v", err)
127	}
128
129	// Deserialize
130	bf2 := &BloomFilter{}
131	if err := bf2.UnmarshalBinary(data); err != nil {
132		t.Fatalf("Failed to unmarshal: %v", err)
133	}
134
135	// Verify items present in deserialized filter
136	for i := 0; i < 100; i++ {
137		if !bf2.MayContainString(fmt.Sprintf("item%d", i)) {
138			t.Errorf("Item %d should be present after deserialization", i)
139		}
140	}
141}
142
143func TestUnionIntersection(t *testing.T) {
144	bf1 := NewOptimalBloomFilter(1000, 0.01)
145	bf2 := NewOptimalBloomFilter(1000, 0.01)
146
147	// Add different items to each
148	for i := 0; i < 50; i++ {
149		bf1.AddString(fmt.Sprintf("item%d", i))
150		bf2.AddString(fmt.Sprintf("item%d", i+25))
151	}
152
153	// Union should contain all items
154	union, err := Union(bf1, bf2)
155	if err != nil {
156		t.Fatalf("Union failed: %v", err)
157	}
158
159	for i := 0; i < 75; i++ {
160		if !union.MayContainString(fmt.Sprintf("item%d", i)) {
161			t.Errorf("Union should contain item%d", i)
162		}
163	}
164
165	// Intersection should contain overlapping items
166	intersection, err := Intersection(bf1, bf2)
167	if err != nil {
168		t.Fatalf("Intersection failed: %v", err)
169	}
170
171	for i := 25; i < 50; i++ {
172		if !intersection.MayContainString(fmt.Sprintf("item%d", i)) {
173			t.Errorf("Intersection should contain item%d", i)
174		}
175	}
176}
177
178func BenchmarkAdd(b *testing.B) {
179	bf := NewOptimalBloomFilter(uint64(b.N), 0.01)
180
181	b.ResetTimer()
182	for i := 0; i < b.N; i++ {
183		bf.AddString(fmt.Sprintf("item%d", i))
184	}
185}
186
187func BenchmarkMayContain(b *testing.B) {
188	bf := NewOptimalBloomFilter(10000, 0.01)
189
190	// Pre-populate
191	for i := 0; i < 10000; i++ {
192		bf.AddString(fmt.Sprintf("item%d", i))
193	}
194
195	b.ResetTimer()
196	for i := 0; i < b.N; i++ {
197		bf.MayContainString(fmt.Sprintf("item%d", i%10000))
198	}
199}

Usage Example

 1package main
 2
 3import (
 4	"fmt"
 5	"bloomfilter"
 6)
 7
 8func main() {
 9	// Example 1: Web crawler URL deduplication
10	fmt.Println("=== Web Crawler Example ===")
11
12	// Expect 1 million URLs, 1% false positive rate
13	crawler := bloomfilter.NewOptimalBloomFilter(1000000, 0.01)
14
15	urls := []string{
16		"https://example.com/page1",
17		"https://example.com/page2",
18		"https://example.com/page3",
19	}
20
21	for _, url := range urls {
22		if crawler.MayContainString(url) {
23			fmt.Printf("Skipping: %s\n", url)
24		} else {
25			fmt.Printf("Crawling: %s\n", url)
26			crawler.AddString(url)
27		}
28	}
29
30	fmt.Printf("\nMemory usage: %d KB\n", crawler.Size()/8/1024)
31	fmt.Printf("Fill ratio: %.2f%%\n", crawler.FillRatio()*100)
32	fmt.Printf("False positive rate: %.4f%%\n", crawler.EstimateFalsePositiveRate()*100)
33
34	// Example 2: Spell checker
35	fmt.Println("\n=== Spell Checker Example ===")
36
37	dictionary := bloomfilter.NewOptimalBloomFilter(100000, 0.001)
38
39	// Add dictionary words
40	words := []string{"hello", "world", "golang", "programming"}
41	for _, word := range words {
42		dictionary.AddString(word)
43	}
44
45	// Check spelling
46	testWords := []string{"hello", "wrld", "golang", "xyz"}
47	for _, word := range testWords {
48		if dictionary.MayContainString(word) {
49			fmt.Printf("'%s' - probably correct\n", word)
50		} else {
51			fmt.Printf("'%s' - DEFINITELY misspelled\n", word)
52		}
53	}
54
55	// Example 3: Database cache filter
56	fmt.Println("\n=== Database Cache Filter Example ===")
57
58	cacheFilter := bloomfilter.NewOptimalBloomFilter(10000, 0.01)
59
60	// Simulate cached keys
61	cachedKeys := []string{"user:1001", "user:1002", "user:1003"}
62	for _, key := range cachedKeys {
63		cacheFilter.AddString(key)
64	}
65
66	// Check before database query
67	checkKey := func(key string) {
68		if cacheFilter.MayContainString(key) {
69			fmt.Printf("%s - Check cache first\n", key)
70		} else {
71			fmt.Printf("%s - Skip cache, query database directly\n", key)
72		}
73	}
74
75	checkKey("user:1001") // In cache
76	checkKey("user:9999") // Not in cache
77
78	// Example 4: Scalable Bloom filter
79	fmt.Println("\n=== Scalable Bloom Filter Example ===")
80
81	scalable := bloomfilter.NewScalableBloomFilter(100, 0.01, 0.9)
82
83	// Add items beyond initial capacity
84	for i := 0; i < 500; i++ {
85		scalable.Add([]byte(fmt.Sprintf("item%d", i)))
86	}
87
88	fmt.Printf("Added %d items\n", scalable.Count())
89	fmt.Printf("Filter grew to %d levels\n", len(scalable.filters))
90}

Testing Your Solution

1. Basic Functionality

  • Add items and verify they return true for MayContain
  • Verify items not added return false
  • Test Clear() resets the filter

2. False Positive Rate Validation

  • Add n items
  • Query 10,000 items NOT in the filter
  • Count false positives
  • Verify rate is close to target rate

3. Optimal Parameters

  • Test that calculated parameters produce expected false positive rate
  • Verify memory usage is reasonable
  • Check that hash count is in reasonable range

4. Serialization

  • Add items to filter
  • Serialize to bytes
  • Deserialize to new filter
  • Verify all items still detected

5. Performance

  • Benchmark add operations
  • Benchmark query operations
  • Compare memory usage vs hash set
  • Test with millions of items

Edge Cases

  • Empty filter
  • Single item
  • Many collisions
  • Maximum capacity
  • Invalid parameters

Bonus Challenges

  1. Compressed Bloom Filter: Implement compression to further reduce memory usage

  2. Partitioned Bloom Filter: Split into multiple partitions for better cache locality

  3. Cuckoo Filter: Implement alternative that supports deletion without counters

  4. Bloom Filter Index: Use Bloom filter to index database for fast negative lookups

  5. Network Protocol: Implement Bloom filter synchronization between distributed nodes

  6. Dynamic Optimal K: Adjust number of hash functions based on fill ratio

  7. Bloom Filter Visualization: Create web UI to visualize bit array and collisions

Key Takeaways

  • Probabilistic data structures trade perfect accuracy for massive space savings
  • False positives are possible but false negatives are impossible
  • Optimal parameters can be mathematically calculated from n and p
  • Double hashing efficiently simulates multiple independent hash functions
  • Bit manipulation enables extremely compact storage
  • Real-world applications include caching, databases, networks, and blockchain
  • Memory savings can be 90-99% compared to hash sets
  • Counting Bloom filters enable deletion at cost of more memory
  • Scalable Bloom filters maintain false positive rate as they grow
  • Trade-offs between accuracy, memory, and query time must be understood

References

  • Original Paper: "Space/Time Trade-offs in Hash Coding with Allowable Errors" by Burton H. Bloom
  • Analysis: "Network Applications of Bloom Filters: A Survey" by Broder and Mitzenmacher
  • Counting Bloom Filters: "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol" by Fan et al.
  • Scalable Bloom Filters: "Scalable Bloom Filters" by Almeida et al.
  • Cuckoo Filters: "Cuckoo Filter: Practically Better Than Bloom" by Fan et al.
  • Applications: Google Chrome safe browsing, Bitcoin SPV, Cassandra, Akamai
  • Hash Functions: MurmurHash3 and FNV-1a documentation
  • Wikipedia: Comprehensive overview with mathematical analysis