How to Implement and Use Bloom Filters in Java

Question

What is a Bloom filter and how can it be implemented in Java?

class BloomFilter {
    private BitSet bitSet;
    private int size;
    private int[] hashSeeds;

    public BloomFilter(int size, int numHashFunctions) {
        this.size = size;
        this.bitSet = new BitSet(size);
        this.hashSeeds = new int[numHashFunctions];
        for (int i = 0; i < numHashFunctions; i++) {
            this.hashSeeds[i] = i + 1;
        }
    }

    public void add(String value) {
        for (int seed : hashSeeds) {
            int hash = getHash(value, seed);
            bitSet.set(hash);
        }
    }

    public boolean contains(String value) {
        for (int seed : hashSeeds) {
            int hash = getHash(value, seed);
            if (!bitSet.get(hash)) {
                return false;
            }
        }
        return true;
    }

    private int getHash(String value, int seed) {
        return Math.abs(value.hashCode() + seed) % size;
    }
}

Answer

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. It can yield false positives but never false negatives, making it highly effective for applications where the efficiency of membership checks is crucial.

class BloomFilter {
    private BitSet bitSet;
    private int size;
    private int[] hashSeeds;

    public BloomFilter(int size, int numHashFunctions) {
        this.size = size;
        this.bitSet = new BitSet(size);
        this.hashSeeds = new int[numHashFunctions];
        for (int i = 0; i < numHashFunctions; i++) {
            this.hashSeeds[i] = i + 1;
        }
    }

    public void add(String value) {
        for (int seed : hashSeeds) {
            int hash = getHash(value, seed);
            bitSet.set(hash);
        }
    }

    public boolean contains(String value) {
        for (int seed : hashSeeds) {
            int hash = getHash(value, seed);
            if (!bitSet.get(hash)) {
                return false;
            }
        }
        return true;
    }

    private int getHash(String value, int seed) {
        return Math.abs(value.hashCode() + seed) % size;
    }
}

Causes

The need for efficient membership checking in large datasets.
Minimizing memory usage when storing items.

Solutions

Implement a Bloom filter using a bit array and a set of hash functions to map elements to positions in the array.
Trade-off between the size of the filter and the number of hash functions used, to balance false positive rates.

Common Mistakes

Mistake: Using too few hash functions, which increases the false positive rate.

Solution: Determine an optimal number of hash functions based on the expected number of elements.

Mistake: Not properly choosing the Bloom filter size, leading to overflow.

Solution: Estimate the size based on the expected number of elements and acceptable false positive probability.

Helpers

Bloom filter
Java Bloom filter implementation
efficient data structures
membership testing in Java
probabilistic data structure