3

We have a service where we have billions of key-value data stored in some storage. Before actually querying the data, we query the bloom filter to determine if the key may exist or definitely does not exist in the underlying store.

But the problem is with ~ 1 Billion keys and 0.1% false positive rate, the size of the bloom filter can easily exceed 2GB. Also we need to load many such filters (for every category of data, there is different filter)

In such scenario, what is the efficient way to load bloom filter (or any data structure for that matter) whose size exceeds the size of the JVM itself ?

  1. Should I not load the filter into memory, and instead operate from disk itself ?
  2. Is there any standard library which provides this abstraction of having the bloom filter on disk, and still query it ?
  3. Is there any like standard way to deal with such issue ? I have heard about off-heap storages, but don't have much understanding around those.
6
  • 1
    Is there any particular reason why you need to do this in one JVM, i.e. one process on one machine? Usually, people start sharding or otherwise parallelizing their processing long before the "billions of records" point. Commented Mar 2, 2020 at 16:25
  • 2
    Have you considered memory mapped files for the filters? They are loaded outside of the JVM heap, and you read from them what you need on demand. Commented Mar 2, 2020 at 16:27
  • I had an answer, but deleted it because I started thinking about this a bit more. Biggest question is what problem is the Bloom filter solving? In other words, why does it matter if the key may exist or not? Is the index so large it is no longer performant? Are you working with a sharded key-value database? You may find the lookup time for a key (or set of keys) is shorter than what the bloom filter buys you--particularly if it is so large. Commented Mar 2, 2020 at 19:22
  • 3
    The short answer though is that you load the data in memory in small chunks and iterate through the data. But you've got to answer the basic question of whether the bloom filter is causing more problems than it is solving. Commented Mar 2, 2020 at 19:25
  • 1
    While you probably want to rethink how you are using a Bloom filter, 2GB isn't a terribly large JVM heap and hasn't been for nearly a decade. I've overseen apps that exceed 12GB running for weeks, even months at a time with negligible GC overhead. Are you running a 32 bit JVM? The G1 collector also makes using large heaps much more efficient. Commented Mar 3, 2020 at 17:08

2 Answers 2

4

As this wonderful bloomfilter tutorial by llimllib will show you, when you test an element you have to recalculate its hash. This will tell you what bits it sets in the bloom filter.

If your bloomfilter is outside of system ram but is on a storage device that still has random access you could pull only the pertinent bytes from storage and test them against the calculated bits. If they're all non zero after anding them then you have a possible match. This works best if the bits an element lights up are sparse.

This would not allow physical Hard Drives to perform well since they have a seek time to get the heads to the correct sector. Solid State Drives would likely perform better but confirm this with testing.

But, as Berin Lorittsch points out, you may be in a case where the bloomfilter is doing more harm than good. This is randomized IO access. IO is slow enough on it's own. It's hard to tell how bad this compares without data from solutions that have no bloomfilter, but seriously, ouch.

Before giving up on the bloomfilter keep in mind that bloomfilters are very customizeable. Give this a close read:

How big should I make my Bloom filter?

It's a nice property of Bloom filters that you can modify the false positive rate of your filter. A larger filter will have less false positives, and a smaller one more.

Your false positive rate will be approximately (1-e-kn/m)k, so you can just plug the number n of elements you expect to insert, and try various values of k and m to configure your filter for your application.2

This leads to an obvious question:

How many hash functions should I use?

The more hash functions you have, the slower your bloom filter, and the quicker it fills up. If you have too few, however, you may suffer too many false positives.

Since you have to pick k when you create the filter, you'll have to ballpark what range you expect n to be in. Once you have that, you still have to choose a potential m (the number of bits) and k (the number of hash functions).

It seems a difficult optimization problem, but fortunately, given an m and an n, we have a function to choose the optimal value of k: (m/n)ln(2)

So, to choose the size of a bloom filter, we:

Choose a ballpark value for n
Choose a value for m
Calculate the optimal value of k
Calculate the error rate for our chosen values of n, m, and k. If it's unacceptable, return to step 2 and change m; otherwise we're done.

bloomfilter-tutorial - llimllib

This should make it clear that if your bloom filter is too big for system ram it's because you chose to make it too big. Seriously, weigh the time cost of hitting IO first against the time cost of more frequent false positives on your first check. This is very much a tuning variable.

1

I noted above in a comment that 2GB seems like a obsolete JVM limit but I'm going to answer here assuming that this is a real hard limit that you must adhere to.

After thinking about this for a while and playing around with a spreadsheet for various options, it occurred to me that you could avoid your specific issue by scaling out to multiple VMs.

To start with I assumed an M of 1.5 GB (roughly 13 billion bits) and an N of exactly 1 billion. If my formulas are right, a K of 3 gets is the smallest value that gets you under 0.1%. I think that's about where you are now.

Then I considered a different approach. Instead of one big heap with 3 hash functions, you could have multiple JVMs each maintaining one hash index. The math for predicting false positives in this scenario is to calculate as above for K=1 and then take that result and raise it to the number of instances. So for example if you have K=1 as in the above you get a 7% false positive rate. If I run 2 instances my chance of a false positive on both is 0.56% and with 3 it drops to 0.04%. Of course, running 3 JVMs at 2GB is pretty heavy weight. There might be a more optimal configuration. I set up a few different models and got the following:

enter image description here

The vertical axis is the false positive rate. (Please note the log scale) I usually wouldn't use a logarithmic scale but it really didn't look good otherwise. The horizontal axis is the total heap space needed for the bloom filters in GB. The labels on each point show the number of instances.

There are three series: 1GB, 0.5GB, and 0.25 GB. These are index sizes. For example, if we look at the point on the bottom left you would have 6 1-GB indexes for a total of 6GB and a false positive rate of 0.0002%. That's massive overkill.

Based on this, a good choice might be 7 (or 8) 0.25-GB instances (the orange series). This gets you to your 0.1% false positive rate. You end up with a total of around 2GB of heap but all your VMs are pretty small.

One nice thing about this is that it's simple to parallelize the searches across the independent instances. One bad thing is that if any one of them goes down, the whole cluster is fubared. If total memory usage is not a concern, you might want to choose 4 1GB indexes instead.

I learned a good bit doing this exercise and if you have any questions or think I've made a major error here, let me know.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.