Skip to main content
11 events
when toggle format what by license comment
Mar 7, 2020 at 8:39 comment added rwong Using multiple machines, using multiple JVMs (processes), using large-array-aware (and large-memory-aware) JVMs, using arrays-of-arrays on JVMs, should be the first four default answer(s) to this question.
Mar 4, 2020 at 21:00 history tweeted twitter.com/StackSoftEng/status/1235309178656305155
Mar 3, 2020 at 23:03 answer added JimmyJames timeline score: 1
Mar 3, 2020 at 17:08 comment added JimmyJames While you probably want to rethink how you are using a Bloom filter, 2GB isn't a terribly large JVM heap and hasn't been for nearly a decade. I've overseen apps that exceed 12GB running for weeks, even months at a time with negligible GC overhead. Are you running a 32 bit JVM? The G1 collector also makes using large heaps much more efficient.
Mar 2, 2020 at 22:06 answer added candied_orange timeline score: 4
Mar 2, 2020 at 19:25 comment added Berin Loritsch The short answer though is that you load the data in memory in small chunks and iterate through the data. But you've got to answer the basic question of whether the bloom filter is causing more problems than it is solving.
Mar 2, 2020 at 19:22 comment added Berin Loritsch I had an answer, but deleted it because I started thinking about this a bit more. Biggest question is what problem is the Bloom filter solving? In other words, why does it matter if the key may exist or not? Is the index so large it is no longer performant? Are you working with a sharded key-value database? You may find the lookup time for a key (or set of keys) is shorter than what the bloom filter buys you--particularly if it is so large.
Mar 2, 2020 at 16:27 comment added Erik Eidt Have you considered memory mapped files for the filters? They are loaded outside of the JVM heap, and you read from them what you need on demand.
Mar 2, 2020 at 16:25 comment added Kilian Foth Is there any particular reason why you need to do this in one JVM, i.e. one process on one machine? Usually, people start sharding or otherwise parallelizing their processing long before the "billions of records" point.
Mar 2, 2020 at 16:10 review First posts
Mar 4, 2020 at 20:10
Mar 2, 2020 at 16:09 history asked Sumit Jha CC BY-SA 4.0