Timeline for Operate on data that doesn't fit into JVM

Current License: CC BY-SA 4.0

11 events

when toggle format	what		by	license	comment
Mar 7, 2020 at 8:39	comment	added	rwong		Using multiple machines, using multiple JVMs (processes), using large-array-aware (and large-memory-aware) JVMs, using arrays-of-arrays on JVMs, should be the first four default answer(s) to this question.
Mar 4, 2020 at 21:00	history	tweeted			twitter.com/StackSoftEng/status/1235309178656305155
Mar 3, 2020 at 23:03	answer	added	JimmyJames		timeline score: 1
Mar 3, 2020 at 17:08	comment	added	JimmyJames		While you probably want to rethink how you are using a Bloom filter, 2GB isn't a terribly large JVM heap and hasn't been for nearly a decade. I've overseen apps that exceed 12GB running for weeks, even months at a time with negligible GC overhead. Are you running a 32 bit JVM? The G1 collector also makes using large heaps much more efficient.
Mar 2, 2020 at 22:06	answer	added	candied_orange		timeline score: 4
Mar 2, 2020 at 19:25	comment	added	Berin Loritsch		The short answer though is that you load the data in memory in small chunks and iterate through the data. But you've got to answer the basic question of whether the bloom filter is causing more problems than it is solving.
Mar 2, 2020 at 19:22	comment	added	Berin Loritsch		I had an answer, but deleted it because I started thinking about this a bit more. Biggest question is what problem is the Bloom filter solving? In other words, why does it matter if the key may exist or not? Is the index so large it is no longer performant? Are you working with a sharded key-value database? You may find the lookup time for a key (or set of keys) is shorter than what the bloom filter buys you--particularly if it is so large.
Mar 2, 2020 at 16:27	comment	added	Erik Eidt		Have you considered memory mapped files for the filters? They are loaded outside of the JVM heap, and you read from them what you need on demand.
Mar 2, 2020 at 16:25	comment	added	Kilian Foth		Is there any particular reason why you need to do this in one JVM, i.e. one process on one machine? Usually, people start sharding or otherwise parallelizing their processing long before the "billions of records" point.
Mar 2, 2020 at 16:10	review	First posts
Mar 4, 2020 at 20:10
Mar 2, 2020 at 16:09	history	asked	Sumit Jha	CC BY-SA 4.0