Timeline for Operate on data that doesn't fit into JVM
Current License: CC BY-SA 4.0
11 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Mar 7, 2020 at 8:39 | comment | added | rwong | Using multiple machines, using multiple JVMs (processes), using large-array-aware (and large-memory-aware) JVMs, using arrays-of-arrays on JVMs, should be the first four default answer(s) to this question. | |
| Mar 4, 2020 at 21:00 | history | tweeted | twitter.com/StackSoftEng/status/1235309178656305155 | ||
| Mar 3, 2020 at 23:03 | answer | added | JimmyJames | timeline score: 1 | |
| Mar 3, 2020 at 17:08 | comment | added | JimmyJames | While you probably want to rethink how you are using a Bloom filter, 2GB isn't a terribly large JVM heap and hasn't been for nearly a decade. I've overseen apps that exceed 12GB running for weeks, even months at a time with negligible GC overhead. Are you running a 32 bit JVM? The G1 collector also makes using large heaps much more efficient. | |
| Mar 2, 2020 at 22:06 | answer | added | candied_orange | timeline score: 4 | |
| Mar 2, 2020 at 19:25 | comment | added | Berin Loritsch | The short answer though is that you load the data in memory in small chunks and iterate through the data. But you've got to answer the basic question of whether the bloom filter is causing more problems than it is solving. | |
| Mar 2, 2020 at 19:22 | comment | added | Berin Loritsch | I had an answer, but deleted it because I started thinking about this a bit more. Biggest question is what problem is the Bloom filter solving? In other words, why does it matter if the key may exist or not? Is the index so large it is no longer performant? Are you working with a sharded key-value database? You may find the lookup time for a key (or set of keys) is shorter than what the bloom filter buys you--particularly if it is so large. | |
| Mar 2, 2020 at 16:27 | comment | added | Erik Eidt | Have you considered memory mapped files for the filters? They are loaded outside of the JVM heap, and you read from them what you need on demand. | |
| Mar 2, 2020 at 16:25 | comment | added | Kilian Foth | Is there any particular reason why you need to do this in one JVM, i.e. one process on one machine? Usually, people start sharding or otherwise parallelizing their processing long before the "billions of records" point. | |
| Mar 2, 2020 at 16:10 | review | First posts | |||
| Mar 4, 2020 at 20:10 | |||||
| Mar 2, 2020 at 16:09 | history | asked | Sumit Jha | CC BY-SA 4.0 |