1

I'm in charge of developing an application that sometimes needs to process massive amounts of data from a Greenplum (a PostgreSQL-derived) database. The process involves a Java 8 program running on a server that fetches this data, processes it, and sends the results to another Greenplum database.

I already know that sending data is better done in batches, but what about receiving it? Currently, my program fetches all the data from Database A in one shot, which sometimes causes OutOfMemoryErrors because the dataset can be enormous.

I recently read about database cursors and how they are often presented as a magic solution for fetching large datasets. This seems like it could solve my exact problem.

However, I'm concerned about the trade-offs. I don't have administrative access to the servers—they are legacy systems. Database A is read-only for me, and Database B is read-write, and both have critical resource management needs that I cannot disrupt.

If I start using cursors, what would the impact be on Database A? How is the data actually batched on the server side? For context, I don't believe I have useful indexes on the tables I'm querying. I neither need to sort or alike the fetched data: I just need that all data produced by my query is fully processed and without duplication.

EDIT

Thanks everyone - your answers helped me understand the topic better. Through further research, I discovered that my database library actually supports automatic cursor handling when certain requirements are met. Given the massive dataset sizes, I'm now exploring streams and iterators as potential solutions.

1
  • A big one is that you keep a connection occupied for the entire time you need the cursor. Depending on the database, and the length of the job, that might be a big ask, especially if you are in a setup with a connection pooling proxy. Commented Oct 16 at 18:38

2 Answers 2

4

I recently read about database cursors and how they are often presented as a magic solution for fetching large datasets. This seems like it could solve my exact problem.

This is a confused idea on a number of levels. It's a common misconception that using a cursor is a choice you make when running a query. In reality, there's always a cursor involved. The only choice you are making is whether you control it explicitly or not. When you query a large set of results, the JDBC library will fetch results in batches. The database will keep an open cursor on your results until you finish reading all the results (batches) or cancel the query.

But moreover, cursors have nothing to do with your issue and can't solve it. Cursors exist on the database. Your 'cursor' object Java in your program is just an abstract representation of that database object. It's not really 'the cursor'.

An OutOfMemoryError (OOME) is a client-side issue. Specifically, it is thrown by a JVM when your program tries to allocate new Objects and there is not enough space in the heap even after a full garbage collection. In other words, the OOME is being caused because you are trying to store all the results in the heap and you either: don't have enough RAM or, your max heap size has been reached.

The simplest possible solution is to increase your max heap size. If you have enough available memory on your system, this should resolve it.

This may work but it's not really considered a great solution. It uses a lot of system resources (RAM) and it's really slow. Instead, if you can, you should try to process the data as you retrieve it. For example, let's say you were trying to find the sum of some field in a table. Instead of storing all the results and then calculating the sum, you could have a running total. This will be faster and use a tiny fraction of the memory. I've seen a lot of confusion around this approach where people think you need to explicitly use a cursor to do this. That might explain where you are getting this idea from.

If you are using a some sort of tooling that takes you a step (or more) from interacting with the DB drivers, take a look at the documentation for things like 'paging', 'batching' or 'streaming' results.

If you really need all of these results in memory at one time and you simply don't have the space to store them in memory, make sure you aren't storing unnecessary information or storing it in an inefficient way. For example, if you have a lot of UUIDs in your data and they are in string format, you could convert them to UUID objects. A string uuid in the standard 36 char format should take up 76 bytes and a UUID object is more like 16 (plus a little overhead.) It might not seem like much but a 75% reduction adds up if you have a lot of UUIDs. If you have a lot of repeating values, make sure you aren't storing a separate object for each one. This can be a simple as a local HashMap or if you need to get fancy, you could use the Flyweight pattern.

2
  • This is all true in general, but possibly not helpful in OP's situation. If you naively try to read all records and then try to process them, then you may indeed see an OutOfMemoryError that using a Java cursor and interleaving fetching and processing would solve (but of course, the essential change is the interleaving, not the cursor object). Commented Oct 18 at 7:35
  • @KilianFoth "OutOfMemoryError that using a Java cursor and interleaving fetching and processing" I'm not sure I follow. You can do this in Java without using an explicit cursor. You can set your fetch size to 1 if you like but I wouldn't recommend it. Actually the default of 10 per the link is probably way too low for common use cases. Commented Oct 19 at 20:43
4

Currently, my program fetches all the data from Database A in one shot, which sometimes causes OutOfMemoryErrors because the dataset can be enormous.

JimmyJames already explained that the cursor is present anyway, and the major distinction is how you use the data that comes from the database—do you put it all in memory, or do you process it row by row.

One important aspect is that in general, that is, unless you have specific reasons to do otherwise, you should try to process the data as it arrives. This guarantees that the process will scale as the quantity of data grows. If the entire process was using 10 KB of memory when you had only one thousand rows in your table, there are chances it will still use the same 10 KB of memory when with a billion rows.

Two terms you may look for are streaming and iterators. Both concepts are very similar and related to processing the data on the fly.

Say you want to encode a video. There is little use in loading the whole video in memory. Instead, you can encode it as you read it, and write it to destination as you encode it. The offset in relation to the source file is just as your database cursor here, with the same ability to move it forward (and, very rarely, backward). Here, treating the files as streams makes it possible to process the video on virtually any low grade PC with very limited RAM.

Or, say, you have a CSV file that you need to analyze to create a bunch of charts for a report. Chances are, such analysis can be done while processing the CSV file line by line. There is a possibility that you would need to use quite a lot of memory for the actual cumulative data (say you're grouping the results by country—you would need to keep those countries and the corresponding data in memory, and the number of countries will grow as you traverse the CSV file). Nevertheless, you can process even a 10 GB CSV file with an iterator with minimal memory impact.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.