How to read large files (a single continuous string) in Java?

Question

I am trying to read a very large file (~2GB). Content is a continuous string with sentences (I would like to split them based on a '.'). No matter how I try, I end up with an Outofmemoryerror.

    BufferedReader in = new BufferedReader(new FileReader("a.txt"));
    String read = null;
    int i = 0;
    while((read = in.readLine())!=null) {
        String[] splitted = read.split("\\.");
        for (String part: splitted) {
            i+=1;
            users.add(new User(i,part));
            repository.saveAll(users);
        }
    }

also,

inputStream = new FileInputStream(path);
    sc = new Scanner(inputStream, "UTF-8");
    while (sc.hasNextLine()) {
        String line = sc.nextLine();
        // System.out.println(line);
    }
    // note that Scanner suppresses exceptions
    if (sc.ioException() != null) {
        throw sc.ioException();
    }

Content of the file (composed of random words with a full stop after 10 words):

fmfbqi .xcdqnjqln kvjhw pexrbunnr cgvrqlr fpaczdegnb puqzjdbp gcfxne jawml aaiwwmo ugzoxn .opjc fmfbqi .xcdqnjqln kvjhw pexrbunnr cgvrqlr fpaczdegnb puqzjdbp gcfxne jawml aaiwwmo ugzoxn .opjc  (so on)

Please help!

The naked truth is: if the file is too big for your memory and you try to read it all in at once, then it'll fail no matter how. Do you actually need it all in memory at once for some reason? Or can you just process it chunk-by-chunk (in whatever junks you want, maybe lines, maybe cut off after every full stop, whatever ...). Because the later will always be easier on the resource requirements, if it's possible. — Joachim Sauer
– Joachim Sauer, Commented Feb 26, 2020 at 15:13
Have you tried java.nio.file.Files.lines()? That streams it... But if you are storing the split values somewhere in your memory, you might get out of memory no matter how you read the file. — deHaar
– deHaar, Commented Feb 26, 2020 at 15:13
If the line doesn't have occasionaly newline characters in it (i.e. it's all one line), then Files.lines() will run into basically the same problem and you'll need to process chunk-by-chunk in some other way. — Joachim Sauer
– Joachim Sauer, Commented Feb 26, 2020 at 15:15
If there are no newlines, then there is only a single line and thus only one line number. — Joachim Sauer
– Joachim Sauer, Commented Feb 26, 2020 at 15:27

searchengine27 · Accepted Answer · 2020-02-26 16:13:58Z

So first and foremost, based on comments on your question, as Joachim Sauer stated:

If there are no newlines, then there is only a single line and thus only one line number.

So your usecase is faulty, at best.

Let's move past that, and assume maybe there are new line characters - or better yet, assume that the . character you're splitting on is intended to be a newline psudeo-replacement.

Scanner is not a bad approach here, though there are others. Since you provided a Scanner, lets continue with that, but you want to make sure you're wrapping it around a BufferedReader. You clearly don't have a lot of memory, and a BufferedReader allows your to read 'chunks' of a file, as buffered by the BufferedReader, while utilizing the functionality of the Scanner completely obscure to you as a caller that the buffering is happening:

Scanner sc = new Scanner(new BufferedReader(new FileReader(new File("a.txt")), 10*1024));

What this is basically doing, is letting the Scanner function as you expect, but allowing you to buffer 10MB at a time, minimizing your memory footprint. Now, you just keep calling

sc.useDelimiter("\\.");
for(int i = 0; sc.hasNext(); i++) {
    String psudeoLine = sc.next();
    //store line 'i' in your database for this psudeo-line
    //DO NOT store psudeoLine anywhere else - you don't have memory for it
}

Since you don't have enough memory, the clear thing to iterate (and re-iterate) is don't store any part of the file within your JVM's heapspace after reading it. Read it, use it how you need it, and allow it to be marked for JVM garbage collection. In your case, you mention you want to store the psudeo lines in a database, so you want to read the psudeo-line, store it in the database, and just discard it.

There are other things to point out here, such as configuring your JVM arguments, but I hesitate to even mention it because just setting your JVM memory high is a bad idea too - another brute force approach. There's nothing wrong with setting your JVM memory max heap size higher, but learning memory management is better if you're still learning how to write software. You'll get in less trouble later when you get into professional development.

Also, I mentioned Scanner and BufferedReader because you mentioned that in your question, but I think checking out java.nio.file.Path.lines() as pointed out by deHaar is also a good idea. This basically does the same thing as the code I've explicitly laid out, with the caveat that it still only does 1 line at a time without the ability to change what you're 'splitting' on. So if your text file has 1 single line in it, this will still cause you a problem and you will still need something like a scanner to fragment the line out.

The advice on buffering could be expanded. The point of a BufferedReader is to limit the number of calls to the underlying "source" (ultimately, a syscall to a file read here). It is an optimization: reading chunks of bytes is more efficient than reading single bytes. BUT. You already have a FileReader, which converts bytes to char, and uses an internal buffer to do so. So you're buffering a buffer. This buffering is private to the Readers so the Scanner does not profit from it. (Plus, the Scanner has its own buffer to accumulate data it reads). So these 10MBs are probably kind of "wasted".
Well, sort of. The Scanner is more to obfuscate the token parsing. You could eliminate the use of the Scanner, if you wanted, but then you could also eliminate the use of Java altogether and read the bytes directly from the file with JNI and an fopen. I'm not saying any approach is wrong - I'm just suggesting that there is some middle ground between ease of coding, speed, and memory management. Eliminating the Scanner is fair, sure, but then you have to go ahead and manually do hands-on token management beyond bounds of the buffer.
@searchengine27 I'm talking about removing the BufferedReader, that only serves here, IMO to buffer (BufferedReader) a buffer (StreamDecoder inside FileReader) that will still be buffered (by the scanner) no matter what. Which seems like a waste of memory for litlle to none performance gain. I do agree that removing the Scanner will incur more work and open up possible implementations errors than keeping it. I did not mean to imply that one should dispose of it.

Collectives™ on Stack Overflow

How to read large files (a single continuous string) in Java?

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related