0

I have written this code:

try(BufferedReader file = new BufferedReader(new FileReader("C:\\Users\\User\\Desktop\\big50m.txt"));){
              String line;
              StringTokenizer st;

              while ((line = file.readLine()) != null){
                  st  = new StringTokenizer(line); // Separation of integers of the file line
                  while(st.hasMoreTokens())
                       numbers.add(Integer.parseInt(st.nextToken())); //Converting and adding to the list of numbers
                  }

          }
          catch(Exception e){
              System.out.println("Can't read the file...");

          }

the big50m file has 50.000.000 integers and i get this runtime error:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3332)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
    at java.lang.StringBuffer.append(StringBuffer.java:367)
    at java.io.BufferedReader.readLine(BufferedReader.java:370)
    at java.io.BufferedReader.readLine(BufferedReader.java:389)
    at unsortedfilesapp.UnsortedFilesApp.main(UnsortedFilesApp.java:37)
C:\Users\User\AppData\Local\NetBeans\Cache\8.2\executor-snippets\run.xml:53: Java returned: 1
BUILD FAILED (total time: 5 seconds)

I think the problem is the string variable named line. Can you tell me how to fix it ? Because i want fast reading i use StringTokenizer.

26
  • Have you looked at the file structure? Commented Apr 6, 2017 at 16:22
  • Yes.. for example : 100 5 55 75 13 .... integer1 onespace integer2 ... Commented Apr 6, 2017 at 16:24
  • Do the file have any \n's in it? Commented Apr 6, 2017 at 16:24
  • 3
    Well that could be your problem. You are trying to read all 50,000,000 numbers at once. Commented Apr 6, 2017 at 16:28
  • 1
    @LeAdErQ can you share your way to increase the heap size? Commented Apr 6, 2017 at 16:44

5 Answers 5

1

Create a BufferedReader from the file and read() char by char. Put digit char into a String, then Integer.parseInt(), skip any non-digit char and continue parsing on the the next digit, etc, etc.

Sign up to request clarification or add additional context in comments.

Comments

0

The readLine() method reads the whole line at once thus eating up a lot of memory. This is highly inefficient and does not scale to an arbitrary big file.

You can use a StreamTokenizer

like this:

StreamTokenizer tokenizer = new StreamTokenizer(new FileReader("bigfile.txt"));
tokenizer.parseNumbers(); // default behaviour
while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
    if (tokenizer.ttype == StreamTokenizer.TT_NUMBER) {
        numbers.add((int)Math.round(tokenizer.nval));
    }
}

I have not tested this code but it gives you the general idea.

1 Comment

Sorry, I didn't realize you are storing the numbers in a List. Unless this is just a toy program, increasing the heap space is no real solution. Memory is an expensive and limited resource. In real problems you will have to read a reasonable amount of data into memory, processes it and proceed until the data is exhausted. The JVM can access a limited amount of memory at once so reading all the numbers into a List does not scale. If this is a real world problem you should consider using cache techniques and process the numbers in parts. I estimate that your approach is using 500MB+ of memory.
0

here is an version that minimize the memory usage. No byte to char conversion. No String operations. But in this version it does not handle negative numbers.

    public static void main(final String[]a) {
        final Set<Integer> number = new HashSet<>();
        int v = 0;
        boolean use = false;
        int c;
        // Input stream avoid char conversion
        try(InputStream s = new FileInputStream("C:\\Users\\User\\Desktop\\big50m.txt")) {
            // No allocation in the loop
            do {
                if((c = s.read()) == -1) break;
                if(c>='0' && c<='9') { v = v * 10 + c-'0'; use =     true; continue; }
                if(use) number.add(v);
                use = false;
                v = 0;
            } while(true);
            if(use) number.add(v);
        } catch(final Exception e){ System.out.println("Can't read the file..."); }
    }

2 Comments

Hi SkateScout, i saw that you used Set, but i don't want set, i want to add same values..
Hi, i want an working sample since your query did not contain any information about number it can be an Collection or Some self defined class. But it does not affect the optimization.
0

On Running the program with -Xmx2048m, the provided snippet worked (with some adjustments: declared numbers as List numbers = new ArrayList<>(50000000); )

1 Comment

i don't know the number of integers
0

Since all numbers are within one line, the BufferedReader approach does not work or scale well. The complete file will be read into memory. Therefore the streaming approach (e.g. from @whbogado) is indeed the way to go.

StreamTokenizer tokenizer = new StreamTokenizer(new FileReader("bigfile.txt"));
tokenizer.parseNumbers(); // default behaviour
while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
    if (tokenizer.ttype == StreamTokenizer.TT_NUMBER) {
        numbers.add((int)Math.round(tokenizer.nval));
    }
}

As you are writing, that you are getting a heap space error as well, I assume, that it is not a problem with the streaming anymore. Unfortunately you are storing all values within a List. I think that is the problem now. You say in a comment, that you do not know the actual count of numbers. Hence you should avoid to store those in a list and do here as well some kind of streaming.

For all who are interested, here is my little testcode (java 8) that does produce a testfile of the needed size USED_INT_VALUES. I limited it for now to 5 000 000 integers. As you can see running it, the memory increases steadily while reading through the file. The only place that holds that much memory is the numbers List.

Be aware that initializing an ArrayList with an initial capacity does not allocate the memory the stored objects need, in your case your Integers.

import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.StreamTokenizer;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.logging.Level;
import java.util.logging.Logger;

public class TestBigFiles {

    public static void main(String args[]) throws IOException {
        heapStatistics("program start");
        final int USED_INT_VALUES = 5000000;
        File tempFile = File.createTempFile("testdata_big_50m", ".txt");
        System.out.println("using file " + tempFile.getAbsolutePath());
        tempFile.deleteOnExit();

        Random rand = new Random();
        FileWriter writer = new FileWriter(tempFile);
        rand.ints(USED_INT_VALUES).forEach(i -> {
            try {
                writer.write(i + " ");
            } catch (IOException ex) {
                Logger.getLogger(TestBigFiles.class.getName()).log(Level.SEVERE, null, ex);
            }
        });
        writer.close();
        heapStatistics("large file generated - size=" + tempFile.length() + "Bytes");
        List<Integer> numbers = new ArrayList<>(USED_INT_VALUES);

        heapStatistics("large array allocated (to avoid array copy)");

        int c = 0;
        try (FileReader fileReader = new FileReader(tempFile);) {
            StreamTokenizer tokenizer = new StreamTokenizer(fileReader);

            while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
                if (tokenizer.ttype == StreamTokenizer.TT_NUMBER) {
                    numbers.add((int) tokenizer.nval);
                    c++;
                }
                if (c % 100000 == 0) {
                    heapStatistics("within loop count " + c);
                }
            }
        }

        heapStatistics("large file parsed nummer list size is " + numbers.size());
    }

    private static void heapStatistics(String message) {
        int MEGABYTE = 1024 * 1024;
        //clean up unused stuff
        System.gc();
        Runtime runtime = Runtime.getRuntime();
        System.out.println("##### " + message + " #####");

        System.out.println("Used Memory:" + (runtime.totalMemory() - runtime.freeMemory()) / MEGABYTE + "MB"
                + " Free Memory:" + runtime.freeMemory() / MEGABYTE + "MB"
                + " Total Memory:" + runtime.totalMemory() / MEGABYTE + "MB"
                + " Max Memory:" + runtime.maxMemory() / MEGABYTE + "MB");
    }
}

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.