5

I'm trying to load large CSV formatted files (typically 200-600mb) efficiently with Java (less memory and as fast as possible access). Currently, the program is utilizing a List of String Arrays. This operation was previously handled with a Lua program using a table for each CSV row and a table to hold each "row" table.

Below is an example of the memory differences and load times:

  • CSV File - 232mb
  • Lua - 549mb in memory - 157 seconds to load
  • Java - 1,378mb in memory - 12 seconds to load

If I remember correctly, duplicate items in a Lua table exist as a reference to the actual value. I suspect in the Java example, the List is holding separate copies of each duplicate value and that may be related to the larger memory usage.

Below is some background on the data within the CSV files:

  • Each field consists of a String
  • Specific fields within each row may include one of a set of Strings (E.g. field 3 could be "red", "green", or "blue").
  • There are many duplicate Strings within the content.

Below are some examples of what may be required of the loaded data:

  • Search through all Strings attempting to match with a given String and return the matching Strings
  • Display matches in a GUI table (sort able via fields).
  • Alter or replace Strings.

My question - Is there a collection that will require less memory to hold the data yet still offer features to easily and quickly search/sort the data?

6
  • 1
    if you know that column 3 only holds a few possible values, you could intern them to reduce the memory usage. See also: stackoverflow.com/a/1855195/829571 Commented Nov 11, 2012 at 15:46
  • Thanks assylias I will run some tests using that. Do you know if it's efficient for short Strings - E.g. "To" or "Go". Most of the fields contain strings that are 45 characters+, however, some are quite short (4 or less). Commented Nov 11, 2012 at 15:52
  • 2
    Have look at stackoverflow.com/questions/12792942/… Commented Nov 11, 2012 at 15:52
  • @PeterLawrey Nice one - how does it perform vs. intern()? Commented Nov 11, 2012 at 15:56
  • 1
    @assylias It is faster and scales better, but it only works on a best effort basis, you would get all the duplicates if your size is smaller than the number of unique objects. Commented Nov 11, 2012 at 16:05

4 Answers 4

1

One easy solution. You can have some HashMap were you will put references to all unique strings. And in ArrayList you will just have reference to existing unique strings in HashMap.

Something like :

private HashMap<String, String> hashMap = new HashMap<String, String>();

public String getUniqueString(String ns) {
   String oldValue = hashMap.get(ns);
   if (oldValue != null) { //I suppose there will be no null strings inside csv
    return oldValue;
   }        
   hashMap.put(ns, ns);
   return ns;
}

Simple usage:

List<String> s = Arrays.asList("Pera", "Zdera", "Pera", "Kobac", "Pera", "Zdera", "rus");
List<String> finS = new ArrayList<String>();
for (String er : s) {
   String ns = a.getUniqueString(er);
   finS.add(ns);
}
Sign up to request clarification or add additional context in comments.

1 Comment

sound's like you're trying to optimize things already optimized by java (saving memory for dupplicate strings in memory), no need for such an implementation, see my answer
0

DAWG

A directed acyclic word graph is the most efficient way to store words (best for memory consumption anyway).

But probably overkill here, as others have said don't create duplicates just make multiple references to the same instance.

1 Comment

Thanks I will look into this option some more. I wouldn't consider anything overkill yet - the more efficient this is the more data can be loaded per session and that's better for the end user.
0

To optimise your your Memory problem i advice to use the Flyweight pattern, specially for fields that have a lot of duplicates.

As a Collection you can use a TreeSet or TreeMap.

If you give a good implementation to your LineItem class (implement equals, hashcode and Comparable) you can optimise the memory use a lot.

Comments

0

just as a side note.

For the duplicate string data you doubt, you don't need to worry about that, as java itself cares of that as all strings are final, and all references target the same object in memory.

so not sure how lua does the job, but in java it should be also quite efficient

8 Comments

But if this is true than equals is unnecessary at all and == will do job for comparasion
well, equals is the correct way, as it's the way you should compare objects in java, == would work too, but it's only kind of side effect, due to way JVM internally handles strings
Well, I'm not sure how much memory java vm internally holds for keeping string references, but i'm pretty sure that in enough large program == will not work
you're joking, right? see: stackoverflow.com/questions/767372/java-string-equals-versus (Michal Bernhard's reply), why would JVM reference only some strings (not all) in such an optimized way?
Yup, but in this problem example, I'm pretty sure @user1816198 will not have bunch of static "Some String" strings but full load of dynamic strings(I suppose he will be using StringBuilders or something to parse csv). Try for example this simple program String a = "Helo,what s up,baby,Hello,baby"; String[] b = a.split(","); for (String c : b) { System.out.println(c); } if (b[0] == b[3]) { System.out.println("Equals Hello"); } if (b[2] == b[4]) { System.out.println("Equals baby"); } Last two ifs are false on my machine.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.