Java - how to efficiently store a large amount of String arrays

Question

I'm trying to load large CSV formatted files (typically 200-600mb) efficiently with Java (less memory and as fast as possible access). Currently, the program is utilizing a List of String Arrays. This operation was previously handled with a Lua program using a table for each CSV row and a table to hold each "row" table.

Below is an example of the memory differences and load times:

CSV File - 232mb
Lua - 549mb in memory - 157 seconds to load
Java - 1,378mb in memory - 12 seconds to load

If I remember correctly, duplicate items in a Lua table exist as a reference to the actual value. I suspect in the Java example, the List is holding separate copies of each duplicate value and that may be related to the larger memory usage.

Below is some background on the data within the CSV files:

Each field consists of a String
Specific fields within each row may include one of a set of Strings (E.g. field 3 could be "red", "green", or "blue").
There are many duplicate Strings within the content.

Below are some examples of what may be required of the loaded data:

Search through all Strings attempting to match with a given String and return the matching Strings
Display matches in a GUI table (sort able via fields).
Alter or replace Strings.

My question - Is there a collection that will require less memory to hold the data yet still offer features to easily and quickly search/sort the data?

if you know that column 3 only holds a few possible values, you could intern them to reduce the memory usage. See also: stackoverflow.com/a/1855195/829571 — assylias
– assylias, Commented Nov 11, 2012 at 15:46
Thanks assylias I will run some tests using that. Do you know if it's efficient for short Strings - E.g. "To" or "Go". Most of the fields contain strings that are 45 characters+, however, some are quite short (4 or less). — user1816198
– user1816198, Commented Nov 11, 2012 at 15:52
@assylias It is faster and scales better, but it only works on a best effort basis, you would get all the duplicates if your size is smaller than the number of unique objects. — Peter Lawrey
– Peter Lawrey, Commented Nov 11, 2012 at 16:05

Igor · Accepted Answer · 2012-11-13 14:16:21Z

1

One easy solution. You can have some HashMap were you will put references to all unique strings. And in ArrayList you will just have reference to existing unique strings in HashMap.

Something like :

private HashMap<String, String> hashMap = new HashMap<String, String>();

public String getUniqueString(String ns) {
   String oldValue = hashMap.get(ns);
   if (oldValue != null) { //I suppose there will be no null strings inside csv
    return oldValue;
   }        
   hashMap.put(ns, ns);
   return ns;
}

Simple usage:

List<String> s = Arrays.asList("Pera", "Zdera", "Pera", "Kobac", "Pera", "Zdera", "rus");
List<String> finS = new ArrayList<String>();
for (String er : s) {
   String ns = a.getUniqueString(er);
   finS.add(ns);
}

edited Nov 13, 2012 at 14:16

answered Nov 11, 2012 at 16:32

Igor

1,85517 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Peter Butkovic Over a year ago

sound's like you're trying to optimize things already optimized by java (saving memory for dupplicate strings in memory), no need for such an implementation, see my answer

NimChimpsky · Accepted Answer · 2012-11-11 15:51:33Z

0

DAWG

A directed acyclic word graph is the most efficient way to store words (best for memory consumption anyway).

But probably overkill here, as others have said don't create duplicates just make multiple references to the same instance.

answered Nov 11, 2012 at 15:51

NimChimpsky

47.5k61 gold badges202 silver badges319 bronze badges

1 Comment

user1816198 Over a year ago

Thanks I will look into this option some more. I wouldn't consider anything overkill yet - the more efficient this is the more data can be loaded per session and that's better for the end user.

Frank · Accepted Answer · 2012-11-11 15:55:02Z

0

To optimise your your Memory problem i advice to use the Flyweight pattern, specially for fields that have a lot of duplicates.

As a Collection you can use a TreeSet or TreeMap.

If you give a good implementation to your LineItem class (implement equals, hashcode and Comparable) you can optimise the memory use a lot.

edited Nov 11, 2012 at 15:55

answered Nov 11, 2012 at 15:50

Frank

15.7k6 gold badges38 silver badges50 bronze badges

Comments

Peter Butkovic · Accepted Answer · 2012-11-11 16:12:22Z

0

just as a side note.

For the duplicate string data you doubt, you don't need to worry about that, as java itself cares of that as all strings are final, and all references target the same object in memory.

so not sure how lua does the job, but in java it should be also quite efficient

answered Nov 11, 2012 at 16:12

Peter Butkovic

12.3k10 gold badges62 silver badges85 bronze badges

8 Comments

Igor Over a year ago

But if this is true than equals is unnecessary at all and == will do job for comparasion

Peter Butkovic Over a year ago

well, equals is the correct way, as it's the way you should compare objects in java, == would work too, but it's only kind of side effect, due to way JVM internally handles strings

Igor Over a year ago

Well, I'm not sure how much memory java vm internally holds for keeping string references, but i'm pretty sure that in enough large program == will not work

Peter Butkovic Over a year ago

you're joking, right? see: stackoverflow.com/questions/767372/java-string-equals-versus (Michal Bernhard's reply), why would JVM reference only some strings (not all) in such an optimized way?

Igor Over a year ago

Yup, but in this problem example, I'm pretty sure @user1816198 will not have bunch of static "Some String" strings but full load of dynamic strings(I suppose he will be using StringBuilders or something to parse csv). Try for example this simple program String a = "Helo,what s up,baby,Hello,baby"; String[] b = a.split(","); for (String c : b) { System.out.println(c); } if (b[0] == b[3]) { System.out.println("Equals Hello"); } if (b[2] == b[4]) { System.out.println("Equals baby"); } Last two ifs are false on my machine.

|

Collectives™ on Stack Overflow

Java - how to efficiently store a large amount of String arrays

4 Answers 4

1 Comment

1 Comment

Comments

8 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

Comments

8 Comments

Linked

Related