1

I have a dictionary as a text file mapping from 2M words to 50k words. I load this file into memory as HashMap<String, String> by reading the file line by line, splitting on a separator and invoking myMap.put(line[0], line[1]). The size of the text file is 45MB, while the HashMap uses 350MB of the heap. My goal is to reduce memory usage without harming lookup speed. myMap.values().size() returns 2M instead of 50k, suggesting that the values are stored as duplicates. Is there a way to make identical values point to the same String object?

Map<String, String> dict = new HashMap<>();
try (FileReader fr = new FileReader(FILE);
        BufferedReader br = new BufferedReader(fr)) {
    String line;
    while ((line = br.readLine()) != null) {
        String key_value[] = line.split(":");
        dict.put(key_value[0], key_value[1].intern());
    }
} catch (Exception e) {
    e.printStackTrace();
}
10
  • 5
    If you have 2M unique words that are mapped to 50k (non unique) words, then you hashmap's size will be 2M. Commented Jul 10, 2013 at 15:30
  • 1
    The hashmaps size is based on the entries therefore the number of keys. Regarding the duplicate values: The JVM does some optimization with string values. As a string is immutable it often uses the same object for equal strings. You can't rely on that but probably your strings are already not duplicated. Commented Jul 10, 2013 at 15:32
  • @assylias I know. My question is how to avoid storing duplicate values. That is allowing multiple keys to point to map to the same object value. Commented Jul 10, 2013 at 15:33
  • @stonedsquirrel. I have already verified that I have 50k values. So there are a lot of duplicated values. Commented Jul 10, 2013 at 15:34
  • Yes, because you have 2M keys. But if a key points to an equals string as another key, it is highly likely that they are pointing to the same string object. Commented Jul 10, 2013 at 15:35

2 Answers 2

5

Regardless of whether or not duplicates point to the same objects, there will still need to be references to these objects, so size should still return the size with duplicates included.

A simple example showing this.

If you want duplicates to point to the same objects, you'll have to do this outside of the HashMap or hope the optimizer takes care of it.

Alternatives to String.intern() as joe776 suggested are possibly with a self-written collection extending some Set (since Set doesn't have a Object get(Object) method) or another HashMap (having objects point to themselves) that allows you to get a reference to the common object.

Sign up to request clarification or add additional context in comments.

1 Comment

I vote for this answer. I gave credit to joe776, though, as he answered first.
2

You can use String.intern() on the values to make them all point to the same instance. But this has other problems like using the PermGenSpace, which is not garbage collected pre-Java 1.7. You would call it like this: myMap.put(line[0], line[1].intern()).

Maybe a map based on a trie is more efficient, but I haven't used that, yet. Also depends on the nature of your Strings. The more alike your keys are, the more space the trie can save.

http://code.google.com/p/trie-map/

Also see Dukeling's answer concerning keys().size() and values().size() and the use of another map to avoid duplicate values.

12 Comments

I'm on Java 1.7, and have just tried line[1].intern(). myMap.values().size() still returns 2M, and memory usage remains the same. I'll try trie if no canonical solutions are provided.
+1 An aternative is to have a Map<String, String> where the key and value are the same. You can lookup the value to see if it has been used before and reuse the same String object. This "interner" map can be discarded when you finish.
@mossaab myMap.values().size() will always return 2M if there are 2M keys.
@mossaab The 2M keys won't change. That's the actual number you have. To check for the number of different values you could try something like new HashSet(myMap.values()).size(). This should give you 50k.
@assylias I now realize that myMap.values().size() will always return 2M. But as memory usage remains the same, this means that either Java already does not store duplicate values, or String.intern() does not do what I need to.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.