Duplicate values stored in HashMap

Question

I have a dictionary as a text file mapping from 2M words to 50k words. I load this file into memory as HashMap<String, String> by reading the file line by line, splitting on a separator and invoking myMap.put(line[0], line[1]). The size of the text file is 45MB, while the HashMap uses 350MB of the heap. My goal is to reduce memory usage without harming lookup speed. myMap.values().size() returns 2M instead of 50k, suggesting that the values are stored as duplicates. Is there a way to make identical values point to the same String object?

Map<String, String> dict = new HashMap<>();
try (FileReader fr = new FileReader(FILE);
        BufferedReader br = new BufferedReader(fr)) {
    String line;
    while ((line = br.readLine()) != null) {
        String key_value[] = line.split(":");
        dict.put(key_value[0], key_value[1].intern());
    }
} catch (Exception e) {
    e.printStackTrace();
}

If you have 2M unique words that are mapped to 50k (non unique) words, then you hashmap's size will be 2M. — assylias
– assylias, Commented Jul 10, 2013 at 15:30
The hashmaps size is based on the entries therefore the number of keys. Regarding the duplicate values: The JVM does some optimization with string values. As a string is immutable it often uses the same object for equal strings. You can't rely on that but probably your strings are already not duplicated. — André Stannek
– André Stannek, Commented Jul 10, 2013 at 15:32
@assylias I know. My question is how to avoid storing duplicate values. That is allowing multiple keys to point to map to the same object value. — mossaab
– mossaab, Commented Jul 10, 2013 at 15:33
@stonedsquirrel. I have already verified that I have 50k values. So there are a lot of duplicated values. — mossaab
– mossaab, Commented Jul 10, 2013 at 15:34
Yes, because you have 2M keys. But if a key points to an equals string as another key, it is highly likely that they are pointing to the same string object. — André Stannek
– André Stannek, Commented Jul 10, 2013 at 15:35

Community · Accepted Answer · 2017-05-23 10:32:08Z

5

Regardless of whether or not duplicates point to the same objects, there will still need to be references to these objects, so size should still return the size with duplicates included.

A simple example showing this.

If you want duplicates to point to the same objects, you'll have to do this outside of the HashMap or hope the optimizer takes care of it.

Alternatives to String.intern() as joe776 suggested are possibly with a self-written collection extending some Set (since Set doesn't have a Object get(Object) method) or another HashMap (having objects point to themselves) that allows you to get a reference to the common object.

edited May 23, 2017 at 10:32

CommunityBot

11 silver badge

answered Jul 10, 2013 at 15:39

Bernhard Barker

55.7k14 gold badges111 silver badges143 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

mossaab Over a year ago

I vote for this answer. I gave credit to joe776, though, as he answered first.

Community · Accepted Answer · 2017-05-23 12:05:28Z

2

You can use String.intern() on the values to make them all point to the same instance. But this has other problems like using the PermGenSpace, which is not garbage collected pre-Java 1.7. You would call it like this: myMap.put(line[0], line[1].intern()).

Maybe a map based on a trie is more efficient, but I haven't used that, yet. Also depends on the nature of your Strings. The more alike your keys are, the more space the trie can save.

http://code.google.com/p/trie-map/

Also see Dukeling's answer concerning keys().size() and values().size() and the use of another map to avoid duplicate values.

edited May 23, 2017 at 12:05

CommunityBot

11 silver badge

answered Jul 10, 2013 at 15:35

joe776

1,11614 silver badges24 bronze badges

12 Comments

mossaab Over a year ago

I'm on Java 1.7, and have just tried line[1].intern(). myMap.values().size() still returns 2M, and memory usage remains the same. I'll try trie if no canonical solutions are provided.

Peter Lawrey Over a year ago

+1 An aternative is to have a Map<String, String> where the key and value are the same. You can lookup the value to see if it has been used before and reuse the same String object. This "interner" map can be discarded when you finish.

assylias Over a year ago

@mossaab myMap.values().size() will always return 2M if there are 2M keys.

joe776 Over a year ago

@mossaab The 2M keys won't change. That's the actual number you have. To check for the number of different values you could try something like new HashSet(myMap.values()).size(). This should give you 50k.

mossaab Over a year ago

@assylias I now realize that myMap.values().size() will always return 2M. But as memory usage remains the same, this means that either Java already does not store duplicate values, or String.intern() does not do what I need to.

|

Collectives™ on Stack Overflow

Duplicate values stored in HashMap

2 Answers 2

1 Comment

12 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

12 Comments

Linked

Related