5

This is Sun JDK 1.6u21, x64.

I have a class for the purpose of experimenting with perm gen usage which contains only a single large string (512k characters):

public class Big0 {
     public String bigString =
         "A string with 2^19 characters, should be 1 MB in size";
}

I check the perm gen usage using getUsage().toString() on the MemoryPoolMXBean object for the permanent generation (called "PS Perm Gen" in u21, although it has slightly different names with different versions, or with different garbage collectors.

When I first reference the class, say by reading Big0.class, perm gen jumps by ~500 KB - that's what I'd expect as the constant pool encoding of the string is UTF-8, and I'm using only ASCII characters.

When I actually create an instance of this class, however, perm gen jumps by ~2 MB. Since this is a 1 MB string in-memory (2 bytes per UTF16 character, certainly no surrogates), I'm confused about why the memory usage is double.

The same effect occurs if I make the string static. If I used final, it fails to compile as I exceed the limit for constant pool items of 65535 bytes (not sure why leaving final off avoids that either - consider that a bonus question).

Any insight appreciated!

Edit: I should also point out that this occurs with non-static, final non-static, and static strings, but not for final static strings. Since that's already a best practice for string constants, maybe this is of mostly academic interest.

3
  • what if you ran system.gc a bunch of times after you create an instance of this class in an effort to clear out all non-necessary cruft from permgen, e.g., is there a fleeting temporary footprint in permgen that leads us to incorrectly conclude there's a higher impact. Commented Feb 23, 2011 at 3:20
  • I did that, no effect unfortunately. I also did the ultimate test - filled up germgen - the app OOMed with it full of these 2.5 MB blocks, without recovering any, so we can pretty much assume they cannot be collected in the current implementation. Commented Feb 23, 2011 at 4:38
  • Is it possible that there are two copies of the string when you make that assignment? One for the literal between the quotes (all string are immutable) and one stored in "bigString". Because "bigString" has a strong reference to the literal, Garbage Collection isn't destroying the first copy (the one to the right of the equals sign). The reason why final and static are working is because the compiler is creating a phantom reference. This is low level stuff for me, so I'm hesitant to post it as an answer. Commented Feb 23, 2011 at 18:28

4 Answers 4

2

I think it's an artefact of your test class. I created a similar class, then decompiled it with javap.

The [eclipse] java compiler breaks the String literal into chunks, each no longer than 64k. The bytecode for initializing the non-constant field consists of cobbling the source string together with a sequence of StringBuilder operations. Although it is this final gigantic string that is interned, the large atoms it is made of take up space in the constant pool.

Sign up to request clarification or add additional context in comments.

3 Comments

That makes a heck of a lot of sense. I found also that 1 MB of the 2.5 MB is recoverable, if all instances are garbage (non-static case as above), and in that case I guess it's the final string which is released to save that, but the atoms are left behind.
Bonus question: How does static final differ? In this case it only used 1.5 MB. Are the chunks discarded in this case - or is the method completely different?
static final (and private non-static final) permit the java compiler to represent the string solely as a constant in the constant pool. I used jmap -histo:live to measure the size of my constant pool for each of the test cases. YMMV IANAL FWIW.
0

Java characters have a width of 2 bytes per character (regardless of whether itd ASCII or a code point above 255). I think that what you seeing is the Java VM translating the internal class file storage (modified UTF8) version of the string into its internal expanded form as soon as the class is initialized (which is done prior to instance creation)

2 Comments

Sure, I accounted for that. My strings were 512k characters, so I would expect them to be 1 MB in-memory (2 bytes per character).
Note also that this doesn't occur at class init in my example above. If I access the class, but don't create an instance, the memory footprint never goes above 500k. Only when I create an instance of my class does it jump another 2 MB.
0

A good memory profiler (i personally use and really like yourkit java profiler) should be able to show you where the memory is being used.

1 Comment

I'd like to think so too - I tried MAT, but information on permgen is lacking. In fact, they document that information on interned strings is unfortunately not even available from dumps.
0

While the class file format specifies modified UTF-8 as its storage format for String literals, the internal format of the runtime is UTF-16. A String stores its data as in UTF-16 encoding in a char[] (usually, it's implementation-dependent, however) . Most characters take up 2 bytes in this encoding (characters outside the BMP take up more).

I've seen references to a modified rt.jar that contains a java.lang.String implementation with a specialized code-path/storage for ASCII-only Strings, which cut down on the memory requirement significantly.

Edit: it seems this option has found its way into the normal Oracle JRE since Java 6 Update 21 according to this reference:

-XX:-XX:+UseCompressedStrings

Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release)

(Found through this answer).

1 Comment

Sure - but see my numbers above: I am well aware of the storage of characters at runtime. I would expect a 512k character string to take 1 MB, but actually 2 MB are used.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.