17

I'm writing fairly low level code that must be highly optimized for speed. Every CPU cycle counts. Since the code is in Java I can't write as low level as in C for example, but I want to get everything out of the VM that I can.

I'm processing an array of bytes. There are two parts of my code that I'm primarily interested in at the moment. The first one is:

int key =  (data[i]     & 0xff)
        | ((data[i + 1] & 0xff) <<  8)
        | ((data[i + 2] & 0xff) << 16)
        | ((data[i + 3] & 0xff) << 24);

and the second one is:

key = (key << 15) | (key >>> 17);

Judging from the performance I'm guessing that these statements aren't optimized the way I expect. The second statement is basically a ROTL 15, key. The first statement loads 4 bytes into an int. The 0xff masks are there only to compensate for the added sign bits resulting from the implicit cast to int if the accessed byte happens to be negative. This should be easy to translate to efficient machine code, but to my surprise performance goes up if I remove the masks. (Which of course breaks my code, but I was interested to see what happens.)

What's going on here? Do the most common Java VMs optimize this code during JIT in the way one would expect a good C++ compiler to optimize the equivalent C++ code? Can I influence this process? Setting -XX:+AggressiveOpts seems to make no difference.

(CPU: x64, Platform: Linux/HotSpot)

9
  • 6
    I don't understand how you get to the point where you have to worry about performance, and you know your performance is not already good enough, and you've already tried the broad algorithmic stuff and have determined that you need to micro-optimize, and you have to do it in a specific language that's not super-low-level... yet you don't know how to determine for yourself what the language is doing behind the scenes, and haven't already done a fair amount of research on the topic as a simple matter of personal interest. Optimization of this sort is a pretty specialized skill. Commented Nov 24, 2011 at 11:31
  • 10
    Wow. I didn't know my Q would offend anyone. You don't seem to assume good faith. I'm putting much effort into this problem, but maybe my efforts would seem pathetic to you. In any case I'd be happy to learn from you. To answer your Qs: the language is a given from the product owner. The performance is critical by the nature of the prod. I don't want to use JNI in order not to break the platform indep. The code I'm talking about is deep in the core and where the CPU spends most of its time. That merits my question if you ask me. But even if not: wouldn't it be interesting in its own right? Commented Nov 24, 2011 at 12:27
  • 1
    Oh, and I get 15 times the speed in C++ on the same machine (in terms of data-throughput). That's at least a clue that performance isn't good enough, isn't it? Commented Nov 24, 2011 at 12:33
  • 1
    @Rinke Are you sure these line are the reason why the java version so much slower than the C++ version? Commented Nov 24, 2011 at 15:10
  • 2
    If you get 15times the speedup in C++ chances are extremely good that you're just measuring wrong so post your microbenchmarks for the problem. @jmg Hotspot -c2 generates code about equal to a C compiler for this stuff sure, but last time I checked it didn't do much SSE - which depending on the larger code at hand could certainly give noticeable speedups for the C++ program. Again not much to say without more code. Commented Nov 24, 2011 at 16:00

4 Answers 4

7

How do you test the performance?

Here is a good article:

http://www.ibm.com/developerworks/java/library/j-benchmark1/index.html

http://www.ibm.com/developerworks/java/library/j-benchmark2/index.html

http://ellipticgroup.com/html/benchmarkingArticle.html

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot. Indeed, my measurements turned out to be flawed. Fixed this, but I'm still not as fast as the C++ version.
5

I've done a lot of performance code in Java, I've even coded directly in Bytecode, enough to be sure of a couple of thing : the JIT is a black box with obscure behaviours, the JIT and compilers are incredibly efficient, and the simplest code usually yield the best performance.

This is normal when you think about the GOAL of the JIT: extract the best possible performance from any Java code. When you add that Java is quite a simple and plain language, the simple code will be optimized, and any further trick will generally do no good.

Of course, there are some common pitfalls and gotchas you ought to know, but I see none in your code samples. Were I to optimize your code, I would go straight to the higher level: algorithm. What is the complexity of your code? Can some data be cached? What APIs are used? Etc... There's a seemingly endless pit of performance to be extracted from algorithmic tricks alone.

And if even this is not sufficient, if the language is not fast enough, if your machine is not fast enough, if your algorithm cannot be made any faster, the answer won't lie in "clock cycles", because you might squeeze 20% of efficiency, but 20% will never be enough when your data grow. To be sure you will never hit (again) a performance wall, the ultimate answer lies in scalability: make your algorithm and your data endlessly distributable so you can just throw more workers to the task.

1 Comment

Thanks. I'm sorry ignore the more interesting points of your answer (i.e. in this comment), but can I conclude that from your experience you'd say that (1) the masks in the first statement shouldn't be explicit in the compiled machine code and (2) the second statement should be translated into a single ROTL (or ROTR)?
3

I do agree with solendil, but if you want to dig deeper at the low-level, try getting the code produced by the JIT as described here.

1 Comment

That link is now broken, here's the new one
0

You don't need to (& 0xff) before shifting 24 bits to the left.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.