1

EDIT: considering the first answer I removed the "myexp()" function as with bug and not the main point of the discussion

I have one simple piece of code and compiled for different platform and get different performance results (execution time):

  • Java 8 / Linux: 3.5 seconds

    Execution Command: java -server Test

  • C++ / gcc 4.8.3: 6.22 seconds

    Compilation options: O3

  • C++ / Visual Studio 2015: 1.7 seconds

    Compiler Options: /Og /Ob2 /Oi

It seems that VS has these additional options not available for g++ compiler.

My question is: why is Visual Studio (with those compiler options) so faster with respect to both Java and C++ (with O3 optimization, which I believe is the most advanced)?

Below you can find both Java and C++ code.

C++ Code:

#include <cstdio>
#include <ctime>
#include <cstdlib>
#include <cmath>


static unsigned int g_seed;

//Used to seed the generator.
inline void fast_srand( int seed )
{
    g_seed = seed;
}

//fastrand routine returns one integer, similar output value range as C lib.
inline int fastrand()
{
    g_seed = ( 214013 * g_seed + 2531011 );
    return ( g_seed >> 16 ) & 0x7FFF;
}

int main()
{
    static const int NUM_RESULTS = 10000;
    static const int NUM_INPUTS  = 10000;

    double dInput[NUM_INPUTS];
    double dRes[NUM_RESULTS];

    fast_srand(10);

    clock_t begin = clock();

    for ( int i = 0; i < NUM_RESULTS; i++ )
    {
        dRes[i] = 0;

        for ( int j = 0; j < NUM_INPUTS; j++ )
        {
           dInput[j] = fastrand() * 1000;
           dInput[j] = log10( dInput[j] );
           dRes[i] += dInput[j];
        }
     }


    clock_t end = clock();

    double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;

    printf( "Total execution time: %f sec - %f\n", elapsed_secs, dRes[0]);

    return 0;
}

Java Code:

import java.util.concurrent.TimeUnit;


public class Test
{

    static int g_seed;

    static void fast_srand( int seed )
    {
        g_seed = seed;
    }

    //fastrand routine returns one integer, similar output value range as C lib.
    static int fastrand()
    {
        g_seed = ( 214013 * g_seed + 2531011 );
        return ( g_seed >> 16 ) & 0x7FFF;
    }


    public static void main(String[] args)
    {
        final int NUM_RESULTS = 10000;
        final int NUM_INPUTS  = 10000;


        double[] dRes = new double[NUM_RESULTS];
        double[] dInput = new double[NUM_INPUTS];


        fast_srand(10);

        long nStartTime = System.nanoTime();

        for ( int i = 0; i < NUM_RESULTS; i++ )
        {
            dRes[i] = 0;

            for ( int j = 0; j < NUM_INPUTS; j++ )
            {
               dInput[j] = fastrand() * 1000;
               dInput[j] = Math.log( dInput[j] );
               dRes[i] += dInput[j];
            }
        }

        long nDifference = System.nanoTime() - nStartTime;

        System.out.printf( "Total execution time: %f sec - %f\n", TimeUnit.NANOSECONDS.toMillis(nDifference) / 1000.0, dRes[0]);
    }
}
7
  • 4
    How did you test your performance? How many loops did you test? Did you do a warm-up? Java is a VM-based, non-native language, and as such there is overhead for the JVM loading and some time is required for the optimizer to kick in. Did you take that into account? Commented Dec 14, 2016 at 10:29
  • The main point is not java warm up (consider that the loop is executed 100 million times and I also executed the same code multiple time without any difference in the result). Java and C++ are already comparable on Linux. I was wondering why VS is so faster Commented Dec 14, 2016 at 10:45
  • I assume that you have different hardware on the linux system and the windows system? Could you run the Java program on the Windows system, just for comparison's sake? Commented Dec 14, 2016 at 10:52
  • actually the windows system is my machine, while linux is a server 16 cores. I tried already on my machine with the same result. moreover, I also profiled the java application and tried different performance jvm arguments (code cache, heap size, compile threshold) Commented Dec 14, 2016 at 10:56
  • Your benchmark is bogus as a very smart compiler, such as modern C++ compiler, can, to a various degree, detect that the input is static and either prove that the output does not depend on input and thus allowed to optimize out everything, leaving only single rand() that influences print dRes[0], or to calculate output on compile time. For proper measurements, you need to pass the arguments to your program on runtime. Commented Dec 14, 2016 at 11:11

1 Answer 1

6

The function

static inline double myexp( double val )
{
    const long tmp = (long)( 1512775 * val + 1072632447 );
    return double( tmp << 32 );
}:

gives the warning in MSVC

warning C4293: '<<' : shift count negative or too big, undefined behavior

After changing to:

static inline double myexp(double val)
{
    const long long tmp = (long long)(1512775 * val + 1072632447);
    return double(tmp << 32);
}

the code also takes around 4 secs in MSVC.

So, apparently the MSVC optimized a whole lot of stuff out there, possibly the entire myexp() function (and maybe even something else depending on this result as well) - because it can (remember, undefined behavior).

The lesson taken: Check (and fix) the warnings as well.


Note that if I try to print the result in the func, the MSVC optimized version gives me (for every call):

tmp: -2147483648
result: 0.000000

I.e. the MSVC optimized the undefined behavior to always return 0. Might be also interesting to see the assembly output to see what else has been optimized out because of this.


So, after checking the assembly, the fixed version has this code:

; 52   :             dInput[j] = myexp(dInput[j]);
; 53   :             dInput[j] = log10(dInput[j]);

    mov eax, esi
    shr eax, 16                 ; 00000010H
    and eax, 32767              ; 00007fffH
    imul    eax, eax, 1000
    movd    xmm0, eax
    cvtdq2pd xmm0, xmm0
    mulsd   xmm0, QWORD PTR __real@4137154700000000
    addsd   xmm0, QWORD PTR __real@41cff7893f800000
    call    __dtol3
    mov edx, eax
    xor ecx, ecx
    call    __ltod3
    call    __libm_sse2_log10_precise

; 54   :             dRes[i] += dInput[j];

In the original version, this entire block is missing, i.e. the call to log10() has been apparently optimized out as well, and replaced by a constant at the end (apparently -INF, which is result of log10(0.0) - in the fact the result might be also undefined or implementation defined). Also, the entire myexp() function was replaced by fldz instruction (basically, "load zero"). So that explains the extra speed :)


EDIT

Regarding the performance difference when using the real exp(): The assembly output might give some clues.

In particular, for MSVC you can utilize those additional parameters:

/FAs /Qvec-report:2

/FAs produces the assembly listing (along with the source code)

/Qvec-report:2 provides useful information about the vectorization status:

test.cpp(49) : info C5002: loop not vectorized due to reason '1304'
test.cpp(45) : info C5002: loop not vectorized due to reason '1106'

The reason codes are available here: https://msdn.microsoft.com/en-us/library/jj658585.aspx - in particular, the MSVC seems to not be able to vectorize the loops properly. But according to the assembly listing, it still uses the SSE2 functions (which is still kind of "vectorization", improving the speed significantly).

The similar parameters for GCC are:

-funroll-loops -ftree-vectorizer-verbose=1

Which gives the result for me:

Analyzing loop at test.cpp:42
Analyzing loop at test.cpp:46
test.cpp:30: note: vectorized 0 loops in function.
test.cpp:46: note: Unroll loop 3 times

So apparently g++ is not able to vectorize either, but it does loop unrolling (in the assembly I can see that the loop code is duplicated 3 times there), which can also explain the better performance.

Unfortunately, this is where Java lacks AFAIK, because Java does not do any vectorization, SSE2 or loop unrolling, therefore it is then much slower than the optimized C++ version. See e.g. here: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions? where the JNI is recommended for better performance (i.e., calculating in C/C++ DLL through JNI interface for the Java app).

Sign up to request clarification or add additional context in comments.

10 Comments

Does VS give the same warning when compiled as 64bit?
Not sure, I didn't try (I tested it with VS2013). But AFAIK long is 4B in 64-bit on Windows as well (LLP64).
But actually g++ 64-bit has the long of 8B, so that might be the reason it works there.
Thanks for the reminder. I now remember yet another reason why I love my Mac =P
you shouldn't blame the platform when you're using implementation-specific types (C++ standard only says long is at least 32bits, possibly more), if you wanted consistent cross-platform behavior you should use fixed-width like int64_t. see stackoverflow.com/a/13604190/1362755
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.