3

I wrote a small benchmark, in which the program creates 108 2-Dimensional std::vector structures of {float, float}, and then sums up the square of their lengths.

Here is the C++ code:

#include <iostream>
#include <chrono>
#include <vector>
#include <array>
#include <cmath>
    
using namespace std;
using namespace std::chrono;
    
const int COUNT = pow(10, 8);
    
class Vec {
public:
    float x, y;
    
    Vec() {}
    
    Vec(float x, float y) : x(x), y(y) {}
    
    float len() {
        return x * x + y * y;
    }
};
    
int main() {
    vector <Vec> vecs;
    
    for(int i = 0; i < COUNT; ++i) {
        vecs.emplace_back(i / 3, i / 5);
    }
    
    auto start = high_resolution_clock::now();
    
    // This loop is timed
    float sum = 0;
        for(int i = 0; i < COUNT; ++i) {
        sum += vecs[i].len();
    }
    
    auto stop = high_resolution_clock::now();
    
    cout << "finished in " << duration_cast <milliseconds> (stop - start).count()
         << " milliseconds" << endl;
    cout << "result: " << sum << endl;
    
    return 0;
}

For which I used this makefile (g++ version 7.5.0):

build:
 g++ -std=c++17 -O3 main.cpp -o program #-ffast-math 
    
run: build
 clear
 ./program

Here is my Java code:

public class MainClass {
    static final int COUNT = (int) Math.pow(10, 8);

    static class Vec {
        float x, y;

        Vec(float x, float y) {
            this.x = x;
            this.y = y;
        }

        float len() {
            return x * x + y * y;
        }
    }

    public static void main(String[] args) throws InterruptedException {

        Vec[] vecs = new Vec[COUNT];

        for (int i = 0; i < COUNT; ++i) {
            vecs[i] = new Vec(i / 3, i / 5);
        }

        long start = System.nanoTime();

        // This loop is timed
        float sum = 0;
        for (int i = 0; i < COUNT; ++i) {
            sum += vecs[i].len();
        }

        long duration = System.nanoTime() - start;
        System.out.println("finished in " + duration / 1000000 + " milliseconds");
        System.out.println("result: " + sum);
    }
}

Which was compiled and ran using Java 11.0.4

And here are the results (the average of a few runs, ran on ubuntu 18.04 16bit):

c++:  262 ms
java: 230 ms

Here are a few things I have tried in order to make the c++ code faster:

  • Use std::array instead of std::vector
  • Use a plain array instead of std::vector
  • Use an iterator in the for loop

However, none of the above resulted in any improvement.

I have noticed a few interesting things:

  1. When I time the whole main() function (allocation + computation), C++ is much better. However, this might be due to the warm up time of the JVM.
  2. For a lower number of objects, like 107, C++ was slightly faster (by a few milliseconds).
  3. Turning on -ffast-math makes the C++ program a few times faster than Java, however the result of the computation is slightly different. Furthermore, I read in a few posts that using this flag is unsafe.

Can I somehow modify my C++ code and make it as fast or faster than Java in this case?

28
  • I don't think such a 15% difference is significant. To answer to such a question one needs to go to assembly code which is even harder with the JVM... So without having all the details (compiler version, OS, ...) I don't think anyone can answer. Commented Feb 20, 2021 at 15:46
  • Possibly due to one binary running 32 bit code, and the other running 64 bit? Commented Feb 20, 2021 at 15:48
  • Thank you for your comments. Both programs indeed run in 64bit mode. Using reserve, did not increase the performance. COUNT is identical in both programs. Commented Feb 20, 2021 at 15:55
  • Yes, as written in the Makefile above, I indeed used -O3 Commented Feb 20, 2021 at 15:57
  • @RoyVaron Using reserve would probably (definitely) improve the performance if the benchmark was of the whole program. I deleted my comment after realizing that the filling-up of the vector isn't actually in the timed section... Commented Feb 20, 2021 at 15:57

1 Answer 1

5

Try this:

    float sum = std::transform_reduce(
        std::execution::par_unseq,
        begin(vecs), end(vecs),
        0.f,
        std::plus<>{},
        [](auto&& x){
            return x.len();
        }
    );

this explicitly tells the C++ compiler what you are doing, that you are ok with using extra threads, that each loop iteration doesn't depend on the others, and that you want to do the work in floats.

It does mean that the addition may occur out of order compared to what you asked for, so the output value may not be exactly the same.

Live example with the raw loop on one side, and permission to out-of-order add on the other.


Further investigation:

So I spun up a godbolt.

In it, I compared this with and without forced vectorization and -ffast-math. Forcing vectorization and -ffast-math resulted in the same assembly code.

The issue is the accumulator. Adding things one at a time into the sum and doing all of the IEEE rounding gives you a different value than accumulating them N at a time in higher precision floating point values, then storing the result en-bulk back into the float.

If you do -ffast-math you'll get 2x speed and a different accumulation. If you replace the float sum with double sum, you'll get the same answer as --ffast-math and vectorizaton on.

Basically, the clang vectorizer doesn't see an easy way to vectorize the accumulation of the sums without breaking the exact float-precision floating point requirements.

Sign up to request clarification or add additional context in comments.

5 Comments

I forgot transform_reduce even exists! How much difference does __attribute__((always_inline)) make? I would expect the compiler to inline that 1 line function regardles.
Does the fact that you removed the constructors matter? It makes the class trivially-constructible, so that must matter a bit.
@AyxanHaqverdili I was just iteratively removing "cruft" that could get in the way of the compiler knowing that Vec was just some flat binary data (trivially copyable etc). I doubt it matters. Note that I'm simulating with only 10^7 elements, as godbolt complains if you ask for 10^8. ;)
#pragma omp simd reduction(+:sum) compiled with -fopenmp will give the compiler permission to pretend that FP math is associative for that reduction without having to use -ffast-math for the whole file, if you don't want to.
Also note that using multiple accumulators (elements of a SIMD vector, and multiple SIMD vectors) is generally a good thing for numerical accuracy. The result is different because it probably has less rounding error than the naive scalar version, closer to pairwise summation. (Simd matmul program gives different numerical results / Efficient stable sum of a sorted array in AVX2). It's not true in general that -ffast-math will make your results more accurate, though!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.