9

This one puzzles me for 3 days. I have an application which needs to evaluate a certain set of integer polynomials (of multiple args) which have very few elements. I already have an implementation written in Java and I am currently porting to C++.

During testing, I noticed that the C++ version is orders of magnitudes slower that the Java variant. I know of course about JIT-ing and that this scenario is particulary well-posed for this kind of compilers, but what I see is way off from what I had expected.

The sample code is below, you'll need boost to compile the C++ code (but that dependency is only required for simple time measurement).

choeger@daishi ~/uebb % clang++ -O3 -std=c++11 polytest.cpp -lboost_timer -lboost_system
choeger@daishi ~/uebb % ./a.out                                                         
 0.011694s wall, 0.010000s user + 0.000000s system = 0.010000s CPU (85.5%)
Ideal Result: 1e+07
 0.421986s wall, 0.420000s user + 0.000000s system = 0.420000s CPU (99.5%)
Result: 1e+07
choeger@daishi ~/uebb % javac PolyTest.java                                             
choeger@daishi ~/uebb % java PolyTest                                                   
evals: 10000000 runtime: 17ms
Ideal Result: 1.0E7
evals: 10000000 runtime: 78ms
Result: 1.0E7

Apparently, the C++ version (compiled with clang-3.3) runs slightly faster when it comes to pure computational power, but Java (openjdk 1.7.0.60) does much better when the polynomial is interpreted. My guess so far is, that my C++ code is not quite optimal due to the iteration over small (in the sample 1-element) vectors. I assume the JVM does much better here when it comes to cache-hits misses.

Is there any way to make my C++ version perform better? Is there a different cause I just did not see? And as a side note: is there way to measure cache-coherence for a C++ and a Java process?

The C++ code looks like this:

#include <boost/timer/timer.hpp>

#include <iostream>
#include <vector>

using namespace std;

struct Product {
  int factor;
  vector<int> fields;
};


class SumOfProducts {
public:
  vector<Product> sum;

  /**
   * evaluate the polynomial with arguments separated by width
   */
  inline double eval(const double* arg, const int width) const {
    double res = 0.0;
    for (Product p : sum) {
      double prod = p.factor;
      for (int f : p.fields) {
    prod *= arg[f*width];
      }
      res += prod;
    }
    return res;
  };
};


double idealBenchmark(const double* arg, const int width) {
  boost::timer::auto_cpu_timer t;
  double res = 0.0;
  // run 10M evaluations 
  for (long l = 0; l < 10000000; l++) {
    res = res + arg[width] * arg[width];
  }
  return res;
}

double benchmark(const double* arg, const SumOfProducts& poly) {
  boost::timer::auto_cpu_timer t;
  double res = 0.0;
  // run 10M evaluations 
  for (long l = 0; l < 10000000; l++) {
    res = res + poly.eval(arg, 1);
  }
  return res;
}

int main() {
  //simple polynomial: x_1^2
  Product p;
  p.factor = 1;
  p.fields.push_back(1);
  p.fields.push_back(1);

  SumOfProducts poly;
  poly.sum.push_back(p);

  double arg[] = { 0, 1 };

  double res = idealBenchmark(arg, 1);
  cout << "Ideal Result: " << res << endl;

  res = benchmark(arg, poly);
  cout << "Result: " << res << endl;
}

The Java version like this:

public class PolyTest {

    static class Product {
    public final int factor;
    public final int[] fields;

    public Product(int pFactor, int[] pFields) {
        factor = pFactor;
        fields = pFields;
    }
    }

    static class SumOfProducts {
    final Product[] sum;

    public SumOfProducts(Product[] pSum) {
        sum = pSum;
    }

    /**
     * evaluate the polynomial with arguments separated by width
     */
    double eval(final double[] arg, final int width) {
        double res = 0.0;
        for (Product p : sum) {
        double prod = p.factor;
        for (int f : p.fields) {
            prod *= arg[f*width];
        }
        res += prod;
        }
        return res;
    }
    }

    static double idealBenchmark(final double[] arg, final int width) {
    final long start = System.currentTimeMillis();
    double res = 0.0;
    long evals = 0;
    // run 10M evaluations 
    for (long l = 0; l < 10000000; l++) {
        evals++;
        res = res + arg[width] * arg[width];
    }
    System.out.println("evals: " + evals + " runtime: " + (System.currentTimeMillis() - start) + "ms");
    return res;
    }

    static double benchmark(final double[] arg, final SumOfProducts poly) {
    final long start = System.currentTimeMillis();
    double res = 0.0;
    long evals = 0;
    // run 10M evaluations 
    for (long l = 0; l < 10000000; l++) {
        evals++;
        res = res + poly.eval(arg, 1);
    }
    System.out.println("evals: " + evals + " runtime: " + (System.currentTimeMillis() - start) + "ms");
    return res;
    }

    public static void main(String[] args) {
    //simple polynomial: x_1^2
    Product p = new Product(1, new int[]{1, 1});

    SumOfProducts poly = new SumOfProducts(new Product[]{p});

    double arg[] = { 0, 1 };

    double res = idealBenchmark(arg, 1);
    System.out.println("Ideal Result: " + res);

    res = benchmark(arg, poly);
    System.out.println("Result: " + res);
    }

}
7
  • 4
    Your Java code use arrays while your C++ code uses vectors. This is not an equal comparison. Commented Jan 8, 2014 at 21:38
  • Orders of magnitude - that's at least 100 times slower then? Commented Jan 8, 2014 at 21:40
  • 1
    Now, the next time some C-hacker tells you that other languages are too slow, you can tell him with first-hand experience: it's not the performance of the language that matters, but how easy the language allows to write efficient programs ;) Commented Jan 8, 2014 at 21:53
  • If I understand your code, you are using the fields vector to indicate the degree of the polynomial value. Is that correct? Commented Jan 8, 2014 at 22:03
  • 2
    @Arian No statement that is not backed up by evidence should be believed, be it from a C-hacker or some other type of programmer. Commented Jan 8, 2014 at 22:08

3 Answers 3

15

You are making expensive copies here:

for (Product p : sum)

Each copy means fully copying the std::vector<int> data member of each element. Use references instead:

for (const Product& p : sum)

Note that I made them const, because you do not need to change the elements of the range.

Sign up to request clarification or add additional context in comments.

9 Comments

And also use an array, not a vector
@DanielNuriyev only if profiling shows that there is an appreciable hit in performance from using a vector. But other things to try first could include changing the range based loop for an index based one.
@DanielNuriyev The addition of move semantics, and proper use of vector's constructors nullifies 3 of the arguments made in that blog post. As always, C++ does not guarantee that your poorly written code will be optimized (I don't know of a single language that does, though).
That nails it. Thanks a lot. Its even not like I did not know about that particular piece of C++ semantics, I just didn't see it in the loop. 3 days of hassling and saved by 1hr of prep for the question + 10min waiting on stackoverflow. Let this never be known to my employers ;).
|
8

For starters, you should change this line

for (Product p : sum)

to become

for (Product const& p: sum)

Every iteration a new Product with its contained std::vector<int> is allocated, copied, and deallocated. I didn't see any other of that but since it is close to the inner loop I'd expect a large impact.

Comments

2

Based on the answer to my question, it looks like you are using the following structure:

struct Product 
{
    int factor;
    vector<int> fields;
};

in a highly inefficient manner. That is, the polynomial 4 x ^ 3 would be stored as

Product p { 4, {1, 1, 1} };

This is incredibly inefficient both in terms of processing power and memory. Instead, if you stored a given term of the polynomial in a predetermined vector:

vector<int> Polynomial { 1, 4, 3, 5 }; // 5x^3 + 3x^2 + 4x + 1

Where the degree of the term is determined by the index. Then, your function to evaluate the polynomial is just:

int evaluate(int x, const std::vector<int>& polynomial)
{
    int result = 0;
    for (std::size_t i = 0; i < polynomial.size(); ++i)
    {
        //        coefficient     x to the correct power
        result += polynomial[i] * std::pow(x, i);
    }
    return result;
}

As a side note: the same optimization can be applied to your Java code.

If you don't want to use std:pow for whatever reason, it is simple enough to implement yourself:

int pow(int x, unsigned int p)
{
    int result = 1;
    for (unsigned int i = 0; i < p; ++i)
    {
        result *= x;
    }
    return result;
}

And if sparse polynomials are your concern:

struct SubPolynomial
{
    int Coefficient;
    unsigned int Degree;
};

std::vector<SubPolynomial> polynomial;

int evaluate(int x, const std::vector<int>& polynomial)
{
    int result = 0;
    std::for_each(polynomial.begin(), polynomial.end(), [&](const SubPolynomial& s)
    {
        //        coefficient     x to the correct power
        result += s.Coefficient * pow(x, s.Degree);
    });
    return result;
}

Note that if you have a full polynomial, you'll be using twice the memory required of the first example. But if you have a sparse polynomial (e.g. a polynomial of degree N with less than N / 2 coefficients being non-zero), you'll be using at most the same amount of memory.

4 Comments

Sorry, but this does not apply. First, I am talking about integer polynomials, so std::pow would probably be a waste of time. Second (and worse), I have multiple arguments in very sparse terms. Thus, I either have to use a lot of ^0 or to jump around the given vector. That being said, I have a Horner-Scheme implementation (using integer pow()) lying around and can simply test it now.
@choeger Even if you didn't like std::pow (for whatever reason), it is simple enough to implement it yourself. Using a vector to hold the degree of a polynomial is a severe waste of space - as I show here, you can use the index for the degree so you only need 1 integer for each coefficient (instead of M(1 + Nk) integers, where M is the number of non-zero coefficients, and Nk is the degree of a kth element). If sparse terms are problematic, you can modify it to hold 2 integers for each term: the coefficient and the degree, which would still be vastly more efficient than what you have.
As I said: I will test an efficient integer-power implementation, but do not expect severe benefits, as my degrees are quite low. Regarding space: Any polynomial can have multiple arguments. I could store the index alongside the degree, but that still requires a vector. It's just my example that had only one argument.
@choeger "Any polynomial can have multiple arguments." If you mean you have have multiple indeterminants, then your example does not match your desired goal. { N {1, 1} } would mean Nxy, not x_1^2 (whatever that is).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.