I am creating a c++ library implementing Java Functional Programming alike interface. In short, the code will look like this:
vector<string> buffer = ... ; // A buffer contains some strings
new IntStream(0, 100).map([](int a){
return (a * 0x2344DDEF) & 0xF;
}).map([=](int a) {
return buffer[a];
}).foreach([](string a) {
cout << a << '\n';
});
Now I want to support parallel evaluation. For the example above, I want to get 100 execution tasks and send them to a thread pool. To do this, I have created EvalOp classes. The stream returns a list of EvalOp objects. They will only perform the actual computation when you invoke EvalOp::eval
template <typename T>
class EvalOp {
public:
virtual T eval() = 0;
};
template <typename FROM, typename TO>
class TransformOp : public EvalOp<TO> {
public:
TO eval() override {
return mapper_(previous_->eval());
}
protected:
unique_ptr<EvalOp<FROM>> previous_;
function<TO(FROM&)> mapper_;
};
template <typename T>
class Stream {
public:
virtual bool isEmpty() = 0;
virtual EvalOp<T> next() = 0;
Stream<N> map(function<N(T)> mapper) {
return new MapStream<N,T>(this, mapper);
}
}
template <typename FROM, typename TO>
class MapStream : public Stream<TO> {
protected:
Stream<FROM>* previous_;
function<TO(FROM)> mapper_;
public:
EvalOp<TO> next() override {
return new TransformOp<FROM, TO>(previous_->next(), mapper_);
}
}
My stream will now return a bunch of EvalOp objects, which you can throw in a thread pool.
This code gets me the correct result. But as it creates many wrapper classes (the EvalOps), the execution is slower. I did a benchmark of the following two tasks:
uint32_t __attribute__ ((noinline)) hash1(uint32_t input) {
return input * 0x12345768;
}
uint32_t __attribute__ ((noinline)) hash2(uint32_t input) {
return ((input * 0x2FDF1234) << 12) * 0x23429459;
}
uint32_t sum = 0;
void summer(uint32_t input) {
sum += input;
}
BENCHMARK(StreamBenchmark, Serial)(State& state) {
for(auto _:state) {
for(int i = 0 ; i < 10000; ++i)
sum += hash2(hash1(i));
}
}
}
BENCHMARK(StreamBenchmark, Wrapper)(State& state) {
for(auto _:state) {
IntStream stream(0, 10000);
stream.map(hash1).map(hash2).foreach(summer);
}
}
From the benchmark result, I see for each element, only 1ns is spent on actual computation and 40ns overhead is spent on the Stream and EvalOp. I am looking for some suggestions to make a more efficient design. Thank you very much!