UPDATE: I tried experimenting with the code myself on 8 cores and found:
In debug build, the insertion sort will trigger some asserts because b-1 can go off the end of the base vector. I just used std::sort() instead of insertion sort.
My above code suggestion doesn't make any difference. Either the compiler or the CPU must already be avoiding the extra fetching I was trying to get rid of.
It did make a difference to avoid the push_back calls in merge(). Apparently they cause a reallocation of the vectors. Instead I adding bounds checking to the merge loop. I guess an alternative would be to use reserve to make the vectors big enough before the copy and append.
I also tried parallelizing merge to see if it really was an issue doing it serially. I could only see to split the work into two parts (not recursively) by merging n/2 elements from the beginning and n-n/2 elements from the end in parallel. That didn't make any difference in total timing values, which further supports memory access being the bottleneck (though an error on my part is quote likely).
Below is the parallel merge code:
template <class In>
void merge_partial(In itl, In itle, In itr, In itre, In b, In e, int n, bool le) {
auto itm = b;
for (;n != 0;n--){
if( le ? (*itl <= *itr) : (*itl >= *itr)) {
*itm++ = *itl++;
if (itl == itle) { // all the rest from the right
for (;n != 0;n--)
*itm++ = *itr++;
break;
}
}
else {
*itm++ = *itr++;
if (itr == itre) { // all the rest from the left
for (;n != 0;n--)
*itm++ = *itl++;
break;
}
}
}
}
...
if(dis > 500) {
tbb::parallel_invoke([&] () { merge_sort_parallel(b, m); },
[&] () { merge_sort_parallel(m, e); });
std::vector<typename In::value_type> left(b, m);
std::vector<typename In::value_type> right(m, e);
auto n = std::distance(b, e);
auto itl = std::begin(left);
auto itle = std::end(left);
auto itr = std::begin(right);
auto itre = std::end(right);
tbb::parallel_invoke(
[&] () { merge_partial(itl, itle, itr, itre, b, e, n / 2, true); },
[&] () { merge_partial(reverse_iterator<In>(itle), reverse_iterator<In>(itl),
reverse_iterator<In>(itre), reverse_iterator<In>(itr),
reverse_iterator<In>(e), reverse_iterator<In>(b),
n - n / 2, false); });
}