Skip to main content
typos
Source Link
xan
  • 171
  • 1
  • 7
  1. In debug build, the insertion sort will trigger some asserts because b-1 can go off the end of the base vector. I just used std::sort() instead of insertion sort.

  2. My above code suggestion doesn't make any difference. Either the compiler or the CPU must already be avoiding the extra fetching I was trying to get rid of.

  3. It did make a difference to avoid the push_back calls in merge(). Apparently they cause a reallocation of the vectors. Instead of using a sentinel, I addingadded bounds checking to the merge loop. I guess an alternative would be to use reserve to make the vectors big enough before the copy and append.

  4. I also tried parallelizing merge to see if it really was an issue doing it serially. I could only see to split the work into two parts (not recursively) by merging n/2 elements from the beginning and n-n/2 elements from the end in parallel. That didn't make any difference in total timing values, which further supports memory access being the bottleneck (though an error on my part is quotequite likely).

  1. In debug build, the insertion sort will trigger some asserts because b-1 can go off the end of the base vector. I just used std::sort() instead of insertion sort.

  2. My above code suggestion doesn't make any difference. Either the compiler or the CPU must already be avoiding the extra fetching I was trying to get rid of.

  3. It did make a difference to avoid the push_back calls in merge(). Apparently they cause a reallocation of the vectors. Instead I adding bounds checking to the merge loop. I guess an alternative would be to use reserve to make the vectors big enough before the copy and append.

  4. I also tried parallelizing merge to see if it really was an issue doing it serially. I could only see to split the work into two parts (not recursively) by merging n/2 elements from the beginning and n-n/2 elements from the end in parallel. That didn't make any difference in total timing values, which further supports memory access being the bottleneck (though an error on my part is quote likely).

  1. In debug build, the insertion sort will trigger some asserts because b-1 can go off the end of the base vector. I just used std::sort() instead of insertion sort.

  2. My above code suggestion doesn't make any difference. Either the compiler or the CPU must already be avoiding the extra fetching I was trying to get rid of.

  3. It did make a difference to avoid the push_back calls in merge(). Apparently they cause a reallocation of the vectors. Instead of using a sentinel, I added bounds checking to the merge loop. I guess an alternative would be to use reserve to make the vectors big enough before the copy and append.

  4. I also tried parallelizing merge to see if it really was an issue doing it serially. I could only see to split the work into two parts (not recursively) by merging n/2 elements from the beginning and n-n/2 elements from the end in parallel. That didn't make any difference in total timing values, which further supports memory access being the bottleneck (though an error on my part is quite likely).

added update section
Source Link
xan
  • 171
  • 1
  • 7

UPDATE: I tried experimenting with the code myself on 8 cores and found:

  1. In debug build, the insertion sort will trigger some asserts because b-1 can go off the end of the base vector. I just used std::sort() instead of insertion sort.

  2. My above code suggestion doesn't make any difference. Either the compiler or the CPU must already be avoiding the extra fetching I was trying to get rid of.

  3. It did make a difference to avoid the push_back calls in merge(). Apparently they cause a reallocation of the vectors. Instead I adding bounds checking to the merge loop. I guess an alternative would be to use reserve to make the vectors big enough before the copy and append.

  4. I also tried parallelizing merge to see if it really was an issue doing it serially. I could only see to split the work into two parts (not recursively) by merging n/2 elements from the beginning and n-n/2 elements from the end in parallel. That didn't make any difference in total timing values, which further supports memory access being the bottleneck (though an error on my part is quote likely).

Below is the parallel merge code:

 template <class In>
 void merge_partial(In itl, In itle, In itr, In itre, In b, In e, int n, bool le) {
   auto itm = b;
   for (;n != 0;n--){
     if( le ? (*itl <= *itr) : (*itl >= *itr)) {
       *itm++ = *itl++;
       if (itl == itle) { // all the rest from the right
         for (;n != 0;n--)
           *itm++ = *itr++;
         break;
       }
     }
     else {
       *itm++ = *itr++;
       if (itr == itre) { // all the rest from the left
         for (;n != 0;n--)
           *itm++ = *itl++;
         break;
       }
     }
   }
 }

...

  if(dis > 500) {
      tbb::parallel_invoke([&] () { merge_sort_parallel(b, m); },
                           [&] () { merge_sort_parallel(m, e); });
      std::vector<typename In::value_type> left(b, m);
      std::vector<typename In::value_type> right(m, e);
      auto n = std::distance(b, e);

      auto itl = std::begin(left);
      auto itle = std::end(left);
      auto itr = std::begin(right);
      auto itre = std::end(right);
      tbb::parallel_invoke(
            [&] () { merge_partial(itl, itle, itr, itre, b, e, n / 2, true); },
            [&] () { merge_partial(reverse_iterator<In>(itle), reverse_iterator<In>(itl),
                         reverse_iterator<In>(itre), reverse_iterator<In>(itr), 
                         reverse_iterator<In>(e), reverse_iterator<In>(b),
                         n - n / 2, false); });
  }

UPDATE: I tried experimenting with the code myself on 8 cores and found:

  1. In debug build, the insertion sort will trigger some asserts because b-1 can go off the end of the base vector. I just used std::sort() instead of insertion sort.

  2. My above code suggestion doesn't make any difference. Either the compiler or the CPU must already be avoiding the extra fetching I was trying to get rid of.

  3. It did make a difference to avoid the push_back calls in merge(). Apparently they cause a reallocation of the vectors. Instead I adding bounds checking to the merge loop. I guess an alternative would be to use reserve to make the vectors big enough before the copy and append.

  4. I also tried parallelizing merge to see if it really was an issue doing it serially. I could only see to split the work into two parts (not recursively) by merging n/2 elements from the beginning and n-n/2 elements from the end in parallel. That didn't make any difference in total timing values, which further supports memory access being the bottleneck (though an error on my part is quote likely).

Below is the parallel merge code:

 template <class In>
 void merge_partial(In itl, In itle, In itr, In itre, In b, In e, int n, bool le) {
   auto itm = b;
   for (;n != 0;n--){
     if( le ? (*itl <= *itr) : (*itl >= *itr)) {
       *itm++ = *itl++;
       if (itl == itle) { // all the rest from the right
         for (;n != 0;n--)
           *itm++ = *itr++;
         break;
       }
     }
     else {
       *itm++ = *itr++;
       if (itr == itre) { // all the rest from the left
         for (;n != 0;n--)
           *itm++ = *itl++;
         break;
       }
     }
   }
 }

...

  if(dis > 500) {
      tbb::parallel_invoke([&] () { merge_sort_parallel(b, m); },
                           [&] () { merge_sort_parallel(m, e); });
      std::vector<typename In::value_type> left(b, m);
      std::vector<typename In::value_type> right(m, e);
      auto n = std::distance(b, e);

      auto itl = std::begin(left);
      auto itle = std::end(left);
      auto itr = std::begin(right);
      auto itre = std::end(right);
      tbb::parallel_invoke(
            [&] () { merge_partial(itl, itle, itr, itre, b, e, n / 2, true); },
            [&] () { merge_partial(reverse_iterator<In>(itle), reverse_iterator<In>(itl),
                         reverse_iterator<In>(itre), reverse_iterator<In>(itr), 
                         reverse_iterator<In>(e), reverse_iterator<In>(b),
                         n - n / 2, false); });
  }
Source Link
xan
  • 171
  • 1
  • 7

Maybe you're at a memory bandwidth limit. I've done similar experiments with Java threads on an 8 core machine (2 quad core Xeons). Non-memory tasks scaled very well, but memory intensive tasks did not. Try something more computation just to see if memory is a limiting factor for you, too.

Or, in the same spirit, try to make your code less memory intensive by using a higher optimization level or hand-optimizing. For instance, the inner merge loop appears to do three memory reads each iteration when it could do one (plus one write).

  typename In::value_type kl = *itl++;
  typename In::value_type kr = *itr++;
  while (b != e) {
    if(kl <= kr) {
      *b++ = kl;
      kl = *itl++;
    }
    else {
      *b++ = kr;
      kr = *itr++;
    }
  }