Skip to main content
Spelling and grammar (including changing archaic "make you" imperative syntax)
Source Link
Toby Speight
  • 88.4k
  • 14
  • 104
  • 327

Summary

Summary

Because the number of threads that a system can create is finite, if your data set is large enough eventually the thread constructor will throw an exception (system_error) when a thread cannot be created. There is also a performance hit when creating a thread, so making threads for very small intervals is counterproductive.

merge

merge

When copying elements into the local vectors lvec and rvec, you are copying elements sequentially. Using one loop to copy all the elements adds extra overhead and is unnecessary. You can use two loops (and reserve memory space for each), use an appropriate vector constructor (that takes a pair of iterators), or use the insert function to block copy elements from one vector to another. Reserving space for the vector, instead of allowing it to grow, will reduce the contention in the memory allocator (which is not concurrent, creating a bottleneck when two threads both try to allocate memory).

Since you are writing to local vectors, and reading from a region of vec that is only being read by the current thread, the mutex is unnecessary.

To avoid repeated calculation of the size, store lvec.size() and rvec.size() in local variables. Or you can use iterators for all vector accesses. The two loops that finish copying the tail elements from lvec or rvec can use vector's insert function.

There is also some inconsistent indentation in the code, and inconsistent use of braces for single line if-line if/elseelse blocks.

There may be a slight performance hit if multiple threads are writing to elements in vec that are near each other (that share a cache line), but this should be rare and with appropriate choice of range to not create threads (see below) should be avoidable.

mergesort

mergesort

Thread creation is expensive, and the number of threads available in a system is finite. Once the number of running threads exceeds the number available in hardware, there is no performance gain from creating more (and usually a slight hit).

Therefore, once the interval to sort is small enough (64 or 128 elements), make you recursive calls directly, rather than create threads for them. Also including an upper limit on the number of threads to create would avoid the penalty of creating threads that will not execute right away. Determination of appropriate values for these limits would require experimentation to check the effect on the performance.

Summary

Because the number of threads that a system can create is finite, if your data set is large enough eventually the thread constructor will throw an exception (system_error) when a thread cannot be created. There is also a performance hit when creating a thread, so making threads for very small intervals is counterproductive.

merge

When copying elements into the local vectors lvec and rvec, you are copying elements sequentially. Using one loop to copy all the elements adds extra overhead and is unnecessary. You can use two loops (and reserve memory space for each), use an appropriate vector constructor (that takes a pair of iterators), or use the insert function to block copy elements from one vector to another. Reserving space for the vector, instead of allowing it to grow, will reduce the contention in the memory allocator (which is not concurrent, creating a bottleneck when two threads both try to allocate memory).

Since you are writing to local vectors, and reading from a region of vec that is only being read by the current thread, the mutex is unnecessary.

To avoid repeated calculation of the size, store lvec.size() and rvec.size() in local variables. Or you can use iterators for all vector accesses. The two loops that finish copying the tail elements from lvec or rvec can use vector's insert function.

There is also some inconsistent indentation in the code, and inconsistent use of braces for single line if/else blocks.

There may be a slight performance hit if multiple threads are writing to elements in vec that are near each other (that share a cache line), but this should be rare and with appropriate choice of range to not create threads (see below) should be avoidable.

mergesort

Thread creation is expensive, and the number of threads available in a system is finite. Once the number of running threads exceeds the number available in hardware there is no performance gain from creating more (and usually a slight hit).

Therefore, once the interval to sort is small enough (64 or 128 elements), make you recursive calls directly, rather than create threads for them. Also including an upper limit on the number of threads to create would avoid the penalty of creating threads that will not execute right away. Determination of appropriate values for these limits would require experimentation to check the effect on the performance.

Summary

Because the number of threads that a system can create is finite, if your data set is large enough eventually the thread constructor will throw an exception (system_error) when a thread cannot be created. There is also a performance hit when creating a thread, so making threads for very small intervals is counterproductive.

merge

When copying elements into the local vectors lvec and rvec, you are copying elements sequentially. Using one loop to copy all the elements adds extra overhead and is unnecessary. You can use two loops (and reserve memory space for each), use an appropriate vector constructor (that takes a pair of iterators), or use the insert function to block copy elements from one vector to another. Reserving space for the vector, instead of allowing it to grow, will reduce the contention in the memory allocator (which is not concurrent, creating a bottleneck when two threads both try to allocate memory).

Since you are writing to local vectors, and reading from a region of vec that is only being read by the current thread, the mutex is unnecessary.

To avoid repeated calculation of the size, store lvec.size() and rvec.size() in local variables. Or you can use iterators for all vector accesses. The two loops that finish copying the tail elements from lvec or rvec can use vector's insert function.

There is also some inconsistent indentation in the code, and inconsistent use of braces for single-line if/else blocks.

There may be a slight performance hit if multiple threads are writing to elements in vec that are near each other (that share a cache line), but this should be rare and with appropriate choice of range to not create threads (see below) should be avoidable.

mergesort

Thread creation is expensive, and the number of threads available in a system is finite. Once the number of running threads exceeds the number available in hardware, there is no performance gain from creating more (and usually a slight hit).

Therefore, once the interval to sort is small enough (64 or 128 elements), make recursive calls directly, rather than create threads for them. Also including an upper limit on the number of threads to create would avoid the penalty of creating threads that will not execute right away. Determination of appropriate values for these limits would require experimentation to check the effect on the performance.

Source Link
1201ProgramAlarm
  • 7.8k
  • 2
  • 23
  • 39

Summary

Because the number of threads that a system can create is finite, if your data set is large enough eventually the thread constructor will throw an exception (system_error) when a thread cannot be created. There is also a performance hit when creating a thread, so making threads for very small intervals is counterproductive.

merge

When copying elements into the local vectors lvec and rvec, you are copying elements sequentially. Using one loop to copy all the elements adds extra overhead and is unnecessary. You can use two loops (and reserve memory space for each), use an appropriate vector constructor (that takes a pair of iterators), or use the insert function to block copy elements from one vector to another. Reserving space for the vector, instead of allowing it to grow, will reduce the contention in the memory allocator (which is not concurrent, creating a bottleneck when two threads both try to allocate memory).

Since you are writing to local vectors, and reading from a region of vec that is only being read by the current thread, the mutex is unnecessary.

To avoid repeated calculation of the size, store lvec.size() and rvec.size() in local variables. Or you can use iterators for all vector accesses. The two loops that finish copying the tail elements from lvec or rvec can use vector's insert function.

There is also some inconsistent indentation in the code, and inconsistent use of braces for single line if/else blocks.

There may be a slight performance hit if multiple threads are writing to elements in vec that are near each other (that share a cache line), but this should be rare and with appropriate choice of range to not create threads (see below) should be avoidable.

mergesort

Thread creation is expensive, and the number of threads available in a system is finite. Once the number of running threads exceeds the number available in hardware there is no performance gain from creating more (and usually a slight hit).

Therefore, once the interval to sort is small enough (64 or 128 elements), make you recursive calls directly, rather than create threads for them. Also including an upper limit on the number of threads to create would avoid the penalty of creating threads that will not execute right away. Determination of appropriate values for these limits would require experimentation to check the effect on the performance.