Skip to main content
15 events
when toggle format what by license comment
Feb 17, 2023 at 18:57 comment added Peter Cordes @TobySpeight: Yeah. And have a separate function to expand that into an array in case you need that.
Feb 17, 2023 at 15:16 comment added Toby Speight @Peter, I have answered with a suggestion to just return a histogram object - is that the approach you're recommending?
Feb 12, 2023 at 11:51 history edited Matthieu M. CC BY-SA 4.0
added 3 characters in body
Feb 12, 2023 at 9:25 comment added Toby Speight It might be better to avoid your debug builds doing an extra pass over the input - and simultaneously reduce the Magic Number count: for (auto i: nums) { assert(static_cast<unsigned>(i) < sizeof counts); } (there's a sneaky implicit i >=0 hidden there, which warrants a comment). Or use a standard library array, giving us std::array::at() for the debug build.
Feb 12, 2023 at 9:18 comment added Toby Speight Since C++11, we can replace the std::fill and it += with a simple it = std::fill_n(it, counts[i], i), since that function now returns the iterator immediately following the written values (and does the right thing if the count is zero).
Feb 12, 2023 at 4:54 comment added Peter Cordes @xyf: Anyway, your algorithm is interesting and might win on a low-end CPU like ARM Cortex-M3 or something (short pipeline so cheap branching, and only running 1 instruction at a time in-order). Or maybe an old x86 CPU like 486 or Pentium. But counting is fast, and memset / std::fill tends to be highly optimized to run faster than 1 element at a time, e.g. filling 32 bytes at a time. Also, if you don't do that step, you have your data in a much more compact representation that can be used for various things, like prefix-summed so indexing the original data is just a matter of bsearch.
Feb 12, 2023 at 4:52 comment added Peter Cordes And histogramming can be optimized much further with SIMD when there are only 3 unique values. (But not when there are many more; then you'd just use memory and SIMD doesn't help you.) You can optimize the standard histogram algorithm with small numbers of buckets by unrolling over multiple arrays of counts (which you combine at the end, cheaply with SIMD) to hide those store/reload bottlenecks.
Feb 12, 2023 at 4:42 comment added Peter Cordes @xyf: Yours has a bunch of data-dependent branching, which is relatively slow on modern high-performance CPUs compared to loads/stores that hit in L1d cache. Some CPUs (Zen 2 and Zen 4, and Ice Lake) can even remove the store/reload forwarding latency bottleneck of a standard scalar histogram incrementing the same memory location multiple times in a row, by optimizing store-forwarding for the same address (accessed with the same addressing-mode). agner.org/forum/viewtopic.php?t=41 / chipsandcheese.com/2022/11/08/…
Feb 11, 2023 at 21:04 comment added xyf Can you please walk through your algo and how it's better than mines in terms of efficiency specially the iterative part?
Feb 11, 2023 at 20:45 comment added Peter Cordes For highly-redundant data like a large array of 3 unique values, a compact data structure like a 3-entry histogram is often usable instead of a sorted array, so I'd recommend making that available as an output of this function, or split to multiple functions so you can just histogram without wasting time expanding again.
Feb 11, 2023 at 20:43 comment added Peter Cordes For high-performance histogramming of a very small number of unique values, see See Micro Optimization of a 4-bucket histogram of a large array or list for how to count very fast with SIMD for 32-bit elements, in that case C++ with AVX2. With two pcmpeq/psubd in the loop to count 0s and 1s, you can derive the count for 2 from size - ones - zeros; Zen 2 or Ice Lake could potentially handle 8 ints per clock cycle.
Feb 11, 2023 at 20:40 comment added Peter Cordes @pacmaninbw: I think you meant to comment under the question, not this answer.
Feb 11, 2023 at 20:39 comment added pacmaninbw Please do not edit the question, especially the code, after an answer has been posted. Changing the question may cause answer invalidation. Everyone needs to be able to see what the reviewer was referring to. What to do after the question has been answered.
Feb 11, 2023 at 20:37 comment added Peter Cordes That's en.wikipedia.org/wiki/Counting_sort not Radix Sort, as discussed in comments under J_H's answer. Re: choice of var name: I'd avoid i for the loop over nums, e.g. for (auto num : nums) { ++counts[num]; }. Normally i is a loop counter for counted loops, not an arbitrary integer loaded from another array.
Feb 11, 2023 at 16:36 history answered Matthieu M. CC BY-SA 4.0