Yay, a performance question that comes with profiler measurements! Excellent.
function call overhead
Thank you for those three tiny helper functions, they do a good job of explaining what's going on.
However, maybe we'd like to inline them? I mean, with your current optimization flags, maybe the compiler cannot "see through" the function call boundary, so it's having a tough time proving some very basic arithmetic facts?
Which leads us to...
1-D versus 2-D
Passing in a one-dimensional array
seems like a pretty inconvenient way of
representing your higher level business concepts.
Couldn't we cast it to a two-dimensional array?
I'd even be willing to suffer the cost of a
single big memcpy()if it meant the compiler
could easily see that (x, y) remain within bounds.
I know / division isn't quite as expensive
as it used to be on earlier CPUs.
But still, that cellY / anchorStepY expression
to recover a y-coordinate seems inconvenient.
I'm sad that in that loop we don't have row number
already available for use.
(Or perhaps you've arranged for anchorStepY to be a power-of-two,
leading to simple bit shifting.)
in-bounds by construction
if (!IsInBounds(index0, min, max) || ... )
continue;
Does this even trigger, does it ever report out-of-bounds?
The y < lastYCell + max
and x < width + max
guards seem like they're already performing the same work, no?
If that predicate does sometimes report false,
then consider changing the loop around, not unlike a loop unroll.
Maybe you'd like to iterate over slightly fewer cells with no check,
and then at the end take care of those remaining cells while carefully checking.
I wonder if a few judicious assert statements would help the compiler to see that indexes are provably within-bounds. Comparing generated object code for very slightly different functions over at https://godbolt.org may prove instructive.
This code achieves most of its design goals.
I would be willing to delegate or accept maintenance tasks on it.