Skip to main content
added 249 characters in body
Source Link
user555045
  • 12.4k
  • 1
  • 19
  • 39

So both of those (shuffles and vdpps) should be a focus if you want to make this faster. Unless you care mostly about AMD Zen (especially Zen 4 or 5) which don't suffer from the "shuffle problem" nearly as much, eg Zen 4 can do 3 vinsertps per cycle and Zen 5 can do 4, it's a whole different thing. It's not strictly the case that on every Intel processor, every shuffle µop must go to p5 and p5 only (for example shufps can go to p1 or p5 on Ice Lake). But it's a fairly common thing to happen that shuffles pile up on p5 and form a bottleneck.

So both of those (shuffles and vdpps) should be a focus if you want to make this faster. Unless you care mostly about AMD Zen (especially Zen 4 or 5) which don't suffer from the "shuffle problem" nearly as much, eg Zen 4 can do 3 vinsertps per cycle and Zen 5 can do 4, it's a whole different thing.

So both of those (shuffles and vdpps) should be a focus if you want to make this faster. Unless you care mostly about AMD Zen (especially Zen 4 or 5) which don't suffer from the "shuffle problem" nearly as much, eg Zen 4 can do 3 vinsertps per cycle and Zen 5 can do 4, it's a whole different thing. It's not strictly the case that on every Intel processor, every shuffle µop must go to p5 and p5 only (for example shufps can go to p1 or p5 on Ice Lake). But it's a fairly common thing to happen that shuffles pile up on p5 and form a bottleneck.

Source Link
user555045
  • 12.4k
  • 1
  • 19
  • 39

Putting the code through LLVM MCA or https://uica.uops.info/ yields one unsurprising (I think) result and one surprise (for me anyway).

The not-surprise

The bottleneck (on Intel Skylake in this example, various Intel processors are roughly similar) is p5, due to shuffles. "Shuffle" including all of the following:

  • vshufps of course, but it's not in the code
  • pack/unpack
  • vinsertps
  • vmovlhps/vmovhlps
  • broadcasting from a register (broadcast from memory would be a "free" broadcast, costing only a load µop)
  • vdpps includes a shuffle µop (and a bunch of other stuff)
  • vhaddps includes two shuffle µops, so it's not necessarily good. On the other hand, it may be able to win anyway, see the surprise.

(details may differ somewhat between different CPU families, I'm writing from mainly a Skylake-like perspective, a lot of stuff is Skylake-like..)

The surprise

On Intel Ice Lake and newer, vdpps has been nerfed so much that it requires the microcode sequencer, which is something I vaguely knew because vdpps requires 6 µops now and anything over 4 goes through the MS. But I hadn't quite realize how bad that would be. From Cascade Lake to Ice Lake, despite Ice Lake being a significant upgrade in almost every way, the code goes from being able to be executed once every 14 cycles to once every 24 cycles (that's not a latency, it's the time per iteration if that code was in a loop), mostly due to vdpps being relegated to the MS.

One way to avoid the vdpps here would be to build something similar out of vhaddps, but there may be other options, actually I'll mention one further down.

So both of those (shuffles and vdpps) should be a focus if you want to make this faster. Unless you care mostly about AMD Zen (especially Zen 4 or 5) which don't suffer from the "shuffle problem" nearly as much, eg Zen 4 can do 3 vinsertps per cycle and Zen 5 can do 4, it's a whole different thing.

Another potential issue, which is not localized to this code per-se, is that each row of the result is being stored to with two overlapping stores. There is nothing wrong with that in vacuum, but it means that loading the result back from memory (if done sufficiently soon after the store and with vector loads) may incur a store-forwarding failure which costs significantly extra latency, on CPUs that cannot forward data from two stores simultaneously to one load (which is most of them). Sadly the main way to avoid that would be adding even more shuffles, which are already a problem. The way the stores are implemented by the compiler, may also depend on the context from which you call this function (esp. Clang might try to insert extra shuffles of its own to "merge" the overlapping stores, I'm not saying that it will happen but Clang tries to be clever with shuffles in general, so it might). Could go either way, it's something to check in the real context, not in a vacuum.

This line:

const __m128 mul00 = _mm_set_ps(1, -t.r3.w, -t.r2.w, -t.r1.w);

Costs a bunch of shuffles. GCC and Clang implemented it a little differently, but either way it's far more expensive than it looks. Clang generate 3 vinsertps for it, GCC generated a more varied vunpcklps/vinsertps/vmovlhps combo, either way that's 3 shuffles. If you pass the translation as a vector, these shuffles go away.

_mm_mul_ps(_mm_movelh_ps(tmp0, tmp2), _mm_set1_ps(1.f / s.r1.x));

These multiply-by-reciprocals aren't saving any divisions, so you may as well divide directly. Actually that would generally be better, because _mm_set1_ps(1.f / s.r1.x) costs not only the scalar division but also a broadcast-from-register (which is a shuffle, unlike a broadcast-from-memory). Dividing by _mm_set1_ps(s.r1.x) feels wasteful but a 128-bit vector division costs the same as a scalar division[1], and this removes the shuffle because a broadcast-from-memory (which the set1 can now compile into) only costs a load µop. As a bonus we also save the multiplication.

But that relies on the scale matrix actually being in memory and being loaded from there. If this function gets inlined, compilers may try to "optimize" by keeping the scale matrix in registers, and then you still get shuffles to implement the element-broadcasts. Then you may as well pass the scale as a vector and then use the multiply-by-reciprocal trick (this time it does save something, because there would be only 1 reciprocal and 3 multiplications .. and also 3 shuffles but in this scenario we would have had those shuffles anyway). Or even if this doesn't apply, you can choose to pass the scale as a vector anyway so you can use the following trick.

If we have s and t in vectors, an alternative computation of the "hard part" (the 4th column) could go like this:

  • take a row of the rotation matrix (from before transposing it)
  • multiply it by the reciprocal of the scale vector
  • then multiply that by a broadcasted entry (use _mm_shuffle_ps) of the translation vector

Do that 3 times, sum the results and negate. No _mm_hadd_ps, no _mm_dp_ps, costs some shuffles still but not more than before. Avoiding _mm_dp_ps is big on Ice Lake and later Intel processors (more due to decoding via the microcode sequencer than the actual µops), and doing it without introducing _mm_hadd_ps (which costs 2 shuffles each) is even better.

You can consider passing the matrices in by address instead of by value (as suggested by chux), but if you change any argument to a vector then that vector should be passed by value (vector arguments would be passed in vector registers and they're trivial to copy).

_mm_movehl_ps(tmp3, tmp1); // [0, 0, 0, 1]

Given that shuffles are a bottleneck here, I recommend _mm_set_ps(1, 0, 0, 0) or _mm_setr_ps(0, 0, 0, 1), which should translate into a load. In other cases it may be better to do the reverse ie replace loading a constant with synthesizing it arithmetically, but this shuffle is expensive in this context.

I know that one thing I could do to make it faster is to store my rotation matrix in column-order as opposed to row-order, that way, it wouldn't have to be transposed.

I think it's worth looking into, but I haven't done it.


[1]: Except on Alder Lake E-cores (maybe newer E-cores too but I don't have that information at this time), various Intel Atoms, AMD Jaguar and Bobcat, AMD K10 and K8, Pentium M, Pentium 4, Pentium 3.. so yes a bunch of exceptions, but they're all old or low-power CPUs, probably not what you're optimizing for.