C - SIMD Code to invert a transformation matrix

Question

I am writing a maths library for a raytracer project, and so I'm trying to make my heavy operations (like matrix inverse) more optimised. After doing some research, I discovered this trick to invert a transformation matrix (described in more detail in this document I made if anyone's interested: https://docs.google.com/document/d/1ok8dzMk7EZiZaVRB61zGDlxRSDoelX3z6ixaCRlg0yM/edit) where I store my transformation matrix as three different matrices (scaling, rotation, and translation), and then I just invert each one of them, and combine the matrices to get the final result, the inverse. Here's the code (also available on godbolt): https://godbolt.org/z/cjsfGzW3c

#include <immintrin.h>

typedef union u_vec4s
{
    float       a[4];
    __m128      simd;
    struct
    {
        float   x;
        float   y;
        float   z;
        float   w;
    };
}__attribute((aligned(16))) t_vec4s;

typedef union u_mat4s
{
    float       a[4][4];
    __m128      simd[4];
    __m256      _ymm[2];
    struct
    {
        t_vec4s   r1;
        t_vec4s   r2;
        t_vec4s   r3;
        t_vec4s   r4;
    };
}__attribute((aligned(16))) t_mat4s;

t_mat4s lag_get_transform_matrix_inverse(const t_mat4s s, const t_mat4s r, const t_mat4s t)
{
    // const __m128 zeros = _mm_set1_ps(0);
    const __m128 mul00 = _mm_set_ps(1, -t.r3.w, -t.r2.w, -t.r1.w);
    t_mat4s t1w0;
    t_mat4s s1rr;

    __m128 tmp0 = _mm_unpacklo_ps(r.simd[0], r.simd[1]); // [r00, r10, r01, r11]
    __m128 tmp1 = _mm_unpackhi_ps(r.simd[0], r.simd[1]); // [r02, r12, r03, r13]
    __m128 tmp2 = _mm_unpacklo_ps(r.simd[2], r.simd[3]); // [r20, r30, r21, r31]
    __m128 tmp3 = _mm_unpackhi_ps(r.simd[2], r.simd[3]); // [r22, r32, r23, r33]

    s1rr.simd[0] = _mm_mul_ps(_mm_movelh_ps(tmp0, tmp2), _mm_set1_ps(1.f / s.r1.x)); // [r00/sx, r10/sx, r20/sx, r30/sx]
    s1rr.simd[1] = _mm_mul_ps(_mm_movehl_ps(tmp2, tmp0), _mm_set1_ps(1.f / s.r2.y)); // [r01/sy, r11/sy, r21/sy, r31/sy]
    s1rr.simd[2] = _mm_mul_ps(_mm_movelh_ps(tmp1, tmp3), _mm_set1_ps(1.f / s.r3.z)); // [r02/sz, r12/sz, r22/sz, r32/sz]
    s1rr.simd[3] = _mm_movehl_ps(tmp3, tmp1); // [0, 0, 0, 1]

    t1w0.simd[0] = /*_mm_sub_ps(zeros, */_mm_dp_ps(s1rr.simd[0], mul00, 0xF1)/*)*/;
    t1w0.simd[1] = /*_mm_sub_ps(zeros, */_mm_dp_ps(s1rr.simd[1], mul00, 0xF1)/*)*/;
    t1w0.simd[2] = /*_mm_sub_ps(zeros, */_mm_dp_ps(s1rr.simd[2], mul00, 0xF1)/*)*/;

    s1rr.r1.w = _mm_cvtss_f32(t1w0.simd[0]);
    s1rr.r2.w = _mm_cvtss_f32(t1w0.simd[1]);
    s1rr.r3.w = _mm_cvtss_f32(t1w0.simd[2]);

    return (s1rr);
}

Now, my question is: is there a better way to do this? Perhaps a more efficient method to achieve the same result? Or maybe I'm doing some unneeded operations. I know that one thing I could do to make it faster is to store my rotation matrix in column-order as opposed to row-order, that way, it wouldn't have to be transposed. One other thing, possibly, is to avoid using _mm_dp_ps, as that function is kinda heavy, and it's possibly better to just use a horizontal add instead. Something like:

    t1w0.simd[0] = _mm_mul_ps(s1rr.simd[0], mul00);
    _mm_storeu_ps((float *)&t1w0.r1, _mm_hadd_ps(t1w0.simd[0], t1w0.simd[0]));
    s1rr.r1.w = t1w0.r1.a[0] + t1w0.r1.a[2];

Is that a worthwhile thing to do? Are there any thoughts or suggestions? Feel free to comment on everything I mentioned before as well. Any and every suggestion would be greatly appreciated.

Something for [simd] folks to review: Perhaps passing by *: const t_mat4s s --> const t_mat4s *s? — chux
– chux, Commented Oct 1, 2024 at 16:38
Welcome to Code Review! Please don't modify the code in your question once it has been answered. You could post improved code as a new question, as an answer, or as a link to an external site - as described in I improved my code based on the reviews. What next?. I have rolled back the edit, so that it's clear exactly what version has been reviewed. — Toby Speight
– Toby Speight, Commented Oct 2, 2024 at 16:55

user555045 · Accepted Answer · 2024-10-02 01:21:11Z

Putting the code through LLVM MCA or https://uica.uops.info/ yields one unsurprising (I think) result and one surprise (for me anyway).

The not-surprise

The bottleneck (on Intel Skylake in this example, various Intel processors are roughly similar) is p5, due to shuffles. "Shuffle" including all of the following:

vshufps of course, but it's not in the code
pack/unpack
vinsertps
vmovlhps/vmovhlps
broadcasting from a register (broadcast from memory would be a "free" broadcast, costing only a load µop)
vdpps includes a shuffle µop (and a bunch of other stuff)
vhaddps includes two shuffle µops, so it's not necessarily good. On the other hand, it may be able to win anyway, see the surprise.

(details may differ somewhat between different CPU families, I'm writing from mainly a Skylake-like perspective, a lot of stuff is Skylake-like..)

The surprise

On Intel Ice Lake and newer, vdpps has been nerfed so much that it requires the microcode sequencer, which is something I vaguely knew because vdpps requires 6 µops now and anything over 4 goes through the MS. But I hadn't quite realize how bad that would be. From Cascade Lake to Ice Lake, despite Ice Lake being a significant upgrade in almost every way, the code goes from being able to be executed once every 14 cycles to once every 24 cycles (that's not a latency, it's the time per iteration if that code was in a loop), mostly due to vdpps being relegated to the MS.

One way to avoid the vdpps here would be to build something similar out of vhaddps, but there may be other options, actually I'll mention one further down.

So both of those (shuffles and vdpps) should be a focus if you want to make this faster. Unless you care mostly about AMD Zen (especially Zen 4 or 5) which don't suffer from the "shuffle problem" nearly as much, eg Zen 4 can do 3 vinsertps per cycle and Zen 5 can do 4, it's a whole different thing. It's not strictly the case that on every Intel processor, every shuffle µop must go to p5 and p5 only (for example shufps can go to p1 or p5 on Ice Lake). But it's a fairly common thing to happen that shuffles pile up on p5 and form a bottleneck.

Another potential issue, which is not localized to this code per-se, is that each row of the result is being stored to with two overlapping stores. There is nothing wrong with that in vacuum, but it means that loading the result back from memory (if done sufficiently soon after the store and with vector loads) may incur a store-forwarding failure which costs significantly extra latency, on CPUs that cannot forward data from two stores simultaneously to one load (which is most of them). Sadly the main way to avoid that would be adding even more shuffles, which are already a problem. The way the stores are implemented by the compiler, may also depend on the context from which you call this function (esp. Clang might try to insert extra shuffles of its own to "merge" the overlapping stores, I'm not saying that it will happen but Clang tries to be clever with shuffles in general, so it might). Could go either way, it's something to check in the real context, not in a vacuum.

This line:

const __m128 mul00 = _mm_set_ps(1, -t.r3.w, -t.r2.w, -t.r1.w);

Costs a bunch of shuffles. GCC and Clang implemented it a little differently, but either way it's far more expensive than it looks. Clang generate 3 vinsertps for it, GCC generated a more varied vunpcklps/vinsertps/vmovlhps combo, either way that's 3 shuffles. If you pass the translation as a vector, these shuffles go away.

_mm_mul_ps(_mm_movelh_ps(tmp0, tmp2), _mm_set1_ps(1.f / s.r1.x));

These multiply-by-reciprocals aren't saving any divisions, so you may as well divide directly. Actually that would generally be better, because _mm_set1_ps(1.f / s.r1.x) costs not only the scalar division but also a broadcast-from-register (which is a shuffle, unlike a broadcast-from-memory). Dividing by _mm_set1_ps(s.r1.x) feels wasteful but a 128-bit vector division costs the same as a scalar division^[1], and this removes the shuffle because a broadcast-from-memory (which the set1 can now compile into) only costs a load µop. As a bonus we also save the multiplication.

But that relies on the scale matrix actually being in memory and being loaded from there. If this function gets inlined, compilers may try to "optimize" by keeping the scale matrix in registers, and then you still get shuffles to implement the element-broadcasts. Then you may as well pass the scale as a vector and then use the multiply-by-reciprocal trick (this time it does save something, because there would be only 1 reciprocal and 3 multiplications .. and also 3 shuffles but in this scenario we would have had those shuffles anyway). Or even if this doesn't apply, you can choose to pass the scale as a vector anyway so you can use the following trick.

If we have s and t in vectors, an alternative computation of the "hard part" (the 4th column) could go like this:

take a row of the rotation matrix (from before transposing it)
multiply it by the reciprocal of the scale vector
then multiply that by a broadcasted entry (use _mm_shuffle_ps) of the translation vector

Do that 3 times, sum the results and negate. No _mm_hadd_ps, no _mm_dp_ps, costs some shuffles still but not more than before. Avoiding _mm_dp_ps is big on Ice Lake and later Intel processors (more due to decoding via the microcode sequencer than the actual µops), and doing it without introducing _mm_hadd_ps (which costs 2 shuffles each) is even better.

You can consider passing the matrices in by address instead of by value (as suggested by chux), but if you change any argument to a vector then that vector should be passed by value (vector arguments would be passed in vector registers and they're trivial to copy).

_mm_movehl_ps(tmp3, tmp1); // [0, 0, 0, 1]

Given that shuffles are a bottleneck here, I recommend _mm_set_ps(1, 0, 0, 0) or _mm_setr_ps(0, 0, 0, 1), which should translate into a load. In other cases it may be better to do the reverse ie replace loading a constant with synthesizing it arithmetically, but this shuffle is expensive in this context.

I know that one thing I could do to make it faster is to store my rotation matrix in column-order as opposed to row-order, that way, it wouldn't have to be transposed.

I think it's worth looking into, but I haven't done it.

[1]: Except on Alder Lake E-cores (maybe newer E-cores too but I don't have that information at this time), various Intel Atoms, AMD Jaguar and Bobcat, AMD K10 and K8, Pentium M, Pentium 4, Pentium 3.. so yes a bunch of exceptions, but they're all old or low-power CPUs, probably not what you're optimizing for.

I see. I appreciate the depth you went to in answering the question. I, sadly, don't have enough rep points on this stack community to cast a vote, but do know that your answer is very much appreciated! — Astranged T'fyer
– Astranged T'fyer, Commented Oct 2, 2024 at 9:36
I must've just gotten it the rep for it. Thanks a lot! @toolic. Can I ask what you mean by a broadcasted entry of the translation vector? I understand what you mean with the rest. I also agree with the point that passing in the translation and scaling as vectors is way better, I was contemplating it, but didn't have enough knowledge about SIMD to constitute doing it. My main reasoning was "less space", but now I feel better about it, so thanks! — Astranged T'fyer
– Astranged T'fyer, Commented Oct 2, 2024 at 11:00
@AstrangedT'fyer I mean take one entry of the vector and put it in all slots eg _mm_shuffle_ps(translation, translation, _MM_SHUFFLE(2, 2, 2, 2)) copies element 2 (the z coordinate usually) to all slots — user555045
– user555045, Commented Oct 2, 2024 at 11:04
is the implementation I added to the question body in line with your recommendations? I'm just trying to make sure everything is proper before I accept the answer — Astranged T'fyer
– Astranged T'fyer, Commented Oct 2, 2024 at 16:47
@AstrangedT'fyer maybe but while negation can be implemented with XORing by -0.0, ORing by -0.0 makes the number negative which doesn't negate negative numbers — user555045
– user555045, Commented Oct 2, 2024 at 17:00

Astranged T'fyer · Accepted Answer · 2024-10-02 18:36:40Z

To expand on user555045's answer, here's a newer version of the code. As recommended, I pass both the translation and scaling as packed-singles (__m128), and I avoid using _mm_dp_ps and _mm_hadd_ps.

static inline t_mat4s lag_mat4s_get_transform_inverse(const t_mat4s rot, const __m128 s, const __m128 t)
{
    t_mat4s ret0;
    t_vec4s trns;
    __m128  tmp0, tmp1, tmp2, tmp3;

    const __m128 rcps = _mm_div_ps(_mm_set1_ps(1.f), s);
    const __m128 tinv = _mm_xor_ps(_mm_set1_ps(-0.f), t);

    ret0.simd[0] = _mm_mul_ps(rot.simd[0], rcps);
    ret0.simd[1] = _mm_mul_ps(rot.simd[1], rcps);
    ret0.simd[2] = _mm_mul_ps(rot.simd[2], rcps);

    ret0.simd[0] = _mm_mul_ps(ret0.simd[0], _mm_shuffle_ps(tinv, tinv, _MM_SHUFFLE(0, 0, 0, 0)));
    ret0.simd[1] = _mm_mul_ps(ret0.simd[1], _mm_shuffle_ps(tinv, tinv, _MM_SHUFFLE(1, 1, 1, 1)));
    ret0.simd[2] = _mm_mul_ps(ret0.simd[2], _mm_shuffle_ps(tinv, tinv, _MM_SHUFFLE(2, 2, 2, 2)));

    trns.simd = _mm_add_ps(_mm_add_ps(ret0.simd[0], ret0.simd[1]), ret0.simd[2]);
    
    tmp0 = _mm_unpacklo_ps(rot.simd[0], rot.simd[1]); // [r00, r10, r01, r11]
    tmp1 = _mm_unpackhi_ps(rot.simd[0], rot.simd[1]); // [r02, r12, r03, r13]
    tmp2 = _mm_unpacklo_ps(rot.simd[2], rot.simd[3]); // [r20, r30, r21, r31]
    tmp3 = _mm_unpackhi_ps(rot.simd[2], rot.simd[3]); // [r22, r32, r23, r33]

    ret0.simd[0] = _mm_movelh_ps(tmp0, tmp2);
    ret0.simd[1] = _mm_movehl_ps(tmp2, tmp0);
    ret0.simd[2] = _mm_movelh_ps(tmp1, tmp3);
    ret0.simd[3] = _mm_set_ps(1, 0, 0, 0);

    ret0.r1.w = trns.x;
    ret0.r2.w = trns.y;
    ret0.r3.w = trns.z;
    return (ret0);
}

The code now solves for the inverse by working on a copy of the rotation matrix, taking advantage of the fact that the scaling vector is applied "vertically" to all elements when calculating the scaled-rotation part of the matrix. And, also, taking advantage of that particular arrangement for calculating the translation component on each row. Since it's a homogenous matrix, the last row is just [0, 0, 0, 1].

Stack Exchange Network

C - SIMD Code to invert a transformation matrix

2 Answers 2

The not-surprise

The surprise

You must log in to answer this question.

Hot Network Questions

C - SIMD Code to invert a transformation matrix

2 Answers 2

The not-surprise

The surprise

You must log in to answer this question.

Related

Hot Network Questions