I am writing a maths library for a raytracer project, and so I'm trying to make my heavy operations (like matrix inverse) more optimised. After doing some research, I discovered this trick to invert a transformation matrix (described in more detail in this document I made if anyone's interested: https://docs.google.com/document/d/1ok8dzMk7EZiZaVRB61zGDlxRSDoelX3z6ixaCRlg0yM/edit) where I store my transformation matrix as three different matrices (scaling, rotation, and translation), and then I just invert each one of them, and combine the matrices to get the final result, the inverse. Here's the code (also available on godbolt): https://godbolt.org/z/cjsfGzW3c
#include <immintrin.h>
typedef union u_vec4s
{
float a[4];
__m128 simd;
struct
{
float x;
float y;
float z;
float w;
};
}__attribute((aligned(16))) t_vec4s;
typedef union u_mat4s
{
float a[4][4];
__m128 simd[4];
__m256 _ymm[2];
struct
{
t_vec4s r1;
t_vec4s r2;
t_vec4s r3;
t_vec4s r4;
};
}__attribute((aligned(16))) t_mat4s;
t_mat4s lag_get_transform_matrix_inverse(const t_mat4s s, const t_mat4s r, const t_mat4s t)
{
// const __m128 zeros = _mm_set1_ps(0);
const __m128 mul00 = _mm_set_ps(1, -t.r3.w, -t.r2.w, -t.r1.w);
t_mat4s t1w0;
t_mat4s s1rr;
__m128 tmp0 = _mm_unpacklo_ps(r.simd[0], r.simd[1]); // [r00, r10, r01, r11]
__m128 tmp1 = _mm_unpackhi_ps(r.simd[0], r.simd[1]); // [r02, r12, r03, r13]
__m128 tmp2 = _mm_unpacklo_ps(r.simd[2], r.simd[3]); // [r20, r30, r21, r31]
__m128 tmp3 = _mm_unpackhi_ps(r.simd[2], r.simd[3]); // [r22, r32, r23, r33]
s1rr.simd[0] = _mm_mul_ps(_mm_movelh_ps(tmp0, tmp2), _mm_set1_ps(1.f / s.r1.x)); // [r00/sx, r10/sx, r20/sx, r30/sx]
s1rr.simd[1] = _mm_mul_ps(_mm_movehl_ps(tmp2, tmp0), _mm_set1_ps(1.f / s.r2.y)); // [r01/sy, r11/sy, r21/sy, r31/sy]
s1rr.simd[2] = _mm_mul_ps(_mm_movelh_ps(tmp1, tmp3), _mm_set1_ps(1.f / s.r3.z)); // [r02/sz, r12/sz, r22/sz, r32/sz]
s1rr.simd[3] = _mm_movehl_ps(tmp3, tmp1); // [0, 0, 0, 1]
t1w0.simd[0] = /*_mm_sub_ps(zeros, */_mm_dp_ps(s1rr.simd[0], mul00, 0xF1)/*)*/;
t1w0.simd[1] = /*_mm_sub_ps(zeros, */_mm_dp_ps(s1rr.simd[1], mul00, 0xF1)/*)*/;
t1w0.simd[2] = /*_mm_sub_ps(zeros, */_mm_dp_ps(s1rr.simd[2], mul00, 0xF1)/*)*/;
s1rr.r1.w = _mm_cvtss_f32(t1w0.simd[0]);
s1rr.r2.w = _mm_cvtss_f32(t1w0.simd[1]);
s1rr.r3.w = _mm_cvtss_f32(t1w0.simd[2]);
return (s1rr);
}
Now, my question is: is there a better way to do this? Perhaps a more efficient method to achieve the same result? Or maybe I'm doing some unneeded operations. I know that one thing I could do to make it faster is to store my rotation matrix in column-order as opposed to row-order, that way, it wouldn't have to be transposed. One other thing, possibly, is to avoid using _mm_dp_ps, as that function is kinda heavy, and it's possibly better to just use a horizontal add instead. Something like:
t1w0.simd[0] = _mm_mul_ps(s1rr.simd[0], mul00);
_mm_storeu_ps((float *)&t1w0.r1, _mm_hadd_ps(t1w0.simd[0], t1w0.simd[0]));
s1rr.r1.w = t1w0.r1.a[0] + t1w0.r1.a[2];
Is that a worthwhile thing to do? Are there any thoughts or suggestions? Feel free to comment on everything I mentioned before as well. Any and every suggestion would be greatly appreciated.
*:const t_mat4s s-->const t_mat4s *s? \$\endgroup\$