30
\$\begingroup\$

I recently started toying with SIMD and came up with the following code for matrix multiplication.

First I attempted to implement it using SIMD the same way I did in SISD, just using SIMD for things like the dot product for each particular entry, which was actually slower (still trying to figure this one out).

After giving it some thought I realized I could do the calculation for the resulting matrix row-by-row instead, by lining up the registers something like this (each row is one SIMD register, and each column is the x, y, z, w parts):

With matrices \$A\$, \$B\$ and computing \$C = A * B\$:

            A_00 * B_00                 A_00 * B_01
                 +                           +
            A_01 * B_10                 A_01 * B_11
                 +                           +
            A_02 * B_20                 A_02 * B_21
                 +                           +
            A_03 * B_30                 A_03 * B_31
                 =                           =
C_00 = Dot(A_Row0, B_Col0), C_01 = Dot(A_Row0, B_Col1), ...


            A_10 * B_00                 A_10 * B_01
                 +                           +
            A_11 * B_10                 A_11 * B_11
                 +                           +
                ...                         ...
C_10 = Dot(A_Row1, B_Col0), C_11 = Dot(A_Row1, B_Col1), ...

I would greatly appreciate if someone with more experience with these things could tell me how far off I am from a good solution.

__m128 BCx = _mm_load_ps((float*)&B.Row0);
__m128 BCy = _mm_load_ps((float*)&B.Row1);
__m128 BCz = _mm_load_ps((float*)&B.Row2);
__m128 BCw = _mm_load_ps((float*)&B.Row3);


// Calculate Row0 in resulting matrix
__m128 ARx = _mm_set1_ps(A.Row0.X);
__m128 ARy = _mm_set1_ps(A.Row0.Y);
__m128 ARz = _mm_set1_ps(A.Row0.Z);
__m128 ARw = _mm_set1_ps(A.Row0.W);

__m128 X = _mm_mul_ps(ARx, BCx);
__m128 Y = _mm_mul_ps(ARy, BCy);
__m128 Z = _mm_mul_ps(ARz, BCz);
__m128 W = _mm_mul_ps(ARw, BCw);

__m128 R = _mm_add_ps(X, _mm_add_ps(Y, _mm_add_ps(Z, W)));
_mm_storeu_ps((float*)&Result.Row0, R);

// Calculate Row1 in resulting matrix
ARx = _mm_set1_ps(A.Row1.X);
ARy = _mm_set1_ps(A.Row1.Y);
ARz = _mm_set1_ps(A.Row1.Z);
ARw = _mm_set1_ps(A.Row1.W);

X = _mm_mul_ps(ARx, BCx);
Y = _mm_mul_ps(ARy, BCy);
Z = _mm_mul_ps(ARz, BCz);
W = _mm_mul_ps(ARw, BCw);

R = _mm_add_ps(X, _mm_add_ps(Y, _mm_add_ps(Z, W)));
_mm_storeu_ps((float*)&Result.Row1, R);

// Calculate Row2 in resulting matrix
ARx = _mm_set1_ps(A.Row2.X);
ARy = _mm_set1_ps(A.Row2.Y);
ARz = _mm_set1_ps(A.Row2.Z);
ARw = _mm_set1_ps(A.Row2.W);

X = _mm_mul_ps(ARx, BCx);
Y = _mm_mul_ps(ARy, BCy);
Z = _mm_mul_ps(ARz, BCz);
W = _mm_mul_ps(ARw, BCw);

R = _mm_add_ps(X, _mm_add_ps(Y, _mm_add_ps(Z, W)));
_mm_storeu_ps((float*)&Result.Row2, R);

// Calculate Row3 in resulting matrix
ARx = _mm_set1_ps(A.Row3.X);
ARy = _mm_set1_ps(A.Row3.Y);
ARz = _mm_set1_ps(A.Row3.Z);
ARw = _mm_set1_ps(A.Row3.W);

X = _mm_mul_ps(ARx, BCx);
Y = _mm_mul_ps(ARy, BCy);
Z = _mm_mul_ps(ARz, BCz);
W = _mm_mul_ps(ARw, BCw);

R = _mm_add_ps(X, _mm_add_ps(Y, _mm_add_ps(Z, W)));
_mm_storeu_ps((float*)&Result.Row3, R);

Using full optimization (/Ox) on the Visual Studio 2013 compiler, this is roughly twice as fast as my typical SISD version (not really sure how much to expect?).

Here is my SISD version:

inline Mat4*
Mat4Mul(const Mat4 *M0, const Mat4 *M1, Mat4 *Out)
{
    Vec4 Col0 = {M1->M00, M1->M10, M1->M20, M1->M30};
    Vec4 Col1 = {M1->M01, M1->M11, M1->M21, M1->M31};
    Vec4 Col2 = {M1->M02, M1->M12, M1->M22, M1->M32};
    Vec4 Col3 = {M1->M03, M1->M13, M1->M23, M1->M33};

    Out->M00 = Vec4Dot(&M0->Row0, &Col0);
    Out->M01 = Vec4Dot(&M0->Row0, &Col1);
    Out->M02 = Vec4Dot(&M0->Row0, &Col2);
    Out->M03 = Vec4Dot(&M0->Row0, &Col3);

    Out->M10 = Vec4Dot(&M0->Row1, &Col0);
    Out->M11 = Vec4Dot(&M0->Row1, &Col1);
    Out->M12 = Vec4Dot(&M0->Row1, &Col2);
    Out->M13 = Vec4Dot(&M0->Row1, &Col3);

    Out->M20 = Vec4Dot(&M0->Row2, &Col0);
    Out->M21 = Vec4Dot(&M0->Row2, &Col1);
    Out->M22 = Vec4Dot(&M0->Row2, &Col2);
    Out->M23 = Vec4Dot(&M0->Row2, &Col3);

    Out->M30 = Vec4Dot(&M0->Row3, &Col0);
    Out->M31 = Vec4Dot(&M0->Row3, &Col1);
    Out->M32 = Vec4Dot(&M0->Row3, &Col2);
    Out->M33 = Vec4Dot(&M0->Row3, &Col3);

    return Out;
}
\$\endgroup\$
6
  • 7
    \$\begingroup\$ There is a dot product instruction available on SSE4: _mm_dp_ps. \$\endgroup\$ Commented Aug 17, 2015 at 3:36
  • 1
    \$\begingroup\$ Hi, thanks for your response, I should perhaps have mentioned that I'm looking to target SSE2 as is right now, I haven't fully looked into what kind of CPU I can reasonably expect with my target audience yet. I'm also unsure about the performance characteristics of the sse4 dot product, there were some mentions about it being slower on some hardware over at SO, if anyone could clarify that it would be great, since my google-fu failed me. \$\endgroup\$ Commented Aug 17, 2015 at 5:28
  • 2
    \$\begingroup\$ @glampert: DPPS isn't efficient, though. On Intel Haswell, it has 14 cycle latency, and 2 cycles per insn throughput. MULPS has 5 cycle latency and 0.5 cycles per insn throughput. (1/3 latency, 4x better throughput). See agner.org/optimize. If you have a lot of operations to do, do them vertically (fma, addps, mulps) and only operate horizontally (haddps / dpps) at the end. \$\endgroup\$ Commented Sep 18, 2015 at 18:51
  • 3
    \$\begingroup\$ Is this C or C++? They are both different languages. \$\endgroup\$ Commented Nov 1, 2015 at 0:37
  • \$\begingroup\$ This is compiled as C++ but written in more of a C style. The question is mostly about correct usage and order of simd instructions, hopefully that doesn't vary too much between C and C++. Feel free to assume whichever one helps you write an answer, bonus points for showing different SIMD solutions for both languages. \$\endgroup\$ Commented Dec 3, 2015 at 16:55

1 Answer 1

13
\$\begingroup\$

Restructure into a function to make it DRYer

Right now the structure of your code is pretty unpleasant, not very DRY, etc. The first thing I'd recommend is restructuring this to work in a function. I don't know for sure how your Mat4 structure is implemented, however you indicated in the comments that it is contiguous so I've based my assumptions off of that. I'd recommend encapsulating this into this sort of function

void dotFourByFourMatrix(const Mat4* left, const Mat4* right, Mat4* result) {
    const __m128 BCx = _mm_load_ps((float*)&B.Row0);
    const __m128 BCy = _mm_load_ps((float*)&B.Row1);
    const __m128 BCz = _mm_load_ps((float*)&B.Row2);
    const __m128 BCw = _mm_load_ps((float*)&B.Row3);

    float* leftRowPointer = &left->Row0;
    float* resultRowPointer = &result->Row0;

    for (unsigned int i = 0; i < 4; ++i, leftRowPointer += 4, resultRowPointer += 4) {
        __m128 ARx = _mm_set1_ps(leftRowPointer[0]);
        __m128 ARy = _mm_set1_ps(leftRowPointer[1]);
        __m128 ARz = _mm_set1_ps(leftRowPointer[2]);
        __m128 ARw = _mm_set1_ps(leftRowPointer[3]);

        __m128 X = ARx * BCx;
        __m128 Y = ARy * BCy;
        __m128 Z = ARz * BCz;
        __m128 W = ARw * BCw;

        __m128 R = X + Y + Z + W;
        _mm_store_ps(resultRowPointer, R);
    }
}

You'll notice that I've made some assumptions about how the pointers increment - my assumption was that member variables like Row0 or X were merely convenience pointers into a single 16-member contiguous array. If that isn't accurate, this will break. I've also labeled the BCx like variables as const because they don't seem as if they should ever change.

Potential undefined behavior in my suggestion

An important note is that there is some question about if this will work perfectly on every system - it would appear that the incrementing of leftRowPointer and resultRowPointer is undefined (after the last iteration) but unlikely to cause problems. I asked a question on StackOverfow about this that you may be interested in reading.

Performance boosts

SIMD is generally super easy to do and gain a speedup. As long as your data layout is good (aligned, contiguous, etc) then you should have pretty good caching behavior, which is the most important aspect of SIMD. The only way you could really improve this would be to add prefetching (that article is fantastic, and has a good section on prefetching), but you generally have to play around with your prefetch distance a bit - i.e. how far ahead you prefetch. This is because the prefetch still takes time to complete, so if you prefetch for the next matrix it probably won't get you any speed improvement unless your current computation takes enough time to mask the prefetching. Without timing it there's no way to be certain how many this is.

for each matrix in a lot of matrices:
    prefetch the matrix 40 matrices ahead, for example
    do some computation with the current matrix

If you don't do work on a lot of matrices sequentially, and you don't have a good way of knowing which matrix you'll work on well in advance, prefetching won't give you anything.

Also avoid pointer chasing - dereferencing a pointer isn't free, and almost always leads to more cache misses (of both the pointer and whatever it points to). Again, I don't see any evidence of that, unless that's what Row0 and such is doing.

You also need to avoid swizzling (favorite word ever) which is basically when you have to rearrange memory to put it into a simd register - using _mm_set_ps is usually a sign that you're swizzling. I don't see any evidence of that here, but for future reference it might be useful.

Use better and fewer intrinsics

I changed your function to not use _mm_mul_ps or _mm_add_ps and just use the operators - much easier to read. Honestly I think you could condense a lot of that even more without sacrificing too much readability - I don't know that all of the temporaries are necessary. I also got rid of the call to _mm_storeu_ps and replaced it with the aligned call - this is going to be substantially faster, and if you have any way of ensuring that you allocate your matrices in an aligned fashion that would be ideal.

Or drop intrinsics all together

In general I find the intrinsics to be a little hard to read, and they're less cross-platform. I strongly suggest using a library like Agner Fog's vectorclass library for both portability and readability.

If you can't or won't use a 3rd party library, it is really quite easy to write up a little wrapper class that is much more readable, and if you're into templates and macros you can make it very portable as well.

Casts

All of that casting makes me a little nervous - again, I don't know anything about the format of your data, but it seems like it should be unnecessary to use the explicit casts. If you must cast, don't use C-style casts - I usually prefer static_cast. You can learn more about why you shouldn't use C-style casts in C++ code here. In summary, C-style casts will do this:

C casts are casts using (type)object or type(object). A C-style cast is defined as the first of the following which succeeds:

const_cast
static_cast (though ignoring access restrictions)
static_cast (see above), then const_cast
reinterpret_cast
reinterpret_cast, then const_cast

This is dangerous because a C-style cast can turn into a reinterpret_cast (from the same link as above, emphasis added):

reinterpret_cast is the most dangerous cast, and should be used very sparingly. It turns one type directly into another - such as casting the value from one pointer to another, or storing a pointer in an int, or all sorts of other nasty things. Largely, the only guarantee you get with reinterpret_cast is that normally if you cast the result back to the original type, you will get the exact same value (but not if the intermediate type is smaller than the original type).

Problem size and algorithm choice

Right now the matrices are small, and if that stays the same don't mind this point, but if you think they're going to get larger you may want to look into tiling the multiplication - this will have much better caching behavior than a large row-major x row-major multiplication.

As-is, with matrices of this size, unless you're doing a ton of them you probably aren't going to get massive performance differences, and the readability hit that manual vectorization gives you might make it not as worthwhile. I see from your profile that you do game development, so you probably do a lot of these, but if so I'd recommend doing it on a GPU instead.

DPPS

Lastly, I'd like to highlight a point made in the comments to use the _mm_dp_ps (DPPS) instruction. As Peter Cordes indicated (also in the comments), this is a bad idea. Horizontal operations ruin the whole point of SIMD registers - I actually wrote up some timing tests some time ago for different SIMD implementations of the dot product including using that instruction, if I can find them I'll upload the results graph. As you might expect, using DPPS pretty much tanks your speed.

EDIT: Your comments point out that this isn't accurate, but I'm leaving it here because it's an important point.

I don't know how your matrices are implemented, but they don't seem to be continuous memory but rather a vector of vectors (or something). This is going to hurt your caching behavior - try for a one dimensional vector/array and then access using (assuming row-major storage) matrix[row*numColumns + col]. You should see an improvement in caching behavior.

\$\endgroup\$
10
  • \$\begingroup\$ Reading the code would reveal that DPPS was never used in my code it was suggested in another comment, and someone else replied that it was slow. The implementation of my matrices is not presented cause I'm just interested in the SIMD code, and can adapt the rest(but they are just continous memory fyi), the SISD sample code is also presented in the question. The casting is due to the fact the matrix is implemented as a struct containing 4 structs(a vector4 for each row). I will have to downvote this until the answer is based on the code that was actually presented. \$\endgroup\$ Commented Jan 7, 2016 at 21:20
  • \$\begingroup\$ @Peter I never said it was - I was just making it clear that using it would be a bad idea. Feel free to downvote - I don't plan on doing anything more on this answer. \$\endgroup\$ Commented Jan 7, 2016 at 21:22
  • \$\begingroup\$ "In the comments it was suggested that you use DPPS - as Peter Cordes indicated, this is usually a bad idea." Suggests to me you read the comments but not the actual code presented. \$\endgroup\$ Commented Jan 7, 2016 at 21:24
  • \$\begingroup\$ @Peter I would hope that the rest of the answer indicates that I did in fact read your code. \$\endgroup\$ Commented Jan 7, 2016 at 21:27
  • \$\begingroup\$ The only indication is that you commented on using _mm_storeu_ps, which is a valid point. I'm still confused about how you started your answer by commenting on something that was actually never in the code. At this point the question is so old and not relevant to me anymore, I just want it to be useful for anyone else looking for a good implementation. \$\endgroup\$ Commented Jan 7, 2016 at 21:30

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.