Return to Answer

added 201 characters in body

Source Link

edited Jul 3, 2021 at 11:54

12.4k
1
19
39

Going outside the screen with blends may not be safe, depending on how much padding it has. In any case it complicates the condition, since it doesn't have the same element size as the pixels. It would cost a lot of code to do that .. similar to what Clang did actually, plus some packs to reduce the width of the mask to byte size. Anyway I'm going to recommend keeping that condition as a branch (not one branch per pixel, that wouldn't even be possible with SIMD, but a branch to end the loop and then drop into the special handling of the chunk that got "cut in half" by the edge of the screen). You can handle the last chunk of every row by "stepping back" to align the end of the vector with the end of the row, overlapping a bit with the previous chunk. That works in this case because the operation we're doing can be safely done twice on the same pixel, that would have the same effect as doing it once. If I have time later and if you're interested, I may do a more elaborate sketch of how to do all this.

Going outside the screen with blends may not be safe, depending on how much padding it has. In any case it complicates the condition, since it doesn't have the same element size as the pixels. It would cost a lot of code to do that .. similar to what Clang did actually, plus some packs to reduce the width of the mask to byte size. Anyway I'm going to recommend keeping that condition as a branch. You can handle the last chunk of every row by "stepping back" to align the end of the vector with the end of the row, overlapping a bit with the previous chunk. That works in this case because the operation we're doing can be safely done twice on the same pixel, that would have the same effect as doing it once. If I have time later and if you're interested, I may do a more elaborate sketch of how to do all this.

Source Link

answered Jul 3, 2021 at 10:31

user555045

12.4k
1
19
39

it seems to be optimized quite well by the compiler, as it uses a lot of vectorized instructions

Well, kind of. GCC didn't use vectorization at all, and Clang used some of it but in a very strange way. Let's look at what it's doing. Basically I'm reviewing what Clang did, not so much your code.. but that then informs you about your code by proxy. I'll go into what to do about it too.

Snippet 1:

    vpaddq  xmm0, xmm1, xmm6
    vpaddq  xmm1, xmm8, xmm6
    vinsertf128     ymm8, ymm1, xmm0, 1
    vpaddq  xmm0, xmm4, xmm6
    vpaddq  xmm1, xmm9, xmm6
    vinsertf128     ymm9, ymm1, xmm0, 1
    vpaddq  xmm0, xmm13, xmm6
    vpaddq  xmm1, xmm10, xmm6
    vinsertf128     ymm10, ymm1, xmm0, 1
    vpaddq  xmm0, xmm14, xmm6
    vpaddq  xmm1, xmm11, xmm6
    vinsertf128     ymm11, ymm1, xmm0, 1

The weird mix of 128bit and 256bit is due to Ivy Bridge supporting AVX but not AVX2, it's not really good to do this (usually it's better to just use 128bit SIMD and forget about 256bit unless it's floats), but it's understandable that a compiler would solve the puzzle that way.

Anyway what's happening here is that SIMD is used to .. increment x. That's not necessarily bad, but worrying, because x really doesn't need to be a vector here (in other contexts it would have made sense).

Snippet 2:

    vmovdqu xmm2, xmmword ptr [r13 + r14]
    vpcmpeqb        xmm3, xmm2, xmm5
    vextractf128    xmm12, ymm7, 1
    vextractf128    xmm1, ymm8, 1
    vpaddq  xmm0, xmm12, xmm1
    vpaddq  xmm4, xmm8, xmm7
    vinsertf128     ymm4, ymm4, xmm0, 1

Here SIMD is used to load the pixels (fine), and evaluate the condition, and evaluate x + scanline_out_index. This is fine.

Snippet 3: times a dozen

    vpextrb eax, xmm3, 1
    not     al
    test    al, 1
    je      .LBB0_17
    vpextrq rax, xmm4, 1
    vpextrb byte ptr [rcx + rax], xmm2, 1

This is not vectorized code. It's extracting scalars from the vectors, and doing a scalar test/branch, and scalar store-byte. This is not good. The whole setup of computing things in vectors is ruined by this part of the code.

I can't blame the compiler too much for this. Vectorizing a conditional store by using a blend is just illegal for compilers (it can introduce race conditions in multi-threaded code), and there is no good byte-granular conditional store in AVX (there is one for dwords, and maskmovdqu is byte-granular but is slow due the NT-hint). As the programmer, you can use a blend, and that's the only way to make it fast.

Going outside the screen with blends may not be safe, depending on how much padding it has. In any case it complicates the condition, since it doesn't have the same element size as the pixels. It would cost a lot of code to do that .. similar to what Clang did actually, plus some packs to reduce the width of the mask to byte size. Anyway I'm going to recommend keeping that condition as a branch. You can handle the last chunk of every row by "stepping back" to align the end of the vector with the end of the row, overlapping a bit with the previous chunk. That works in this case because the operation we're doing can be safely done twice on the same pixel, that would have the same effect as doing it once. If I have time later and if you're interested, I may do a more elaborate sketch of how to do all this.

In conclusion, Clang vectorized the part that shouldn't have been vectorized, and didn't vectorize the part that should have been (it wasn't allowed to touch that part though, only you are).