it seems to be optimized quite well by the compiler, as it uses a lot of vectorized instructions
Well, kind of. GCC didn't use vectorization at all, and Clang used some of it but in a very strange way. Let's look at what it's doing. Basically I'm reviewing what Clang did, not so much your code.. but that then informs you about your code by proxy. I'll go into what to do about it too.
Snippet 1:
vpaddq xmm0, xmm1, xmm6
vpaddq xmm1, xmm8, xmm6
vinsertf128 ymm8, ymm1, xmm0, 1
vpaddq xmm0, xmm4, xmm6
vpaddq xmm1, xmm9, xmm6
vinsertf128 ymm9, ymm1, xmm0, 1
vpaddq xmm0, xmm13, xmm6
vpaddq xmm1, xmm10, xmm6
vinsertf128 ymm10, ymm1, xmm0, 1
vpaddq xmm0, xmm14, xmm6
vpaddq xmm1, xmm11, xmm6
vinsertf128 ymm11, ymm1, xmm0, 1
The weird mix of 128bit and 256bit is due to Ivy Bridge supporting AVX but not AVX2, it's not really good to do this (usually it's better to just use 128bit SIMD and forget about 256bit unless it's floats), but it's understandable that a compiler would solve the puzzle that way.
Anyway what's happening here is that SIMD is used to .. increment x. That's not necessarily bad, but worrying, because x really doesn't need to be a vector here (in other contexts it would have made sense).
Snippet 2:
vmovdqu xmm2, xmmword ptr [r13 + r14]
vpcmpeqb xmm3, xmm2, xmm5
vextractf128 xmm12, ymm7, 1
vextractf128 xmm1, ymm8, 1
vpaddq xmm0, xmm12, xmm1
vpaddq xmm4, xmm8, xmm7
vinsertf128 ymm4, ymm4, xmm0, 1
Here SIMD is used to load the pixels (fine), and evaluate the condition, and evaluate x + scanline_out_index. This is fine.
Snippet 3: times a dozen
vpextrb eax, xmm3, 1
not al
test al, 1
je .LBB0_17
vpextrq rax, xmm4, 1
vpextrb byte ptr [rcx + rax], xmm2, 1
This is not vectorized code. It's extracting scalars from the vectors, and doing a scalar test/branch, and scalar store-byte. This is not good. The whole setup of computing things in vectors is ruined by this part of the code.
I can't blame the compiler too much for this. Vectorizing a conditional store by using a blend is just illegal for compilers (it can introduce race conditions in multi-threaded code), and there is no good byte-granular conditional store in AVX (there is one for dwords, and maskmovdqu is byte-granular but is slow due the NT-hint). As the programmer, you can use a blend, and that's the only way to make it fast.
Going outside the screen with blends may not be safe, depending on how much padding it has. In any case it complicates the condition, since it doesn't have the same element size as the pixels. It would cost a lot of code to do that .. similar to what Clang did actually, plus some packs to reduce the width of the mask to byte size. Anyway I'm going to recommend keeping that condition as a branch. You can handle the last chunk of every row by "stepping back" to align the end of the vector with the end of the row, overlapping a bit with the previous chunk. That works in this case because the operation we're doing can be safely done twice on the same pixel, that would have the same effect as doing it once. If I have time later and if you're interested, I may do a more elaborate sketch of how to do all this.
In conclusion, Clang vectorized the part that shouldn't have been vectorized, and didn't vectorize the part that should have been (it wasn't allowed to touch that part though, only you are).