Return to Answer

added 2330 characters in body

Source Link

edited Jun 10, 2023 at 21:25

12.4k
1
19
39

It's possible to some extent to trick compilers into doing (almost) the right thing, by implementing the "separate inner loop" idea in scalar code, and then relying on auto-vectorization to get SIMD out of it. Such trickery is generally not reliable, but this time seems to be relatively successful.

size_t difference(size_t n, const char* a, const char* b) {
    size_t res = 0;
    size_t i;
    for (i = 0; (i + 127) < n; i += 128) {
        uint8_t inner_count = 0;
        for (size_t j = 0; j < 128; j++)
            inner_count += a[i + j] == b[i + j];
        res += inner_count;
    }
    for (; i < n; i++)
        res += a[i] == b[i];
    return n - res;
}

The result from Clang (the important part) is this:

    vmovdqu ymm2, ymmword ptr [rsi + rcx]
    vmovdqu ymm3, ymmword ptr [rsi + rcx + 32]
    vmovdqu ymm4, ymmword ptr [rsi + rcx + 64]
    vmovdqu ymm5, ymmword ptr [rsi + rcx + 96]
    vpcmpeqb        ymm2, ymm2, ymmword ptr [rdx + rcx]
    vpand   ymm2, ymm2, ymm0
    vpcmpeqb        ymm3, ymm3, ymmword ptr [rdx + rcx + 32]
    vpcmpeqb        ymm4, ymm4, ymmword ptr [rdx + rcx + 64]
    vpsubb  ymm2, ymm2, ymm3
    vpsubb  ymm2, ymm2, ymm4
    vpcmpeqb        ymm3, ymm5, ymmword ptr [rdx + rcx + 96]
    vpsubb  ymm2, ymm2, ymm3
    vextracti128    xmm3, ymm2, 1
    vpaddb  xmm2, xmm2, xmm3
    vpshufd xmm3, xmm2, 238                 # xmm3 = xmm2[2,3,2,3]
    vpaddb  xmm2, xmm2, xmm3
    vpsadbw xmm2, xmm2, xmm1
    vpextrb r8d, xmm2, 0

Which is not ideal (quick-bench suggests 1.3x as slow as the version with intrinsics), but it's not horrible. If that's what Clang (or some other compiler for that matter) had done with the original code, I would have been impressed.

This time Clang turns out to know about the vpsadbw trick to horizontally add groups of 8 bytes .. if it had known that earlier (when compiling the original code), it could have prevented most of the vpaddq and all of the vpmovzx. Putting the vpsadbw in every iteration (which I suppose would have been the result in that hypothetical scenario) is not ideal either, but at least not as bad.

size_t difference(size_t n, const char* a, const char* b) {
    size_t res = 0;
    size_t i;
    for (i = 0; (i + 127) < n; i += 128) {
        uint8_t inner_count = 0;
        for (size_t j = 0; j < 128; j++)
            inner_count += a[i + j] == b[i + j];
        res += inner_count;
    }
    for (; i < n; i++)
        res += a[i] == b[i];
    return n - res;
}

The result from Clang (the important part) is this:

    vmovdqu ymm2, ymmword ptr [rsi + rcx]
    vmovdqu ymm3, ymmword ptr [rsi + rcx + 32]
    vmovdqu ymm4, ymmword ptr [rsi + rcx + 64]
    vmovdqu ymm5, ymmword ptr [rsi + rcx + 96]
    vpcmpeqb        ymm2, ymm2, ymmword ptr [rdx + rcx]
    vpand   ymm2, ymm2, ymm0
    vpcmpeqb        ymm3, ymm3, ymmword ptr [rdx + rcx + 32]
    vpcmpeqb        ymm4, ymm4, ymmword ptr [rdx + rcx + 64]
    vpsubb  ymm2, ymm2, ymm3
    vpsubb  ymm2, ymm2, ymm4
    vpcmpeqb        ymm3, ymm5, ymmword ptr [rdx + rcx + 96]
    vpsubb  ymm2, ymm2, ymm3
    vextracti128    xmm3, ymm2, 1
    vpaddb  xmm2, xmm2, xmm3
    vpshufd xmm3, xmm2, 238                 # xmm3 = xmm2[2,3,2,3]
    vpaddb  xmm2, xmm2, xmm3
    vpsadbw xmm2, xmm2, xmm1
    vpextrb r8d, xmm2, 0

Source Link

answered Jun 10, 2023 at 20:39

user555045

12.4k
1
19
39

At first I wasn't going to do this, in part because of the cross-posted version of this question to Stack Overflow being closed as duplicates of various pre-existing questions about counting bytes, but I changed my mind..

Actually making it fast

Auto-vectorization takes a big L in this case. Which compiler is used and which ISA level is targeted re-rolls the dice, but none of the results that I've seen are anywhere near "good". The common theme is that all compilers (at this time) get super hung up on res being 64-bit, while the key to doing this efficiently is avoiding 64-bit additions as much as possible - not because they're slow, but because a vector of any given size fits 8 times fewer 64-bit integers in it compared to 8-bit integers.

So with auto-vectorization, we end up with code like this: (from ICX, with -O3 -march=haswell - not allowing it to use AVX512, because that's not very widespread yet, but allowing it to use AVX512 just makes it do other silly things that are still hyper-focused on 64-bit operations for no good reason).

.LBB0_6:                                # =>This Inner Loop Header: Depth=1
        vmovdqu xmm6, xmmword ptr [rsi + rax]
        vpcmpeqb        xmm6, xmm6, xmmword ptr [rdx + rax]
        vpxor   xmm6, xmm6, xmm1
        vpmovzxbw       ymm6, xmm6              # ymm6 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero,xmm6[4],zero,xmm6[5],zero,xmm6[6],zero,xmm6[7],zero,xmm6[8],zero,xmm6[9],zero,xmm6[10],zero,xmm6[11],zero,xmm6[12],zero,xmm6[13],zero,xmm6[14],zero,xmm6[15],zero
        vpmovzxwd       ymm7, xmm6              # ymm7 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero,xmm6[4],zero,xmm6[5],zero,xmm6[6],zero,xmm6[7],zero
        vpmovzxdq       ymm8, xmm7              # ymm8 = xmm7[0],zero,xmm7[1],zero,xmm7[2],zero,xmm7[3],zero
        vpand   ymm8, ymm8, ymm3
        vpaddq  ymm0, ymm8, ymm0
        vextracti128    xmm7, ymm7, 1
        vpmovzxdq       ymm7, xmm7              # ymm7 = xmm7[0],zero,xmm7[1],zero,xmm7[2],zero,xmm7[3],zero
        vpand   ymm7, ymm7, ymm3
        vpaddq  ymm2, ymm2, ymm7
        vextracti128    xmm6, ymm6, 1
        vpmovzxwd       ymm6, xmm6              # ymm6 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero,xmm6[4],zero,xmm6[5],zero,xmm6[6],zero,xmm6[7],zero
        vpmovzxdq       ymm7, xmm6              # ymm7 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero
        vpand   ymm7, ymm7, ymm3
        vpaddq  ymm4, ymm4, ymm7
        vextracti128    xmm6, ymm6, 1
        vpmovzxdq       ymm6, xmm6              # ymm6 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero
        vpand   ymm6, ymm6, ymm3
        vpaddq  ymm5, ymm5, ymm6
        add     rax, 16
        cmp     rax, r8
        jbe     .LBB0_6

All of this is to compare and count two blocks of 32 bytes.

With the techniques mentioned on Stack Overflow, you could do it like this: (not tested)

size_t difference_avx2(size_t n, const char a[n], const char b[n]) {
    size_t i;
    __m256i count = _mm256_setzero_si256();
    for (i = 0; (i + 4095) < n; i += 4096) {
        __m256i inner_count = _mm256_setzero_si256();
        for (size_t j = 0; j < 4096; j += 32) {
            __m256i A = _mm256_loadu_si256((__m256i*)&a[i + j]);
            __m256i B = _mm256_loadu_si256((__m256i*)&b[i + j]); 
            __m256i comparemask = _mm256_cmpeq_epi8(A, B);
            inner_count = _mm256_sub_epi8(inner_count, comparemask);
        }
        count = _mm256_add_epi64(count, _mm256_sad_epu8(inner_count, _mm256_setzero_si256()));
    }
    for (; (i + 31) < n; i += 32) {
        __m256i A = _mm256_loadu_si256((__m256i*)&a[i]);
        __m256i B = _mm256_loadu_si256((__m256i*)&b[i]); 
        __m256i comparemask = _mm256_abs_epi8(_mm256_cmpeq_epi8(A, B));
        count = _mm256_add_epi64(count, _mm256_sad_epu8(comparemask, _mm256_setzero_si256()));
    }
    size_t count_scalar = _mm256_extract_epi64(count, 0) +
        _mm256_extract_epi64(count, 1) +
        _mm256_extract_epi64(count, 2) +
        _mm256_extract_epi64(count, 3);
    for (; i < n; i++) {
        count_scalar += a[i] == b[i];
    }
    return n - count_scalar;
}

The key trick is creating an extra inner loop, which runs 128 times (we could go up to 255 but it's such a non-round number.. but do it if you want). In that loop, we are free to subtract (or add) the boolean results as bytes, which prevents the 8x expansion factor that converting them to 64-bit numbers entails, and everything that comes along with it (the extra vpmovzx's and so on).

The _mm256_sad_epu8 trick to horizontally add groups of 8 bytes is just a bonus really, it doesn't get executed all that often.

Is it fast? quick-bench says yes. Unfortunately the disassembly of the AVX2 version broke, but I'm inclined to believe the result: it should be about 8x as fast as what Clang was doing itself, because we don't have the 8x expansion factor. That's a bit simplistic and frankly optimistic, but still, this result does not imply that the benchmark is broken, it really could be that much faster.

If you cannot use AVX2, you should be easily able to cut this down into an SSSE3 version with 128-bit vectors.