At first I wasn't going to do this, in part because of the cross-posted version of this question to Stack Overflow being closed as duplicates of various pre-existing questions about counting bytes, but I changed my mind..
Actually making it fast
Auto-vectorization takes a big L in this case. Which compiler is used and which ISA level is targeted re-rolls the dice, but none of the results that I've seen are anywhere near "good". The common theme is that all compilers (at this time) get super hung up on res being 64-bit, while the key to doing this efficiently is avoiding 64-bit additions as much as possible - not because they're slow, but because a vector of any given size fits 8 times fewer 64-bit integers in it compared to 8-bit integers.
So with auto-vectorization, we end up with code like this: (from ICX, with -O3 -march=haswell - not allowing it to use AVX512, because that's not very widespread yet, but allowing it to use AVX512 just makes it do other silly things that are still hyper-focused on 64-bit operations for no good reason).
.LBB0_6: # =>This Inner Loop Header: Depth=1
vmovdqu xmm6, xmmword ptr [rsi + rax]
vpcmpeqb xmm6, xmm6, xmmword ptr [rdx + rax]
vpxor xmm6, xmm6, xmm1
vpmovzxbw ymm6, xmm6 # ymm6 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero,xmm6[4],zero,xmm6[5],zero,xmm6[6],zero,xmm6[7],zero,xmm6[8],zero,xmm6[9],zero,xmm6[10],zero,xmm6[11],zero,xmm6[12],zero,xmm6[13],zero,xmm6[14],zero,xmm6[15],zero
vpmovzxwd ymm7, xmm6 # ymm7 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero,xmm6[4],zero,xmm6[5],zero,xmm6[6],zero,xmm6[7],zero
vpmovzxdq ymm8, xmm7 # ymm8 = xmm7[0],zero,xmm7[1],zero,xmm7[2],zero,xmm7[3],zero
vpand ymm8, ymm8, ymm3
vpaddq ymm0, ymm8, ymm0
vextracti128 xmm7, ymm7, 1
vpmovzxdq ymm7, xmm7 # ymm7 = xmm7[0],zero,xmm7[1],zero,xmm7[2],zero,xmm7[3],zero
vpand ymm7, ymm7, ymm3
vpaddq ymm2, ymm2, ymm7
vextracti128 xmm6, ymm6, 1
vpmovzxwd ymm6, xmm6 # ymm6 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero,xmm6[4],zero,xmm6[5],zero,xmm6[6],zero,xmm6[7],zero
vpmovzxdq ymm7, xmm6 # ymm7 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero
vpand ymm7, ymm7, ymm3
vpaddq ymm4, ymm4, ymm7
vextracti128 xmm6, ymm6, 1
vpmovzxdq ymm6, xmm6 # ymm6 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero
vpand ymm6, ymm6, ymm3
vpaddq ymm5, ymm5, ymm6
add rax, 16
cmp rax, r8
jbe .LBB0_6
All of this is to compare and count two blocks of 32 bytes.
With the techniques mentioned on Stack Overflow, you could do it like this: (not tested)
size_t difference_avx2(size_t n, const char a[n], const char b[n]) {
size_t i;
__m256i count = _mm256_setzero_si256();
for (i = 0; (i + 4095) < n; i += 4096) {
__m256i inner_count = _mm256_setzero_si256();
for (size_t j = 0; j < 4096; j += 32) {
__m256i A = _mm256_loadu_si256((__m256i*)&a[i + j]);
__m256i B = _mm256_loadu_si256((__m256i*)&b[i + j]);
__m256i comparemask = _mm256_cmpeq_epi8(A, B);
inner_count = _mm256_sub_epi8(inner_count, comparemask);
}
count = _mm256_add_epi64(count, _mm256_sad_epu8(inner_count, _mm256_setzero_si256()));
}
for (; (i + 31) < n; i += 32) {
__m256i A = _mm256_loadu_si256((__m256i*)&a[i]);
__m256i B = _mm256_loadu_si256((__m256i*)&b[i]);
__m256i comparemask = _mm256_abs_epi8(_mm256_cmpeq_epi8(A, B));
count = _mm256_add_epi64(count, _mm256_sad_epu8(comparemask, _mm256_setzero_si256()));
}
size_t count_scalar = _mm256_extract_epi64(count, 0) +
_mm256_extract_epi64(count, 1) +
_mm256_extract_epi64(count, 2) +
_mm256_extract_epi64(count, 3);
for (; i < n; i++) {
count_scalar += a[i] == b[i];
}
return n - count_scalar;
}
The key trick is creating an extra inner loop, which runs 128 times (we could go up to 255 but it's such a non-round number.. but do it if you want). In that loop, we are free to subtract (or add) the boolean results as bytes, which prevents the 8x expansion factor that converting them to 64-bit numbers entails, and everything that comes along with it (the extra vpmovzx's and so on).
The _mm256_sad_epu8 trick to horizontally add groups of 8 bytes is just a bonus really, it doesn't get executed all that often.
Is it fast? quick-bench says yes. Unfortunately the disassembly of the AVX2 version broke, but I'm inclined to believe the result: it should be about 8x as fast as what Clang was doing itself, because we don't have the 8x expansion factor. That's a bit simplistic and frankly optimistic, but still, this result does not imply that the benchmark is broken, it really could be that much faster.
If you cannot use AVX2, you should be easily able to cut this down into an SSSE3 version with 128-bit vectors.