Return to Revisions

1 of 14

answered Jan 22, 2024 at 1:02

3.8k
18
28

(This isn't so much a code-review as what I was going to comment on other answers that mentioned optimized asm libc implementation of memchr and so on. And/or how I'd go about solving it, especially in non-portable C that could use <immintrin.h> or <arm_neon.h>. It's disappointing that compilers are almost completely unable to make truly efficient code from simple source like yours for simple problems like these.)

Performance of brute-force loops

There's no portable way with C to take full advantage of the SIMD brute force modern CPUs can apply to problems like this, comparing 16 or 32 bytes at once and doing one test/branch to check for any matches. For non-huge problems, carefully applied brute-force should be very good, like for strings that stay hot in L2 cache. (often 256K on older CPUs, 1M or larger on some recent x86). You have the option of building some index data structures, but depending on typical queries (how far we have to scan to get a match), it might not be worth it if we could write a program that was anywhere near as good as what we could do with x86 AVX2 or AArch64 ASIMD intrinsics (or hand-written asm).

(SIMD = Single Instruction Multiple Data, like SSE2 pcmpeqb to do 16 byte-compares with one instruction that modern CPUs can run at 3 per clock cycle, not much worse than integer add. Processing the results so we can branch on them is non-trivial and should be amortized over multiple SIMD vectors vectors if we don't expect a match very soon. See Why does glibc's strlen need to be so complicated to run quickly? for discussion of both the the SIMD hand-written asm it uses on x86, and the "portable" bithack long-at-a-time C version (but not really; strict-aliasing UB) used as a fallback on a few targets like MIPS.)

Current compilers unfortunately won't auto-vectorize loops whose iteration-count couldn't be calculated ahead of the first iteration. So that rules out search loops that have an early-out. Library implementations of memchr and strchr using hand-written asm can go 16 or 32x faster than current compilers with byte-at-a-time C, on typical x86 with AVX2. (glibc detects CPU features at run time, so can also take advantage of AVX2 in a program compiled without -march=native).

Take advantage of hand-written asm library functions

C's library exposes a few of the things one can do fast in non-portable asm, like finding the position of a given byte (memchr). Unfortunately missing some very useful things, like position of the first mismatch between two strings / arrays, but fortunately we don't need that.

You know N, the length of the string, so there's no need to be searching for the terminating 0 all the time. Use memchr not strchr: it has lower startup overhead since it can just check the size and know it's allowed to read a whole vector, vs. strchr having to worry that a 32 byte load could go off the end of a page into an unmapped page and segfault. So it has to check pointer alignment and branch on a bunch of special cases.

To amortize call overhead for query type 2 (X or Y next), you might use memchr(pos, x, 512) and memchr(pos, y, 512) as building-blocks in a loop doing pos+=512 (and checking if we're getting near the end of the string). If there's actually a Y 20 bytes from pos, we'll check up to 512 bytes looking for an X before we call the second memchr to find the Y, but there is some startup overhead in each memchr.

512 bytes is only 8x 64-byte cache lines, or 16x 32-byte vectors. The block size is a tunable parameter you can play with. Perhaps 128 bytes (2 lines, 4 YMM vectors) would be good, depending on the library implementation.

I'd guess this might run 10x faster than the simple version on a typical modern x86 with AVX2 like Skylake or Zen 2, for data hot in L2 cache, if the next-patch position is a few thousand bytes away.

char findFirstOccurrence(const char *str, size_t len, size_t k, char x, char y)
{
    const int chunk_size = 512;

    size_t len_remain = len-k;  // bytes from pos to end of string
    for (const char *pos = str+k ; len_remain >= chunk_size ; pos += chunk_size, len_remain -= chunk_size) {
        const char *xpos = memchr(pos, x, chunk_size);
        const char *ypos = memchr(pos, y, chunk_size);
        if (xpos && ypos) {
            return xpos < ypos ? x : y;
        } else if (xpos) {
            return x;
        } else if (ypos) {
            return y;
        }
         // else neither found a match
    }

    if (len_remain){
        // same as loop body but with len_remain instead of a full chunk_size
        // we could have done min(len_remain, chunk_size) inside the loop instead of peeling this
        const char *xpos = memchr(pos, x, len_remain);
        const char *ypos = memchr(pos, y, len_remain);
      // For production use I'd probably factor this into a helper function or macro.
        if (xpos && ypos) {
            return xpos < ypos ? x : y;
        } else if (xpos) {
            return x;
        } else if (ypos) {
            return y;
        }
        // else neither found a match
    }
    // return '0';   // why ASCII decimal digit '0'?
    return 0;        // ASCII nul '\0' makes more sense to me.
}

This is significantly worse than what you could do with SSE2 intrinsics, like this sketch that omits some of the pointer math. It's like memchr but with bitwise OR of two compare results. If desired, we can unroll the way glibc does. An AVX2 version would be identical but with __m256i and _mm256_... intrinsics. An AVX-512 / AVX10 version using compare-into-mask could use kortest to branch on either compare result having a set bit.

#include <immintrin.h>

char findfirst_SSE2(...){

  __m128i vx = _mm_set1_epi8(x);  // vector with all 16 elements = x
  __m128i vy = _mm_set1_epi8(y);

  for(; pos < endpos ; pos += 16){
      __m128i v = _mm_loadu_si256((__m256i*)pos);  // unaligned load
      __m128i cmpx = _mm_cmpeq_epi8(v, vx);  // vector of 0 or -1 bytes
      __m128i cmpy = _mm_cmpeq_epi8(v, vy);
      __m128i match = _mm_or_si128(cmpx, cmpy);  // true in elements where either had a match
      // and then the rest is identical to a memchr loop
      // optionally do a couple vectors unconditionally to amortize test+branch, and then sort out which vector had the match later.
      unsigned mask = _mm_movemask_epi8(match);   // extract the high bits of  each element to an integer bitmask
      if (mask) {
          // test eax,eax / jcc  then  tzcnt or BSF
          return pos[ __builtin_ctz(mask) ];  // find the match position and get the char there
      }
  }
  if (the total search region was at least 16 bytes) {
     // check the last 16 bytes of the string, 
     // partially overlapping with the last vector from the loop
     __m128i v = _mm_loadu_si128((__m128i*)str+len-16);
     compare and check mask as in the loop.
  } else {
     scalar fallback, or _mm_loadu_si64 then scalar
  }
  return 0;
}

Counting matching chars in a range

You can get compilers to vectorize things like counting matching bytes. (Between two arrays, or with a fixed character against arr[i] is basically the same problem, accumulating the compare results is the hard part. a[i] == b[i] vs. a[i] == c is just the difference between another vector load vs. broadcasting a scalar ahead of the loop). Compilers can sometimes do an ok job, like what @harold managed to get clang to do in Count the number of mismatches between two arrays with some careful programming to hand-hold the compiler toward the asm he knew he wanted it to make. Otherwise expect quite a bit worse, but still hopefully better than scalar.

But you said programs are only compiled with -O2. If that's GCC, you won't get auto-vectorization except of the easiest loops, like a constant iteration count that's a multiple of 32 or something.

It might help to write the source like count += (str[i] == c); to encourage the compiler to make branchless asm. This pattern of conditionally adding based on a compare result is basically the same as Stack Overlow's branch-prediction canonical Q&A, https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array . At least for scalar code.

(For vectorization, that code is adding a 32-bit array element to a 64-bit counter, but here we're just incrementing by 1 or not so you want to work with 16x 1-byte elements as long as possible, not unpacking to four vectors of 4x 32-bit elements right away.)

Query type 1: kth occurrence

You could just call memchr k times. But if the density of occurrences is high-ish, like more than one per 32 bytes, this is much worse for large k than we can do with SIMD brute force. You'd maybe want to build this out of the counting code, perhaps counting over a range like 512 or 1024 bytes and then continuing forward if we haven't reached k, or backtracking if we've gone past.

It's non-trivial to do it even with manual vectorization with SIMD intrinsics. To count efficiently, you normally want to avoid reducing to scalar very often. Probably work in large batches, and when getting close use a reduce-every-16-bytes loop.

answered Jan 22, 2024 at 1:02

Peter Cordes

3.8k
18
28