Return to Revisions

4 of 14

Remove duplicated word

edited Jan 22, 2024 at 8:01

88.4k
14
104
327

(This isn't so much a code-review as what I was going to comment on other answers that mentioned optimized asm libc implementation of memchr and so on. And/or how I'd go about solving it, especially in non-portable C that could use <immintrin.h> or <arm_neon.h>. It's disappointing that compilers are almost completely unable to make truly efficient code from simple source like yours for simple problems like these.)

Performance of brute-force loops

There's no portable way with C to take full advantage of the SIMD brute force modern CPUs can apply to problems like this, comparing 16 or 32 bytes at once and doing one test/branch to check for any matches. For non-huge problems, carefully applied brute-force should be very good, like for strings that stay hot in L2 cache. (often 256K on older CPUs, 1M or larger on some recent x86). You have the option of building some index data structures, but depending on typical queries (how far we have to scan to get a match), it might not be worth it if we could write a program that was anywhere near as good as what we could do with x86 AVX2 or AArch64 ASIMD intrinsics (or hand-written asm).

(SIMD = Single Instruction Multiple Data, like SSE2 pcmpeqb to do 16 byte-compares with one instruction that modern CPUs can run at 3 per clock cycle, not much worse than integer add. Processing the results so we can branch on them is non-trivial and should be amortized over multiple SIMD vectors if we don't expect a match very soon. See Why does glibc's strlen need to be so complicated to run quickly? for discussion of both the the SIMD hand-written asm it uses on x86, and the "portable" bithack long-at-a-time C version (but not really; strict-aliasing UB) used as a fallback on a few targets like MIPS.)

Current compilers unfortunately won't auto-vectorize loops whose iteration-count couldn't be calculated ahead of the first iteration. So that rules out search loops that have an early-out. Library implementations of memchr and strchr using hand-written asm can go 16 or 32x faster than current compilers with byte-at-a-time C, on typical x86 with AVX2. (glibc detects CPU features at run time, so can also take advantage of AVX2 in a program compiled without -march=native).

Take advantage of hand-written asm library functions

C's library exposes a few of the things one can do fast in non-portable asm, like finding the position of a given byte (memchr). Unfortunately missing some very useful things, like position of the first mismatch between two strings / arrays, but fortunately we don't need that.

You know N, the length of the string, so there's no need to be searching for the terminating 0 all the time. Use memchr not strchr: it has lower startup overhead since it can just check the size and know it's allowed to read a whole vector, vs. strchr having to worry that a 32 byte load could go off the end of a page into an unmapped page and segfault. So it has to check pointer alignment and branch on a bunch of special cases.

To amortize call overhead for query type 2 (X or Y next), you might use memchr(pos, x, 512) and memchr(pos, y, 512) as building-blocks in a loop doing pos+=512 (and checking if we're getting near the end of the string). If there's actually a Y 20 bytes from pos, we'll check up to 512 bytes looking for an X before we call the second memchr to find the Y, but there is some startup overhead in each memchr.

512 bytes is only 8x 64-byte cache lines, or 16x 32-byte vectors. The block size is a tunable parameter you can play with. Perhaps 128 bytes (2 lines, 4 YMM vectors) would be good, depending on the library implementation.

I'd guess this might run 10x faster than the simple version on a typical modern x86 with AVX2 like Skylake or Zen 2, for data hot in L2 cache, if the next-patch position is a few thousand bytes away.

char findFirstOccurrence(const char *str, size_t len, size_t k, char x, char y)
{
    const int chunk_size = 512;

    size_t len_remain = len-k;  // bytes from pos to end of string
    for (const char *pos = str+k ; len_remain >= chunk_size ; pos += chunk_size, len_remain -= chunk_size) {
        const char *xpos = memchr(pos, x, chunk_size);
        const char *ypos = memchr(pos, y, chunk_size);
        if (xpos && ypos) {
            return xpos < ypos ? x : y;
        } else if (xpos) {
            return x;
        } else if (ypos) {
            return y;
        }
         // else neither found a match
    }

    if (len_remain){
        // same as loop body but with len_remain instead of a full chunk_size
        // we could have done min(len_remain, chunk_size) inside the loop instead of peeling this
        const char *xpos = memchr(pos, x, len_remain);
        const char *ypos = memchr(pos, y, len_remain);
      // For production use I'd probably factor this into a helper function or macro.
        if (xpos && ypos) {
            return xpos < ypos ? x : y;
        } else if (xpos) {
            return x;
        } else if (ypos) {
            return y;
        }
        // else neither found a match
    }
    // return '0';   // why ASCII decimal digit '0'?
    return 0;        // ASCII nul '\0' makes more sense to me.
}

This is significantly worse than what you could do with SSE2 intrinsics, like this sketch that omits some of the pointer math. It's like memchr but with bitwise OR of two compare results. If desired, we can unroll the way glibc does. An AVX2 version would be identical but with __m256i and _mm256_... intrinsics. An AVX-512 / AVX10 version using compare-into-mask could use kortest to branch on either compare result having a set bit.

#include <immintrin.h>

char findfirst_SSE2(...){

  __m128i vx = _mm_set1_epi8(x);  // vector with all 16 elements = x
  __m128i vy = _mm_set1_epi8(y);

  for(; pos < endpos ; pos += 16){
      __m128i v = _mm_loadu_si256((__m256i*)pos);  // unaligned load
      __m128i cmpx = _mm_cmpeq_epi8(v, vx);  // vector of 0 or -1 bytes
      __m128i cmpy = _mm_cmpeq_epi8(v, vy);
      __m128i match = _mm_or_si128(cmpx, cmpy);  // true in elements where either had a match
      // and then the rest is identical to a memchr loop
      // optionally do a couple vectors unconditionally to amortize test+branch, and then sort out which vector had the match later.
      unsigned mask = _mm_movemask_epi8(match);   // extract the high bits of  each element to an integer bitmask
      if (mask) {
          // test eax,eax / jcc  then  tzcnt or BSF
          return pos[ __builtin_ctz(mask) ];  // find the match position and get the char there
      }
  }
  if (the total search region was at least 16 bytes) {
     // check the last 16 bytes of the string, 
     // partially overlapping with the last vector from the loop
     __m128i v = _mm_loadu_si128((__m128i*)str+len-16);
     compare and check mask as in the loop.
  } else {
     scalar fallback, or _mm_loadu_si64 then scalar
  }
  return 0;
}

ARM / AArch64 don't have a pmovmskb instruction that pulls the high bit from each vector. For AArch64 memchr and similar, you normally reduce 128-bit to 64-bit with a shift right and insert instruction, then move that from SIMD to general-purpose integer and check it for non-zero. (If so, then bit-scan it for the match position with rbit (bit-reverse) / clz (count leading zero), or a bithack to isolate the lowest set bit + clz.)

AArch64 CPUs usually have efficient SIMD → integer transfers, but many 32-bit ARMv7 CPUs did not, so it wasn't as easy or profitable to use NEON for memchr type problems. (The core had to stall and wait for SIMD instructions to finish because they normally weren't tightly coupled enough to track dependencies.) Counting matches over a range is fine, but branching every 32 or 64 bytes on compare results was a problem. But not for AArch64 usually, and not for x86. On some x86 CPUs, SIMD vector to integer data transfers have highish latency, but it's always pipelined so not big throughput problem.

Counting matching chars in a range

Efficient manual vectorization looks like this with AVX2: https://stackoverflow.com/questions/54541129/how-to-count-character-occurrences-using-simd - pcmpeqb / psubb in the inner loop, using byte compare results as 0/-1 integers to increment a vector of uint8_t counts. And psadbw against zero in an outer loop to horizontally sum unsigned bytes to uint64_t in 8-byte chunks, before the uint8_t counters can overflow. This can run about 1.5x 32 bytes per cycle on modern Intel, or 2x 32 byte per cycle on modern AMD (4 SIMD ALU ports), if data is hot in L1d cache.

You can get compilers to vectorize things like counting matching bytes. (Between two arrays, or with a fixed character against arr[i] is basically the same problem, accumulating the compare results is the hard part. a[i] == b[i] vs. a[i] == c is just the difference between another vector load vs. broadcasting a scalar ahead of the loop). Compilers can sometimes do an ok job, like what @harold managed to get clang to do in Count the number of mismatches between two arrays with some careful programming to hand-hold the compiler toward the asm he knew he wanted it to make. Otherwise expect quite a bit worse, but still hopefully better than scalar.

But you said programs are only compiled with -O2. If that's GCC, you won't get auto-vectorization except of the easiest loops, like a constant iteration count that's a multiple of 32 or something.

It might help to write the source like count += (str[i] == c); to encourage the compiler to make branchless asm. This pattern of conditionally adding based on a compare result is basically the same as Stack Overlow's branch-prediction canonical Q&A, https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array . At least for scalar code.

(For vectorization, that code is adding a 32-bit array element to a 64-bit counter, but here we're just incrementing by 1 or not so you want to work with 16x 1-byte elements as long as possible, not unpacking to four vectors of 4x 32-bit elements right away.)

Query type 1: kth occurrence

You could just call memchr k times. But if the density of occurrences is high-ish, like more than one per 32 bytes, this is much worse for large k than we can do with SIMD brute force. You'd maybe want to build this out of the counting code, perhaps counting over a range like 512 or 1024 bytes and then continuing forward if we haven't reached k, or backtracking if we've gone past.

It's non-trivial to do it even with manual vectorization with SIMD intrinsics. To count efficiently, you normally want to avoid reducing to scalar very often. Probably work in large batches, and when getting close use a reduce-every-16-bytes loop.

For all of these, building arrays of symbol-occurrence positions can be a big win, but type 1 maybe moreso, especially if you're trying to hack around the lack of portable SIMD search in languages like C.

C23 _BitInt(24) might be useful to save some cache footprint on arrays of occurrence position, if compilers pack that into 3 bytes. Or not, since spatial locality is poor when binary searching, but if it lets you "index" stay hot in cache more of the time across queries it could be worth making each access cost more CPU instructions.

Speaking of searchable sorted data structures, an array isn't necessarily optimal, especially if you can use SIMD efficiently for searching (like modern x86 can, and also AArch64). https://stackoverflow.com/questions/56521133/what-is-the-most-efficient-way-to-implement-a-bst-in-such-a-way-the-findvalue - an implicit tree has good locality without wasting any space on pointers. And it doesn't have to be a binary tree; each node can be a whole cache line of an n-ary tree, so you could have a 17-ary tree with 32-bit integers and 64-byte cache lines. (Use aligned_alloc to get alignas(64) for dynamic allocation.)

Turning the arrays of symbol positions into implicit trees would add some preprocesing work, so might only be worth it if Q is large. But we know Q before we even read the string so we can decide ahead of time how much preprocessing it's worth doing on the string!

For low Q we might even just brute-force search the original string, especially if N is large. (That means writing multiple versions of the rest of the program with if(strategy) or using function pointers to plug in different implementations to dispatch after parsing a query.)

Presumably for this assignment, a fixed strategy is totally fine, and struct { uint32_t *pos; size_t len; } symbol_occurrences[UCHAR_MAX+1]; would be plenty fast, sorted arrays of indices.

answered Jan 22, 2024 at 1:02

Peter Cordes

3.8k
18
28