Return to Revisions

6 of 14

added 1053 characters in body

edited Jan 22, 2024 at 16:40

3.8k
18
28

(This isn't so much a code-review as what I was going to comment on other answers that mentioned optimized asm libc implementation of memchr and so on. And/or how I'd go about solving it, especially in non-portable C that could use <immintrin.h> or <arm_neon.h>. It's disappointing that compilers are almost completely unable to make truly efficient code from simple source like yours for simple problems like these.)

Performance of brute-force loops

There's no portable way with C to take full advantage of the SIMD brute force modern CPUs can apply to problems like this, comparing 16 or 32 bytes at once and doing one test/branch to check for any matches. For non-huge problems, carefully applied brute-force should be very good, like for strings that stay hot in L2 cache. (often 256K on older CPUs, 1M or larger on some recent x86). You have the option of building some index data structures, but depending on typical queries (how far we have to scan to get a match), it might not be worth it if we could write a program that was anywhere near as good as what we could do with x86 AVX2 or AArch64 ASIMD intrinsics (or hand-written asm).

(SIMD = Single Instruction Multiple Data, like SSE2 pcmpeqb to do 16 byte-compares with one instruction that modern CPUs can run at 3 per clock cycle, not much worse than integer add. Processing the results so we can branch on them is non-trivial and should be amortized over multiple SIMD vectors if we don't expect a match very soon. See Why does glibc's strlen need to be so complicated to run quickly? for discussion of both the the SIMD hand-written asm it uses on x86, and the "portable" bithack long-at-a-time C version (but not really; strict-aliasing UB) used as a fallback on a few targets like MIPS.)

Current compilers unfortunately won't auto-vectorize loops whose iteration-count couldn't be calculated ahead of the first iteration. So that rules out search loops that have an early-out. Library implementations of memchr and strchr using hand-written asm can go 16 or 32x faster than current compilers with byte-at-a-time C, on typical x86 with AVX2. (glibc detects CPU features at run time, so can also take advantage of AVX2 in a program compiled without -march=native).

Take advantage of hand-written asm library functions

C's library exposes a few of the things one can do fast in non-portable asm, like finding the position of a given byte (memchr). Unfortunately missing some very useful things, like position of the first mismatch between two strings / arrays, but fortunately we don't need that.

You know N, the length of the string, so there's no need to be searching for the terminating 0 all the time. Use memchr not strchr: it has lower startup overhead since it can just check the size and know it's allowed to read a whole vector, vs. strchr having to worry that a 32 byte load could go off the end of a page into an unmapped page and segfault. So it has to check pointer alignment and branch on a bunch of special cases.

To amortize call overhead for query type 2 (X or Y next), you might use memchr(pos, x, 512) and memchr(pos, y, 512) as building-blocks in a loop doing pos+=512 (and checking if we're getting near the end of the string). If there's actually a Y 20 bytes from pos, we'll check up to 512 bytes looking for an X before we call the second memchr to find the Y, but there is some startup overhead in each memchr.

512 bytes is only 8x 64-byte cache lines, or 16x 32-byte vectors. The block size is a tunable parameter you can play with. Perhaps 128 bytes (2 lines, 4 YMM vectors) would be good, depending on the library implementation.

I'd guess this might run 10x faster than the simple version on a typical modern x86 with AVX2 like Skylake or Zen 2, for data hot in L2 cache, if the next-patch position is a few thousand bytes away.

char findFirstOccurrence(const char *str, size_t len, size_t k, char x, char y)
{
    const int chunk_size = 512;

    size_t len_remain = len-k;  // bytes from pos to end of string
    for (const char *pos = str+k ; len_remain >= chunk_size ; pos += chunk_size, len_remain -= chunk_size) {
        const char *xpos = memchr(pos, x, chunk_size);
        const char *ypos = memchr(pos, y, chunk_size);
        if (xpos && ypos) {
            return xpos < ypos ? x : y;
        } else if (xpos) {
            return x;
        } else if (ypos) {
            return y;
        }
         // else neither found a match
    }

    if (len_remain){
        // same as loop body but with len_remain instead of a full chunk_size
        // we could have done min(len_remain, chunk_size) inside the loop instead of peeling this
        const char *xpos = memchr(pos, x, len_remain);
        const char *ypos = memchr(pos, y, len_remain);
      // For production use I'd probably factor this into a helper function or macro.
        if (xpos && ypos) {
            return xpos < ypos ? x : y;
        } else if (xpos) {
            return x;
        } else if (ypos) {
            return y;
        }
        // else neither found a match
    }
    // return '0';   // why ASCII decimal digit '0'?
    return 0;        // ASCII nul '\0' makes more sense to me.
}

Toby suggested a good way to avoid the if (len_remain){ cleanup part: round up the allocation for the string buffer to a multiple of the chunk size, padding the end with something like 0 (ASCII '\0') that can't be a false-positive match. This works because we're only handling ASCII string data, not binary where every char value is possible.

This is significantly worse than what you could do with SSE2 intrinsics, like this sketch that omits some of the pointer math. It's like memchr but with bitwise OR of two compare results. If desired, we can unroll the way glibc does. An AVX2 version would be identical but with __m256i and _mm256_... intrinsics. An AVX-512 / AVX10 version using compare-into-mask could use kortest to branch on either compare result having a set bit.

#include <immintrin.h>

char findfirst_SSE2(...){

  __m128i vx = _mm_set1_epi8(x);  // vector with all 16 elements = x
  __m128i vy = _mm_set1_epi8(y);

  for(; pos < endpos ; pos += 16){
      __m128i v = _mm_loadu_si256((__m256i*)pos);  // unaligned load
      __m128i cmpx = _mm_cmpeq_epi8(v, vx);  // vector of 0 or -1 bytes
      __m128i cmpy = _mm_cmpeq_epi8(v, vy);
      __m128i match = _mm_or_si128(cmpx, cmpy);  // true in elements where either had a match
      // and then the rest is identical to a memchr loop
      // optionally do a couple vectors unconditionally to amortize test+branch, and then sort out which vector had the match later.
      unsigned mask = _mm_movemask_epi8(match);   // extract the high bits of  each element to an integer bitmask
      if (mask) {
          // test eax,eax / jcc  then  tzcnt or BSF
          return pos[ __builtin_ctz(mask) ];  // find the match position and get the char there
      }
  }
  if (the total search region was at least 16 bytes) {
     // check the last 16 bytes of the string, 
     // partially overlapping with the last vector from the loop
     __m128i v = _mm_loadu_si128((__m128i*)str+len-16);
     compare and check mask as in the loop.
  } else {
     scalar fallback, or _mm_loadu_si64 then scalar
  }
  return 0;
}

ARM / AArch64 don't have a pmovmskb instruction that pulls the high bit from each vector. For AArch64 memchr and similar, you normally reduce 128-bit to 64-bit with a shift right and insert instruction, then move that from SIMD to general-purpose integer and check it for non-zero. (If so, then bit-scan it for the match position with rbit (bit-reverse) / clz (count leading zero), or a bithack to isolate the lowest set bit + clz.)

AArch64 CPUs usually have efficient SIMD → integer transfers, but many 32-bit ARMv7 CPUs did not, so it wasn't as easy or profitable to use NEON for memchr type problems. (The core had to stall and wait for SIMD instructions to finish because they normally weren't tightly coupled enough to track dependencies.) Counting matches over a range is fine, but branching every 32 or 64 bytes on compare results was a problem. But not for AArch64 usually, and not for x86. On some x86 CPUs, SIMD vector to integer data transfers have highish latency, but it's always pipelined so not big throughput problem.

Counting matching chars in a range

Efficient manual vectorization looks like this with AVX2: https://stackoverflow.com/questions/54541129/how-to-count-character-occurrences-using-simd - pcmpeqb / psubb in the inner loop, using byte compare results as 0/-1 integers to increment a vector of uint8_t counts. And psadbw against zero in an outer loop to horizontally sum unsigned bytes to uint64_t in 8-byte chunks, before the uint8_t counters can overflow. This can run about 1.5x 32 bytes per cycle on modern Intel, or 2x 32 byte per cycle on modern AMD (4 SIMD ALU ports), if data is hot in L1d cache.

You can get compilers to vectorize things like counting matching bytes. (Between two arrays, or with a fixed character against arr[i] is basically the same problem, accumulating the compare results is the hard part. a[i] == b[i] vs. a[i] == c is just the difference between another vector load vs. broadcasting a scalar ahead of the loop). Compilers can sometimes do an ok job, like what @harold managed to get clang to do in Count the number of mismatches between two arrays with some careful programming to hand-hold the compiler toward the asm he knew he wanted it to make. Otherwise expect quite a bit worse, but still hopefully better than scalar.

But you said programs are only compiled with -O2. If that's GCC, you won't get auto-vectorization except of the easiest loops, like a constant iteration count that's a multiple of 32 or something.

It might help to write the source like count += (str[i] == c); to encourage the compiler to make branchless asm. This pattern of conditionally adding based on a compare result is basically the same as Stack Overlow's branch-prediction canonical Q&A, https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array . At least for scalar code.

(For vectorization, that code is adding a 32-bit array element to a 64-bit counter, but here we're just incrementing by 1 or not so you want to work with 16x 1-byte elements as long as possible, not unpacking to four vectors of 4x 32-bit elements right away.)

Query type 1: kth occurrence

You could just call memchr k times. But if the density of occurrences is high-ish, like more than one per 32 bytes, this is much worse for large k than we can do with SIMD brute force. You'd maybe want to build this out of the counting code, perhaps counting over a range like 512 or 1024 bytes and then continuing forward if we haven't reached k, or backtracking if we've gone past.

It's non-trivial to do it even with manual vectorization with SIMD intrinsics. To count efficiently, you normally want to avoid reducing to scalar very often. Probably work in large batches, and when getting close use a reduce-every-16-bytes loop.

For all of these, building arrays of symbol-occurrence positions can be a big win, but type 1 maybe moreso, especially if you're trying to hack around the lack of portable SIMD search in languages like C.

Tuning an array of `symbol_occurrences` positions

C23 unsigned _BitInt(24) might be useful to save some cache footprint on arrays of occurrence position, if compilers pack that into 3 bytes. Or not, since spatial locality is poor when binary searching, but if it lets more of your "index" stay hot in cache more of the time across queries it could be worth making each access cost more CPU instructions.

Speaking of searchable sorted data structures, an array isn't necessarily optimal, especially if you can use SIMD efficiently for searching (like modern x86 can, and also AArch64). https://stackoverflow.com/questions/56521133/what-is-the-most-efficient-way-to-implement-a-bst-in-such-a-way-the-findvalue - an implicit tree has good locality without wasting any space on pointers. And it doesn't have to be a binary tree; each node can be a whole cache line of an n-ary tree, so you could have a 17-ary tree with 32-bit integers and 64-byte cache lines. (Use aligned_alloc to get alignas(64) for dynamic allocation.)

Turning the arrays of symbol positions into implicit trees would add some preprocesing work, so might only be worth it if Q is large. But we know Q before we even read the string so we can decide ahead of time how much preprocessing it's worth doing on the string!

For low Q we might even just brute-force search the original string, especially if N is large. (That means writing multiple versions of the rest of the program with if(strategy) or switch(strategy), or using function pointers to plug in different implementations to dispatch after parsing a query.)

Presumably for this assignment, a fixed strategy is totally fine, and struct { uint32_t *pos; size_t len; } symbol_occurrences[UCHAR_MAX+1]; would be plenty fast, sorted arrays of indices. (No actual sorting work necessary, one pass over the string appending positions to each symbol array will make sorted output.)

Actually we're guaranteed that the symbols are non-whitespace printable ASCII, so the range of symbols is 33..126 ('!' = ' '+1 to '~' but in this case that's not a more meaningful way to write the constants.) Instead of UCHAR_MAX+1ULL = 1ULL<<CHAR_BIT (typically 256), our array size can be 126-33 = 93, and we index it with arr[c - 33] (where we'd want to put the magic numbers into a named constant like #define SYMBOL_MIN 33 // one past ASCII space.)

#include <stdint.h>
#include <stdlib.h>

#define SYMBOL_MIN   33  // one past ASCII space
#define SYMBOL_MAX   126 // one before ASCII  DEL
#define SYMBOL_RANGE (SYMBOL_MAX + 1 - SYMBOL_MIN)  // array size

// TODO: use as narrow a type as possible for pos, perhaps unsigned on a machine where that's 20 or more bits (enough for max N) but less than 32, like some DSPs
// or with C23, perhaps unsigned _BitInt(20) or unsigned _BitInt(24)
//  .len might as well also be uint32_t, but size_t will usually be pointer width and structs will be padded for alignment anyway if we use a narrower type.
static struct { uint32_t *pos; size_t len; } 
     symbol_occurrences[SYMBOL_RANGE] = {};

// we use size_t for local vars, only the narrow position type in the array.
// Using uint32_t everywhere for the string size would also be fine, but not unsigned; it could be too short.
void build_idx(const char *str, size_t str_N)
{
    // allocate max size, resize later.
    //  Lazy allocation by the kernel avoids wasting actual RAM for pages we never touch
    for (int i = 0 ; i < SYMBOL_RANGE ; i++){
        symbol_occurrences[i].pos = malloc(str_N * sizeof(symbol_occurrences[0].pos[0]));
        // .len already zeroed
        if (! symbol_occurrences[i].pos) {
            abort();
        }
    }
    // Alternative: could start with one big allocation and carve it up with pointers into it.

    for (size_t pos = 0 ; pos < str_N ; pos++){
        unsigned c = (unsigned char)str[pos];  // array indices in 0..255 not -128..127 range!
        if (c < SYMBOL_MIN || c > SYMBOL_MAX){
            return;
            // error, abort() or make this non-void and return 0
        }
        size_t list_idx = symbol_occurrences[c - SYMBOL_MIN].len++;
        symbol_occurrences[c - SYMBOL_MIN].pos[list_idx] = pos;
    }

    // Optionally free up space in the tails of our arrays.
    // This is unlikely to actually compact them closer together for dTLB locality or transparent hugepages
    // If we wanted that, we'd have to manually memcpy, or memmove if we started with one huge allocation.
    for (int i = 0 ; i < SYMBOL_RANGE ; i++){
        void *new_alloc = realloc(symbol_occurrences[i].pos, 
                    sizeof(symbol_occurrences[0].pos[0]) * symbol_occurrences[i].len);
        if (new_alloc) {
            symbol_occurrences[i].pos = new_alloc;
        } // else realloc failed, leaving the old allocation still there.
          // For most implementations, impossible when shrinking
    }
}

Notice the use of unsigned char for indexing arrays. In cases where you want a lookup table of all UCHAR_MAX+1 possible byte values instead of a range-check, make sure you use unsigned char so your indexes are non-negative even with unexpected input.

OSes like Linux normally allow unlimited overcommit of memory allocations, so there's not much downside to just allocating enough space for the worst case with malloc(sizeof(uint_least32_t) * N) where N is the string length. Any virtual pages we don't touch won't ever get physical pages allocated to back them, and virtual address-space isn't a limiting factor even on a 32-bit system for a string with max length 1MiB so 94 * sizeof(T) = 376 MiB of virtual address-space. (But even rare symbols have their arrays in a page by itself.)

Instead of overcommit, another way to avoid separate size and capacity struct members (like std::vector) would be a fixed growth increment of 8192B. Check .len for crossing a multiple of 8192 and realloc. i.e. capacity is inferred from .len as (.len + 8191) & -8192, i.e. .len rounded up to the next multiple of 8192.

One downside to big initial allocations is L1dTLB misses (L2 TLB hits) from random access to different pages. Modern x86 L2 TLBs are big enough that the L1dTLB misses would hit in the second-level TLB instead of causing a page-walk so it's not a total disaster. On low-end ARM, IDK, could be worse. (Skylake stats - L1dTLB size of 64 entries for 4K or 2M pages, L2 TLB 1536 entries.) And if some symbols are rare, those arrays can just stay cold.

You can realloc each array to its size when we're done, but that won't compact them. So answering queries will also suffer TLB misses which we maybe could have avoided if we compacted the allocations, especially on x86-64 Linux where the hugepage size is 2MiB and Linux will opportunistically use transparent hugepages when most of a 2M region is populated. Compaction into a single allocation with memmove is possible. (Or instead of just copying, compact into an implicit tree for cache locality of searching. As discussed above in the SIMD section, an implicit 17-ary tree is good if you're SIMD searching it, otherwise maybe a binary or 3-ary or 5-ary tree for scalar code that doesn't have SIMD's fixed overhead for finding which element of a group matched.)

Glibc malloc uses mmap for whole pages for big allocs, returning a pointer that's 16 bytes past the start of a page. So all the arrays would have the same alignment relative to a 4K boundary. So their first bytes would all alias the same set in L1d cache.

But since (unlike Toby's answer), I'm keeping the size outside the dynamic allocation, we don't need to access that position for most queries. Presumably the array of pointer/len structs would stay hot in cache, and with that the first access to the dynamic allocations would be at a position that depends on the query, or in the middle (for a binary search) which depends on the length of that symbol-position entry. With different counts for different symbols, we hopefully don't get too many conflict-misses in cache.

TODO: make a single large allocation (or even a static uint32_t pos_storage[sizeof(uint32_t) * MAX_N];). As discussed in comments with @MatthieuM., you don't even need an array of pointers, you can just calculate the start of each symbol's portion with idx = (c - RANGE_MIN) * N. (Using current N, not MAX_N, to get good locality for small N, and to avoid having all arrays start a multiple of 4K from each other for cache aliasing and other effects.)

This is also very good for very small N like the test-cases. Zero or one calls to a library allocation function and minimal malloc bookkeeping overhead, and not too sparse use of the space.

answered Jan 22, 2024 at 1:02

Peter Cordes

3.8k
18
28