19

Suppose I have this value, with underscores for clarity:

0x12_23_45_09_45_11_23_10

I want to zero out all instances of 0x10, 0x23 and 0x45 inside it, making this number:

0x12_00_00_09_00_11_00_00

Is there a bit twiddling hack that can do this without looping over the bytes of the value?

I am aware of how to detect byte equal to N, but I couldn't glue together the byte-clearing with multiple targets.

11
  • 7
    You could use the digit separators since C23: 0x12'23'45'09'45'11'23'10 Commented Sep 15 at 6:22
  • 2
    Do you mean to zero out specific byte values (0x10 etc) , or specific byte "positions" (2nd, 3rd etc.) ? Commented Sep 15 at 6:22
  • 2
    @cup, Neither. 0x12'00'44'51'02'33'9A'22 Commented Sep 15 at 13:46
  • 3
    Would you upvote an answer that uses SSE intrinsics (but is written in C, without inline assembly), and thus only works on x86_64? It's pretty straightforward to do, because SSE has a bytewise comparison instruction. Commented Sep 15 at 18:26
  • 3
    What's the goal for this? Many of these answers are extremely ingenious and instructive, but I'm not sure I'd want to see any of them in production code I was maintaining... Commented Sep 17 at 7:54

9 Answers 9

20

You can do it but I doubt it will be faster than the looping method.

Here's how you could zero out one particular value (note this is not hand-optimized, there could be better ways):

uint64_t zero_out(uint64_t src, uint8_t val) {
    // 1. Replicate val in every byte
    uint64_t v = val;
    v |= v << 8;
    v |= v << 16;
    v |= v << 32;
    // 2. XOR out the value
    uint64_t dst = src ^ v;
    // 3. Turn every non=zero byte to 1.
    dst |= dst >> 4; dst &= 0x0F0F0F0F0F0F0F0F;
    dst |= dst >> 2; dst &= 0x0303030303030303;
    dst |= dst >> 1; dst &= 0x0101010101010101;
    // 4. Multiply by 255 turning every 1 to FF
    dst *= 255;
    // 5. Mask out 
    return src & dst;
}

Sign up to request clarification or add additional context in comments.

3 Comments

Brilliant! Another user expanded over it but it is a very clean solution
Not sure if would actually be better than multiplication by 255, but the step to convert bytes that are 00 or 01 to 00 or ff can also be achieved either “lighter steps: subtract x = 0x808080808080 -x; x &= 0x7f7f7f7f7f7f7f7f; x|=x+x; Moreover, since this method shifts and masks anyway after the subtraction, one can simplify the first part a bit
Multiply is a lot cheaper on modern CPU like all x86 (3 cycle latency, 1/clock or better throughput, single uop so for throughput only costs as much as an add if you don't bottleneck on multiply throughput). And I think all AArch64 will also have faster HW multipliers. On a machine with slow multiply, x = (x<<8) - x does x*255 as x*256 - x*1. (The intermediate result overflows, but this is unsigned so it safely wraps back to the same result. You can also look at how borrow propagates to get the right result in each byte, including the top one.)
18

SIMD byte compares can efficiently produce the masks you want. A SIMD compare (on all ISAs except SVE and AVX-512) produces a vector where the elements are all-0 or all-1 bits (so 0x00 or 0xff for a byte compare) according to the predicate being false or true, respectively. For example a compare-for-equal will make a vector that has 0xff bytes at the matches, all-zero elsewhere.

We just need to get the scalar integer into (the bottom of) a vector register and back out. On x86 and AArch64, that should be worth it, especially if you can keep the constants around across uses.

In GNU C with its ISA-independent vector extensions, we can write code that compiles nicely for x86 and AArch64, and hopefully most other SIMD ISAs. At least with GCC. Clang manages to waste instructions especially for x86.

#ifdef __GNUC__

// clang for x86-64 wastes multiple instructions!
// clang for AArch64 wastes 1
// GCC compiles nicely.
uint64_t zero_out_stuff (uint64_t x)
{
    typedef uint8_t v8u8 __attribute__((vector_size(8)));  // total vector size in bytes, of uint8_t elements

    const unsigned char stuff[] = { 0x10, 0x23, 0x45};

    v8u8 v8x = (v8u8)x;  // like memcpy
    v8u8 clearmask = {0};
    for(size_t i=0; i<sizeof(stuff); i++) {
        v8u8 cmp = (v8x == stuff[i]);  // implicit broadcast of scalar
        clearmask |= cmp;
        // don't update v8x until after all the compares, for better ILP
    }
    v8x &= ~clearmask;
    return (uint64_t)v8x;  // type-pun back to scalar
}
#endif

AArch64 natively supports 64-bit vectors. If you were doing this for more than one uint64_t at once, you'd want to define and use v2u64 vx = {x1, x2}; or something and cast it to v16u8.

Clang will auto-vectorize the above if you loop over an array, but not very efficiently: 4x 64-bit loads it shuffles together, but one vector store. (With even more unnecessary shuffling than you'd expect from that description. Oh, because it's using vector constants with every other 64-bit chunk = 0.) Godbolt:

# AArch64 GCC 15.2  -O3 
zero_out_stuff:
        movi    v0.8b, 0x10          # broadcast the constants
        movi    v29.8b, 0x23
        fmov    d31, x0              # copy incoming scalar arg to 8-byte vector
        movi    v30.8b, 0x45
        cmeq    v0.8b, v31.8b, v0.8b
        cmeq    v29.8b, v31.8b, v29.8b       // the compares: all-0 or all-1 bits within each byte element
        cmeq    v30.8b, v31.8b, v30.8b
        orr     v29.8b, v0.8b, v29.8b        // reduce to one mask
        orr     v30.8b, v29.8b, v30.8b
        bic     v31.8b, v31.8b, v30.8b       // apply the mask: x & ~mask
        umov    x0, v31.d[0]               // move back to scalar
        ret

Unfortunately Clang for x86-64 really over-complicates this. (Except with -march=x86-64-v4 for AVX-512, then it does pretty well, although could maybe be even better with masked compare-into-mask using compare-not-equal, saving the two kand instructions.)

# Clang 21.1 -O3 -march=x86-64-v3
zero_out_stuff:
        vmovq   xmm0, rdi
        vpcmpeqb        xmm1, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpcmpeqb        xmm2, xmm0, xmmword ptr [rip + .LCPI0_1]   # 16-byte constants from memory
        vpcmpeqb        xmm3, xmm0, xmmword ptr [rip + .LCPI0_2]
        vpor    xmm2, xmm2, xmm3
        vpor    xmm1, xmm1, xmm2       # reduce to one mask
        vpcmpeqd        xmm2, xmm2, xmm2   # all-ones
        vpxor   xmm1, xmm1, xmm2           # mask = ~mask
        vpsllw  xmm1, xmm1, 7              # 16-bit element, left-shift by 7
        vpand   xmm1, xmm1, xmmword ptr [rip + .LCPI0_3]   # set1(0x80)
        vpxor   xmm2, xmm2, xmm2           # all-zero
        vpcmpgtb        xmm1, xmm2, xmm1   # get back to bytes being all-0 or all-1 with signed compare against 0
        vpand   xmm0, xmm1, xmm0           # apply the mask
        vmovq   rax, xmm0

Much less goofy with Clang for AArch64, but still one wasted instruction:

# ARMv8 Clang 21  -O3
zero_out_stuff:
        movi    v0.8b, #35
        fmov    d1, x0
        movi    v2.8b, #16
        movi    v3.8b, #69
        cmeq    v0.8b, v1.8b, v0.8b
        cmeq    v2.8b, v1.8b, v2.8b
        cmeq    v3.8b, v1.8b, v3.8b
        mvn     v0.8b, v0.8b           // mask0 = ~mask1
        bic     v0.8b, v0.8b, v2.8b    // mask0 &= ~mask2
        bic     v0.8b, v0.8b, v3.8b    // mask= &= ~mask3
        and     v0.8b, v0.8b, v1.8b    // apply the mask
        fmov    x0, d0                 // IDK if fmov is better than umov x0, v0.d[0]  like GCC does.  Hopefully not worse.
        ret

A worse version of the function applies each mask right away, creating one serial dep chain of compare then bitwise. So it has less instruction-level parallelism if compiled as written (which GCC and Clang do for x86-64 and AArch64).

The more verbose scalar -> vector -> scalar compiles equivalently; I didn't realize the simpler version above was allowed until I tried it.

// slower.
// Except x86-64 Clang doesn't go nuts with it.
uint64_t zero_out_stuff_dep_chain (uint64_t x)
{
    typedef uint64_t v1u64 __attribute__((vector_size(8)));
    typedef uint8_t v8u8 __attribute__((vector_size(8)));

    const unsigned char stuff[] = { 0x10, 0x23, 0x45};

    v1u64 vx = {x};    // vector of 1 scalar element
    v8u8 v8x = (v8u8)vx;  // or just memcpy scalar to v8u8
    for(size_t i=0; i<sizeof(stuff); i++) {
        v8u8 cmp = (v8x == stuff[i]);  // implicit broadcast of scalar
        v8x &= ~cmp;
    }
    vx = (v1u64)v8x;  // cast back to a vector of 1x u64
    return vx[0];     // take the low scalar element
}

1 Comment

(For x86, the relevant C intrinsics are _mm_cmpeq_epi8 and _mm_andn_si128. And for getting data in/out, _mm_loadu_si128 from an array of uint64_t, or for a single value _mm_cvtsi64x_si128(__int64) (movq xmm, r/m64) and _mm_cvtsi128_si64x(__m128i) to get the low scalar value back out. intel.com/content/www/us/en/docs/intrinsics-guide/…). Or _mm_loadu_si64(void*) with a memory source for the movq xmm, xmm/m64 opcode.
14

I managed to combine the algorithm from the source link into a mask that works on the reverse way as the others presented here, while using fewer operations: roughly 4*N + 3 operations per N distinct bytes to detect. This is the fastest non-SIMD implementation.

uint64_t maskbytes(uint64_t v) {
    const uint64_t ones = 0x0101010101010101U;
    const uint64_t high = 0x8080808080808080U;

    uint64_t mask10 = ((v ^ 0x1010101010101010U) | high) - ones;
    uint64_t mask23 = ((v ^ 0x2323232323232323U) | high) - ones;
    uint64_t mask45 = ((v ^ 0x4545454545454545U) | high) - ones;

    uint64_t mask = high & ~(mask10 & mask23 & mask45);
    // mask is all zeros if no special byte was found
    return v & ~((mask >> 7) * 255);
}

Thanks everyone for your contributions

3 Comments

Minor: LL not needed in = 0x.................ULL;. = 0x.................U; is sufficient.
The ~(...) & high can be factored out (due to Associativity and De Morgan's law), saving 2*N-2 operations, i.e., maskX = v ^ (X*ones); maskX |= (maskX|high)-ones; mask = ~(mask10 & mask23 & mask45) & high; mask = (mask >> 7)*255;
That worked, thanks! I'm editing the solution with your suggestion, it seems like the best solution without going into SSE (which tbh seems very interesting, I should study it more)
12

Building on @n.m.couldbeanAI's answer, here is a function that uses the hack to clear the specific values 0x10, 0x23 and 0x45 as posted:

uint64_t cmpne8(uint64_t src, uint8_t val) {
    // 1. replicate the value
    uint64_t pat = val * 0x0101010101010101U;
    // 2. XOR out the replicated value
    uint64_t dst = src ^ pat;
    // 3. Turn every non-zero byte to 1.
    dst |= dst >> 4; dst &= 0x0F0F0F0F0F0F0F0F;
    dst |= dst >> 2; dst &= 0x0303030303030303;
    dst |= dst >> 1; dst &= 0x0101010101010101;
    return dst;
}

uint64_t mask_off_10_23_45(uint64_t src) {
    // Compute the combined mask
    uint64_t mask = cmpne8(src, 0x10) & cmpne8(src, 0x23) & cmpne8(src, 0x45);
    // 4. Multiply by 255 turning every 1 to FF
    mask *= 255;
    // 5. Mask out 
    return src & mask;
}

1 Comment

Cool optimization you used! If you consider the "pat" as a comp time expression, this is roughly 11*N + 2 operations per N distinct bytes to detect. I just found a faster solution here by myself though, but really appreciate it
5

Should * be expensive (and not byte shifts), the following recursive approach offers another way to look at things.

Break the 64-bit problem down to 2 32-bit ones, then 32-bit to 16-bit, then 16-bit to 8-bit.

#include <stdint.h>

uint8_t clear_specific_byte8(uint8_t x, uint8_t val) {
  return (x == val) ? 0 : val;
}

uint16_t clear_specific_bytes16(uint16_t x, uint8_t val) {
  uint16_t y = clear_specific_byte8((uint8_t)(x >> 8), val);
  return (y << 8) | clear_specific_byte8((uint8_t)x, val);
}

uint32_t clear_specific_bytes32(uint32_t x, uint8_t val) {
  uint32_t y = clear_specific_bytes16((uint16_t)(x >> 16), val);
  return (y << 16) | clear_specific_bytes16((uint16_t)x, val);
}

uint64_t clear_specific_bytes64(uint64_t x, uint8_t val) {
  uint64_t y = clear_specific_bytes32((uint32_t)(x >> 32), val);
  return (y << 32) | clear_specific_bytes32((uint32_t)x, val);
}

Comments

3

Here is another option based on @n.m.couldbeanAI's and chqrlie's answers, for the case when multiplication is an expensive operation:

uint64_t cmpne8(uint64_t src, uint8_t val) {
    // 1. Replicate the byte
    uint64_t v = val;
    v |= v << 8;
    v |= v << 16;
    v |= v << 32;
    // 2. Reset matching bytes
    uint64_t dst = src ^ v;
    // 3. Turn every non-zero byte to FF
    uint64_t even = dst & 0x00FF00FF00FF00FF;
    even |= even << 1;
    even |= even << 2;
    even |= even << 4;
    even |= even >> 8;
    even &= 0x00FF00FF00FF00FF;
    uint64_t odd = dst & 0xFF00FF00FF00FF00;
    odd |= odd >> 1;
    odd |= odd >> 2;
    odd |= odd >> 4;
    odd |= odd << 8;
    odd &= 0xFF00FF00FF00FF00;
    return even | odd;
}

uint64_t mask_off_10_23_45(uint64_t src) {
    // Compute the combined mask
    uint64_t mask = cmpne8(src, 0x10) & cmpne8(src, 0x23) & cmpne8(src, 0x45);
    // 4. Mask out 
    return src & mask;
}

Comments

2

Pretty much every x86_64 processor where people will be installing new software these days has SSE4 instructions that let you operate on 128-bit vector registers. C compilers provide intrinsics that make it easy to use those instructions, as documented in the Intel Intrinsics Guide.

This works for with MinGW GCC on Windows:

// GCC options: -msse4 -O1
#include <stdint.h>
#include <stdio.h>
#include <smmintrin.h>

uint64_t clear_10_23_45(uint64_t value)
{
  __m128i v = _mm_set_epi64x(0, value);
  v = _mm_andnot_si128(_mm_cmpeq_epi8(v, _mm_set1_epi8(0x10)), v);
  v = _mm_andnot_si128(_mm_cmpeq_epi8(v, _mm_set1_epi8(0x23)), v);
  v = _mm_andnot_si128(_mm_cmpeq_epi8(v, _mm_set1_epi8(0x45)), v);
  return _mm_extract_epi64(v, 0);
}

int main()
{
  uint64_t value = 0x1223450945112310;
  printf("before: %llx\n", value);
  value = clear_10_23_45(value);
  printf("after:  %llx\n", value);
}

Assembly generated by GCC:

clear_10_23_45:
        movq    %rcx, %xmm1
        movl    $269488144, %eax
        movd    %eax, %xmm0
        pshufd  $0, %xmm0, %xmm0
        pcmpeqb %xmm1, %xmm0
        pandn   %xmm1, %xmm0
        movl    $589505315, %eax
        movd    %eax, %xmm1
        pshufd  $0, %xmm1, %xmm1
        pcmpeqb %xmm0, %xmm1
        pandn   %xmm0, %xmm1
        movl    $1162167621, %eax
        movd    %eax, %xmm0
        pshufd  $0, %xmm0, %xmm0
        pcmpeqb %xmm1, %xmm0
        pandn   %xmm1, %xmm0
        movq    %xmm0, %rax
        ret

You can make this code even better by adding more variables and structuring it differently to reduce serial dependency chains, allowing the CPU to do multiple instructions at once. Something like this should be faster:

uint64_t clear_10_23_45(uint64_t value)
{
  __m128i v = _mm_set_epi64x(0, value);
  __m128i mask1 = _mm_cmpeq_epi8(v, _mm_set1_epi8(0x10));
  __m128i mask2 = _mm_cmpeq_epi8(v, _mm_set1_epi8(0x23));
  __m128i mask3 = _mm_cmpeq_epi8(v, _mm_set1_epi8(0x45));
  v = _mm_andnot_si128(_mm_or_si128(mask1, mask2), v);
  v = _mm_andnot_si128(mask3, v);
  return _mm_extract_epi64(v, 0);
}

2 Comments

Aside: as "%llx" matches unsigned long long and not certainly uint64_t, to maintain correctness now and in the future, better to use printf("before: %llx\n", (unsigned long long) value);, printf("before: %" PRIx64 "\n", value); or the like.
You don't need SSE4 for this. godbolt.org/z/96rfc6e5G _mm_extract_epi64(v, 0); even optimizes to SSE2 movq. If you use _mm_cvtsi128_si64 in the source, you can compile it with only baseline x86-64 (which includes SSE2). Also, your version has a serial dependency chain through all the compares and andnots, since you update v before the next compare. Two separate dep chains, comparing v_orig and updating v_masked, is faster. (clear_10_23_45_v2 in my godbolt link. Or clear_10_23_45 combines the masks with OR before using. 1 OR + 2x ANDN is probably best.)
2

Assuming you know the values which are to be cleared at compile time, all values differ in their lower 4 bits, and none of these values has the highest bit set (all conditions are met in your example), you can get a quite elegant solution using a pshufb-lookup (requires SSSE3) and just 3 uops in total (besides moving values into and from SSE registers):

#include <immintrin.h>

uint64_t clear_10_23_45(uint64_t value)
{
    // Look-up Table: If the bottom nibble matches one of the to-be-cleared values, put the corresponding value at that position
    // All other cases (as well as bytes with the highest bit set) will be zero.
    const __m128i LUT = _mm_setr_epi8(0x10, 0, 0, 0x23, 0, 0x45, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
    __m128i v = _mm_cvtsi64_si128(value);
    // mask will be 0xFF only if one of 0x10, 0x23, or 0x45 was at that position
    __m128i mask = _mm_cmpeq_epi8(_mm_shuffle_epi8(LUT, v), v);
    return _mm_cvtsi128_si64(_mm_andnot_si128(mask, v));
}

As long as all masked-out values differ at 4 bits (ideally the lowest or at least consecutive bits), this can be easily adapted to other values (this may need some bit-twiddling, before the pshufb lookup). For more than 16 values to be masked-out, two or more pshufbs could be combined.

Godbolt-Demo: https://godbolt.org/z/jjGh5cWP6

Comments

2

When the bytes values to be zeroed are all unsigned:

uint64_t clear_10_23_45 (uint64_t v) {
    const uint64_t x7F = 0x7F7F7F7F7F7F7F7F;
    uint64_t m = v & x7F; // clear msb of each byte

    // detect unsigned byte values
    // if xor doesn't produce `0x00` then msb becomes set
    m = ((m ^ 0x1010101010101010) + x7F) &
        ((m ^ 0x2323232323232323) + x7F) &
        ((m ^ 0x4545454545454545) + x7F);

    m = (v | m) & ~x7F; // if msb in `v` or `m`
    m ^= m - (m >> 7); // convert 0x80 to 0xFF
    return v & m;
}

1 Comment

Mr. aqrit, I've seen you improving Daniel Lemire's byte detection codes before! I'm glad you also shared knowledge here (: I just refactored my submission, it is now faster than yours by around ~12% (15ns vs 17ns). My previous submission had some redundancies and it was around 20ns fast

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.