SIMD byte compares can efficiently produce the masks you want. A SIMD compare (on all ISAs except SVE and AVX-512) produces a vector where the elements are all-0 or all-1 bits (so 0x00
or 0xff
for a byte compare) according to the predicate being false or true, respectively. For example a compare-for-equal will make a vector that has 0xff
bytes at the matches, all-zero elsewhere.
We just need to get the scalar integer into (the bottom of) a vector register and back out. On x86 and AArch64, that should be worth it, especially if you can keep the constants around across uses.
In GNU C with its ISA-independent vector extensions, we can write code that compiles nicely for x86 and AArch64, and hopefully most other SIMD ISAs. At least with GCC. Clang manages to waste instructions especially for x86.
#ifdef __GNUC__
// clang for x86-64 wastes multiple instructions!
// clang for AArch64 wastes 1
// GCC compiles nicely.
uint64_t zero_out_stuff (uint64_t x)
{
typedef uint8_t v8u8 __attribute__((vector_size(8))); // total vector size in bytes, of uint8_t elements
const unsigned char stuff[] = { 0x10, 0x23, 0x45};
v8u8 v8x = (v8u8)x; // like memcpy
v8u8 clearmask = {0};
for(size_t i=0; i<sizeof(stuff); i++) {
v8u8 cmp = (v8x == stuff[i]); // implicit broadcast of scalar
clearmask |= cmp;
// don't update v8x until after all the compares, for better ILP
}
v8x &= ~clearmask;
return (uint64_t)v8x; // type-pun back to scalar
}
#endif
AArch64 natively supports 64-bit vectors. If you were doing this for more than one uint64_t
at once, you'd want to define and use v2u64 vx = {x1, x2};
or something and cast it to v16u8
.
Clang will auto-vectorize the above if you loop over an array, but not very efficiently: 4x 64-bit loads it shuffles together, but one vector store. (With even more unnecessary shuffling than you'd expect from that description. Oh, because it's using vector constants with every other 64-bit chunk = 0.) Godbolt:
# AArch64 GCC 15.2 -O3
zero_out_stuff:
movi v0.8b, 0x10 # broadcast the constants
movi v29.8b, 0x23
fmov d31, x0 # copy incoming scalar arg to 8-byte vector
movi v30.8b, 0x45
cmeq v0.8b, v31.8b, v0.8b
cmeq v29.8b, v31.8b, v29.8b // the compares: all-0 or all-1 bits within each byte element
cmeq v30.8b, v31.8b, v30.8b
orr v29.8b, v0.8b, v29.8b // reduce to one mask
orr v30.8b, v29.8b, v30.8b
bic v31.8b, v31.8b, v30.8b // apply the mask: x & ~mask
umov x0, v31.d[0] // move back to scalar
ret
Unfortunately Clang for x86-64 really over-complicates this. (Except with -march=x86-64-v4
for AVX-512, then it does pretty well, although could maybe be even better with masked compare-into-mask using compare-not-equal, saving the two kand
instructions.)
# Clang 21.1 -O3 -march=x86-64-v3
zero_out_stuff:
vmovq xmm0, rdi
vpcmpeqb xmm1, xmm0, xmmword ptr [rip + .LCPI0_0]
vpcmpeqb xmm2, xmm0, xmmword ptr [rip + .LCPI0_1] # 16-byte constants from memory
vpcmpeqb xmm3, xmm0, xmmword ptr [rip + .LCPI0_2]
vpor xmm2, xmm2, xmm3
vpor xmm1, xmm1, xmm2 # reduce to one mask
vpcmpeqd xmm2, xmm2, xmm2 # all-ones
vpxor xmm1, xmm1, xmm2 # mask = ~mask
vpsllw xmm1, xmm1, 7 # 16-bit element, left-shift by 7
vpand xmm1, xmm1, xmmword ptr [rip + .LCPI0_3] # set1(0x80)
vpxor xmm2, xmm2, xmm2 # all-zero
vpcmpgtb xmm1, xmm2, xmm1 # get back to bytes being all-0 or all-1 with signed compare against 0
vpand xmm0, xmm1, xmm0 # apply the mask
vmovq rax, xmm0
Much less goofy with Clang for AArch64, but still one wasted instruction:
# ARMv8 Clang 21 -O3
zero_out_stuff:
movi v0.8b, #35
fmov d1, x0
movi v2.8b, #16
movi v3.8b, #69
cmeq v0.8b, v1.8b, v0.8b
cmeq v2.8b, v1.8b, v2.8b
cmeq v3.8b, v1.8b, v3.8b
mvn v0.8b, v0.8b // mask0 = ~mask1
bic v0.8b, v0.8b, v2.8b // mask0 &= ~mask2
bic v0.8b, v0.8b, v3.8b // mask= &= ~mask3
and v0.8b, v0.8b, v1.8b // apply the mask
fmov x0, d0 // IDK if fmov is better than umov x0, v0.d[0] like GCC does. Hopefully not worse.
ret
A worse version of the function applies each mask right away, creating one serial dep chain of compare then bitwise. So it has less instruction-level parallelism if compiled as written (which GCC and Clang do for x86-64 and AArch64).
The more verbose scalar -> vector -> scalar compiles equivalently; I didn't realize the simpler version above was allowed until I tried it.
// slower.
// Except x86-64 Clang doesn't go nuts with it.
uint64_t zero_out_stuff_dep_chain (uint64_t x)
{
typedef uint64_t v1u64 __attribute__((vector_size(8)));
typedef uint8_t v8u8 __attribute__((vector_size(8)));
const unsigned char stuff[] = { 0x10, 0x23, 0x45};
v1u64 vx = {x}; // vector of 1 scalar element
v8u8 v8x = (v8u8)vx; // or just memcpy scalar to v8u8
for(size_t i=0; i<sizeof(stuff); i++) {
v8u8 cmp = (v8x == stuff[i]); // implicit broadcast of scalar
v8x &= ~cmp;
}
vx = (v1u64)v8x; // cast back to a vector of 1x u64
return vx[0]; // take the low scalar element
}
0x12'23'45'09'45'11'23'10
0x12'00'44'51'02'33'9A'22