Optimizing Masked Bit Shifts of Gray Code with AND Operation and Parity Count

Question

I am trying to determine the parity of the result of a complex calculation involving two uint64_t variables, value_a and value_b, along with two distinct masks: NO_SHIFT_MASK and SHIFT_MASK. These masks are non-overlapping and can be redefined or padded with zeros to optimize the calculation. I have written multiple different versions of the code, each compiled for modern x86 processors using instructions like popcnt. Despite assuming out-of-order, parallel execution, the version with less parallelization runs faster, which I find confusing. I am seeking insights into why the less parallelized version is faster and whether there are better optimization strategies for this problem.

The format of value_a and value_b must be the same.
The masks are two arbitrary, non-overlapping (i.e. (NO_SHIFT_MASK & SHIFT_MASK) == 0) masks where each mask has '0' or more contiguous bits set to 1.
It is possible to redefine the format by reordering the masks or adding zero paddings to the variables, if it is convenient to save steps in the calculation. I suspect that adding some zero padding might save some bit shifts, but it is not clear how to do this.
I wrote multiple different versions of this code, which need to be compiled for modern x86 processors to run quickly (in gcc, compile with -arch = x86-64-v4), using instructions like popcnt. I assumed out-of-order, parallel execution. However, the version with less parallelization runs faster, and I don't understand why. Maybe my processor is too old. If you have a modern CPU, please tell me about the benchmark results.

I considered storing both the value and its gray code in the 64 bits, when the masks are short enough, to save the conversion step. However, in general the masks can cover all the 64 bits. I find it acceptable to reserve a small number of bits for zero padding, but storing the gray code would double the space and halve the capacity. I can make multiple code paths for small and large masks, but here I am interested in the worst cases, where the masks occupy up to 60 bits.

Steps:

Calculate the gray code of value_a.
Right shift the bits in the gray code masked by SHIFT_MASK.
Perform a bitwise AND operation with value_b.
Compute the parity of the resulting bits.

(this is Pseudocode. I provide C++ code below)

modified_a = value_a ^ (value_a << 1);
no_shift_modified_a = modified_a & NO_SHIFT_MASK;
shift_modified_a = (modified_a & SHIFT_MASK) >> 1;
1ULL & (popcount(no_shift_modified_a & value_b) + popcount(shift_modified_a & value_b));

Example:

uint64_t value_a =       0b0101'010101;
uint64_t value_b =       0b0111'101010;
uint64_t SHIFT_MASK =    0b0000'111111; // The rightmost bit of value_a matching SHIFT_MASK doesn't change the calculation
uint64_t NO_SHIFT_MASK = 0b1111'000000;

//modified_a
//0b0111'010111 value_a
//0b1110'101110 value_a << 1
//0b1001'111001 value_a ^ (value_a << 1)
modified_a = value_a ^ (value_a << 1);

//no_shift_modified_a
//0b1001'111001 modified_a
//0b1111'000000 NO_SHIFT_MASK
//0b1001'000000 modified_a & NO_SHIFT_MASK
no_shift_modified_a = modified_a & NO_SHIFT_MASK;

//shift_modified_a
//0b1001'111001 modified_a
//0b0000'111111 SHIFT_MASK
//0b0000'111001 modified_a & SHIFT_MASK;
//0b0000'011100 (modified_a & SHIFT_MASK) >> 1 // It doesn't matter that the rightmost bit is lost
shift_modified_a = (modified_a & SHIFT_MASK) >> 1;

//popcount b masked by no_shift_modified_a
//0b1001'000000 no_shift_modified_a
//0b0111'101010 value_b
//0b0001'000000 no_shift_modified_a & value_b
//1             popcount(no_shift_modified_a & value_b)

//popcount b masked by shift_modified_a
//0b0000'011100 shift_modified_a
//0b0111'101010 value_b
//0b0000'001000 shift_modified_a & value_b
//1             popcount(shift_modified_a & value_b)

//Find parity of total popcount
//1ULL & (1+1)==0 // 1ULL & (popcount(no_shift_modified_a & value_b) + popcount(shift_modified_a & value_b))
1 & (popcount(no_shift_modified_a & value_b) + popcount(shift_modified_a & value_b));

Note that the rightmost bit of value_a & SHIFT_MASK doesn't change the result. I tried to do a single popcount of the entire result:

modified_a = value_a ^ (value_a << 1); // two clocks
no_shift_modified_a = modified_a & NO_SHIFT_MASK; // one clock
shift_modified_a = (modified_a & SHIFT_MASK) >> 1; // two clocks
both_shifts_modified_a = no_shift_modified_a | shift_modified_a; // one clock
1ULL & popcount(both_shifts_modified_a & value_b); // three clocks

This sums to 9 clocks, but on x86, independent variables are executed in parallel out of order, so it would go as:

modified_a = value_a ^ (value_a << 1); // two clocks
//subtotal: 2 clocks

//parallel out of order
no_shift_modified_a = modified_a & NO_SHIFT_MASK; // one clock
shift_modified_a = (modified_a & SHIFT_MASK) >> 1; // two clocks
// subtotal: 2 clocks

both_shifts_modified_a = no_shift_modified_a | shift_modified_a; // one clock
1ULL & popcount(both_shifts_modified_a & value_b); // three clocks

So in total, it would be 8 clocks, since every instruction depends on the previous one, and value_a.

However, I can break the dependency on value_a by bit-shifting the masked value_b to the left instead of modified_a to the right.

I can adjust the SHIFT_MASK to the right by right-shifting it (it is a constant so it has no cost), then apply it to left-shift the masked value_b.

// Padding with 0
uint64_t value_a =            0b0101'010101;
uint64_t value_b =            0b0111'101010;
uint64_t SHIFT_MASK_SHIFTED = 0b0000'011111; // The SHIFT_MASK has been right shifted
uint64_t NO_SHIFT_MASK      = 0b1111'000000;
uint64_t BOTH_MASK          = 0b1111'011111; // ==NO_SHIFT_MASK+SHIFT_MASK

modified_a = value_a ^ (value_a << 1); // two clocks
shift_b = (value_b & BOTH_MASK) + (value_b & SHIFT_MASK_SHIFTED); // two clocks (the sum of constants is precalculated, and the AND are independent, so thew get calculated in parallel)
1ULL & popcount(modified_a & shift_b); // three clocks

Total: 5 clocks

This version should be faster than the previous one, becasue it has less dependecies, but benchmarks show it's slower, and I'm confused.

Here is the C++ code:

#include <iostream>
#include <cstdint>
#include <bitset>
#include <vector>
#include <chrono>
#include <random>
#include <iomanip>

// Version 1: Original calculation
uint64_t version1(const uint64_t VALUE_A, const uint64_t VALUE_B,
                  const uint64_t NO_SHIFT_MASK, const uint64_t SHIFT_MASK)
{
    uint64_t modified_a = VALUE_A ^ (VALUE_A << 1);
    uint64_t no_shift_modified_a = modified_a & NO_SHIFT_MASK;
    uint64_t shift_modified_a = (modified_a & SHIFT_MASK) >> 1;
    return 1ULL & (__builtin_popcountll(no_shift_modified_a & VALUE_B) + __builtin_popcountll(shift_modified_a & VALUE_B));
}

// Version 2: Optimized calculation with padding
// This is the fastest despite unavoidable dependencies from VALUE_A.
// (╯°□°）╯︵ ┻━┻ Why?!
uint64_t version2(const uint64_t VALUE_A, const uint64_t VALUE_B,
                  const uint64_t NO_SHIFT_MASK, const uint64_t SHIFT_MASK)
{
    uint64_t modified_a = VALUE_A ^ (VALUE_A << 1);
    uint64_t both_shifts_modified_a = (modified_a & NO_SHIFT_MASK) | ((modified_a & SHIFT_MASK) >> 1);
    return __builtin_parityll(both_shifts_modified_a & VALUE_B);
}

// Despite both lines being able to be calculated in parallel and out of order,
// (╯°□°）╯︵ ┻━┻ Why?!
uint64_t version3(const uint64_t VALUE_A, const uint64_t VALUE_B,
                  const uint64_t BOTH_MASK, const uint64_t SHIFT_MASK_SHIFTED)
{
    uint64_t modified_a = VALUE_A ^ (VALUE_A << 1);
    uint64_t shifted_b = (VALUE_B & BOTH_MASK) + (VALUE_B & SHIFT_MASK_SHIFTED);
    // return 1ULL & popcount(modified_a & shifted_b);
    return __builtin_parityll(modified_a & shifted_b); // requires -fgnu-runtime
    // TODO: check if __builtin_parity is faster for smaller bits
}

// This version is slowest, despite having fewer dependencies, and shorter assembler
// https://godbolt.org/z/ac54M1YK9
// (╯°□°）╯︵ ┻━┻ Why?!
uint64_t version4(const uint64_t VALUE_A, const uint64_t VALUE_B,
                  const uint64_t NO_SHIFT_MASK, const uint64_t SHIFT_MASK_SHIFTED)
{
    uint64_t modified_a = VALUE_A ^ (VALUE_A << 1);
    uint64_t shifted_b = (VALUE_B & NO_SHIFT_MASK) | ((VALUE_B & SHIFT_MASK_SHIFTED) << 1);
    // return 1ULL & popcount(modified_a & shifted_b);
    return __builtin_parityll(modified_a & shifted_b); // requires -fgnu-runtime
    // TODO: check if __builtin_parity is faster for smaller bits
}
void validate(const std::vector<uint64_t> &VALUES,
              const uint64_t NO_SHIFT_MASK, const uint64_t SHIFT_MASK,
              const uint64_t BOTH_MASKS, const uint64_t SHIFT_MASK_SHIFTED)
{
    std::cout << "Initializing validation..." << std::endl;

    auto start = std::chrono::high_resolution_clock::now();
    for (const auto &value_a : VALUES)
    {
        for (const auto &value_b : VALUES)
        {
            uint64_t result_v1 = version1(value_a, value_b, NO_SHIFT_MASK, SHIFT_MASK);
            uint64_t result_v2 = version2(value_a, value_b, NO_SHIFT_MASK, SHIFT_MASK);
            uint64_t result_v3 = version3(value_a, value_b, BOTH_MASKS, SHIFT_MASK_SHIFTED);
            uint64_t result_v4 = version4(value_a, value_b, NO_SHIFT_MASK, SHIFT_MASK_SHIFTED);

            if (__builtin_expect(!!(result_v2 != result_v1 ||
                                    result_v3 != result_v1 ||
                                    result_v4 != result_v1),
                                 false))
            {
                std::cout << "Mismatch between versions." << std::endl
                          << "Value A: " << value_a << " (0b" << std::bitset<64>(value_a) << ")" << std::endl
                          << "Value B: " << value_b << " (0b" << std::bitset<64>(value_b) << ")" << std::endl;
                std::cout << "Results: " << std::endl
                          << "Version 1: " << result_v1 << std::endl
                          << "Version 2: " << result_v2 << std::endl
                          << "Version 3: " << result_v3 << std::endl
                          << "Version 4: " << result_v4 << std::endl;

                throw std::runtime_error("Results from different versions do not match");
            }
        }

        static size_t iteration_count = 0;
        static size_t total_iterations = VALUES.size();

        ++iteration_count;
        if ((iteration_count & ((1 << 10) - 1)) == 0)
        { // Update every 2^10 iterations to avoid excessive output
            double percent_complete = (static_cast<double>(iteration_count) / total_iterations) * 100;
            std::cout << "\r" << "Progress: " << std::fixed << std::setprecision(2) << percent_complete << "% completed" << std::flush;
        }

    }
    std::cout << std::endl
              << "Validation complete." << std::endl;
}

void benchmark(const std::vector<uint64_t> &VALUES,
               const uint64_t NO_SHIFT_MASK, const uint64_t SHIFT_MASK,
               const uint64_t BOTH_MASKS, const uint64_t SHIFT_MASK_SHIFTED)
{

    std::cout << "Starting benchmark." << std::endl;

    const uint16_t REPETITIONS = 10u;
    auto start = std::chrono::high_resolution_clock::now();

    for (uint16_t repetition = 0; repetition < REPETITIONS; ++repetition)
    {
        for (const auto &value_a : VALUES)
        {
            for (const auto &value_b : VALUES)
            {
                volatile uint64_t result = version1(value_a, value_b, NO_SHIFT_MASK, SHIFT_MASK);
            }
        }
    }
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration = end - start;
    std::cout << "Version 1 took " << duration.count() << " seconds\n";

    start = std::chrono::high_resolution_clock::now();

    for (uint16_t repetition = 0; repetition < REPETITIONS; ++repetition)
    {
        for (const auto &value_a : VALUES)
        {
            for (const auto &value_b : VALUES)
            {
                volatile uint64_t result = version2(value_a, value_b, NO_SHIFT_MASK, SHIFT_MASK);
            }
        }
    }
    end = std::chrono::high_resolution_clock::now();
    duration = end - start;
    std::cout << "Version 2 took " << duration.count() << " seconds\n";

    start = std::chrono::high_resolution_clock::now();

    for (uint16_t repetition = 0; repetition < REPETITIONS; ++repetition)
    {
        for (const auto &value_a : VALUES)
        {
            for (const auto &value_b : VALUES)
            {
                volatile uint64_t result = version3(value_a, value_b, BOTH_MASKS, SHIFT_MASK_SHIFTED);
            }
        }
    }
    end = std::chrono::high_resolution_clock::now();
    duration = end - start;
    std::cout << "Version 3 took " << duration.count() << " seconds\n";

    start = std::chrono::high_resolution_clock::now();

    for (uint16_t repetition = 0; repetition < REPETITIONS; ++repetition)
    {
        for (const auto &value_a : VALUES)
        {
            for (const auto &value_b : VALUES)
            {
                volatile uint64_t result = version4(value_a, value_b, NO_SHIFT_MASK, SHIFT_MASK_SHIFTED);
            }
        }
    }
    end = std::chrono::high_resolution_clock::now();
    duration = end - start;
    std::cout << "Version 4 took " << duration.count() << " seconds\n";
}

int main()
{
    const uint8_t SIZE = 14; // how many bits will have the values
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<int> dis(0, SIZE - 1);
    const uint8_t MASK_SEPARATOR = dis(gen); // bit position where the masks will be split
    const uint64_t SHIFT_MASK = (1ULL << MASK_SEPARATOR) - 1;
    const uint64_t NO_SHIFT_MASK = ~SHIFT_MASK;
    const uint64_t SHIFT_MASK_SHIFTED = SHIFT_MASK >> 1ULL;
    const uint64_t BOTH_MASKS = NO_SHIFT_MASK | SHIFT_MASK_SHIFTED;

    // Generate vectors with all values from 0 to 2^SIZE
    std::vector<uint64_t> values(1ULL << SIZE);

    for (size_t i = 0; i < values.size(); ++i)
    {
        values[i] = i;
    }
    // Validate that all versions return the same result
    validate(values, NO_SHIFT_MASK, SHIFT_MASK, BOTH_MASKS, SHIFT_MASK_SHIFTED);

    // Benchmark each version
    benchmark(values, NO_SHIFT_MASK, SHIFT_MASK, BOTH_MASKS, SHIFT_MASK_SHIFTED);

    return 0;
}

The assembler is on this compiler explorer: https://godbolt.org/z/ac54M1YK9 for version2() and version3(), it generates this code

version2(unsigned long, unsigned long, unsigned long, unsigned long):
  lea rax, [rdi+rdi]
  xor rax, rdi
  and rcx, rax
  and rax, rdx
  shr rcx
  or rcx, rax
  xor eax, eax
  and rcx, rsi
  popcnt rax, rcx
  and eax, 1
  ret
version3(unsigned long, unsigned long, unsigned long, unsigned long):
  and rdx, rsi
  lea rax, [rdi+rdi]
  and rsi, rcx
  xor rax, rdi
  add rdx, rsi
  and rdx, rax
  xor eax, eax
  popcnt rax, rdx
  and eax, 1
  ret

Version 3 has fewer instructions and less dependency levels, but the benchmark shows that it is 40% slower than version 2.

I'm not expert on assembler, but I think they should execute this way

Version2:

    Level 0: lea     rax, [rdi+rdi]
    Level 1: xor     rax, rdi
    Level 2: and     rcx, rax
             and     rax, rdx
    Level 3: shr     rcx
    Level 4: or      rcx, rax
    Level 5: xor     eax, eax
             and     rcx, rsi
    Level 6: popcnt  rax, rcx
    Level 7: and     eax, 1
    Level 8: ret

Version3:

    Level 0: and     rdx, rsi
             lea     rax, [rdi+rdi]
    Level 1: and     rsi, rcx
             xor     rax, rdi
    Level 2: add     rdx, rsi
    Level 3: and     rdx, rax
    Level 4: xor     eax, eax
    Level 5: popcnt  rax, rdx
    Level 6: and     eax, 1
    Level 7: ret

I haven't looked at the details yet but I note that you analyzed and optimized the latency, but then benchmarked the throughput. — user555045
– user555045, Commented Dec 8, 2024 at 2:19
This seems like less of a code-review and more of a Stack Overflow question asking for an explanation of the performance effect. It could get migrated, or I guess it seems fine here, with the "review" being of the performance analysis more than of the actual code. — Peter Cordes
– Peter Cordes, Commented Dec 9, 2024 at 18:37
@Peter Cordes I asked for an enhancement of the algorithm, but I was told to change it. Also, in Stack Overflow, this question is sent here. Actually I posted this first in Stack Overflow, and had to delete it and repost here, because the rule in Stack Overflow, is that if the code is working correctly, it doesn't belong there. — wepajakeg
– wepajakeg, Commented Dec 9, 2024 at 20:19
Either you misinterpreted, or a commenter was wrong. Asking for a full code review is off-topic, but asking for performance improvements is ok and a working example is a bonus, not something that makes it off-topic. There are many questions which ask for an explanation of a performance effect ("why is this faster than that"), with working code, such as gcc optimization flag -O3 makes code slower than -O2, and Stack Overflow's most-upvoted question, Why is processing a sorted array faster than processing an unsorted array? — Peter Cordes
– Peter Cordes, Commented Dec 9, 2024 at 20:29
@Peter Cordes Is not clear what site you are commenting. Anyways, I want a code review of my algorithm, and you cannot post a decade old example of Stack Overflow, since the rules had changed, and is common on all stack exchange websites that what was acceptable a decade ago may be banned today. — wepajakeg
– wepajakeg, Commented Dec 9, 2024 at 20:35

user555045 · Accepted Answer · 2024-12-08 12:28:18Z

Latency analysis

The "generic" (not specific to any particular microarchitecture) latency analysis you did is not bad. There are some extra quirks though:

xor eax, eax is a special pattern to zero out rax, its input dependency is ignored, on many modern CPUs it is performed in the front-end without any ALU operation.
popcnt has a latency of 3 cycles on Intel CPUs. That doesn't matter for what you called the dependency levels, but you should know this to be able to compute the right value for the expected latency, as a sanity-check for benchmarks.
A chunk of code with more than one input/output has more than one latency, one for each input-output pair. There is only one output here, but multiple inputs, so "the latency" depends on which one is analyzed. Version2 and version3 are, by starting at the first instruction, effectively analyzed in different ways. Is that fair? I don't know, depends on what you want to get out of the analysis. Was it intended?
"Pure" latency considerations can break down due to resource conflict: for example on Intel Skylake, two popcnt at the same dependency level cannot start in the same cycle because Skylake only has one ALU that can perform popcnt. Something like that does not seem to occur in this code however.

You can use code analyzers such as https://uica.uops.info/ to take care of such details. They do still take some care with what you put into them, so that you do not create unintended loop-carried dependencies.

If you want to analyze throughput, that can really only be done in a microarchitecture-specific way since it depends directly on how many execution ports are available and what operations they can do, there is no "generic" version such as there was when talking about dependency levels. Code analyzers such as uica can do this as well.

The benchmark

Since the result is just thrown into a sink (volatile uint64_t result) without affecting the input of the next call, this is a throughput benchmark .. but throughput of what? Measuring throughput is useful (often more useful than measuring latency, but that depends on what you want to know), and I do not disagree with throwing the result into a volatile sink, that's not the problem (at least not per se, if you wanted to measure a latency then this isn't the way to do it). Let's get back to "throughput of what".

In order to get a good picture of what is being benchmarked and how, lets ignore the "standalone" version of version1 etc and look at the inlined copy of it in the benchmark loop, including the loop stuff.

Version2:

.L25:
  mov rdx, QWORD PTR [rdi]
  mov rcx, r13
  lea rax, [rdx+rdx]
  xor rax, rdx
  mov rdx, r8
  and rcx, rax
  and rax, rbx
  shr rcx
  or rcx, rax
.L26:
  mov rax, QWORD PTR [rdx]
  and rax, rcx
  popcnt rax, rax
  and eax, 1
  mov QWORD PTR [rsp+24], rax
  mov rax, rdx
  add rdx, 8
  cmp rsi, rdx
  jne .L26
  cmp rdi, rax
  je .L53
  add rdi, 8
  jmp .L25

Version3:

.L29:
  mov rax, QWORD PTR [rdi]
  mov rdx, r8
  lea rcx, [rax+rax]
  xor rcx, rax
.L30:
  mov rax, QWORD PTR [rdx]
  mov r10, QWORD PTR [rdx]
  and rax, r14
  and r10, rbp
  add rax, r10
  and rax, rcx
  popcnt rax, rax
  and eax, 1
  mov QWORD PTR [rsp+32], rax
  mov rax, rdx
  add rdx, 8
  cmp rsi, rdx
  jne .L30
  cmp rdi, rax
  je .L54
  add rdi, 8
  jmp .L29

For version2, the only part of your actual function version2 that ended up in the inner loop is this:

  and rax, rcx
  popcnt rax, rax
  and eax, 1

(the other things are overhead from reading an input, yeeting the result into the volatile sink, and loop overhead)

That's clearly not all of version2. The compiler noticed that the computation of both_shifts_modified_a is loop-invariant: while VALUE_B is changing in the inner loop, both_shifts_modified_a depends only on VALUE_A and the two constant masks. So a lot of computation is lifted out the inner loop (loop-invariant code motion, LICM) into the outer loop, where of course it contributes much less to the overall time.

For version3, VALUE_A ^ (VALUE_A << 1) was lifted out the inner loop again, but not the other things since they now depend on VALUE_B which changes in the inner loop.

Either way the code of which the throughput is being tested is not the whole of the computation which you intended to compare, but only part of it - different portions of it were moved out of the inner loop, making the "wrong" version win. It's not a meaningless result, version2 should win if it is used this way, since it allows that kind of LICM to be performed.

What you should do about any of this depends on what you want to measure.

Do you mean that the CPU pipeline detects that modified_a is being reused, so it optimizes the code as if the function were inlined, avoiding to recalculate it on each "function call"? Anyways, you made me realize that the best version depends on how it is used. Sometimes, I really need value_a against many value_b, and other times, in only need both once. — wepajakeg
– wepajakeg, Commented Dec 8, 2024 at 20:28
@wepajakeg: No, the compiler notices which part of your computation is loop-invariant, and makes asm accordingly. CPUs don't try to detect that at run-time. This answer is pointing out what the compiler-generated asm is telling the CPU to do. To defeat optimizations like this, you can often use Benchmark::DoNotOptimize to make a compiler forget what it knows about a variable. (It uses inline asm; you could do it manually with asm volatile("" : "+r"(value_a)); to tell the optimizer that value_a = black_box(value_a), with asm volatile meaning it can't assume same input = same outpt — Peter Cordes
– Peter Cordes, Commented Dec 8, 2024 at 22:26
@Peter Cordes but Compiler explorer doesn't inlined the function. I used the same compiler parameters. So it doesn't explains the benchmark numbers. — wepajakeg
– wepajakeg, Commented Dec 8, 2024 at 23:36
@wepajakeg I copied the assembly here from your compiler explorer link, of course the "standalone" versions of your function are also in there but the benchmark loops have an inlined/LICMed copy in them — user555045
– user555045, Commented Dec 9, 2024 at 0:12
@wepajakeg: Declare the function static and/or inline to hide the stand-alone copy of the function once you're done looking at it and want to see how it inlines, especially if it's hard to find in a larger asm output. — Peter Cordes
– Peter Cordes, Commented Dec 9, 2024 at 0:19

G. Sliepen · Accepted Answer · 2024-12-08 14:19:56Z

CPUs are much more complex

Level 8: ret

You are looking at the instructions from the stand-alone functions, but those are never actually called by the benchmark function, instead the compiler will inline them, thereby changing the actual instructions emitted and their order.

modified_a = value_a ^ (value_a << 1); // two clocks

No, you cannot count instructions and say it is one clock per instruction. Virtually no CPUs will execute one instruction per second. There are two values to consider (as hinted by user555045):

latency: the amount of clock cycles it takes for the result of an instruction to be available
1/throughput: the amount of clock cycles each instruction takes if there were many running back-to-back with no dependencies between them

It is possible to have a throughput of more than one instruction per clock cycle, since each core nowadays has multiple ALUs that can work in parallel. So two and instructions could in principle run in parallel. However, not all ALUs are equal, not all of them can handle every type of instruction. So you really have to look at the micro-architecture of the CPU that is in your computer to figure out how fast it will run your code.

Apart from the simple logical and arithmetic operations, your code is also accessing memory. Memory also has latency and throughput, and in your case the data fits in on-CPU cache memory and the access pattern is probably simple enough that the prefetcher will ensure everything can be accessed from the L1 cache.

volatile uint64_t result = version1(value_a, value_b, NO_SHIFT_MASK, SHIFT_MASK);

Unrealistic benchmark

So you store the return value in a volatile variable to avoid it from being optimized away. But this is not realistic code. Consider if you would store the results in an array:

std::uint64_t results[…];
std::uint64_t *result = results;
…
    *result++ = version1(value_a, value_b, NO_SHIFT_MASK, SHIFT_MASK);
…

If you do that, the compiler will be able to unroll the loops and inline the function multiple times, each time potentially using a different set of registers, thereby allowing much more instruction parallelism.

So, since your benchmark is not running realistic code, the numbers you get are not very useful.

and rcx, rax and and rax, rdx can be executed in parallel, the apparent anti-dependance hazard is resolved through register renaming. The compiler using different registers after inlining and unrolling is unnecessary, that ILP is also already available thanks to renaming. — user555045
– user555045, Commented Dec 8, 2024 at 11:13
@user555045 Is renaming really enough? Wouldn't it need to copy the value of rax? I don't know enough about this to know the cost. — G. Sliepen
– G. Sliepen, Commented Dec 8, 2024 at 12:11
Renaming means that the old value of rax is not overwritten by that write, a "fresh" physical register is used to hold the result. The old physical register that contained the value of rax at that point won't be overwritten until after that physical register has been "reclaimed", which happens after it cannot be an input to any µop anymore (so basically garbage collection) — user555045
– user555045, Commented Dec 8, 2024 at 12:15
Ah of course, silly me, I was stuck on the fact that it was a two-operand instruction. — G. Sliepen
– G. Sliepen, Commented Dec 8, 2024 at 12:43
However, not all ALUs are equal, not all of them can handle every type of instruction. - It would be surprising to build an integer ALU execution unit that couldn't handle ADD and bitwise-boolean ops. ADDs are very common, and bitwise booleans are pretty trivial to implement even as one of a set of operations selected by a control signal. In terms of x86, I don't remember any CPUs in agner.org/optimize that have an integer execution unit which can't handle ADD and AND. Multi-cycle latency ops like imul are often only a single execution unit (e.g. port 1 on Intel P6 / SnB-family). — Peter Cordes
– Peter Cordes, Commented Dec 8, 2024 at 22:19

wepajakeg · Accepted Answer · 2024-12-11 09:28:50Z

This function is faster, but it crashes in the corner case where SHIFT_MASK == 0 because it loses the least significant bit. This issue can be resolved by zero padding to the right or by using black magic, I mean function pointers to change the definition of version4()according to the masks.

The function could be made even faster by manipulating VALUE_B instead of VALUE_A, which would solve the dependencies. However, making this approach fast and efficient is beyond my IQ. I figured the masked bitshift to the left is x += x & MASK, but no clue on how to do to the right (I mean fast).


uint64_t version4(const uint64_t VALUE_A, const uint64_t VALUE_B,
                  const uint64_t NO_SHIFT_MASK_SHIFTED, const uint64_t _)
{

    uint64_t modified_a = VALUE_A ^ (VALUE_A >> 1); //equivalent to (VALUE_A ^ (VALUE_A << 1)) >> 1
    modified_a += (modified_a & NO_SHIFT_MASK_SHIFTED); //undo the last line for the masked part
    return __builtin_parityll(modified_a & VALUE_B); // requires -fgnu-runtime
    // TODO: check if __builtin_parity is faster for smaller bits
}

Here is the modified benchmark adjusted for this function (it crashes at the corner case).

#include <iostream>
#include <cstdint>
#include <bitset>
#include <vector>
#include <chrono>
#include <random>
#include <iomanip>
#include <functional>

// Version 1: Original calculation
uint64_t version1(const uint64_t VALUE_A, const uint64_t VALUE_B,
                  const uint64_t NO_SHIFT_MASK, const uint64_t SHIFT_MASK)
{
    uint64_t modified_a = VALUE_A ^ (VALUE_A << 1);
    uint64_t no_shift_modified_a = modified_a & NO_SHIFT_MASK;
    uint64_t shift_modified_a = (modified_a & SHIFT_MASK) >> 1;
    return 1ULL & (__builtin_popcountll(no_shift_modified_a & VALUE_B) + __builtin_popcountll(shift_modified_a & VALUE_B));
}

uint64_t version2(const uint64_t VALUE_A, const uint64_t VALUE_B,
                  const uint64_t NO_SHIFT_MASK, const uint64_t SHIFT_MASK)
{
    uint64_t modified_a = VALUE_A ^ (VALUE_A << 1);
    uint64_t both_shifts_modified_a = (modified_a & NO_SHIFT_MASK) | ((modified_a & SHIFT_MASK) >> 1);
    return __builtin_parityll(both_shifts_modified_a & VALUE_B);
}

uint64_t version3(const uint64_t VALUE_A, const uint64_t VALUE_B,
                  const uint64_t BOTH_MASK, const uint64_t SHIFT_MASK_SHIFTED)
{
    uint64_t modified_a = VALUE_A ^ (VALUE_A << 1);
    uint64_t shifted_b = (VALUE_B & BOTH_MASK) + (VALUE_B & SHIFT_MASK_SHIFTED);
    // return 1ULL & popcount(modified_a & shifted_b);
    return __builtin_parityll(modified_a & shifted_b); // requires -fgnu-runtime
    // TODO: check if __builtin_parity is faster for smaller bits
}

uint64_t version4(const uint64_t VALUE_A, const uint64_t VALUE_B,
                  const uint64_t NO_SHIFT_MASK_SHIFTED, const uint64_t _)
{

    uint64_t modified_a = VALUE_A ^ (VALUE_A >> 1);
    modified_a += (modified_a & NO_SHIFT_MASK_SHIFTED);
    return __builtin_parityll(modified_a & VALUE_B); // requires -fgnu-runtime
    // TODO: check if __builtin_parity is faster for smaller bits
}
void validate(const std::vector<uint64_t> &VALUES,
              const uint64_t NO_SHIFT_MASK, const uint64_t SHIFT_MASK,
              const uint64_t BOTH_MASKS,
              const uint64_t SHIFT_MASK_SHIFTED,
              const uint64_t NO_SHIFT_MASK_SHIFTED)
{
    std::cout << "Initializing validation..." << std::endl;

    auto start = std::chrono::high_resolution_clock::now();
    for (const auto &value_a : VALUES)
    {
        for (const auto &value_b : VALUES)
        {
            uint64_t result_v1 = version1(value_a, value_b, NO_SHIFT_MASK, SHIFT_MASK);
            uint64_t result_v2 = version2(value_a, value_b, NO_SHIFT_MASK, SHIFT_MASK);
            uint64_t result_v3 = version3(value_a, value_b, BOTH_MASKS, SHIFT_MASK_SHIFTED);
            uint64_t result_v4 = version4(value_a, value_b, NO_SHIFT_MASK_SHIFTED, 0);

            if (__builtin_expect(!!(result_v2 != result_v1 ||
                                    result_v3 != result_v1 ||
                                    result_v4 != result_v1),
                                 false))
            {
                std::cout << "Mismatch between versions." << std::endl
                          << "Value A: " << value_a << " (0b" << std::bitset<64>(value_a) << ")" << std::endl
                          << "Value B: " << value_b << " (0b" << std::bitset<64>(value_b) << ")" << std::endl;
                std::cout << "Results: " << std::endl
                          << "Version 1: " << result_v1 << std::endl
                          << "Version 2: " << result_v2 << std::endl
                          << "Version 3: " << result_v3 << std::endl
                          << "Version 4: " << result_v4 << std::endl;

                throw std::runtime_error("Results from different versions do not match");
            }
        }

        static size_t iteration_count = 0;
        static size_t total_iterations = VALUES.size();

        ++iteration_count;
        if ((iteration_count & ((1 << 10) - 1)) == 0)
        { // Update every 2^10 iterations to avoid excessive output
            double percent_complete = (static_cast<double>(iteration_count) / total_iterations) * 100;
            std::cout << "\r" << "Progress: " << std::fixed << std::setprecision(2) << percent_complete << "% completed" << std::flush;
        }
    }
    std::cout << std::endl
              << "Validation complete." << std::endl;
}
// Structure to hold the function pointer and the name of the version
struct VersionInfo
{
    std::string name;
    std::function<uint64_t(uint64_t, uint64_t, uint64_t, uint64_t)> func;
    uint64_t mask1;
    uint64_t mask2;
};

void benchmark(const std::vector<uint64_t> &VALUES,
               const uint64_t NO_SHIFT_MASK, const uint64_t SHIFT_MASK,
               const uint64_t BOTH_MASKS,
               const uint64_t SHIFT_MASK_SHIFTED,
               const uint64_t NO_SHIFT_MASK_SHIFTED)
{
    std::cout << "Starting benchmark." << std::endl;

    const uint16_t REPETITIONS = 8u;

    // Vector of versions to benchmark
    std::vector<VersionInfo> versions = {
        {"Version 1", version1, NO_SHIFT_MASK, SHIFT_MASK},
        {"Version 2", version2, NO_SHIFT_MASK, SHIFT_MASK},
        {"Version 3", version3, BOTH_MASKS, SHIFT_MASK_SHIFTED},
        {"Version 4", version4, NO_SHIFT_MASK_SHIFTED, 0}};

    for (const auto &version : versions)
    {
        uint64_t result = 0;
        auto start = std::chrono::high_resolution_clock::now();

        for (uint16_t repetition = 0; repetition < REPETITIONS; ++repetition)
        {
            for (const auto &value_a : VALUES)
            {
                for (const auto &value_b : VALUES)
                {
                    result += version.func(value_a, value_b, version.mask1, version.mask2);
                }
            }
        }
        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> duration = end - start;
        std::cout << version.name << " took " << duration.count() << " seconds" << ". Sum of all results for " << version.name << ": " << result << std::endl;
    }
}

int main()
{
#ifdef DEBUG
    const uint8_t SIZE = 12; // how many bits will have the values
#else
    const uint8_t SIZE = 14; // how many bits will have the values
#endif

    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<int> dis(0, SIZE - 1);
    const uint8_t MASK_SEPARATOR = dis(gen); // bit position where the masks will be split

    std::cout << "Mask Separator: " << static_cast<int>(MASK_SEPARATOR) << std::endl;

    const uint64_t SHIFT_MASK = (1ULL << MASK_SEPARATOR) - 1;
    const uint64_t NO_SHIFT_MASK = ~SHIFT_MASK;
    const uint64_t SHIFT_MASK_SHIFTED = SHIFT_MASK >> 1ULL;
    const uint64_t NO_SHIFT_MASK_SHIFTED = NO_SHIFT_MASK >> 1ULL;
    const uint64_t BOTH_MASKS = NO_SHIFT_MASK | SHIFT_MASK_SHIFTED;

    // Generate vectors with all values from 0 to 2^SIZE
    std::vector<uint64_t> values(1ULL << SIZE);

    for (size_t i = 0; i < values.size(); ++i)
    {
        values[i] = i << 1ULL;
    }
    // Validate that all versions return the same result
    validate(values, NO_SHIFT_MASK, SHIFT_MASK, BOTH_MASKS, SHIFT_MASK_SHIFTED, NO_SHIFT_MASK_SHIFTED);

    // Benchmark each version
    benchmark(values, NO_SHIFT_MASK, SHIFT_MASK, BOTH_MASKS, SHIFT_MASK_SHIFTED, NO_SHIFT_MASK_SHIFTED);

    return 0;
}

Stack Exchange Network

Optimizing Masked Bit Shifts of Gray Code with AND Operation and Parity Count

3 Answers 3

Latency analysis

The benchmark

CPUs are much more complex

Unrealistic benchmark

You must log in to answer this question.

Hot Network Questions

Optimizing Masked Bit Shifts of Gray Code with AND Operation and Parity Count

3 Answers 3

Latency analysis

The benchmark

CPUs are much more complex

Unrealistic benchmark

You must log in to answer this question.

Related

Hot Network Questions