I am trying to determine the parity of the result of a complex calculation involving two uint64_t variables, value_a and value_b, along with two distinct masks: NO_SHIFT_MASK and SHIFT_MASK. These masks are non-overlapping and can be redefined or padded with zeros to optimize the calculation. I have written multiple different versions of the code, each compiled for modern x86 processors using instructions like popcnt. Despite assuming out-of-order, parallel execution, the version with less parallelization runs faster, which I find confusing. I am seeking insights into why the less parallelized version is faster and whether there are better optimization strategies for this problem.
- The format of
value_aandvalue_bmust be the same. - The masks are two arbitrary, non-overlapping (i.e.
(NO_SHIFT_MASK & SHIFT_MASK) == 0) masks where each mask has '0' or more contiguous bits set to 1. - It is possible to redefine the format by reordering the masks or adding zero paddings to the variables, if it is convenient to save steps in the calculation. I suspect that adding some zero padding might save some bit shifts, but it is not clear how to do this.
- I wrote multiple different versions of this code, which need to be compiled for modern x86 processors to run quickly (in gcc, compile with
-arch = x86-64-v4), using instructions likepopcnt. I assumed out-of-order, parallel execution. However, the version with less parallelization runs faster, and I don't understand why. Maybe my processor is too old. If you have a modern CPU, please tell me about the benchmark results.
I considered storing both the value and its gray code in the 64 bits, when the masks are short enough, to save the conversion step. However, in general the masks can cover all the 64 bits. I find it acceptable to reserve a small number of bits for zero padding, but storing the gray code would double the space and halve the capacity. I can make multiple code paths for small and large masks, but here I am interested in the worst cases, where the masks occupy up to 60 bits.
Steps:
- Calculate the gray code of
value_a. - Right shift the bits in the gray code masked by
SHIFT_MASK. - Perform a bitwise AND operation with
value_b. - Compute the parity of the resulting bits.
(this is Pseudocode. I provide C++ code below)
modified_a = value_a ^ (value_a << 1);
no_shift_modified_a = modified_a & NO_SHIFT_MASK;
shift_modified_a = (modified_a & SHIFT_MASK) >> 1;
1ULL & (popcount(no_shift_modified_a & value_b) + popcount(shift_modified_a & value_b));
Example:
uint64_t value_a = 0b0101'010101;
uint64_t value_b = 0b0111'101010;
uint64_t SHIFT_MASK = 0b0000'111111; // The rightmost bit of value_a matching SHIFT_MASK doesn't change the calculation
uint64_t NO_SHIFT_MASK = 0b1111'000000;
//modified_a
//0b0111'010111 value_a
//0b1110'101110 value_a << 1
//0b1001'111001 value_a ^ (value_a << 1)
modified_a = value_a ^ (value_a << 1);
//no_shift_modified_a
//0b1001'111001 modified_a
//0b1111'000000 NO_SHIFT_MASK
//0b1001'000000 modified_a & NO_SHIFT_MASK
no_shift_modified_a = modified_a & NO_SHIFT_MASK;
//shift_modified_a
//0b1001'111001 modified_a
//0b0000'111111 SHIFT_MASK
//0b0000'111001 modified_a & SHIFT_MASK;
//0b0000'011100 (modified_a & SHIFT_MASK) >> 1 // It doesn't matter that the rightmost bit is lost
shift_modified_a = (modified_a & SHIFT_MASK) >> 1;
//popcount b masked by no_shift_modified_a
//0b1001'000000 no_shift_modified_a
//0b0111'101010 value_b
//0b0001'000000 no_shift_modified_a & value_b
//1 popcount(no_shift_modified_a & value_b)
//popcount b masked by shift_modified_a
//0b0000'011100 shift_modified_a
//0b0111'101010 value_b
//0b0000'001000 shift_modified_a & value_b
//1 popcount(shift_modified_a & value_b)
//Find parity of total popcount
//1ULL & (1+1)==0 // 1ULL & (popcount(no_shift_modified_a & value_b) + popcount(shift_modified_a & value_b))
1 & (popcount(no_shift_modified_a & value_b) + popcount(shift_modified_a & value_b));
Note that the rightmost bit of value_a & SHIFT_MASK doesn't change the result.
I tried to do a single popcount of the entire result:
modified_a = value_a ^ (value_a << 1); // two clocks
no_shift_modified_a = modified_a & NO_SHIFT_MASK; // one clock
shift_modified_a = (modified_a & SHIFT_MASK) >> 1; // two clocks
both_shifts_modified_a = no_shift_modified_a | shift_modified_a; // one clock
1ULL & popcount(both_shifts_modified_a & value_b); // three clocks
This sums to 9 clocks, but on x86, independent variables are executed in parallel out of order, so it would go as:
modified_a = value_a ^ (value_a << 1); // two clocks
//subtotal: 2 clocks
//parallel out of order
no_shift_modified_a = modified_a & NO_SHIFT_MASK; // one clock
shift_modified_a = (modified_a & SHIFT_MASK) >> 1; // two clocks
// subtotal: 2 clocks
both_shifts_modified_a = no_shift_modified_a | shift_modified_a; // one clock
1ULL & popcount(both_shifts_modified_a & value_b); // three clocks
So in total, it would be 8 clocks, since every instruction depends on the previous one, and value_a.
However, I can break the dependency on value_a by bit-shifting the masked value_b to the left instead of modified_a to the right.
I can adjust the SHIFT_MASK to the right by right-shifting it (it is a constant so it has no cost), then apply it to left-shift the masked value_b.
// Padding with 0
uint64_t value_a = 0b0101'010101;
uint64_t value_b = 0b0111'101010;
uint64_t SHIFT_MASK_SHIFTED = 0b0000'011111; // The SHIFT_MASK has been right shifted
uint64_t NO_SHIFT_MASK = 0b1111'000000;
uint64_t BOTH_MASK = 0b1111'011111; // ==NO_SHIFT_MASK+SHIFT_MASK
modified_a = value_a ^ (value_a << 1); // two clocks
shift_b = (value_b & BOTH_MASK) + (value_b & SHIFT_MASK_SHIFTED); // two clocks (the sum of constants is precalculated, and the AND are independent, so thew get calculated in parallel)
1ULL & popcount(modified_a & shift_b); // three clocks
Total: 5 clocks
This version should be faster than the previous one, becasue it has less dependecies, but benchmarks show it's slower, and I'm confused.
Here is the C++ code:
#include <iostream>
#include <cstdint>
#include <bitset>
#include <vector>
#include <chrono>
#include <random>
#include <iomanip>
// Version 1: Original calculation
uint64_t version1(const uint64_t VALUE_A, const uint64_t VALUE_B,
const uint64_t NO_SHIFT_MASK, const uint64_t SHIFT_MASK)
{
uint64_t modified_a = VALUE_A ^ (VALUE_A << 1);
uint64_t no_shift_modified_a = modified_a & NO_SHIFT_MASK;
uint64_t shift_modified_a = (modified_a & SHIFT_MASK) >> 1;
return 1ULL & (__builtin_popcountll(no_shift_modified_a & VALUE_B) + __builtin_popcountll(shift_modified_a & VALUE_B));
}
// Version 2: Optimized calculation with padding
// This is the fastest despite unavoidable dependencies from VALUE_A.
// (╯°□°)╯︵ ┻━┻ Why?!
uint64_t version2(const uint64_t VALUE_A, const uint64_t VALUE_B,
const uint64_t NO_SHIFT_MASK, const uint64_t SHIFT_MASK)
{
uint64_t modified_a = VALUE_A ^ (VALUE_A << 1);
uint64_t both_shifts_modified_a = (modified_a & NO_SHIFT_MASK) | ((modified_a & SHIFT_MASK) >> 1);
return __builtin_parityll(both_shifts_modified_a & VALUE_B);
}
// Despite both lines being able to be calculated in parallel and out of order,
// (╯°□°)╯︵ ┻━┻ Why?!
uint64_t version3(const uint64_t VALUE_A, const uint64_t VALUE_B,
const uint64_t BOTH_MASK, const uint64_t SHIFT_MASK_SHIFTED)
{
uint64_t modified_a = VALUE_A ^ (VALUE_A << 1);
uint64_t shifted_b = (VALUE_B & BOTH_MASK) + (VALUE_B & SHIFT_MASK_SHIFTED);
// return 1ULL & popcount(modified_a & shifted_b);
return __builtin_parityll(modified_a & shifted_b); // requires -fgnu-runtime
// TODO: check if __builtin_parity is faster for smaller bits
}
// This version is slowest, despite having fewer dependencies, and shorter assembler
// https://godbolt.org/z/ac54M1YK9
// (╯°□°)╯︵ ┻━┻ Why?!
uint64_t version4(const uint64_t VALUE_A, const uint64_t VALUE_B,
const uint64_t NO_SHIFT_MASK, const uint64_t SHIFT_MASK_SHIFTED)
{
uint64_t modified_a = VALUE_A ^ (VALUE_A << 1);
uint64_t shifted_b = (VALUE_B & NO_SHIFT_MASK) | ((VALUE_B & SHIFT_MASK_SHIFTED) << 1);
// return 1ULL & popcount(modified_a & shifted_b);
return __builtin_parityll(modified_a & shifted_b); // requires -fgnu-runtime
// TODO: check if __builtin_parity is faster for smaller bits
}
void validate(const std::vector<uint64_t> &VALUES,
const uint64_t NO_SHIFT_MASK, const uint64_t SHIFT_MASK,
const uint64_t BOTH_MASKS, const uint64_t SHIFT_MASK_SHIFTED)
{
std::cout << "Initializing validation..." << std::endl;
auto start = std::chrono::high_resolution_clock::now();
for (const auto &value_a : VALUES)
{
for (const auto &value_b : VALUES)
{
uint64_t result_v1 = version1(value_a, value_b, NO_SHIFT_MASK, SHIFT_MASK);
uint64_t result_v2 = version2(value_a, value_b, NO_SHIFT_MASK, SHIFT_MASK);
uint64_t result_v3 = version3(value_a, value_b, BOTH_MASKS, SHIFT_MASK_SHIFTED);
uint64_t result_v4 = version4(value_a, value_b, NO_SHIFT_MASK, SHIFT_MASK_SHIFTED);
if (__builtin_expect(!!(result_v2 != result_v1 ||
result_v3 != result_v1 ||
result_v4 != result_v1),
false))
{
std::cout << "Mismatch between versions." << std::endl
<< "Value A: " << value_a << " (0b" << std::bitset<64>(value_a) << ")" << std::endl
<< "Value B: " << value_b << " (0b" << std::bitset<64>(value_b) << ")" << std::endl;
std::cout << "Results: " << std::endl
<< "Version 1: " << result_v1 << std::endl
<< "Version 2: " << result_v2 << std::endl
<< "Version 3: " << result_v3 << std::endl
<< "Version 4: " << result_v4 << std::endl;
throw std::runtime_error("Results from different versions do not match");
}
}
static size_t iteration_count = 0;
static size_t total_iterations = VALUES.size();
++iteration_count;
if ((iteration_count & ((1 << 10) - 1)) == 0)
{ // Update every 2^10 iterations to avoid excessive output
double percent_complete = (static_cast<double>(iteration_count) / total_iterations) * 100;
std::cout << "\r" << "Progress: " << std::fixed << std::setprecision(2) << percent_complete << "% completed" << std::flush;
}
}
std::cout << std::endl
<< "Validation complete." << std::endl;
}
void benchmark(const std::vector<uint64_t> &VALUES,
const uint64_t NO_SHIFT_MASK, const uint64_t SHIFT_MASK,
const uint64_t BOTH_MASKS, const uint64_t SHIFT_MASK_SHIFTED)
{
std::cout << "Starting benchmark." << std::endl;
const uint16_t REPETITIONS = 10u;
auto start = std::chrono::high_resolution_clock::now();
for (uint16_t repetition = 0; repetition < REPETITIONS; ++repetition)
{
for (const auto &value_a : VALUES)
{
for (const auto &value_b : VALUES)
{
volatile uint64_t result = version1(value_a, value_b, NO_SHIFT_MASK, SHIFT_MASK);
}
}
}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> duration = end - start;
std::cout << "Version 1 took " << duration.count() << " seconds\n";
start = std::chrono::high_resolution_clock::now();
for (uint16_t repetition = 0; repetition < REPETITIONS; ++repetition)
{
for (const auto &value_a : VALUES)
{
for (const auto &value_b : VALUES)
{
volatile uint64_t result = version2(value_a, value_b, NO_SHIFT_MASK, SHIFT_MASK);
}
}
}
end = std::chrono::high_resolution_clock::now();
duration = end - start;
std::cout << "Version 2 took " << duration.count() << " seconds\n";
start = std::chrono::high_resolution_clock::now();
for (uint16_t repetition = 0; repetition < REPETITIONS; ++repetition)
{
for (const auto &value_a : VALUES)
{
for (const auto &value_b : VALUES)
{
volatile uint64_t result = version3(value_a, value_b, BOTH_MASKS, SHIFT_MASK_SHIFTED);
}
}
}
end = std::chrono::high_resolution_clock::now();
duration = end - start;
std::cout << "Version 3 took " << duration.count() << " seconds\n";
start = std::chrono::high_resolution_clock::now();
for (uint16_t repetition = 0; repetition < REPETITIONS; ++repetition)
{
for (const auto &value_a : VALUES)
{
for (const auto &value_b : VALUES)
{
volatile uint64_t result = version4(value_a, value_b, NO_SHIFT_MASK, SHIFT_MASK_SHIFTED);
}
}
}
end = std::chrono::high_resolution_clock::now();
duration = end - start;
std::cout << "Version 4 took " << duration.count() << " seconds\n";
}
int main()
{
const uint8_t SIZE = 14; // how many bits will have the values
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<int> dis(0, SIZE - 1);
const uint8_t MASK_SEPARATOR = dis(gen); // bit position where the masks will be split
const uint64_t SHIFT_MASK = (1ULL << MASK_SEPARATOR) - 1;
const uint64_t NO_SHIFT_MASK = ~SHIFT_MASK;
const uint64_t SHIFT_MASK_SHIFTED = SHIFT_MASK >> 1ULL;
const uint64_t BOTH_MASKS = NO_SHIFT_MASK | SHIFT_MASK_SHIFTED;
// Generate vectors with all values from 0 to 2^SIZE
std::vector<uint64_t> values(1ULL << SIZE);
for (size_t i = 0; i < values.size(); ++i)
{
values[i] = i;
}
// Validate that all versions return the same result
validate(values, NO_SHIFT_MASK, SHIFT_MASK, BOTH_MASKS, SHIFT_MASK_SHIFTED);
// Benchmark each version
benchmark(values, NO_SHIFT_MASK, SHIFT_MASK, BOTH_MASKS, SHIFT_MASK_SHIFTED);
return 0;
}
The assembler is on this compiler explorer: https://godbolt.org/z/ac54M1YK9
for version2() and version3(), it generates this code
version2(unsigned long, unsigned long, unsigned long, unsigned long):
lea rax, [rdi+rdi]
xor rax, rdi
and rcx, rax
and rax, rdx
shr rcx
or rcx, rax
xor eax, eax
and rcx, rsi
popcnt rax, rcx
and eax, 1
ret
version3(unsigned long, unsigned long, unsigned long, unsigned long):
and rdx, rsi
lea rax, [rdi+rdi]
and rsi, rcx
xor rax, rdi
add rdx, rsi
and rdx, rax
xor eax, eax
popcnt rax, rdx
and eax, 1
ret
Version 3 has fewer instructions and less dependency levels, but the benchmark shows that it is 40% slower than version 2.
I'm not expert on assembler, but I think they should execute this way
Version2:
Level 0: lea rax, [rdi+rdi]
Level 1: xor rax, rdi
Level 2: and rcx, rax
and rax, rdx
Level 3: shr rcx
Level 4: or rcx, rax
Level 5: xor eax, eax
and rcx, rsi
Level 6: popcnt rax, rcx
Level 7: and eax, 1
Level 8: ret
Version3:
Level 0: and rdx, rsi
lea rax, [rdi+rdi]
Level 1: and rsi, rcx
xor rax, rdi
Level 2: add rdx, rsi
Level 3: and rdx, rax
Level 4: xor eax, eax
Level 5: popcnt rax, rdx
Level 6: and eax, 1
Level 7: ret