Skip to main content
1 of 2
Joop Eggen
  • 4.7k
  • 15
  • 19

Stylistic (not asked) the wording can be more to the point. Sometimes that gives an optimisation idea.

Much elegance one might not expect, shifting in arrays in C. But I did find a spot.

As such:

void shl(uint8_t * const words, uint_fast_t n) {
    const int word_bits = 8;
    const int word_count = 256 / word_bits;
    const int word_shift = n / word_bits;
    const int bit_shift = n % word_bits;
    if (word_shift != 0) {
        for (int i = word_count - 1 - word_shift; i >= 0; --i) {
            words[i + word_shift] = (words[i] << bit_shift) | ;
        }
        for (int i = word_shift - 1; i >= 0; --i) { // Or memset
            words[i] = 0;
        }
    }
    uint8_t carry = 0;
    uint8_t mask = (1 << word_bits) - 1;
    for (int_fast8_t i = word_shift; i < word_count; ++i) {
        uint8_t m = carry;
        carry = (words[i] >> (word_bits - bit_shift)) & mask;
        words[i] = (words[i] << bit_shift) | m;
    }
}

As you see I replaced one decreasing loop + plus handling of [0] with a single increasing loop with a carry.

I used uint8_t instead of uint64_t (which is generally is faster), as the carry could then be done inside a larger uint16_t:

    uint16_t carry = 0;
    uint16_t mask = (1 << word_bits) - 1;
    for (int_fast8_t i = word_shift; i < word_count; ++i) {
        uint8_t m = carry;
        carry = ((uint16_t)words[i]) << bit_shift;
        words[i] = ((uint8)carry) | m;
        carry >>= bits_shift;
    }

This is my spotted "improvement" (which has to be proven by timing).

uint32_t instead of uint64_t would be a middle way.

Which is realy faster has to be determined.

By the way in C++'s std::bitset would be more elegant.

Joop Eggen
  • 4.7k
  • 15
  • 19