Stylistic (not asked) the wording can be more to the point. Sometimes that gives an optimisation idea.
Much elegance one might not expect, shifting in arrays in C. But I did find a spot.
As such:
void shl(uint8_t * const words, uint_fast_t n) {
const int word_bits = 8;
const int word_count = 256 / word_bits;
const int word_shift = n / word_bits;
const int bit_shift = n % word_bits;
if (word_shift != 0) {
for (int i = word_count - 1 - word_shift; i >= 0; --i) { // Or memcpy
words[i + word_shift] = (words[i] << bit_shift) | ;words[i];
}
for (int i = word_shift - 1; i >= 0; --i) { // Or memset
words[i] = 0;
}
}
uint8_t carry = 0;
uint8_t mask = (1 << word_bits) - 1;
for (int_fast8_t i = word_shift; i < word_count; ++i) {
uint8_t m = carry;
carry = (words[i] >> (word_bits - bit_shift)) & mask;
words[i] = (words[i] << bit_shift) | m;
}
}
As you see I replaced one decreasing loop + plus handling of [0] with a single increasing loop with a carry.
I used uint8_t instead of uint64_t (which is generally is faster), as the carry could then be done inside a larger uint16_t:
uint16_t carry = 0;
uint16_t mask = (1 << word_bits) - 1;
for (int_fast8_t i = word_shift; i < word_count; ++i) {
uint8_t m = carry;
carry = ((uint16_t)words[i]) << bit_shift;
words[i] = ((uint8)carry) | m;
carry >>= bits_shift;
}
This is my spotted "improvement" (which has to be proven by timing).
uint32_t instead of uint64_t would be a middle way.
Which is realy faster has to be determined.
By the way in C++'s std::bitset would be more elegant.