Return to Answer

deleted 5 characters in body

Source Link

edited Feb 19, 2020 at 14:17

4.7k
15
19

Stylistic (not asked) the wording can be more to the point. Sometimes that gives an optimisation idea.

Much elegance one might not expect, shifting in arrays in C. But I did find a spot.

As such:

void shl(uint8_t * const words, uint_fast_t n) {
    const int word_bits = 8;
    const int word_count = 256 / word_bits;
    const int word_shift = n / word_bits;
    const int bit_shift = n % word_bits;
    if (word_shift != 0) {
        for (int i = word_count - 1 - word_shift; i >= 0; --i) { // Or memcpy
            words[i + word_shift] = (words[i] << bit_shift) | ;words[i];
        }
        for (int i = word_shift - 1; i >= 0; --i) { // Or memset
            words[i] = 0;
        }
    }
    uint8_t carry = 0;
    uint8_t mask = (1 << word_bits) - 1;
    for (int_fast8_t i = word_shift; i < word_count; ++i) {
        uint8_t m = carry;
        carry = (words[i] >> (word_bits - bit_shift)) & mask;
        words[i] = (words[i] << bit_shift) | m;
    }
}

As you see I replaced one decreasing loop + plus handling of [0] with a single increasing loop with a carry.

I used uint8_t instead of uint64_t (which is generally is faster), as the carry could then be done inside a larger uint16_t:

    uint16_t carry = 0;
    uint16_t mask = (1 << word_bits) - 1;
    for (int_fast8_t i = word_shift; i < word_count; ++i) {
        uint8_t m = carry;
        carry = ((uint16_t)words[i]) << bit_shift;
        words[i] = ((uint8)carry) | m;
        carry >>= bits_shift;
    }

This is my spotted "improvement" (which has to be proven by timing).

uint32_t instead of uint64_t would be a middle way.

Which is realy faster has to be determined.

By the way in C++'s std::bitset would be more elegant.

Stylistic (not asked) the wording can be more to the point. Sometimes that gives an optimisation idea.

Much elegance one might not expect, shifting in arrays in C. But I did find a spot.

As such:

void shl(uint8_t * const words, uint_fast_t n) {
    const int word_bits = 8;
    const int word_count = 256 / word_bits;
    const int word_shift = n / word_bits;
    const int bit_shift = n % word_bits;
    if (word_shift != 0) {
        for (int i = word_count - 1 - word_shift; i >= 0; --i) {
            words[i + word_shift] = (words[i] << bit_shift) | ;
        }
        for (int i = word_shift - 1; i >= 0; --i) { // Or memset
            words[i] = 0;
        }
    }
    uint8_t carry = 0;
    uint8_t mask = (1 << word_bits) - 1;
    for (int_fast8_t i = word_shift; i < word_count; ++i) {
        uint8_t m = carry;
        carry = (words[i] >> (word_bits - bit_shift)) & mask;
        words[i] = (words[i] << bit_shift) | m;
    }
}

As you see I replaced one decreasing loop + plus handling of [0] with a single increasing loop with a carry.

I used uint8_t instead of uint64_t (which is generally is faster), as the carry could then be done inside a larger uint16_t:

    uint16_t carry = 0;
    uint16_t mask = (1 << word_bits) - 1;
    for (int_fast8_t i = word_shift; i < word_count; ++i) {
        uint8_t m = carry;
        carry = ((uint16_t)words[i]) << bit_shift;
        words[i] = ((uint8)carry) | m;
        carry >>= bits_shift;
    }

This is my spotted "improvement" (which has to be proven by timing).

uint32_t instead of uint64_t would be a middle way.

Which is realy faster has to be determined.

By the way in C++'s std::bitset would be more elegant.

Stylistic (not asked) the wording can be more to the point. Sometimes that gives an optimisation idea.

Much elegance one might not expect, shifting in arrays in C. But I did find a spot.

As such:

void shl(uint8_t * const words, uint_fast_t n) {
    const int word_bits = 8;
    const int word_count = 256 / word_bits;
    const int word_shift = n / word_bits;
    const int bit_shift = n % word_bits;
    if (word_shift != 0) {
        for (int i = word_count - 1 - word_shift; i >= 0; --i) { // Or memcpy
            words[i + word_shift] = words[i];
        }
        for (int i = word_shift - 1; i >= 0; --i) { // Or memset
            words[i] = 0;
        }
    }
    uint8_t carry = 0;
    uint8_t mask = (1 << word_bits) - 1;
    for (int_fast8_t i = word_shift; i < word_count; ++i) {
        uint8_t m = carry;
        carry = (words[i] >> (word_bits - bit_shift)) & mask;
        words[i] = (words[i] << bit_shift) | m;
    }
}

As you see I replaced one decreasing loop + plus handling of [0] with a single increasing loop with a carry.

I used uint8_t instead of uint64_t (which is generally is faster), as the carry could then be done inside a larger uint16_t:

    uint16_t carry = 0;
    uint16_t mask = (1 << word_bits) - 1;
    for (int_fast8_t i = word_shift; i < word_count; ++i) {
        uint8_t m = carry;
        carry = ((uint16_t)words[i]) << bit_shift;
        words[i] = ((uint8)carry) | m;
        carry >>= bits_shift;
    }

This is my spotted "improvement" (which has to be proven by timing).

uint32_t instead of uint64_t would be a middle way.

Which is realy faster has to be determined.

By the way in C++'s std::bitset would be more elegant.

Source Link

answered Feb 19, 2020 at 13:01

Joop Eggen

4.7k
15
19

Stylistic (not asked) the wording can be more to the point. Sometimes that gives an optimisation idea.

Much elegance one might not expect, shifting in arrays in C. But I did find a spot.

As such:

void shl(uint8_t * const words, uint_fast_t n) {
    const int word_bits = 8;
    const int word_count = 256 / word_bits;
    const int word_shift = n / word_bits;
    const int bit_shift = n % word_bits;
    if (word_shift != 0) {
        for (int i = word_count - 1 - word_shift; i >= 0; --i) {
            words[i + word_shift] = (words[i] << bit_shift) | ;
        }
        for (int i = word_shift - 1; i >= 0; --i) { // Or memset
            words[i] = 0;
        }
    }
    uint8_t carry = 0;
    uint8_t mask = (1 << word_bits) - 1;
    for (int_fast8_t i = word_shift; i < word_count; ++i) {
        uint8_t m = carry;
        carry = (words[i] >> (word_bits - bit_shift)) & mask;
        words[i] = (words[i] << bit_shift) | m;
    }
}

As you see I replaced one decreasing loop + plus handling of [0] with a single increasing loop with a carry.

I used uint8_t instead of uint64_t (which is generally is faster), as the carry could then be done inside a larger uint16_t:

    uint16_t carry = 0;
    uint16_t mask = (1 << word_bits) - 1;
    for (int_fast8_t i = word_shift; i < word_count; ++i) {
        uint8_t m = carry;
        carry = ((uint16_t)words[i]) << bit_shift;
        words[i] = ((uint8)carry) | m;
        carry >>= bits_shift;
    }

This is my spotted "improvement" (which has to be proven by timing).

uint32_t instead of uint64_t would be a middle way.

Which is realy faster has to be determined.

By the way in C++'s std::bitset would be more elegant.