3

The bottleneck of my code is the repeated calling of pow(base,exponent,modulus) for very large integers (numpy does not support such large integers, about 100 to 256 bits). However, my exponent and modulus is always the same. Could I somehow utilize this to speed the computation up with a custom function? I tried to define a function as seen below (function below is for general modulus and exponent).

However, even if I hardcode every operation without the while-loop and if-statements for a fixed exponent and modulus, it is slower than pow.

def modular_pow(self, base, exponent, modulus):
    result = 1
    base = base % modulus
    while exponent > 0:
        if (exponent % 2 == 1):
            result = (result * base) % modulus
        exponent = exponent >> 1
        base = (base * base) % modulus
    return result

Another option would be if I can somehow "vectorize" it. I have to compute pow for about 100000000 different base values. While these values often change between my script runs (so a look up table will not be useful), I will know these values the moment I hit run (I could compute them all at once).

Any ideas? I got some speedup by using mpz datatypes from gmpy2, but it still is too slow.

9
  • The code isn't indented correctly. Commented May 23, 2021 at 21:16
  • I think by vectorize you mean memoization. Commented May 23, 2021 at 21:59
  • If by memoization you mean look-up table, this won't work, the base values are unique and change too often. Commented May 24, 2021 at 6:31
  • Try gmpy2 - Python's arbitrary-precision int implementation isn't designed for speed. Commented May 24, 2021 at 6:39
  • 1
    Can you please give us some information about the data that you are using? Is base an integer? Can it be stored in a standard 32 bit integer? And what about exponent and modulus? Thank you! Commented May 30, 2021 at 17:32

5 Answers 5

6
+100

Good news, bad news. Good news is that when the modulus m is fixed, there are ways to speed computing a*b % m. Search for "Barrett reduction" and "Montgomery reduction". They work, in different ways, by precomputing constants related to m such that % m can be computed via multiplication and shifting, without needing division.

The bad news: to find the remainder, both ways require (in addition to cheaper operations) two multiplications. So they don't pay overall unless multiplication is much cheaper than division.

For that reason, they're generally slower unless the modulus is "truly" large - your "about 100 to 256 bits" is still on the small side by modern standards, just a few times wider than native 64-bit machine ints. Things like speedy FFT-based multiplication require much larger integers before they pay off.

CPython's built-in modular pow is already using a "binary" scheme, along the lines of what you coded in Python, but fancier (if the exponent is "large enough", the built-in pow views it as being in base 32, consuming 5 exponent bits per loop iteration).

In a quick stab implementing Montgomery reduction in Python, and replacing the modular multiplies in your code by the Montgomery spellings, modular_pow() didn't get faster than the built-in before the modulus grew to tens of thousands of bits. For inputs around 256 bits, it was about 3x slower.

That's a mixed bag: the Python code didn't exploit the "base 32" tricks, which can give substantial benefits. But for large enough inputs, CPython uses faster-than-naĂŻve Karatsuba multiplication, which the division-free Montgomery spelling can benefit from (CPython's int division has no speedup tricks regardless of input sizes, and CPython's built-in modular pow always uses division to find remainders).

So, short course: there's nothing I know of obvious you can do in CPython to speed a single instance of pow(a, b, c). It's possible some C-coded crypto library out there has something suitable, but not that I know of.

But the other good news is that your problem is "embarrassingly parallel". If you have N processors, you can give each of them 100000000/N of your inputs, and they can all run full speed in parallel. That would give a speedup of about a factor of N.

But the bad news is that your integers really aren't "large" (they're small enough that I bet you can still compute thousands of modular pows per second with the built-in pow), and interprocess communication costs may wipe out the benefits of doing N computations in parallel. That all depends on exactly how you get your inputs and what you want to do with the results.

FOLLOW UP

The "Handbook of Applied Cryptography" (HAC), chapter 14, essentially spells out the state of the art for gonzo modular exponentiation algorithms.

Looking into the code, GMP already implements every trick they have. This includes the things I mentioned (Montgomery reduction, and using a power-of-2 base higher than 2 to chew up more exponent bits per loop iteration). And others I didn't mention (e.g., GMP has a special internal routine for squaring, which saves cycles over a general product of possibly unequal integers). In all, it's a small mountain of implementation code.

I expect that's why you're not getting more answers: GMP is already doing, at worst, close to the best anyone has ever figured out how to do. The speedup for you isn't truly dramatic because, as already noted, the integers you're using are actually on the small side.

So if you need to pursue this, using GMP is likely the fastest you're going to get. As noted, multiprocessing is an obvious way to get a theoretical N-fold speedup with N processors, but as also noted you haven't said anything about the context (where these inputs come from or what you need to do with the outputs). So it's impossible to guess whether that might pay off for you. The more interprocess communication you need, the more that damages potential multiprocessing speedups.

Note: what you're doing is exactly what, e.g., RSA public-key cryptosystems do, although they generally use larger integers. That is, your "base" is their "message", and a public (or private) RSA key consists of a fixed exponent and fixed modulus. Only the base (message or encrypted bits) varies across en/de-cryption instances. For a given key, the exponent and modulus are always the same.

Many world-class mathematicians have studied the problem, and world-class hackers coded the algorithms for peak speed. That's why you should give up hope on that there's a faster way that HAC just forgot to mention ;-)

Speculative

Drawing the connection to RSA reminded me: RSA decryption, in practice, does not proceed in "the obvious" way. Instead, the holder of the private key knows the prime factorization of the key's modulus (in RSA, the modulus is the product of two distinct - but kept secret -large primes), and that can be used to significantly speed exponentiation with respect to that modulus.

So (can't guess), if the way you get your modulus instances is such that you can compute their prime factorizations efficiently, that can be used to get significant speedups, when they're composite.

Not so much for a prime modulus, though. About the only highly potentially valuable trick then is that, for p prime and a not divisible by p,

pow(a, b, p) == pow(a, b % (p-1), p)

That can save unbounded time if b can be much larger than p. It works because, by Fermat's little theorem,

pow(a, p-1, p) == 1

for p prime and a not divisible by p. For example,

>>> p
2347
>>> assert all(p % i != 0 for i in range(2, p))  # that is, p is prime
>>> pow(345, 1000000000000000000000000000000, p)
301
>>> 1000000000000000000000000000000 % (p-1)
1198
>>> pow(345, 1198, p) # same thing, but much faster
301

For a composite modulus, much the same is done for each of its prime power factors, and then the results are pasted together via the Chinese remainder theorem.

If you think your problem can be worked to exploit that, search for "modular exponentiation Chinese remainder" to find a number of good expositions.

Sign up to request clarification or add additional context in comments.

Comments

2

I decided to implement and compare several methods in Python (without C/C++). To tell in advance, with following my code I reached 25x times speedup compared to using Python's pow(a, b, c) Some methods are parallelized, some not.

Note: See updated Part 2 of my post below with C++ code.

See code at the very bottom of this post. Timings below will be for my i7 laptop with 4 cores at 1.2 Ghz (8 hardware threads).

  1. Python - Pure Python single-process. This method does list comprehension with pow(a, b, c) built-in python function.

  2. Python_Parallel - Multi-process pure Python. Same as method 1) but parallel. This and further methods use multiprocessing module to occupy all CPU cores.

  3. Gmp - GMPY2-based single-process solution. This uses powmod() function and list comprehension.

  4. Gmp_Parallel - Parallel GMPY2-based. Uses method 3) but on all cores.

  5. Redc_Div_Parallel - Binary exponentiation (also known as Exponentiation by Squarings) with regular modulus computation through % mod operation. This and further methods also use Numpy arrays with Python objects (object dtype). This and further methods implement Windowed Method to boost exponentiation.

  6. Redc_Barrett_Parallel - Same as 5) but uses Barrett Reduction to replace division by multiplications.

  7. Redc_Montgomery_Parallel - Same as 5) but uses Montgomery Reduction to replace division.


Timings:

Python                    : time  78.74 sec, boost  1.00
Python_Parallel           : time  19.09 sec, boost  4.12, Correct
Gmp                       : time  27.76 sec, boost  2.84, Correct
Gmp_Parallel              : time   7.57 sec, boost 10.40, Correct
Redc_Div_Parallel         : time  17.38 sec, boost  4.53, Correct, w = 4 (324 redcs)
Redc_Barrett_Parallel     : time  33.15 sec, boost  2.38, Correct, w = 4 (324 redcs)
Redc_Montgomery_Parallel  : time  42.24 sec, boost  1.86, Correct, w = 4 (324 redcs)

All timings above measure boost compared to Python (single-threaded) base. Correctness of all computed results is checked compared to Python output.

One can see that Gmp_Parallel gives the biggest boost, also it is very simple, just two lines of code. Probably among Python-only solutions it is the best and most simple.

Of course if you convert Redc_... methods to C/C++, they should become much faster, and could be faster than GMP C method mpz_powm (which is a base of GMPY2's powmod()). You can use fixed-arithmetics _ExtInt type available in Clang, just do using u512 = unsigned _ExtInt(512); and you'll get fixed-size 512-bit integer type for free, you can even put any large non-even number of bits like _ExtInt(12345). Also Boost has uint128_t/uint256_t/uint512_t available in C++.

I can imagine the only reason that Barrett/Montgomery solutions are slower only because they are Python-object based, so most of time is spent in organizing Python interactions instead of doing simple mul/shift operations without division. Hence if you use C++ you'll get much speed improvement.

In my code below you can tweak variables num_bits and pow_bits to set size of numbers (in bits) and size of powers, also cnt_nums sets size of array (numbers of bases to be power-rised).

Try it online!

def PowModPython(bs, e, m):
    return [pow(b, e, m) for b in bs]

def PowModGmp(bs, e, m):
    import gmpy2, numpy as np
    return [gmpy2.powmod(b, e, m) for b in bs]
    #return list(np.vectorize(lambda x: gmpy2.powmod(x, e, m))(np.asarray(bs, dtype = object)))

def PowModParallel(func, bs, e, m, *, processes = None, extra = ()):
    import multiprocessing as mp, functools
    if processes is None:
        processes = mp.cpu_count()
    with mp.Pool(processes) as pool:
        try:
            block = (len(bs) + processes * 8 - 1) // (processes * 8)
            return functools.reduce(
                lambda a, b: a + b,
                pool.starmap(func,
                    [(bs[i : i + block], e, m) + extra for i in range(0, len(bs), block)]),
                []
            )
        finally:
            pool.close()
            pool.join()

def PowModRedc(bs, e, m, bits, redc_type = None, w = None):
    import gmpy2, numpy as np, math

    def Int(x):
        return gmpy2.mpz(x)
    def Mask(x, s):
        #return x & ((1 << s) - 1)
        return np.vectorize(lambda x: gmpy2.t_mod_2exp(x, s))(x)

    e = Int(e)
    m = Int(m)
    
    def CreateMod():
        if redc_type == 'div':
            def Mod(x):
                return x % m
            def To(x):
                return x
            def From(x):
                return x
            def Adjust(x):
                return x
        else:
            if redc_type is None and m & 1 != 0 or redc_type == 'montgomery':
                # https://www.nayuki.io/page/montgomery-reduction-algorithm

                assert m & 1 != 0, 'Montgomery can not handle even modulus!'

                def MontgomeryPrec(n, s):
                    n, s = Int(n), Int(s)
                    r = 1 << s
                    r2 = (r * r) % n
                    ri = pow(r, -1, n)
                    k = (r * ri - 1) // n
                    return n, s, Int(k), Int(r2), Int(r - 1)
                
                Mg = MontgomeryPrec(m, bits + 2)
                
                def Mod(x):
                    n, s, k, _, mask = Mg
                    t = Mask(Mask(x, s) * k, s)
                    u = (x + t * n) >> s
                    return u
                    
                def To(x):
                    _, _, _, r2, _ = Mg
                    return Mod(x * r2)

                def From(x):
                    return Mod(x)
            elif redc_type == 'barrett':
                # https://www.nayuki.io/page/barrett-reduction-algorithm

                assert m > 0 and m & (m - 1) != 0, 'Modulus is a power of two! Not possible for Barrett.'

                def BarrettPrec(n, s):
                    n, s = Int(n), Int(s)
                    return n, s, (1 << s) // n

                Br = BarrettPrec(m, bits * 2 + 2)

                def Mod(x):
                    n, s, r = Br
                    q = (x * r) >> s
                    u = x - q * n
                    return u

                def To(x):
                    return x
                
                def From(x):
                    return x
            else:
                assert False, f'Unknown reduction type {redc_type}!'

            def Adjust(x):
                gm = x >= m
                x[gm] -= m
                return x

        return Mod, To, From, Adjust
    
    def Arr(x):
        return np.array([Int(e) for e in  x], dtype = object)
    
    Mod, To, From, Adjust = CreateMod()
    bs = To(Mod(To(Arr(bs))))

    if w is None:
        w = OptW(e)[1]
    
    def PreComputePowers(d, w, bs):
        d = Int(d)
        digits = []
        mask = 2 ** w - 1
        while d != 0:
            digits.append(d & mask)
            d >>= w
        digits = digits[::-1]
        powers = [None, np.copy(bs)]
        for i in range(2, 2 ** w):
            powers.append(Mod(powers[-1] * bs))
        return digits, powers
    
    digits, powers = PreComputePowers(e, w, bs)
    res = powers[digits[0]]
    
    for digit in digits[1:]:
        for i in range(w):
            res = Mod(res * res)
        if digit != 0:
            res = Mod(res * powers[digit])
    
    return list(Adjust(From(res)))

def OptW(e):
    n = max(1, e.bit_length())
    minv = None
    for w in range(1, 11):
        # Number of multiplication-reductions
        c = (2 ** w - 2) + ((n - 1) // w) * (w + 1)
        if minv is None or c < minv[0]:
            minv = (c, w)
    return minv

def Test():
    import time, random, numpy as np
    random.seed(0)
    
    num_bits = 496
    pow_bits = 252
    cnt_nums = 1 << 16
    cnt_pows = 1 << 0
    
    def Rand(bits):
        return random.randrange(1 << bits)
    
    bs = [Rand(num_bits) for i in range(cnt_nums)]
    em = [(Rand(pow_bits), Rand(num_bits) | 1) for i in range(cnt_pows)]

    opt_w = OptW(em[0][0])
    ref_tim, res = None, None
    for fname, f, w in [
        ('Python', PowModPython, None),
        ('Python_Parallel', lambda bs, e, m: PowModParallel(PowModPython, bs, e, m), None),
        ('Gmp', PowModGmp, None),
        ('Gmp_Parallel', lambda bs, e, m: PowModParallel(PowModGmp, bs, e, m), None),
        ('Redc_Div_Parallel', lambda bs, e, m: PowModParallel(PowModRedc, bs, e, m, extra = (num_bits, 'div')), opt_w),
        ('Redc_Barrett_Parallel', lambda bs, e, m: PowModParallel(PowModRedc, bs, e, m, extra = (num_bits, 'barrett')), opt_w),
        ('Redc_Montgomery_Parallel', lambda bs, e, m: PowModParallel(PowModRedc, bs, e, m, extra = (num_bits, 'montgomery')), opt_w),
    ]:
        tim = time.time()
        for e, m in em:
            resc = f(bs, e, m)
        tim = time.time() - tim
        if ref_tim is None:
            ref_tim = tim
        print(f'{fname:<26}: time {tim:>6.02f} sec, boost {ref_tim / tim:>5.02f}', end = '', flush = True)
        if res is not None:
            assert np.all(np.equal(res, resc))
            print(', Correct', end = '')
        else:
            res = resc
        print(f', w = {w[1]} ({w[0]} redcs)' if w is not None else '')

if __name__ == '__main__':
    Test()

Update. Also implemented C++ version together with Cython to adopt it to run in Python. See following code at the bottom. Also I used Sliding Window approach instead of Regular Window used above.

C++ paralel version (Montgomery-based) appeared to be 2x times faster than GMPY2 parallel implemented above (and 19x times faster than Python single-threaded version). Barrett version appeared 1.1x times faster than parallel GMP. Precise timings below, see Cython_... variants, they are fastest:

reg_win = 4 (328 redcs), slide_win = 5 (320 redcs)

Python                           : time  78.90 sec, boost  1.00
Python_Parallel                  : time  20.22 sec, boost  3.90, Correct
Gmp                              : time  27.81 sec, boost  2.84, Correct
Gmp_Parallel                     : time   8.19 sec, boost  9.63, Correct
Redc_Div_Parallel                : time  17.42 sec, boost  4.53, Correct
Redc_Barrett_Parallel            : time  33.08 sec, boost  2.39, Correct
Redc_Montgomery_Parallel         : time  43.05 sec, boost  1.83, Correct
Cython_Redc_Barrett_Parallel     : time   7.71 sec, boost 10.23, Correct
Cython_Redc_Montgomery_Parallel  : time   4.15 sec, boost 19.00, Correct

C++ version uses features of C++20 standard. Also it needs Clang compiler, because it uses _ExtInt(N) type available in Clang only. For long arithmetics it uses Boost. For parallelization it uses OpenMP technology built-in in compiler.

Cython package should be installed in Python, because Cython is reponsible for wrapping and compiling C++. Install it through python -m pip install cython.

On Windows you can install Clang from releases page, download and install LLVM-13.0.0-win64.exe. Boost can be installed by Chocolatey, it is a great package manager for Windows, or downloaded from Boost website. To install it with Chocolatey just type choco install boost-msvc-14.2 in console. Also on Windows you should install Visual Studio, 2019 or 2022 version, Community variant is enough. Visual Studio is needed because Clang depends on its libraries on Windows, only on Linux/MacOS Clang is independent.

Under Linux Clang and Boost can be installed just through sudo apt install clang libboost-all-dev.

In my code below you can see clang_dir = 'c:/bin/llvm/bin/', boost_dir = None in cython_import() function. You should set these two options only under Windows, for Linux set them to clang_dir = '', boost_dir = ''. Chocolatey installs Boost into c:/local/boost_1_74_0/ directory by default.

Sources Here. Source code with C++ appeared to be too large for StackOverflow post size limit (30 000 chars), hence I uploaded this code to PasteBin, it can be downloaded from here (Tio.run mirror is here, it is not runnable).


Update 2. Implemented one more version, also in C++/Cython. Now it is pure GMP (C library) based. And there is no Python interaction while computing.

It appeared that this solution is quite simple and fastest of all. To remind my fastest solution above in C++ was based on Montgomery Reduction and Sliding Window variant of Binary Exponentiation and that solution was 19x times faster than pure Python single-threaded variant.

And current solution which is C++/GMP based is 25.4x times faster than single threaded Python. Of course it is parallelized using OpenMP. Its code is quite simple, I provide only a code of this single C++ function down below due to StackOverflow post size limit. Full code is provided also below, hosted at external PasteBin web-service.

Precise timings of all methods (new method is called Cython_Gmp_Parallel):

reg_win = 4 (330 redcs), slide_win = 5 (321 redcs)

Python                           : time 162.81 sec, boost  1.00
Python_Parallel                  : time  37.92 sec, boost  4.29, Correct
Gmp                              : time  56.57 sec, boost  2.88, Correct
Gmp_Parallel                     : time  15.13 sec, boost 10.76, Correct
Redc_Div_Parallel                : time  34.81 sec, boost  4.68, Correct
Redc_Barrett_Parallel            : time  67.23 sec, boost  2.42, Correct
Redc_Montgomery_Parallel         : time  86.69 sec, boost  1.88, Correct
Cython_Redc_Barrett_Parallel     : time  13.96 sec, boost 11.66, Correct
Cython_Redc_Montgomery_Parallel  : time   8.31 sec, boost 19.58, Correct
Cython_Gmp_Parallel              : time   6.40 sec, boost 25.42, Correct

Full code hosted at PasteBin is here.

And just a code of single C++ function, which is quite short:

#include <gmpxx.h>
// .... And several other includes

void PowModGmpC(
    uint64_t * bs0, uint64_t bs_words, uint64_t cnt, const uint64_t * e0,
    uint64_t e_words, const uint64_t * m0, uint64_t m_words) {
    
    auto ToL = [](void const * x, size_t c) {
        mpz_class r;
        mpz_import(r.get_mpz_t(), c / 8, -1, 8, 0, 0, x);
        return r;
    };
    auto ToW = [](mpz_class const & x, void * y, size_t c) {
        size_t cnt = 0;
        mpz_export(y, &cnt, -1, 8, 0, 0, x.get_mpz_t());
    };
    
    mpz_class const
        e = ToL(e0, e_words * sizeof(u64)),
        m = ToL(m0, m_words * sizeof(u64));
    
    std::span<u64> bs(bs0, bs_words * cnt);
    
    #pragma omp parallel for
    for (size_t i = 0; i < cnt * bs_words; i += bs_words) {
        mpz_class b = ToL(&bs[i], bs_words * sizeof(u64)), r;
        mpz_powm(r.get_mpz_t(), b.get_mpz_t(), e.get_mpz_t(), m.get_mpz_t());
        ToW(r, &bs[i], bs_words * sizeof(u64));
    }
}

Comments

0

Looking at the wiki page. Doesn't seem like your implementation is correct. Moving the those two statements out of else have improved the performance significantly.

This is what I have from Wikipedia

def modular_pow(base, exponent, modulus):
    if modulus == 1:
        return 0
    else:
        result = 1
        base = base % modulus
        while exponent > 0:
            if exponent % 2 == 1:
                result = (result * base) % modulus
            exponent = exponent >> 1
            base = (base * base) % modulus
        return result

Output:

print(modular_pow(4, 13, 497))

445

3 Comments

This is the version I have. Above was a copy error, as you could see with the indentation, the statements where already outside of the else-statement. The else-statement contained a debugging-print for test-purposes, that I removed. This is slower than Python's pow.
@torpedo give me an input where it was significantly slow. For me it ran pretty fast.
@torpedo It took me ~1.14 minutes to run print(modular_pow(4, 13, 497)) 100000000 times.
0

You can use the windowed NAF method to precompute a^2, a^3,...,a^(2^w-1). Now instead of n products and squaring you have n/w rounds of products. For example, in 256-bit modexp, with w=4, we make 6 precomputations. But instead of 256 squarings and products, we have 256/4=64 of products. At the cost of 14 precomputations it is a serious savings. Now 4 bits is 2^4=16 possible values. But NAF represents these in range -w+1..w-1. Note negative exponents require modular inverse of a^(-1). So just using the encoding to scale positive values is more optimal than extra multiplication or need to compute modular inverse. Note a^0 and a^1 do not require computation.

Some precomputations might be optimized based on the exponent in NAF form as you know before hand exactly which values will be needed.

The w value must be adjusted based on exponentiation but size. The optimal value can be computed based on n/w vs. 2^w-1, or determined empirically.

I'm surprised fixed exponent aspect of the problem was not addressed yet. There are also a couple papers on this exact topic. The same techniques are used in elliptic curve scalar multiplication albeit there typicslly the point similar to the base is fixed rather than the scalar equivalent to exponent here. The same method works on fixed base but the precomputations can be done offline and reused more efficiently whereas with exponent, they are done on the fly.

Comments

-1

If you wanna speed it up, terminate slightly earlier

— because once the exponent is down to {0, 1, 2}, the extra calculations needed are straight forward.

Using these assumptions...

  1. _b_ :: base - integer
  2. _e_ :: exponent - non-negative integer
  3. _m_ :: modulus - non-zero integer

...the code below matches output of python3's built-in pow(_b_, _e_, _m_) based on limited testing :

def mod_pow(_b_, _e_, _m_):

    if ((-(_ := 1) <= _m_) and _m_ <= _) or ((_b_ == -_b_) and _b_ < _e_):
        return  _ - _
    elif  _e_ < _  or  ((_b_ >= -_) and _b_ <= _):
        return (_ - (int(_b_ == -_)  & _e_) * (_  +  _)) % _m_
    elif ((-_ < (_b_ := _b_ % _m_)) and _b_ <= _) or _ >=  _e_:
        return _b_
    else:
        _e_ <<= _

    while  2 < (_e_ := _e_ >> 1):
        if 1 &  _e_:
            _ = _ * _b_ % _m_
        _b_ = _b_ * _b_ % _m_

    if 1 < _e_:
        _b_ *= _b_

    return _ * _b_ % _m_ if _e_ else _

3 Comments

That's an unconventional approach to variable naming...
My view would be: function local variables aren't externally accessible so it's pretty meaningless to mark them as "internal" any: none of these clash with Python keywords so no need to avoid conflict; _ is usually used to indicate a variable you're deliberately ignoring and not using rather than your most used temporary variable. And if you're concerned about searchability then exponent might be clearer than e. It doesn't really matter so name things how you like it your own code, but it is unconventional.
@DavidW : if there's no typical use case for _, like in this function, then why not use it? It's sitting around idling anyway.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.