I decided to implement and compare several methods in Python (without C/C++). To tell in advance, with following my code I reached 25x times speedup compared to using Python's pow(a, b, c) Some methods are parallelized, some not.
Note: See updated Part 2 of my post below with C++ code.
See code at the very bottom of this post. Timings below will be for my i7 laptop with 4 cores at 1.2 Ghz (8 hardware threads).
Python
- Pure Python single-process. This method does list comprehension with pow(a, b, c) built-in python function.
Python_Parallel
- Multi-process pure Python. Same as method 1) but parallel. This and further methods use multiprocessing module to occupy all CPU cores.
Gmp
- GMPY2-based single-process solution. This uses powmod() function and list comprehension.
Gmp_Parallel
- Parallel GMPY2-based. Uses method 3) but on all cores.
Redc_Div_Parallel
- Binary exponentiation (also known as Exponentiation by Squarings) with regular modulus computation through % mod
operation. This and further methods also use Numpy arrays with Python objects (object
dtype). This and further methods implement Windowed Method to boost exponentiation.
Redc_Barrett_Parallel
- Same as 5) but uses Barrett Reduction to replace division by multiplications.
Redc_Montgomery_Parallel
- Same as 5) but uses Montgomery Reduction to replace division.
Timings:
Python : time 78.74 sec, boost 1.00
Python_Parallel : time 19.09 sec, boost 4.12, Correct
Gmp : time 27.76 sec, boost 2.84, Correct
Gmp_Parallel : time 7.57 sec, boost 10.40, Correct
Redc_Div_Parallel : time 17.38 sec, boost 4.53, Correct, w = 4 (324 redcs)
Redc_Barrett_Parallel : time 33.15 sec, boost 2.38, Correct, w = 4 (324 redcs)
Redc_Montgomery_Parallel : time 42.24 sec, boost 1.86, Correct, w = 4 (324 redcs)
All timings above measure boost compared to Python
(single-threaded) base. Correctness of all computed results is checked compared to Python
output.
One can see that Gmp_Parallel
gives the biggest boost, also it is very simple, just two lines of code. Probably among Python-only solutions it is the best and most simple.
Of course if you convert Redc_...
methods to C/C++, they should become much faster, and could be faster than GMP C method mpz_powm (which is a base of GMPY2's powmod()
). You can use fixed-arithmetics _ExtInt type available in Clang, just do using u512 = unsigned _ExtInt(512);
and you'll get fixed-size 512-bit integer type for free, you can even put any large non-even number of bits like _ExtInt(12345)
. Also Boost has uint128_t
/uint256_t
/uint512_t
available in C++.
I can imagine the only reason that Barrett/Montgomery solutions are slower only because they are Python-object based, so most of time is spent in organizing Python interactions instead of doing simple mul/shift operations without division. Hence if you use C++ you'll get much speed improvement.
In my code below you can tweak variables num_bits
and pow_bits
to set size of numbers (in bits) and size of powers, also cnt_nums
sets size of array (numbers of bases to be power-rised).
Try it online!
def PowModPython(bs, e, m):
return [pow(b, e, m) for b in bs]
def PowModGmp(bs, e, m):
import gmpy2, numpy as np
return [gmpy2.powmod(b, e, m) for b in bs]
#return list(np.vectorize(lambda x: gmpy2.powmod(x, e, m))(np.asarray(bs, dtype = object)))
def PowModParallel(func, bs, e, m, *, processes = None, extra = ()):
import multiprocessing as mp, functools
if processes is None:
processes = mp.cpu_count()
with mp.Pool(processes) as pool:
try:
block = (len(bs) + processes * 8 - 1) // (processes * 8)
return functools.reduce(
lambda a, b: a + b,
pool.starmap(func,
[(bs[i : i + block], e, m) + extra for i in range(0, len(bs), block)]),
[]
)
finally:
pool.close()
pool.join()
def PowModRedc(bs, e, m, bits, redc_type = None, w = None):
import gmpy2, numpy as np, math
def Int(x):
return gmpy2.mpz(x)
def Mask(x, s):
#return x & ((1 << s) - 1)
return np.vectorize(lambda x: gmpy2.t_mod_2exp(x, s))(x)
e = Int(e)
m = Int(m)
def CreateMod():
if redc_type == 'div':
def Mod(x):
return x % m
def To(x):
return x
def From(x):
return x
def Adjust(x):
return x
else:
if redc_type is None and m & 1 != 0 or redc_type == 'montgomery':
# https://www.nayuki.io/page/montgomery-reduction-algorithm
assert m & 1 != 0, 'Montgomery can not handle even modulus!'
def MontgomeryPrec(n, s):
n, s = Int(n), Int(s)
r = 1 << s
r2 = (r * r) % n
ri = pow(r, -1, n)
k = (r * ri - 1) // n
return n, s, Int(k), Int(r2), Int(r - 1)
Mg = MontgomeryPrec(m, bits + 2)
def Mod(x):
n, s, k, _, mask = Mg
t = Mask(Mask(x, s) * k, s)
u = (x + t * n) >> s
return u
def To(x):
_, _, _, r2, _ = Mg
return Mod(x * r2)
def From(x):
return Mod(x)
elif redc_type == 'barrett':
# https://www.nayuki.io/page/barrett-reduction-algorithm
assert m > 0 and m & (m - 1) != 0, 'Modulus is a power of two! Not possible for Barrett.'
def BarrettPrec(n, s):
n, s = Int(n), Int(s)
return n, s, (1 << s) // n
Br = BarrettPrec(m, bits * 2 + 2)
def Mod(x):
n, s, r = Br
q = (x * r) >> s
u = x - q * n
return u
def To(x):
return x
def From(x):
return x
else:
assert False, f'Unknown reduction type {redc_type}!'
def Adjust(x):
gm = x >= m
x[gm] -= m
return x
return Mod, To, From, Adjust
def Arr(x):
return np.array([Int(e) for e in x], dtype = object)
Mod, To, From, Adjust = CreateMod()
bs = To(Mod(To(Arr(bs))))
if w is None:
w = OptW(e)[1]
def PreComputePowers(d, w, bs):
d = Int(d)
digits = []
mask = 2 ** w - 1
while d != 0:
digits.append(d & mask)
d >>= w
digits = digits[::-1]
powers = [None, np.copy(bs)]
for i in range(2, 2 ** w):
powers.append(Mod(powers[-1] * bs))
return digits, powers
digits, powers = PreComputePowers(e, w, bs)
res = powers[digits[0]]
for digit in digits[1:]:
for i in range(w):
res = Mod(res * res)
if digit != 0:
res = Mod(res * powers[digit])
return list(Adjust(From(res)))
def OptW(e):
n = max(1, e.bit_length())
minv = None
for w in range(1, 11):
# Number of multiplication-reductions
c = (2 ** w - 2) + ((n - 1) // w) * (w + 1)
if minv is None or c < minv[0]:
minv = (c, w)
return minv
def Test():
import time, random, numpy as np
random.seed(0)
num_bits = 496
pow_bits = 252
cnt_nums = 1 << 16
cnt_pows = 1 << 0
def Rand(bits):
return random.randrange(1 << bits)
bs = [Rand(num_bits) for i in range(cnt_nums)]
em = [(Rand(pow_bits), Rand(num_bits) | 1) for i in range(cnt_pows)]
opt_w = OptW(em[0][0])
ref_tim, res = None, None
for fname, f, w in [
('Python', PowModPython, None),
('Python_Parallel', lambda bs, e, m: PowModParallel(PowModPython, bs, e, m), None),
('Gmp', PowModGmp, None),
('Gmp_Parallel', lambda bs, e, m: PowModParallel(PowModGmp, bs, e, m), None),
('Redc_Div_Parallel', lambda bs, e, m: PowModParallel(PowModRedc, bs, e, m, extra = (num_bits, 'div')), opt_w),
('Redc_Barrett_Parallel', lambda bs, e, m: PowModParallel(PowModRedc, bs, e, m, extra = (num_bits, 'barrett')), opt_w),
('Redc_Montgomery_Parallel', lambda bs, e, m: PowModParallel(PowModRedc, bs, e, m, extra = (num_bits, 'montgomery')), opt_w),
]:
tim = time.time()
for e, m in em:
resc = f(bs, e, m)
tim = time.time() - tim
if ref_tim is None:
ref_tim = tim
print(f'{fname:<26}: time {tim:>6.02f} sec, boost {ref_tim / tim:>5.02f}', end = '', flush = True)
if res is not None:
assert np.all(np.equal(res, resc))
print(', Correct', end = '')
else:
res = resc
print(f', w = {w[1]} ({w[0]} redcs)' if w is not None else '')
if __name__ == '__main__':
Test()
Update. Also implemented C++ version together with Cython to adopt it to run in Python. See following code at the bottom. Also I used Sliding Window approach instead of Regular Window used above.
C++ paralel version (Montgomery-based) appeared to be 2x
times faster than GMPY2 parallel implemented above (and 19x
times faster than Python single-threaded version). Barrett version appeared 1.1x
times faster than parallel GMP. Precise timings below, see Cython_...
variants, they are fastest:
reg_win = 4 (328 redcs), slide_win = 5 (320 redcs)
Python : time 78.90 sec, boost 1.00
Python_Parallel : time 20.22 sec, boost 3.90, Correct
Gmp : time 27.81 sec, boost 2.84, Correct
Gmp_Parallel : time 8.19 sec, boost 9.63, Correct
Redc_Div_Parallel : time 17.42 sec, boost 4.53, Correct
Redc_Barrett_Parallel : time 33.08 sec, boost 2.39, Correct
Redc_Montgomery_Parallel : time 43.05 sec, boost 1.83, Correct
Cython_Redc_Barrett_Parallel : time 7.71 sec, boost 10.23, Correct
Cython_Redc_Montgomery_Parallel : time 4.15 sec, boost 19.00, Correct
C++ version uses features of C++20 standard. Also it needs Clang compiler, because it uses _ExtInt(N)
type available in Clang only. For long arithmetics it uses Boost. For parallelization it uses OpenMP technology built-in in compiler.
Cython package should be installed in Python, because Cython is reponsible for wrapping and compiling C++. Install it through python -m pip install cython
.
On Windows you can install Clang from releases page, download and install LLVM-13.0.0-win64.exe
. Boost can be installed by Chocolatey, it is a great package manager for Windows, or downloaded from Boost website. To install it with Chocolatey just type choco install boost-msvc-14.2
in console. Also on Windows you should install Visual Studio, 2019 or 2022 version, Community variant is enough. Visual Studio is needed because Clang depends on its libraries on Windows, only on Linux/MacOS Clang is independent.
Under Linux Clang and Boost can be installed just through sudo apt install clang libboost-all-dev
.
In my code below you can see clang_dir = 'c:/bin/llvm/bin/', boost_dir = None
in cython_import()
function. You should set these two options only under Windows, for Linux set them to clang_dir = '', boost_dir = ''
. Chocolatey installs Boost into c:/local/boost_1_74_0/
directory by default.
Sources Here. Source code with C++ appeared to be too large for StackOverflow post size limit (30 000 chars), hence I uploaded this code to PasteBin, it can be downloaded from here (Tio.run
mirror is here, it is not runnable).
Update 2. Implemented one more version, also in C++/Cython. Now it is pure GMP (C library) based. And there is no Python interaction while computing.
It appeared that this solution is quite simple and fastest of all. To remind my fastest solution above in C++ was based on Montgomery Reduction and Sliding Window variant of Binary Exponentiation and that solution was 19x
times faster than pure Python single-threaded variant.
And current solution which is C++/GMP based is 25.4x
times faster than single threaded Python. Of course it is parallelized using OpenMP. Its code is quite simple, I provide only a code of this single C++ function down below due to StackOverflow post size limit. Full code is provided also below, hosted at external PasteBin web-service.
Precise timings of all methods (new method is called Cython_Gmp_Parallel
):
reg_win = 4 (330 redcs), slide_win = 5 (321 redcs)
Python : time 162.81 sec, boost 1.00
Python_Parallel : time 37.92 sec, boost 4.29, Correct
Gmp : time 56.57 sec, boost 2.88, Correct
Gmp_Parallel : time 15.13 sec, boost 10.76, Correct
Redc_Div_Parallel : time 34.81 sec, boost 4.68, Correct
Redc_Barrett_Parallel : time 67.23 sec, boost 2.42, Correct
Redc_Montgomery_Parallel : time 86.69 sec, boost 1.88, Correct
Cython_Redc_Barrett_Parallel : time 13.96 sec, boost 11.66, Correct
Cython_Redc_Montgomery_Parallel : time 8.31 sec, boost 19.58, Correct
Cython_Gmp_Parallel : time 6.40 sec, boost 25.42, Correct
Full code hosted at PasteBin is here.
And just a code of single C++ function, which is quite short:
#include <gmpxx.h>
// .... And several other includes
void PowModGmpC(
uint64_t * bs0, uint64_t bs_words, uint64_t cnt, const uint64_t * e0,
uint64_t e_words, const uint64_t * m0, uint64_t m_words) {
auto ToL = [](void const * x, size_t c) {
mpz_class r;
mpz_import(r.get_mpz_t(), c / 8, -1, 8, 0, 0, x);
return r;
};
auto ToW = [](mpz_class const & x, void * y, size_t c) {
size_t cnt = 0;
mpz_export(y, &cnt, -1, 8, 0, 0, x.get_mpz_t());
};
mpz_class const
e = ToL(e0, e_words * sizeof(u64)),
m = ToL(m0, m_words * sizeof(u64));
std::span<u64> bs(bs0, bs_words * cnt);
#pragma omp parallel for
for (size_t i = 0; i < cnt * bs_words; i += bs_words) {
mpz_class b = ToL(&bs[i], bs_words * sizeof(u64)), r;
mpz_powm(r.get_mpz_t(), b.get_mpz_t(), e.get_mpz_t(), m.get_mpz_t());
ToW(r, &bs[i], bs_words * sizeof(u64));
}
}