Revisions to Numerical integration via Newton-Cotes formulas using integer arithmetic

added 1 character in body

Source Link

edited Feb 24 at 7:52

4.3k
14
50

The above lists eight different ways to generate a list containing the first 1025 powers of two (1 to \$2^{1024}\$). The execution time measures fluctuate, but generally the performance hierarchy doesn't change. The left-shift method and the addition and index assignment method both seem to be the fastest, but for large numbers, repeatedly shifingshifting the big integers will be slow, there is a lot of extra work to be done.

deprettify program output

Source Link

edited Feb 24 at 7:48

Toby Speight

92.1k
15
106
337

In [5]: n = 9460536207068016

In [6]: %timeit n + n
48.6 ns ± 0.206 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [7]: %timeit n * 2
68.4 ns ± 0.159 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [8]: %timeit n + n + n
80.4 ns ± 0.23 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [9]: %timeit n * 3
69.7 ns ± 0.95 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [10]: %timeit n + n + n + n
111 ns ± 1.27 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [11]: %timeit n * 4
68.2 ns ± 0.0599 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [12]: %timeit n << 2
57.8 ns ± 0.243 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [13]: %timeit n << 1
57.8 ns ± 0.15 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [14]: %timeit n * 8
69.6 ns ± 0.348 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [15]: %timeit n << 3
59.9 ns ± 0.655 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [5]: n = 9460536207068016

In [6]: %timeit n + n
48.6 ns ± 0.206 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [7]: %timeit n * 2
68.4 ns ± 0.159 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [8]: %timeit n + n + n
80.4 ns ± 0.23 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [9]: %timeit n * 3
69.7 ns ± 0.95 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [10]: %timeit n + n + n + n
111 ns ± 1.27 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [11]: %timeit n * 4
68.2 ns ± 0.0599 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [12]: %timeit n << 2
57.8 ns ± 0.243 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [13]: %timeit n << 1
57.8 ns ± 0.15 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [14]: %timeit n * 8
69.6 ns ± 0.348 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [15]: %timeit n << 3
59.9 ns ± 0.655 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [5]: n = 9460536207068016

In [6]: %timeit n + n
48.6 ns ± 0.206 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [7]: %timeit n * 2
68.4 ns ± 0.159 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [8]: %timeit n + n + n
80.4 ns ± 0.23 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [9]: %timeit n * 3
69.7 ns ± 0.95 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [10]: %timeit n + n + n + n
111 ns ± 1.27 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [11]: %timeit n * 4
68.2 ns ± 0.0599 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [12]: %timeit n << 2
57.8 ns ± 0.243 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [13]: %timeit n << 1
57.8 ns ± 0.15 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [14]: %timeit n * 8
69.6 ns ± 0.348 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [15]: %timeit n << 3
59.9 ns ± 0.655 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [5]: n = 9460536207068016

In [6]: %timeit n + n
48.6 ns ± 0.206 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [7]: %timeit n * 2
68.4 ns ± 0.159 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [8]: %timeit n + n + n
80.4 ns ± 0.23 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [9]: %timeit n * 3
69.7 ns ± 0.95 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [10]: %timeit n + n + n + n
111 ns ± 1.27 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [11]: %timeit n * 4
68.2 ns ± 0.0599 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [12]: %timeit n << 2
57.8 ns ± 0.243 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [13]: %timeit n << 1
57.8 ns ± 0.15 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [14]: %timeit n * 8
69.6 ns ± 0.348 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [15]: %timeit n << 3
59.9 ns ± 0.655 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

added 6958 characters in body

Source Link

edited Feb 24 at 7:43

Ξένη Γήινος

4.3k
14
50

Update 1

I chose the my code structure based purely on maximizing of performance (shortest execution time), and all else are disposable and should be sacrificed for even small gains in making the code faster. All my choices in this program are based on empirical evidence.

In [41]: %%timeit
    ...: lst = [1] * 1025
    ...: m = 1
    ...: for i in range(1, 1025):
    ...:     lst[i] = m = m + m
93.3 μs ± 1.28 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [42]: %%timeit
    ...: m = 1
    ...: lst = [1, *((m := m + m) for _ in range(1024))]
129 μs ± 1.25 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [43]: %%timeit
    ...: m = 1
    ...: lst = [1, *[(m := m + m) for _ in range(1024)]]
100 μs ± 465 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [44]: %%timeit
    ...: m = 1
    ...: lst = [(m := m + m) for _ in range(1024)]
    ...: lst.insert(0, 1)
94.9 μs ± 975 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [45]: %%timeit
    ...: m = 1
    ...: lst = [(m := m * 2) for _ in range(1024)]
    ...: lst.insert(0, 1)
132 μs ± 1.55 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [46]: %timeit [1 << i for i in range(1025)]
94.5 μs ± 643 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [47]: %%timeit
    ...: lst = [1] * 1025
    ...: m = 1
    ...: for i in range(1, 1025):
    ...:     lst[i] = m = m + m
94 μs ± 919 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [48]: %%timeit
    ...: m = 1
    ...: [1] + [(m := m * 2) for _ in range(1024)]
139 μs ± 1.21 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [49]: %%timeit
    ...: m = 1
    ...: [1] + [(m := m + m) for _ in range(1024)]
101 μs ± 611 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The above lists eight different ways to generate a list containing the first 1025 powers of two (1 to \$2^{1024}\$). The execution time measures fluctuate, but generally the performance hierarchy doesn't change. The left-shift method and the addition and index assignment method both seem to be the fastest, but for large numbers, repeatedly shifing the big integers will be slow, there is a lot of extra work to be done.

I will explain each in detail.

In the first method, a list of length 1025 is created, initially filled with 1s. Then starting at index 1, each element is mutated by indexing assignment, the code first doubles multiple and assigns the result to multiple and the list at the given index. There is one single list throughout, no list.append calls and therefore no reallocation, it doesn't mutate the length of the list thus the memory access is contiguous, it is just pointer increments, therefore it is the fastest loop scheme.

The second method uses the same logic and operator, except it uses a generator, it doubles multiple at each iteration and reassigns that to multiple and yields the result at the same time, and then unpacks the generator and copies the contents of the yielded items to a list initialized with first item being 1. It has to first construct a generator, a generator is a Python object with all the overheads and it doesn't have a length attribute. The code has to first create a list and exhaust the generator and extend the list via list.extend which copies the values and mutates the length of the list so that memory access isn't contiguous. Therefore it is slow.

The third method is almost identical to the second, except it uses list comprehension instead of a generator. Generally a list comprehension is faster than the equivalent generator (changing [] to () without changing anything else), because it doesn't have Python object overheads. It has the same problem with the second. It has to create two lists and do bulk memory copying.

The fourth method creates a list using list comprehension and inserts 1 at position 0. It creates only one list, but it has to move all 1024 references down by one position to get one free space and then assign the first one. It has to copy all these elements one by one, and it has to move the starting position of the list if there isn't enough memory to store 1025 references because the position at start + 1024 is occupied by some other program. It has to mutate the length of the list therefore memory access isn't contiguous.

The fifth method is almost identical to the fourth, except it uses a multiplication instead of an addition. School book multiplication has time complexity of \$O(n^2)\$, but actually it is false. It is actually \$O(\lfloor \log_{base} n \rfloor + 1)^2\$, where base is the maximum digit plus one, \$\lfloor \log_{base} n \rfloor + 1\$ is the number of digits of n in base base numeral expansion, multiplication has to multiply every digit of the left operand with every digit of the right operand in the numeral expansions. For CPython the base is \$2^{30} = 1073741824\$, which fits inside a uint32_t. For large integers Python uses Karatsuba multiplication which has lower time complexity and also higher cost for small inputs, it multiplies two n-digit numbers in \$O(n^{\log_2 3})\$ time, which triggers in the example. For even larger numbers Python uses Number Theoretic Transform which has time complexity \$O(n \log n)\$ for n-digit multiplication but here the bit length is below the cutoff so it doesn't trigger. NTT is extremely slow for small inputs.

Now what is the time complexity of addition? Adding two n-digit integers can be done in n iterations, so it is \$O(n)\$ where n is the number of digits or \$\lfloor \log_{base} N \rfloor + 1\$ for input N. So for adding one integer with itself, n + n is strictly faster than n * 2, because n < n log n. Of course for adding n with itself m times n * (m + 1) for m > 1 is strictly faster than chain addition, because they have the same time complexity and the former has far fewer function calls and overheads.

The sixth method is almost as fast as the first, but the timings are unreliable. It uses list comprehension to generate a single list containing 1<<0, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5... 1<<1024. For an n-digit number left-shift is \$O(n)\$ in time, it has to propagate the carry bit from lowest digit to highest digit (+ 1 digit if highest bit is set) in one single loop, so it is strictly faster than multiplication for multiplying with powers of 2. For some reason n << 1 is slower than n + n. The sixth method does a whole lot of redundant work because each immediate result can be obtained by a single operation from the previous result. So it is strictly slower than the first.

The eighth and nineth generates a list using list comprehension and concatenates it with another list. It creates two lists and then creates another list and copies the contents of the two lists to the third list, so they are the slowest.

Update 1

I chose the my code structure based purely on maximizing of performance (shortest execution time), and all else are disposable and should be sacrificed for even small gains in making the code faster. All my choices in this program are based on empirical evidence.

In [41]: %%timeit
    ...: lst = [1] * 1025
    ...: m = 1
    ...: for i in range(1, 1025):
    ...:     lst[i] = m = m + m
93.3 μs ± 1.28 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [42]: %%timeit
    ...: m = 1
    ...: lst = [1, *((m := m + m) for _ in range(1024))]
129 μs ± 1.25 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [43]: %%timeit
    ...: m = 1
    ...: lst = [1, *[(m := m + m) for _ in range(1024)]]
100 μs ± 465 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [44]: %%timeit
    ...: m = 1
    ...: lst = [(m := m + m) for _ in range(1024)]
    ...: lst.insert(0, 1)
94.9 μs ± 975 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [45]: %%timeit
    ...: m = 1
    ...: lst = [(m := m * 2) for _ in range(1024)]
    ...: lst.insert(0, 1)
132 μs ± 1.55 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [46]: %timeit [1 << i for i in range(1025)]
94.5 μs ± 643 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [47]: %%timeit
    ...: lst = [1] * 1025
    ...: m = 1
    ...: for i in range(1, 1025):
    ...:     lst[i] = m = m + m
94 μs ± 919 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [48]: %%timeit
    ...: m = 1
    ...: [1] + [(m := m * 2) for _ in range(1024)]
139 μs ± 1.21 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [49]: %%timeit
    ...: m = 1
    ...: [1] + [(m := m + m) for _ in range(1024)]
101 μs ± 611 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)