Return to Answer

Post Timeline

added 7 characters in body

Source Link

edited Feb 28 at 18:56

4.3k
14
50

Thus if I set a fixed number of bits, and I truncate every arithmetic result to the highest number of bits, the numbers will never blow up and I get \$\log_{10}(2)\$ times the precision number of accurate decimal digits at minimum, this can ensure the calculations stay fast no matter how many fractions I add. And as a bonus, this beats floats, because if the numerator and denominator are both below the limit, then the fraction is exact, unlike in float in which every fraction with reduced denominator containing any prime factor other than 2 gets infinite binary expansion thus only approximations can be stored.

Source Link

answered Feb 28 at 18:50

Ξένη Γήινος

4.3k
14
50

Thanks for the answers, but they don't improve the performance so I won't accept them.

And I have discovered three tricks that can drastically increase the performance of the program and I have also implemented two minor optimizations.

I will explain each in detail.

First, the simplest optimization I made, as some integrals start at the origin, it doesn't make sense to build the multiples of the numerator of the start times the denominator of the end (low_num * high_den in my code), as it is 0 if low_num is 0, and all its multiples are 0, there is a lot of redundant computation, so I made it a special case and I made the code compile dedicated functions just for them.

Second, I have added the option to reduce the starting fractions to the lowest terms (making the fractions coprime pairs), the idea is simple, I add the fractions using \$\frac{a}{b} + \frac{c}{d} = \frac{ad + bc}{bd}\$, as you can see both b and d cannot be 0, and a and c are likely positive, which means in general the numerator and denominator of the sum will be larger than the original fractions, and the rate of increase is extremely fast, the larger the numbers the faster the growth rate, and adding many fractions together will make the numbers explode.

Now, the elementary arithmetic functions used here: addition and multiplication, both have growing cost as the numbers get larger, the larger the numbers the more time they will take, with multiplication scale as the square of the number of digits.

So I made the code to generate functions that reduce the end points and the common denominator up front, and I used 3 GCD calls and 7 floor divisions, they do have some cost, but compared to the cost of arithmetic on large numbers their costs are small, and they ensure the starting values are as small as possible, which means the numbers won't grow too fast.

Now, the first trick I used, is this, instead of adding the fractions one by one linearly, I split the list of fractions into pairs, and add the fractions in each pair, and collect the resulting fractions and carry fraction if the count of starting fractions is odd, this halves the number of fractions each iteration, the output of each iteration is the input of the next iteration.

Eventually all the fractions will be added and there will be only one fraction in the stack, and so the loop terminates. This method is much faster than bruteforce, for more information, see this answer.

There are two main reasons that the second method is faster, in the first method it is always a big number multiplied with a small number, the big numbers grow in a factorial like fashion. In the second method, the fractions are added in pairs, which means the numbers will be smaller than if three and more fractions are added together, this much should be obvious. So the numbers stay smaller for longer, thus operations on them remain faster for longer.

The second reason is this, in the second method as the fractions are added in pairs, in each iteration all the fractions will be comparable in size. Why does this matter? Because Python multiplies digit by digit, CPython uses 30 bits in a digit, so a digit can hold \$2^{30} = 1073741824\$ values, and for large numbers Python uses Karatsuba multiplication which multiplies two n-digit numbers in \$O(n^{\log_2(3)})\$ time, and CPython uses Karatsuba multiplication when both operands are at least 70 digits long (2100 bits), whereas for cases where at least one of the operands is less than 70 digits long, Python uses schoolbook multiplication which has time complexity of \$O(n^2)\$, maybe it doesn't seem much, but it is exponentiation which means it grows extremely fast, so Karatsuba multiplication will be much faster than schoolbook multiplication as the input get larger. For example, \$70^2 = 4900, 70^{\log_2(3)} \approx 840.2551898497885\$, the difference is huge.

Proof:

import numpy as np
from itertools import batched
from math import gcd


def make_test_data(num_digits: int, num_fractions: int) -> list[tuple[int, int]]:
    result = []
    data = np.random.randint(
        0x20000000, 0x40000000, (num_fractions, num_digits << 1), dtype=np.uint32
    )
    for row in data:
        pair = [0, 0]
        for i, e in enumerate(row.tolist()):
            pair[which] = (pair[which := i < num_digits] << 30) | e

        result.append(pair)

    return result


def test1(fracs: list[tuple[int, int]]) -> tuple[int, int]:
    num, den = 0, 1
    for tnum, tden in fracs:
        num, den = num * tden + tnum * den, den * tden

    return num, den


def test2(fracs: list[tuple[int, int]]) -> tuple[int, int]:
    while (length := len(fracs)) > 1:
        carry = fracs.pop(-1) if length & 1 else None
        fracs = [(a * d + b * c, b * d) for (a, b), (c, d) in batched(fracs, 2)]
        if carry:
            fracs.append(carry)

    return fracs[0]

In [46]: data = make_test_data(16, 16)

In [47]: %timeit test1(data)
126 μs ± 811 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [48]: %timeit test2(data)
129 μs ± 763 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [49]: data = make_test_data(32, 16)

In [50]: %timeit test1(data)
449 μs ± 9.22 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [51]: %timeit test2(data)
400 μs ± 7.51 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [52]: data = make_test_data(64, 16)

In [53]: %timeit test1(data)
1.71 ms ± 10.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [54]: %timeit test2(data)
1.24 ms ± 22.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [55]: data = make_test_data(16, 32)

In [56]: %timeit test1(data)
484 μs ± 14.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [57]: %timeit test2(data)
431 μs ± 14.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [58]: data = make_test_data(16, 64)

In [59]: %timeit test1(data)
1.87 ms ± 56.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [60]: %timeit test2(data)
1.39 ms ± 12.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

As you can see, test2 is increasingly faster than test1 as the number of digits increase and the number of fractions increase, though for trivial cases test1 is faster, so I only use the method used in test2 in the integrate_by_parts function (and yes I am aware that it differs from school definitions, and the method is better described as "composite rule", but I don't care about academia and I call it integrate by parts because it splits the interval into many parts and integrates over each and then sum).

The second trick is much easier to follow. So we all know multiplying the numerator and denominator with some number simultaneously doesn't change the value of the fraction. Now, division is multiplication, division is multiplication with the multiplicative inverse element. And, as the fractions get larger, the cost of adding the fractions together will also get larger even faster, and the cost will only increase as we add the fractions together, so naturally if we divide both sides by some number the numerator and denominator will be smaller and thus arithmetic on them will be faster again.

Thus naturally we should find the greatest common denominator of the numerator and denominator and divide both sides by that number, right? WRONG! GCD is slow, and division is slow, we need one GCD and two floor divisions every iteration if we reduce the fraction every iteration, and the code will be much slower than before.

Instead we can do this, obviously a fraction with an even numerator and an even denominator isn't reduced, they have common factors of 2, a fraction with both sides odd isn't guaranteed to be reduced either, but a reduced fraction must not have both sides even, and from both sides even to at least one side odd we divide by a power of two on both sides, thus the numbers are smaller.

Now computers use binary, it is just ones and zeros, each place represents a power of two. For example, 9460536207068016 is \$100001100111000100111010111100000000011100101101110000_2\$ in binary, and the number of times two appears in a number's prime factorization (the exponent of the maximum power of two that divides a number) is simply the number of trailing zeros in a number's binary expansion, for example the number given is \$2^{53} + 2^{48} + 2^{47} + 2^{44} + 2^{43} + 2^{42} + 2^{38} + 2^{35} + 2^{34} + 2^{33} + 2^{31} + 2^{29} + 2^{28} + 2^{27} + 2^{26} + 2^{16} + 2^{15} + 2^{14} + 2^{11} + 2^{9} + 2^{8} + 2^{6} + 2^{5} + 2^{4}\$, the polynomial is what that binary expansion means, the polynomial can be expressed more efficiently as \$2^{53} + 2^{49} - 2^{47} + 2^{45} - 2^{42} + 2^{38} + 2^{36} - 2^{33} + 2^{31} + 2^{30} - 2^{26} + 2^{17} - 2^{14} + 2^{11} + 2^{10} - 2^{8} + 2^{7} - 2^{4}\$ but I digress.

To get the number of trailing zeros in Python, use (n ^ n - 1).bit_length() - 1 (subtraction has a higher precedence so it is evaluated first), for a number with n trailing zeros, after subtraction the last n bits will be 1, and the next higher bit will be onset because of the borrow, previously it was set by definition, therefore these will be the only bits that differ, and XOR is true when operands are different.

We take the minimum of the number of trail zeros of the numerator and denominator, this is the exponent of the largest power of two that divides both, to get rid of it we simply right shift both sides by that amount. This preserves the exact fraction and makes the numbers smaller at very little cost.

The third trick is like magic. I don't fully understand why it works. But I can describe it in different ways. As we all know, if we divide both sides of a fraction by the same amount, the result doesn't change. But what if we do integer division on both sides, what if we divide both sides by an integer that isn't a common factor and get rid of the remainders? Well, we obviously get a new fraction, with the quotients as the new numerator and new denominator, but how different is the new fraction from the original?

In [102]: import gmpy2

In [103]: pi = gmpy2.const_pi(333)

In [104]: gmpy2.get_context().precision = 333

In [105]: n, d = pi.as_integer_ratio()

In [106]: n/d
Out[106]: mpfr('3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706807',333)

In [107]: (n>>1)/(d>>1)
Out[107]: mpfr('3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706784',333)

In [108]: (n>>2)/(d>>2)
Out[108]: mpfr('3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706738',333)

In [109]: (n>>4)/(d>>4)
Out[109]: mpfr('3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706738',333)

In [110]: (n>>8)/(d>>8)
Out[110]: mpfr('3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211701252',333)

In [111]: (n>>16)/(d>>16)
Out[111]: mpfr('3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534210595206',333)

In [112]: (n>>32)/(d>>32)
Out[112]: mpfr('3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482484887449405',333)

In [113]: (n>>64)/(d>>64)
Out[113]: mpfr('3.14159265358979323846264338327950288419716939937510582097494459230781640628620899755636516885771225893',333)

In [114]: for i in range(1, 9):
     ...:     shift = 1 << i
     ...:     print(shift, int(-gmpy2.log10(abs((n>>shift)/(d>>shift) - pi))))
2 99
4 99
8 97
16 95
32 90
64 81
128 61
256 23

In [115]: 8 * gmpy2.log10(2)
Out[115]: mpfr('2.40823996531184956170991115779594421414551905169686833048341968901686551419539607589541801694548937642',333)

In [116]: 16 * gmpy2.log10(2)
Out[116]: mpfr('4.81647993062369912341982231559188842829103810339373666096683937803373102839079215179083603389097875284',333)

In [117]: 32 * gmpy2.log10(2)
Out[117]: mpfr('9.63295986124739824683964463118377685658207620678747332193367875606746205678158430358167206778195750567',333)

In [118]: 64 * gmpy2.log10(2)
Out[118]: mpfr('19.2659197224947964936792892623675537131641524135749466438673575121349241135631686071633441355639150113',333)

If we right shift both sides by n bits, then in general we lose the last \$n \ \log_{10}(2)\$ digits. But all the other digits stay the same, they don't change.

Why is it so?

Well, we can approach this in different ways. First, we know \$\frac{a}{b} = \frac{2^na}{2^nb}\$, left-shifting both sides by n equals multiplying both sides by \$2^n\$. Conversely right-shifting both sides by n equals dividing both sides by the same amount. Now add a tiny number to the numerator after multiplying both sides by the power of two, this number is smaller than the power of two. Obviously the difference will be small, as the number is small and it is divided by a huge number, and the numerator is dominated by a. Similarly add some small number to the denominator, the difference will be small because the denominator is dominated by b.

Another way to look at this, is to think of the fraction not as a number, but as a convergent series. We have this series: \$\frac{N_0 \cdot 2^n + N_1 \cdot 2^{n-1} + N_2 \cdot 2^{n-2} + N_3 \cdot 2^{n-3} + N_4 \cdot 2^{n-4} + N_5 \cdot 2^{n-5}...}{D_0 \cdot 2^d + D_1 \cdot 2^{d-1} + D_2 \cdot 2^{d-2} + D_3 \cdot 2^{d-3} + D_4 \cdot 2^{d-4} + D_5 \cdot 2^{d-5}...}\$, in which capital N with subscript means the bits of the numerator, and capital D for the denominator bits, little n and little d are for the highest bit positions. As you can clearly see, the sums are dominated by the higher bits because they are multiplied with huge numbers, the more to the right a term is, the smaller its overall contribution to the sums, and so they don't make much of a difference, but the more terms there are, the more precision the partial sum is, therefore the later terms change the last few digits.

Thus if I set a fixed number of bits, and I truncate every arithmetic result to the highest number of bits, the numbers will never blow up and I get \$\log_{10}(2)\$ times the precision number of accurate decimal digits at minimum, this can ensure the calculations stay fast no matter how many fractions I add. And as a bonus, this beats floats, because if the numerator and denominator are both below the limit, then the fraction is exact, unlike in float in which every fraction with reduced denominator containing any prime other than 2 gets infinite binary expansion thus only approximations can be stored.

Code:

from itertools import batched, islice
from math import ceil, gcd
from gmpy2 import atan2, get_context, log10, mpfr, set_exp

NEWTON_COTES_FORMULAS = [
    {
        "steps": 3,
        "scale": (3, 8),
        "extrema": 1,
        "middle": {3: [(1, 3), (2, 3)]},
    },
    {
        "steps": 4,
        "scale": (2, 45),
        "extrema": 7,
        "middle": {32: [(1, 4), (3, 4)], 12: [(1, 2)]},
    },
    {
        "steps": 5,
        "scale": (5, 288),
        "extrema": 19,
        "middle": {75: [(1, 5), (4, 5)], 50: [(2, 5), (3, 5)]},
    },
    {
        "steps": 6,
        "scale": (1, 140),
        "extrema": 41,
        "middle": {216: [(1, 6), (5, 6)], 27: [(1, 3), (2, 3)], 272: [(1, 2)]},
    },
    {
        "steps": 7,
        "scale": (7, 17280),
        "extrema": 751,
        "middle": {
            3577: [(1, 7), (6, 7)],
            1323: [(2, 7), (5, 7)],
            2989: [(3, 7), (4, 7)],
        },
    },
    {
        "steps": 8,
        "scale": (4, 14175),
        "extrema": 989,
        "middle": {
            5888: [(1, 8), (7, 8)],
            -928: [(1, 4), (3, 4)],
            10496: [(3, 8), (5, 8)],
            -4540: [(1, 2)],
        },
    },
    {
        "steps": 9,
        "scale": (9, 89600),
        "extrema": 2857,
        "middle": {
            15741: [(1, 9), (8, 9)],
            1080: [(2, 9), (7, 9)],
            19344: [(1, 3), (2, 3)],
            5778: [(4, 9), (5, 9)],
        },
    },
    {
        "steps": 10,
        "scale": (5, 299376),
        "extrema": 16067,
        "middle": {
            106300: [(1, 10), (9, 10)],
            -48525: [(1, 5), (4, 5)],
            272400: [(3, 10), (7, 10)],
            -260550: [(2, 5), (3, 5)],
            427368: [(1, 2)],
        },
    },
]


def generate_indices(formula: dict) -> dict:
    num_mults = set()
    den_mults = set()
    middle = formula["middle"]
    for points in middle.values():
        num, den = points[0]
        num_mults.add(num)
        den_mults.add(den)
        if len(points) > 1:
            num, den = points[1]
            num_mults.add(num)
            den_mults.add(den)

    num_mults = sorted(num_mults)
    den_mults = sorted(den_mults)
    num_indices = {e: i for i, e in enumerate(num_mults)}
    if len(den_mults) == 1 and den_mults[0] > 2:
        return {
            "num_mults": num_mults,
            "den_mults": den_mults,
            "indices": [
                (weight, num_indices[a], num_indices[b])
                for weight, ((a, _), (b, _)) in middle.items()
            ],
            "middle": None,
        }

    den_indices = {e: i for i, e in enumerate(den_mults)}
    indices = [
        (weight, [(num_indices[num], den_indices[den]) for num, den in points])
        for weight, points in middle.items()
    ]
    weight = None
    if den_mults[0] == 2:
        weight, _ = indices.pop(-1)

    return {
        "num_mults": num_mults,
        "den_mults": den_mults,
        "indices": indices,
        "middle": weight,
    }


def assemble_strategy(mults: list[int]) -> dict:
    strategy = {"add": [], "mul": []}
    if (e := mults[0]) == 1:
        diffs = [b - a for a, b in zip(mults, mults[1:])]
        if len(set(diffs)) == 1:
            return {"type": "range", "step": diffs[0], "increments": len(diffs)}
    else:
        strategy["mul"].append((0, e))

    for i, e in enumerate(mults[1:], start=1):
        found = False
        for j, f in enumerate(mults[:i]):
            for k, g in enumerate(mults[j:i], start=j):
                if f + g == e:
                    strategy["add"].append((i, j, k))
                    found = True
                    break
            if found:
                break
        else:
            strategy["mul"].append((i, e))

    return {"type": "general", "length": len(mults), "strategy": strategy}


INITIALIZE_TYPE_0 = """
    left = low_num * high_den
    span = high_num * low_den - left
    divisor = low_den * high_den
"""

INITIALIZE_TYPE_1 = """
    low_num //= (common := gcd(low_num, low_den))
    low_den //= common
    high_num //= (common := gcd(high_num, high_den))
    high_den //= common
    left = low_num * high_den // (common := gcd(low_den, high_den))
    span = high_num * low_den // common - left
    divisor = low_den * high_den // common
"""


INITIALIZE_TYPE_2 = """
    span = high_num
    divisor = high_den
"""

INITIALIZE_TYPE_3 = """
    span = high_num // (common := gcd(high_num, high_den))
    divisor = high_den // common
"""


FUNCTION_TEMPLATE_0 = """def {func_name}(
    low_num: int, low_den: int, high_num: int, high_den: int, integrand: callable
) -> tuple[int, int]:{initialize}
{assemble}
    result_num, result_den = integrand(low_num, low_den)
    term_num, term_den = integrand(high_num, high_den)
    result_num, result_den = result_num * term_den + term_num * result_den, result_den * term_den
    result_num {extrema_weight}
    indices = NEWTON_COTES_MIDDLE_INDICES[{indices_length}]{loop_body}{extra_iteration}
    return result_num * span {num_scale}, result_den * {last_den} {den_scale}
"""


FUNCTION_TEMPLATE_1 = """def {func_name}(
    high_num: int, high_den: int, integrand: callable
) -> tuple[int, int]:{initialize}
{assemble}
    result_num, result_den = integrand(0, 1)
    term_num, term_den = integrand(high_num, high_den)
    result_num, result_den = result_num * term_den + term_num * result_den, result_den * term_den
    result_num {extrema_weight}
    indices = NEWTON_COTES_MIDDLE_INDICES[{indices_length}]{loop_body}{extra_iteration}
    return result_num * span {num_scale}, result_den * {last_den} {den_scale}
"""


FUNCTION_MAINLOOP = """
    for weight, ((x1, y), (x2, _)) in indices:
        term_num, term_den = integrand((start := mults_left[y]) + mults_span[x1], (den := mults_div[y]))
        next_num, next_den = integrand(start + mults_span[x2], den)
        term_num, term_den = term_num * next_den + next_num * term_den, term_den * next_den
        result_num, result_den = result_num * term_den + term_num * result_den * weight, result_den * term_den
"""

FUNCTION_MAINLOOP_1 = """
    for weight, ((x1, y), (x2, _)) in indices:
        term_num, term_den = integrand(mults_span[x1], (den := mults_div[y]))
        next_num, next_den = integrand(mults_span[x2], den)
        term_num, term_den = term_num * next_den + next_num * term_den, term_den * next_den
        result_num, result_den = result_num * term_den + term_num * result_den * weight, result_den * term_den
"""


FUNCTION_ALTLOOP = """
    for weight, x1, x2 in indices:
        term_num, term_den = integrand(max_left + mults_span[x1], max_div)
        next_num, next_den = integrand(max_left + mults_span[x2], max_div)
        term_num, term_den = term_num * next_den + next_num * term_den, term_den * next_den
        result_num, result_den = result_num * term_den + term_num * result_den * weight, result_den * term_den
"""


FUNCTION_ALTLOOP_1 = """
    for weight, x1, x2 in indices:
        term_num, term_den = integrand(mults_span[x1], max_div)
        next_num, next_den = integrand(mults_span[x2], max_div)
        term_num, term_den = term_num * next_den + next_num * term_den, term_den * next_den
        result_num, result_den = result_num * term_den + term_num * result_den * weight, result_den * term_den
"""


FUNCTION_LAST_ITERATION = """
    term_num, term_den = integrand(mults_left[0] + span, mults_div[0])
    result_num, result_den = result_num * term_den + term_num * result_den {weight}, result_den * term_den
"""

FUNCTION_LAST_ITERATION_1 = """
    term_num, term_den = integrand(span, mults_div[0])
    result_num, result_den = result_num * term_den + term_num * result_den {weight}, result_den * term_den
"""


def mul_or_lshift(n: int) -> str:
    return f"* {n}" if n & -n != n else f"<< {n.bit_length() - 1}"


def imul_or_ilshift(n: int) -> str:
    return f"*= {n}" if n & -n != n else f"<<= {n.bit_length() - 1}"


def assemble_numerator_multiples(strategy: dict) -> list:
    if strategy["type"] == "range":
        return [
            (
                "    step = span"
                if (step := strategy["step"]) == 1
                else (
                    "    step = span + span"
                    if step == 2
                    else f"    step = span {mul_or_lshift(step)}"
                )
            ),
            "    number = span",
            f"    mults_span = [span] * {strategy['increments']+1}\n    for i in range(1, {strategy['increments']+1}):\n        mults_span[i] = number = number + step\n",
        ]

    result = [f"    mults_span = [span] * {strategy['length']}"]
    strats = strategy["strategy"]
    for a, b in strats["mul"]:
        result.append(f"    mults_span[{a}] {imul_or_ilshift(b)}")

    for i, j, k in strats["add"]:
        if j != k:
            result.append(
                f"    mults_span[{i}] = "
                + (f"mults_span[{j}]" if j else "span")
                + " + "
                + (f"mults_span[{k}]" if k else "span")
            )
        elif not j:
            result.append(f"    mults_span[{i}] += span")

        else:
            result.append(f"    mults_span[{i}] = (val := mults_span[{j}]) + val")

    return result


def assemble_denominator_multiples(strategy: dict) -> dict:
    strats = strategy["strategy"]
    if strategy["length"] == 1:
        mult = strats["mul"][0][1]
        return {
            "left": [f"    max_left = left {(op := mul_or_lshift(mult))}"],
            "div": [f"    max_div = divisor {op}"],
        }

    result = {
        "left": [f"    mults_left = [left] * {(length := strategy['length'])}"],
        "div": [f"    mults_div = [divisor] * {length}"],
    }

    for a, b in strats["mul"]:
        result["left"].append(f"    mults_left[{a}] {(op := imul_or_ilshift(b))}")
        result["div"].append(f"    mults_div[{a}] {op}")

    for i, j, k in strats["add"]:
        if j != k:
            result["left"].append(
                f"    mults_left[{i}] = mults_left[{j}] + mults_left[{k}]"
            )
            result["div"].append(
                f"    mults_div[{i}] = mults_div[{j}] + mults_div[{k}]"
            )
        else:
            result["left"].append(
                f"    mults_left[{i}] = (val := mults_left[{j}]) + val"
            )
            result["div"].append(f"    mults_div[{i}] = (val := mults_div[{j}]) + val")

    return result


def strip(a: int, b: int, c: int) -> tuple[int, int]:
    return (
        a
        >> (
            tail := (
                (t1 := (a ^ a - 1).bit_length() - 1),
                (t2 := (b ^ b - 1).bit_length() - 1),
            )[t1 > t2]
        ),
        b >> tail,
    )


def truncate(a: int, b: int, max_bits: int) -> tuple[int, int]:
    return (
        a
        >> (
            shift := (
                (
                    overflow := (
                        diff := ((l1 := a.bit_length()), (l2 := b.bit_length()))[
                            l1 > l2
                        ]
                        - max_bits
                    )
                    * (diff > 0)
                ),
                (
                    tail := (
                        (t1 := (a ^ a - 1).bit_length() - 1),
                        (t2 := (b ^ b - 1).bit_length() - 1),
                    )[t1 > t2]
                ),
            )[overflow < tail]
        ),
        b >> shift,
    )


SHORTEN = (strip, truncate)


NAMESPACE = {
    "NEWTON_COTES_MIDDLE_INDICES": [],
    "ROMBERG_WEIGHTS": [],
    "gcd": gcd,
    "batched": batched,
    "mpfr": mpfr,
    "set_exp": set_exp,
    "SHORTEN": SHORTEN,
}


def generate_source_Newton(level: int) -> list[tuple[str, str]]:
    assert isinstance(level, int) and 0 <= level <= 7
    base_fname = f"Newton_Cotes_integrate_{level+3}"
    formula = NEWTON_COTES_FORMULAS[level]
    data = generate_indices(formula)
    lst = NAMESPACE["NEWTON_COTES_MIDDLE_INDICES"]
    length = len(lst)
    lst.append(data["indices"])
    den_mults = data["den_mults"]
    num_scale, den_scale = formula["scale"]
    l = len(den_mults)
    special_assembly = assemble_numerator_multiples(
        assemble_strategy(data["num_mults"])
    )
    denominator_assembly = assemble_denominator_multiples(assemble_strategy(den_mults))
    special_assembly.extend(denominator_assembly["div"])
    general_assembly = special_assembly + denominator_assembly["left"]
    special_assembly = "\n".join(special_assembly)
    general_assembly = "\n".join(general_assembly)
    base_args = {
        "num_scale": mul_or_lshift(num_scale),
        "den_scale": mul_or_lshift(den_scale),
        "last_den": "max_div" if l == 1 else "mults_div[-1]",
        "extrema_weight": imul_or_ilshift(formula["extrema"]),
        "indices_length": length,
    }
    if l > 1:
        loop_body0 = FUNCTION_MAINLOOP
        loop_body1 = FUNCTION_MAINLOOP_1
    else:
        loop_body0 = FUNCTION_ALTLOOP
        loop_body1 = FUNCTION_ALTLOOP_1
    if middle := data["middle"]:
        middle_weight = mul_or_lshift(middle)
        extra_iteration0 = FUNCTION_LAST_ITERATION.format(weight=middle_weight)
        extra_iteration1 = FUNCTION_LAST_ITERATION_1.format(weight=middle_weight)
    else:
        extra_iteration0 = extra_iteration1 = ""
    return [
        (
            (fname_0 := f"{base_fname}_type_0"),
            FUNCTION_TEMPLATE_0.format_map(
                base_args
                | {
                    "func_name": fname_0,
                    "initialize": INITIALIZE_TYPE_0,
                    "assemble": general_assembly,
                    "loop_body": loop_body0,
                    "extra_iteration": extra_iteration0,
                }
            ),
        ),
        (
            (fname_1 := f"{base_fname}_type_1"),
            FUNCTION_TEMPLATE_0.format_map(
                base_args
                | {
                    "func_name": fname_1,
                    "initialize": INITIALIZE_TYPE_1,
                    "assemble": general_assembly,
                    "loop_body": loop_body0,
                    "extra_iteration": extra_iteration0,
                }
            ),
        ),
        (
            (fname_2 := f"{base_fname}_type_2"),
            FUNCTION_TEMPLATE_1.format_map(
                base_args
                | {
                    "func_name": fname_2,
                    "initialize": INITIALIZE_TYPE_2,
                    "assemble": special_assembly,
                    "loop_body": loop_body1,
                    "extra_iteration": extra_iteration1,
                }
            ),
        ),
        (
            (fname_3 := f"{base_fname}_type_3"),
            FUNCTION_TEMPLATE_1.format_map(
                base_args
                | {
                    "func_name": fname_3,
                    "initialize": INITIALIZE_TYPE_3,
                    "assemble": special_assembly,
                    "loop_body": loop_body1,
                    "extra_iteration": extra_iteration1,
                }
            ),
        ),
    ]


FUNCTION_SOURCES = {}
for i in range(8):
    for fname, source in generate_source_Newton(i):
        exec(compile(source, "<dynamic>", "exec"), NAMESPACE)
        FUNCTION_SOURCES[fname] = source


def integrate_by_parts(
    low_num: int,
    low_den: int,
    high_num: int,
    high_den: int,
    level: int,
    integrand: callable,
    integrator: callable,
    max_digits: int = None,
    reduce: bool = False,
) -> tuple[int, int]:
    max_bits = max_digits * 30 if max_digits else 0
    shorten = SHORTEN[bool(max_digits)]
    low_num //= (common := gcd(low_num, low_den))
    low_den //= common
    high_num //= (common := gcd(high_num, high_den))
    high_den //= common
    left = low_num * high_den // (common := gcd(low_den, high_den))
    span = high_num * low_den // common - left
    divisor = low_den * high_den // common
    total = 1 << level
    points = [(low_num, low_den)] * (total + 1)
    points[-1] = (high_num, high_den)
    level_up = level + 1
    multiple = left
    starts = [left] * level_up
    for i in range(1, level_up):
        starts[i] = multiple = multiple + multiple

    multiple = divisor
    denominators = [divisor] * level_up
    for i in range(1, level_up):
        denominators[i] = multiple = multiple + multiple

    step = span + span
    multiple = span
    lim = total >> 1
    odds = [span] * lim
    for i in range(1, lim):
        odds[i] = multiple = multiple + step

    power = 1
    powers = [1] * level_up
    for i in range(1, level_up):
        powers[i] = power = power + power

    for index, start, step in zip(range(level, 0, -1), powers, powers[1:]):
        initial = starts[index]
        denominator = denominators[index]
        for i, odd in zip(range(start, total, step), odds):
            points[i] = (initial + odd, denominator)

        del odds[lim:]
        lim >>= 1

    result_num, result_den = 0, 1
    it = iter(points)
    next(it)
    fracs = [
        integrator(num1, den1, num2, den2, integrand)
        for (num1, den1), (num2, den2) in zip(points, it)
    ]
    while (length := len(fracs)) > 1:
        carry = fracs.pop(-1) if length & 1 else None
        fracs = [
            shorten(a * d + b * c, b * d, max_bits)
            for (a, b), (c, d) in batched(fracs, 2)
        ]
        if carry:
            fracs.append(carry)

    result_num, result_den = fracs[0]

    if reduce:
        common = gcd(result_num, result_den)
        result_num //= common
        result_den //= common

    return result_num, result_den


def arctan_derivative(x: int, y: int) -> tuple[int, int]:
    return y * y, x * x + y * y


def arctan_Newton(x: int, y: int, order: int) -> tuple[int, int]:
    assert 3 <= order <= 10
    return NAMESPACE[f"Newton_Cotes_integrate_{order}_type_3"](x, y, arctan_derivative)


def arctan_parts(
    x: int, y: int, order: int, level: int, max_digits: int = None
) -> tuple[int, int]:
    assert 3 <= order <= 10
    return integrate_by_parts(
        0,
        1,
        x,
        y,
        level,
        arctan_derivative,
        NAMESPACE[f"Newton_Cotes_integrate_{order}_type_0"],
        max_digits,
    )

Test:

In [48]: n, d = arctan_parts(123, 456, 4, 6); n/d
Out[48]: 0.26346654103491746

In [49]: math.atan2(123, 456)
Out[49]: 0.26346654103491746

In [50]: n, d = arctan_parts(123, 456, 4, 6, 2); n/d
Out[50]: 0.26346654103491746

In [51]: n, d = arctan_parts(123, 456, 5, 5, 2); n/d
Out[51]: 0.26346654103491746

In [52]: n, d = arctan_parts(123, 456, 6, 4, 2); n/d
Out[52]: 0.26346654103491746

In [53]: n, d = arctan_parts(123, 456, 7, 3, 2); n/d
Out[53]: 0.26346654103491746

In [54]: n, d = arctan_parts(123, 456, 8, 2, 2); n/d
Out[54]: 0.26346654103491746

In [55]: n, d = arctan_parts(123, 456, 9, 2, 2); n/d
Out[55]: 0.26346654103491746

In [56]: n, d = arctan_parts(123, 456, 10, 1, 2); n/d
Out[56]: 0.26346654103491746

In [57]: %timeit arctan_parts(123, 456, 4, 6)
655 μs ± 11.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [58]: %timeit arctan_parts(123, 456, 4, 6, 2)
398 μs ± 15.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [59]: %timeit arctan_parts(123, 456, 4, 7, 2)
778 μs ± 7.98 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [60]: %timeit arctan_parts(123, 456, 4, 8, 2)
1.7 ms ± 20.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [61]: %timeit arctan_parts(123, 456, 4, 8)
4.77 ms ± 92.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [62]: %timeit arctan_parts(123, 456, 4, 7)
1.67 ms ± 12.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [63]: %timeit arctan_parts(123, 456, 5, 5)
315 μs ± 18.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [64]: %timeit arctan_parts(123, 456, 5, 5, 2)
224 μs ± 9.12 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [65]: %timeit arctan_parts(123, 456, 6, 4)
167 μs ± 4.52 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [66]: %timeit arctan_parts(123, 456, 6, 4, 2)
131 μs ± 1.36 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [67]: %timeit arctan_parts(123, 456, 7, 3)
76.4 μs ± 1.17 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [68]: %timeit arctan_parts(123, 456, 7, 3, 2)
70.2 μs ± 1.35 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [69]: %timeit arctan_parts(123, 456, 8, 2)
44 μs ± 1.89 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [70]: %timeit arctan_parts(123, 456, 8, 2, 2)
41.7 μs ± 1.61 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [71]: %timeit arctan_parts(123, 456, 9, 2)
45.5 μs ± 1.25 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [72]: %timeit arctan_parts(123, 456, 9, 2, 2)
44.3 μs ± 1.29 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [73]: %timeit arctan_parts(123, 456, 10, 1)
23.7 μs ± 1.01 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [74]: %timeit arctan_parts(123, 456, 10, 1, 2)
24.2 μs ± 1.08 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [75]: %timeit arctan_parts(123, 456, 10, 2)
50.1 μs ± 1.3 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [76]: %timeit arctan_parts(123, 456, 10, 3)
110 μs ± 375 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [77]: %timeit arctan_parts(123, 456, 10, 4)
266 μs ± 9.49 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [78]: %timeit arctan_parts(123, 456, 10, 5)
704 μs ± 8.64 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [79]: %timeit arctan_parts(123, 456, 10, 6)
1.88 ms ± 23.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [80]: %timeit arctan_parts(123, 456, 10, 2, 2)
49 μs ± 1.44 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [81]: %timeit arctan_parts(123, 456, 10, 3, 2)
97.1 μs ± 415 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [82]: %timeit arctan_parts(123, 456, 10, 4, 2)
195 μs ± 1.55 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [83]: %timeit arctan_parts(123, 456, 10, 5, 2)
395 μs ± 14.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [84]: %timeit arctan_parts(123, 456, 10, 6, 2)
774 μs ± 10.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [85]: %timeit arctan_parts(123, 456, 4, 5)
273 μs ± 11.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [86]: %timeit arctan_parts(123, 456, 4, 5, 2)
212 μs ± 9.68 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Previously arctan_parts(123, 456, 10, 5) 1.08 milliseconds (1080 microseconds) and arctan_parts(123, 456, 4, 5) took 303 microseconds, now the first test without truncation, only takes 704 microseconds, it is \$(1080-704)/1080=0.34\overline{814}\$ increase in performance, and with the optimization it is \$(1080-395)/1080=0.634\overline{259}\$ increase in performance, the second test takes 273 microseconds without truncation, \$(303-273)/303=0.099\overline{0099}\$ increase, with the final optimization now it is 212 microseconds, \$(303-212)/303=0.\overline{3003}\$ increase in performance.