Fast algorithm to calculate Pi in parallel

Question

I am starting to learn CUDA and I think calculating long digits of pi would be a nice, introductory project.

I have already implemented the simple Monte Carlo method which is easily parallelize-able. I simply have each thread randomly generate points on the unit square, figure out how many lie within the unit circle, and tally up the results using a reduction operation.

But that is certainly not the fastest algorithm for calculating the constant. Before, when I did this exercise on a single threaded CPU, I used Machin-like formulae to do the calculation for far faster convergence. For those interested, this involves expressing pi as the sum of arctangents and using Taylor series to evaluate the expression.

An example of such a formula:

enter image description here

Unfortunately, I found that parallelizing this technique to thousands of GPU threads is not easy. The problem is that the majority of the operations are simply doing high precision math as opposed to doing floating point operations on long vectors of data.

So I'm wondering, what is the most efficient way to calculate arbitrarily long digits of pi on a GPU?

Have you looked at this: sites.google.com/a/nirmauni.ac.in/cudacodes/ongoing-projects/… — James Black
– James Black, Commented Jun 5, 2012 at 2:00
I don't think that one does arbitrary precision calculations. — tskuzzy
– tskuzzy, Commented Jun 5, 2012 at 2:05
@JamesBlack: the code you have linked to is utter nonsense. It seems to be an incredibly naive automatic translation of a serial piece of C code into a serial piece of GPU code where many threads compute the identical first 1000 elements of the series expansion. Literally 99.99% of the computation performed by the code is redundant. — talonmies
– talonmies, Commented Jun 5, 2012 at 8:23
Erlang? I think you could use it for parallel processing. Not sure if it helps with algorithm implementation. — Code Droid
– Code Droid, Commented Jul 17, 2012 at 22:33
See also: stackoverflow.com/questions/19/fastest-way-to-get-value-of-pi and stackoverflow.com/questions/14283270/… — assylias
– assylias, Commented May 16, 2013 at 9:45

B. D · Accepted Answer · 2012-06-05 02:22:54Z

You should use the Bailey–Borwein–Plouffe formula

Why? First of all, you need an algorithm that can be broken down. So, the first thing that came to my mind is having a representation of pi as an infinite sum. Then, each processor just computes one term, and you sum them all in the end.

Then, it is preferable that each processor manipulates small-precision values, as opposed to very high precision ones. For example, if you want one billion decimals, and you use some of the expressions used here, like the Chudnovsky algorithm, each of your processor will need to manipulate a billion long number. That's simply not the appropriate method for a GPU.

So, all in all, the BBP formula will allow you to compute the digits of pi separately (the algorithm is very cool), and with "low precision" processors! Read the "BBP digit-extraction algorithm for π"

Advantages of the BBP algorithm for computing π This algorithm computes π without requiring custom data types having thousands or even millions of digits. The method calculates the nth digit without calculating the first n − 1 digits, and can use small, efficient data types. The algorithm is the fastest way to compute the nth digit (or a few digits in a neighborhood of the nth), but π-computing algorithms using large data types remain faster when the goal is to compute all the digits from 1 to n.

So I understand the idea that you compute all the digits you want in (embarassing) parallel. But that isn't a guarantee that this algorithm is efficient; each processor/GPU might be computing information that others could share. Maybe this algorithm is efficient and you just haven't told us how. But if not, you don't wan't to parallelize an inefficient algorithm just because you can. (Perhaps a more useful measure would be digits/transistor or digits/watt produced).
Well, it's a "decent" algorithm. It is not the best one (records are held by other algorithms) but it is still decent. And let's also remember that OP does not wish to break records, but I am starting to learn CUDA and I think calculating long digits of pi would be a nice, introductory project.
Then its a fine scheme to try out. (I've seen people trying to make parallel programs in Python, which is an interpreter. Eh what?)
Keep in mind that BBP doesn't give you decimal digits, only binary.

Andy Richter · Accepted Answer · 2025-09-06 00:42:13Z

Here is a bit of my code. The rest is here: https://github.com/Overboard-code/Pi-Pourri/blob/main/pi-pourri.py You call it with the number of desired digits, like 1_000_000 and the denominators, multipliers and operators. For your example use:

PiMachin(1_000_000,"John Machin 1706",[5,239],[4,1],[1,-1])

Here is the class. It needed some setup to work outside of my program.

from datetime import timedelta
from functools import partial
import sys,time,multiprocessing,logging,os,argparse
try:
    # https://stackoverflow.com/questions/384076/how-can-i-color-python-logging-output
    import Colorer
except ImportError:
    pass
try:
    from gmpy2 import mpz,isqrt,mpfr,atan2,sqrt,get_context,const_pi  # Gumpy2 mpz large ints are ten times faster than python large int
except ImportError:
    raise ImportError('This program requires gmpy2, please insatll. exiting....')


# CONSTANTS
LOG_LEVELS = ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]
DEFAULT_LOG_LEVEL = "INFO"
LOG2_10 = 3.32192809488736


class PiMachin:

    def __init__(self,ndigits,name,denoms,mults,operators):
        """ Initialization
        :param int digits: digits of PI computation
        :param string name: name for credit on formula
        :param list  denoms a lis of ints for the denomiators
        :param list  mults: a list of ints as multipliyers for the Machin formula
        :param list operators: a list of 1 or -1 to cause addition or subtraction
        """
        self.name = name  # Not used.  Planned to use in logging
        self.ndigits = ndigits
        self.denoms = denoms
        self.mults = mults
        self.operators = operators
        self.xdigits = len(denoms)+7              # Extra digits to reduce trailing error More factors means more error
        self.start_time = 0

    def ArctanDenom(self,d):

        cdigits = self.ndigits+self.xdigits
        get_context().precision=int(cdigits * LOG2_10)
        # Calculates arctan(1/d) = 1/d - 1/(3*d^3) + 1/(5*d^5) - 1/(7*d^7) + ...
        logging.debug('arctan(1/%d) Started  ',d )
        total = mpfr(0)
        arc_start_time = time.time()  # Start the clock for this arctan calulation
        total = mpfr(atan2(mpfr(1),mpfr(d)))
        logging.debug('arctan(1/{}) Done!   {:.2f} seconds.'
            .format(int(d),time.time() - arc_start_time))
        return total,int(1) # I used to calulate arctan by hand.  Now I just use atan2() so just one iteration here
    #
    def compute(self):
        self.start_time = time.time()  # Start the clock for total time
        ndigits = self.ndigits
        cdigits = self.ndigits+self.xdigits
        logging.info("Starting Machin-Like formula to {:,} decimal places"
            .format(ndigits) )
        get_context().precision=int(cdigits * LOG2_10)
        logging.debug("Starting %d Pool threads to calculate arctan values.",
            len(self.denoms) )
        p =  multiprocessing.Pool(processes=(len(self.denoms))) # get some threads for our pool
        results=p.map(self.ArctanDenom, self.denoms) # one thread per arctan(1/xxxx)
        p.close()
        p.join()  # wait for them to finish
        # Now we have the arctan calculations from the pool threads in results[]
        # Apply chosen Formula to the results and calculate pi using mults and signs
        logging.debug ("Now multiplying and summing all arctan results")
        arctanSum = pi = mpfr(0)
        iters = 0
        for i, result in enumerate(results):
            iters += result[1]  # Keep track of this thread's iterations for later
            arctanSum += mpfr(mpfr(self.mults[i])*mpfr(result[0])*mpfr(self.operators[i])) # Add or subtract the product from the accumulated arctans
        pi = mpfr(4) * arctanSum # change pi/4 = x to pi = 4 * x
        # We calculated extra digits to compensate for roundoff error.
        # Chop off the extra digits with format() using D to basically trunc() to ndigits
        return "{0:.{1}Df}".format(pi,self.ndigits),iters,time.time()-self.start_time


ndigits= 1_000_000

obj = PiMachin(ndigits,"John Machin 1706",[5,239],[4,1],[1,-1]) 
# Calculate Pi using selected formula
pi,iters,time_to_calc = obj.compute()
print(f"First 200 of {ndigits:,}\n{pi[:200]}\nTook {time_to_calc:.2f} seconds")

Collectives™ on Stack Overflow

Fast algorithm to calculate Pi in parallel

2 Answers 2

4 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Linked

Related