# Pi Day at UTS

The Faculty of Engineering and Information Technology celebrated Pi-Day March 14 2011 with IBM.

Well-known celebrity mathematician Professor Jonathan Borwein, an architect of the Borwein Algorithm for computing π, entertained us with an insightful expose of the weird and wonderful number π.

Professor Borwein is one of the world's most imaginative and creative mathematicians. He shared his passion for π, while explaining how it connects the scientific and engineering wonders of the ancient Greeks to 21^{st} century advanced computing.

The Pi Day seminar was introduced by ABC Science TV host Paul Willis.

During the lead up to the seminar Glenn Wightwick and Andrew Mattingly (IBM Australia) worked with Professor Borwein and David Bailey (Lawrence Berkeley National Laboratory) to calculate the 60 trillionth digit of π² in hexidecimal. The three major calculations made by IBM include (i) 106 digits of π² base 2 at the ten trillionth place base 64, (ii) 94 digits of π² base 3 at the ten trillionth place base 729, and (iii) 141 digits of Catalan's constant base 2 at the ten trillionth place base 409. Catalan's constant is the simplest number not proven to be irrational and so on the frontier of rationality. According to Professor Borwein this work represents the largest single computation done for any mathematical object to date.

Key technical results were reported by the US Department of Energy and a related article about cracking an impossible calculation appeared in the Australian. Professor Borwein's slides are available. You can find details about the famous computation in a paper(~6 MBytes) published with the American Mathematical Society.

The Computational aspects of the Calculations were described by Glenn Wightwick and Andrew Mattingly (IBM Australia) as follows:

#### Partitioning and Restartability

These calculations were performed on a 4-rack BlueGene/P system at IBM's Benchmarking Centre in Rochester, Minnesota, USA. This is a shared facility, so over the several months that we have been carrying out these calculations, at any given time, we had access to none, some or all of the system. Accordingly, the code was designed to allow a calculation to be split into any number of equal-sized "partitions". We typically chose a partition size of 2048 threads, which corresponds to half a rack, called a "midplane". Provided at least one midplane was available on the system, our calculation could proceed.

To take account of the fact that we did not have continuous access to any portion of the system, the code was designed to recover from a crash or deliberate termination of the calculation and pick up where it left off, when the calculation was restarted.

#### Complexity and Windowing

The BBP "spigot" algorithm to compute the d+1^{st} digit of P(s, b, n, A) to base b, is dominated by a requirement to perform n' x d "modular exponentiations", where n' is the number of non-zero coefficients in A. For each non-zero coefficient, A_{j}, d+1 modular exponentiations are required, of the form b^{(d-k)} mod (kn+j)^{s}, for k= 0, 1, ..., d. Each modular exponentiation is of approximate computational effort, O(log(d-k)), so the total calculation is of complexity O(n'.log(d!)), which is approximately O(n'.d.log(d)).

It is desirable to divide the computation among the available threads (say 2048 threads in 4 partitions) algorithmically, rather than using a queuing system, where each thread completes a modular exponentiation then gets the next one off a queue. The overheads involved in implementing a queuing system for this calculation prove to be very significant. However, when we allocated the work to each thread, using a naive scheme, like "thread 1 does the 1^{st}, 2049^{th}, 4097^{th}, ..., thread 2 does 2^{nd}, 2050^{th}, 4098^{th}, etc), we found that the slowest thread took 20% longer to complete its allotted work than the fastest thread. To get around this problem, we implemented a "windowing" scheme, where each thread performed a block of consecutive modular exponentiations, then skipped over the other threads' blocks to do another block, and so on. By experimentation, we arrived at an optimal block size, or "window" of 1031. Using this window reduced the range of execution times between the slowest and fastest threads to around 2% for the pi-squared calculations and around 6% for the Catalan's constant calculation.

#### Zero Relations

For the calculation of Catalan's constant, David Bailey's BBP Compendium provides two candidate formulae for s=2, b=4096, n=24. One of these formulae has 22 non-zero coefficients, while the other has 18. Clearly, we should choose the latter formula to reduce the computing requirement. However, the existence of multiple, independent formulae for the same constant and the same s, b, n, implies there are one or more "zero relations" which satisfy P(s, b, n, A) = 0. Indeed, for s=2, b=4096, n=24, there are two independent zero relations. By combining these zero relations with one of the formulae for Catalan's constant, we arrived at a formula with only 16 non-zero coefficients, further reducing the computing requirement.