Thursday, June 17, 2010

Playing with Primes in R (Part II)

Popping Part III off the stack—where I ended up unexpectedly discovering that the primes and primlist functions are broken in the schoolmath package on CRAN—let's see what prime numbers look like when computed correctly in R. To do this, I've had to roll my own prime number generating function.

Personalizing primes in R

For what I want to show, I mostly need a list of primes in some arbitrary range. Here is a recursive function to do just that:

# "To iterate is human, to recurse, divine." --Peter Deutsch
Prime numbers only have themselves and 1 as factors. All other numbers are called composites. Here are the first 10 primes:

> nmax<-30
> primelist <- (1:nmax)[primegen(1:nmax)]
> primelist
 [1]  2  3  5  7 11 13 17 19 23 29
> length(primelist)
[1] 10
Not everything in schoolmath is broken, so I don't have to abandon it altogether. I can use it to do things like check that this list of numbers are all primes:


> is.prim(primelist)

The point of primes

It was once thought by mathematicians that prime numbers could never have any application to the real world. As usual, that view has turned out to be quite mistaken—especially since the advent of computers. Today, prime numbers are used in cryptography, hash tables, pseudo-random number generation and distributed computing.

My interest, however, is in none of these things. I want to show a connection between prime numbers and scalability in highly parallel computing, which I will get to in Part I. But before I do that, I need to review some of the properties of prime numbers that I will be referring to in Part I. Remember, I'm going backwards because I'm now popping this stack.

Mersenne primes

Since all primes (except 2) are odd numbers, any number M having the form 2p − 1 will be odd, but will it also be prime? Here, you have to be a bit careful. What if we choose the exponent p to be prime, will that guarantee that M is prime? Sounds like a good idea. Let's try it for p = 17, since that's one of the first ten primes we generated:

> p<-17
> M17<-2^p-1
> M17
[1] 131071
> is.prim(M17)
[1] TRUE
Seems to work. What about p = 29; the last of the primes on our list of ten?

> p<-29
> M29<-2^p-1
> M29
[1] 536870911
> is.prim(M29)
> prime.factor(M29)
[1]  233 1103 2089
> 233 * 1103 * 2089
[1] 536870911
M29 has factors other than itself and unity, so it's not prime, even though p is prime. There goes that idea. :(

On the other hand, if you find an M that is prime, then it's guaranteed that p will be prime. This makes it difficult to find Ms that are prime, but computers can help. These particular prime numbers are known as Mersenne primes, which you may have heard of because computers are now used collaboratively to discover and validate their existence using software like GIMPS (the SETI@home of mathematics). [Who said math isn't experimental?] The largest validated prime number of this type contains more than 12 million digits.

A paucity of provable patterns

Mathematicians like to find patterns in order to write down equations and theorems and, more importantly, prove them. When it comes to the primes, however, it is exceedingly difficult to find patterns that can be expressed as equations. But sometimes you can get lucky, as did the eminent mathematician Stan Ulam while doodling at a conference. He discovered a grid pattern amongst the primes when laid out on a logarithmic spiral. See! Daydreaming and doodling are a good thing, despite what your boss might think.

Three fundamental results about primes are:
  1. Goldbach's conjecture (1742)
  2. Gauss' prime number theorem (1792, extended in 1849 and published in 1863)
  3. Riemann's hypothesis (1859)
The first two are easily understood by non-mathematicians and I'll use R to discuss them here. The third (aka RH) is the deepest result and would take us too far afield to discuss here. RH would also explain Gauss' result if it could be proven. To this day, all three results remain unproven. BTW, proving RH has $1 million attached to it, if you have nothing better to do some weekend.

Numbers are polyatomic primes

We can think of every prime number as being atomic, because it's not decomposable into any other factors (by definition). The fundamental theorem of arithmetic states that all numbers are molecules comprised of prime atoms. To see this, let's take a random set of integers between 20 an 30:

> smpl <- sample(20:30)
> cat(smpl,"\n")

25 21 20 26 28 29 23 27 30 22 24 
Now, let's apply the schoolmath functions to see which atoms comprise the composite numbers:

> for(i in 1:length(smpl)) {
+  if(!is.prim(smpl[i])) {
+   pf<-prime.factor(smpl[i])
+   cat(paste(smpl[i],":",sep=""),pf,"\n")
+  }
+ }
25: 5 5 
21: 3 7 
20: 2 2 5 
26: 2 13 
28: 2 2 7 
27: 3 3 3 
30: 2 3 5 
22: 2 11 
24: 2 2 2 3 
Every composite number in the list has been expressed as a molecule composed of only prime numbers as factors. The numbers 23 and 29 in the original list do not appear because they are atomic primes. This is a multiplicative property of prime numbers. In the section called Goldbach's Comet, we'll look at an additivity property of primes.

Gauss' counter

What is both engrossing and exasperating about the primes, is summed up in this quote:
"There are two facts about the distribution of prime numbers which I hope to convince you so overwhelmingly that they will be permanently engraved in your hearts.

The first is that despite their simple definition and role as the building blocks of the natural numbers, the prime numbers... grow like weeds among the natural numbers, seeming to obey no other law than that of chance, and nobody can predict where the next one will sprout.

The second fact is even more astonishing, for it states just the opposite: that the prime numbers exhibit stunning regularity, that there are laws governing their behaviour, and that they obey these laws with almost military precision."
—Don Zagier (1975)

To get a sense of the way the prime numbers are distributed amongst the composite numbers, we can count the number of primes up to an arbitrary number (n), for successively greater values of n. We saw earlier that there were 10 primes below n = 30. I wrote the following R code to count the primes up to n = 200. Since I'm going to the trouble of doing that, I've also computed some related interesting things in R, which I'll explain shortly.

nmax <- 200
intgnd <- function(x) {1/log(x)}        # logarithmic integral
np <- c(0)                              # number of primes up to n
gp <- rep(0,9)                          # Gauss' early estimate
li <- c(NULL)                           # Gauss' later estimate

for (n in 1:nmax) {
        if(n > 1) {
            i <- which(primelist[1:length(primelist)] <= n)
            np <- append(np,max(i))
            li <- append(li,integrate(intgnd,2,n)$value)
        if(n >= 10) gp <- append(gp, n/log(n))

plot(np,type="s",lwd=2,xlab="Number (n)",ylab="Count of primes < n")
The black step-curve in Figure 1 is the count of primes (y-axis) up to n = 200 (x-axis). It belongs to the values of np in the R code. The reason it has a step structure arises from the fact that the same number of primes may exist below successive values of n. For example, the same 4 primes (i.e., 2, 3, 5, 7) are counted when going from n = 7 to n = 10. It's a discrete function. But it should be clear, even from this relatively small range of n that the width of the steps is highly irregular or random looking.

Figure 1. Prime counting functions

The red curve is Gauss' original estimate about the distribution of the primes. It belongs to the values of gp in the R code. The first thing that is remarkable about this curve is just the idea that a discrete counting function could be represented by a continuous function: n/log(n) in R. Granted it is not an exact fit but rather a lower bound (i.e., the count of primes can never be less than the red line). But that's the second remarkable thing. Gauss came up with this lower bound when he was a teenager. What did you do when you were a teenager? So, let's not be too critical.

Years later, around 1849, Gauss managed to come up with a vastly improved approximation, shown in Figure 1 as the blue curve. It belongs to the values of li in the R code. This time it's an upper bound and it's an even tighter bound than the red curve.

In 1859, one of Gauss' former students, Georg Bernhard Riemann (from whom Einstein later borrowed heavily), extended the study of prime numbers from 1-dimension to 2-dimensions, viz., the complex plane; another simple idea that turned out to be revolutionary. It leads to something called the Riemann zeta-function, which we are most definitely not going into here. A consequence of Riemann's function is that it can be used to generate even better approximations to the prime counting function. In Figure 1, I've shown it as the green curve for comparison. You may need to click on the figure to enlarge it. Actually, this is only the leading order approximation. Higher order terms in the Riemann zeta-function make the curve actually take on the step structure of prime counter with more and more accuracy. This is truly incredible! We can see that it works, we just can't prove that it works ... yet. That's why there's a million dollars attached to a proof.

Skewes in the data

Gauss' blue curve is an upper bound on the prime count. It looks like the black and blue curves would never collide, and Gauss thought they didn't. But Figure 1 only goes up to n = 200, which is very small potatoes to a mathematician. In 1914 it was shown that not only do the two curves cross, they cross infinitely many times! OK, so at what value of n does the first intersection occur?

You know what a google is. You know what a googol is, right? Well, that's numerical peanuts. An initial estimate for the first intersection was given by
n = 10101034
a special number known as Skewes number. More recent estimates (2005) now put the first crossing at around 10316; big, but much smaller than originally thought.

Previously, I mentioned that we can see that RH works, we just haven't managed to prove that it works. But seeing depends on where you are looking. In Figure 1, there's no hint that that black and blue curve intersect, but we know for sure that there are an infinite number of such intersections. That's why proofs are important in mathematics.

Goldbach's comet

Earlier, I mentioned that the fundamental theorem of arithmetic relies on the multiplicative property of primes. Here, we look at the additivity properties of primes. In 1742, the Prussian amateur mathematician Christian Goldbach wrote a letter to the great Leonhard Euler and conjectured that every even number greater than 2 can be expressed as the sum of two primes.

Let's check that out in R:


gbp<-function(x) {
    i <- 1
    parts <- NULL
    while (i <= x/2) {
        if (is.prim(i) && is.prim(x-i)) {
            if (i > 1) { 
                parts <- append(parts,sprintf("%d+%d",i,x-i))
        i <- ifelse(x == 4,i+1,i+2)

for (k in seq(4,14,2))  cat(sprintf("%2d: ", k), gbp(k), "\n")
for (k in seq(52,60,2)) cat(sprintf("%2d: ", k), gbp(k), "\n")
which produces:
4:  2+2 
 6:  3+3 
 8:  3+5 
10:  3+7 5+5 
12:  5+7 
14:  3+11 7+7 
52:  5+47 11+41 23+29 
54:  7+47 11+43 13+41 17+37 23+31 
56:  3+53 13+43 19+37 
58:  5+53 11+47 17+41 29+29 
60:  7+53 13+47 17+43 19+41 23+37 29+31 
Notice that there can be more than one way to write the sum of primes as the even numbers get bigger. If we count the number of sums (partitions), much like counting the primes themselves:

i  <- 1
ge <- NULL
ne <- NULL
for (k in seq(4,2000,2)) {
    ne <- append(ne,k)
    ge <- append(ge,length(gbp(k)))

plot(ne,ge,pch=4,col="blue",xlab="Number (n)",ylab="Distinct partitions")
and plot them in Figure 2, we see something called Goldbach's comet.

The trend in the distribution of these points (partitions) is very similar the trends in Figure 1; a kind of regularity in the mess:
"One of the remarkable aspects of the distribution of prime numbers is their tendency to exhibit global regularity and local irregularity. The prime numbers behave like the 'ideal gases' which physicists are so fond of. Considered from an external point of view, the distribution is - in broad terms - deterministic, but as soon as we try to describe the situation at a given point, statistical fluctuations occur as in a game of chance where it is known that on average the heads will match the tail but where, at any one moment, the next throw cannot be predicted. Prime numbers try to occupy all the room available (meaning that they behave as randomly as possible), given that they need to be compatible with the drastic constraint imposed on them, namely to generate the ultra-regular sequence of integers.

This idea underpins the majority of conjectures concerning prime numbers: everything which is not trivially forbidden should actually happen..."
—G. Tenenbaum and M. Mendèfs France (2000)

If you look back over the shapes of the data in Figures 1 and 2, and especially the continuous curves in Figure 1, you will notice that they resemble the shape of throughput curves in performance analysis. And that will bring the next post to the top of the stack: how all this relates to Amdahl's law.


metasoft said...

can't wait to see the next post, viz., relating throughput to prime numbers.

another major contribution of prime numbers is in godel's incompleteness theorem.

Neil Gunther said...

Nice one!

For the curious, here's an elaboration.

Neil Gunther said...

Gödel's proof relies on enumerating theorems (statements) in arithmetic; treated as an axiomatic system. The Gödel numbers (that metasoft is referring to), like any enumeration, are a kind of meta-system. Here are some simple-minded renditions of Gödel's theorem.

Moreover, Gödel's theorem is related to the Halthing problem in computation theory. All very heavy stuff.

In an ironic twist, I just came across this statement about IBM's latest efforts in the AI arena: "Computation is easy. Meta-computation is a bitch."

metasoft said...

prime numbers are also related to quantum mechanics.

just want to share and see if you know anything about this.

Neil Gunther said...

Yes, I was aware of the application of random matrix theory to RH, but it really has not born the kind of fruit that people were hoping for: neither the original Montgomery-Dyson observation nor the more recent quantum-chaos model (to which the Seed piece refers).

See "Quantum physics sheds light on Riemann hypothesis" for a less overstated summary. I wasn't familiar with the RH moment sequence A039622: 1, 1, 2, 42, 24024, 701149020, ... so, thanks for that.

BTW: I find Dyson's comment a bit fatuous. Using that kind of logic, one could go further and say that the photon (the "original" quantum) could have been recognized in the 1860s. Maxwell's equations are the correct equations for the photon field, but it's not just a matter of having the correct equations. Understanding their deeper ramifications, both mathematical and physical, often takes significant amounts of time.

My guess is that, like FLT, RH will be cracked by mathematicians, not physicists. Brian Conrey puts it this way:
"It is my belief that RH is a genuinely arithmetic
question that likely will not succumb to methods
of analysis. There is a growing body of evidence indicating that one needs to consider families of
L-functions in order to make progress on this
difficult question. If so, then number theorists
are on the right track to an eventual proof of RH,
but we are still lacking many of the tools."

Swetha said...

Great work...following all your other posts. Thanks a lot

Neil Gunther said...

Welcome. Feel free to ask any questions.

Patrick said...


also works:



Patrick said...

According to one Henrik there is/was a bug in primes from library(schoolmath), just saying you might double check if it's been fixed or use some other method in this post:

Neil Gunther said...

Thanks for the updated info.