Tuesday, June 22, 2010

Linear Modeling in R and the Hubble Bubble

Here is a scatter plot with the coordinate labels deliberately omitted.

Figure 1.

Do you see any trends? How would you model these data?

It just so happens that this scatterplot is arguably the most famous scatterplot in history. One aficionado, writing more than forty years after its publication, commented skeptically [1]:
"[The] data points were consequently spread all over the [x-y] plot. Nevertheless, he was somehow able to deduce a 'roughly linear' relation between them."
There's a hint: a linear fit. "What's the big deal?" you ask. "Isn't that the sort of obvious modeling assumption any statistician, worth his or her salt, would make?" Good point but these ain't just any old data. These data contain the mind of God, as an Einstein or Hawking might say.

To see this, let's label the coordinates and do a linear fit through the origin.

Figure 2.

Now you know. Figure 2 shows the data that Edwin Hubble used in 1929 to deduce that there was a simple linear relationship between the radial distance to a galaxy (x-axis) and the speed with which it is moving relative to us (y-axis). It says that the further away it is, the faster it seems to be moving. For calibration, a Megaparsec (Mpc) is about 3 million light years. The slope of the line corresponds to Hubble's constant, H0 = 423.94 km.s-1.Mpc-1 in Figure 2. The units don't matter too much for our discussion.

Today, it's difficult to appreciate the impact of this conclusion. In 1929, it was only just being realized that the universe extended beyond our own Milky Way galaxy. Today, we know that the Milky Way is a mere speck in the vastness of the universe. Also, the Big Bang theory was a very new idea and had to compete with other theories like the Steady State model. The point is, if you claimed a linear relationship in 1929, you were saying something very profound about the structure of the universe so, you'd better have your ducks lined up. Hubble worked pretty hard at herding ducks. How do I know that? Because here's the plot that appears in Hubble's 1929 paper (see Figure 1).

Figure 3.

You can see immediately that these data points are more tightly clustered around the line of best fit than the data points in Figure 2. There are actually two lines in Figure 3 because Hubble used two different aggregations of his data to further support his conclusions, but that need not concern us here. The important difference is that my Figure 2 comes from simply plotting the distance-velocity (r,v) data in Table 1 of Hubble's paper and it clearly doesn't match his improved Figure 3. Was Hubble stacking his data?

The telescopes used by Hubble are located on Earth, which is part of the solar system of planets. Hubble knew that the solar system is also moving through space relative to the stars. Think of our solar system as being like a car going down the highway. When looking at scenery out of the car window, closer objects in the scene, e.g., signage, cows, etc., appear to move past us more rapidly than the more distant mountains. Similarly with the apparent motion of the stars or galaxies, except that motion in space is 3-dimensional (x,y,z) rather than primarily 2-dimensional (x,y) on a highway. Put differently, the bigger the angle we move through relative to the object, the more it appears to move. Hubble took this angular effect into account with the three additional X, Y, Z terms in his linear model:

r K + X cos(α) cos(δ) + Y sin (α) cos(δ) + Z sin(δ) = v

which is unnumbered and appears about half way between Table 1 and Table 2 in his paper.

The factor K is his notation for H0. Hubble didn't provide the extra angular-motion data in his paper, but if you're an astronomer, you can determine it. I'm not an astronomer but Prof. Ned Wright at UCLA is, and he kindly provided me with those additional data so that I was able to apply the angular corrections to Table 1 and reproduce Hubble's Figure 3. Here's the R script that does all the tedious work.

hub <- read.table("~/Hubble-UCLA.data", header=TRUE)

# convert RA to decimal hrs
hrs <- NULL
for(i in 1:dim(hub)[1]) {

# convert DEC to decimal degrees
for(i in 1:dim(hub)[1]) {
    sgn<-ifelse(substr(decs,1,1) == "-",-1,+1)

# Add HRS and DEG as cols to dataframe
hub<-cbind(hub,HRS=hrs)     # adds col name "HRS"
hub<-cbind(hub,DEG=degs)    # adds col name "DEG"

# NOTE: trig funs in R want radians, not degrees.
A<-hub$HRS * (360/24) *  (pi/180)   # convert RA to radians.
D<-hub$DEG * (pi/180)               # convert DEC to radians
acx <- cos(A)*cos(D)                # angular x component
acy <- sin(A)*cos(D)                # angular y component
acz <- sin(D)                       # angular z component

hfit <- lm(hub$v ~ hub$r + acx + acy + acz + 0) # fits thru origin.
summary(hfit)       # the R^2 value
Ho<-coef(hfit)[1]   # Hubble's const for corrected data
cat(sprintf("Fitted Hubble's const: %.2f (cf. Hubble paper K = 465 +/-50)", H0))

# Cartesian components of solar peculiar velocity 

plot(hub$r,hub$v-(vox*acx + voy*acy + voz*acz),
    xlim=c(0,2.25),ylim=c(-250,1250),main="Hubble's 1929 Corrected Data",xlab="Galactic distance    (Mpc)",ylab="Recessional velocity (km/s)",pch=19)
abline(v=c(0,1,2),h=c(0,500,1000))  # gridlines
The black dots in the resulting plot (Figure 4) are an exact match with those in Figure 3.

Figure 4.

The data from Table 1 (in Hubble's paper) are shown in gray, while the corrected data are shown as black dots in Figure 4. It's clear that the closer galaxies (i.e., the data points nearer the origin of the plot) have been displaced more than those on the right-hand side of the plot because of the larger angular correction. But this is not Edwin Hubble stacking his data. This is Edwin Hubble herding his ducks. As a consequence of this herding, the line of best fit has rotated slightly anticlockwise, and consequently the corrected Hubble's constant: H0 = 465.19 (cf. K = 465 +/-50 in Hubble's paper) is bigger than the fit obtained from Figure 2.

Why is that small increase important? Because it's going in the wrong direction! It's getting bigger, when it should be smaller (as we know today). Hubble even rounded it up to H0 = 500. This turns out to be badly wrong because the inverse of Hubble's constant corresponds to the age of the universe (assuming the Big Bang theory) and, using a value of 500 predicts an age for the universe of about 2 billion years. However, it was already known that the rocks in the Earth are about 4.5 billion years old. How could the universe be younger than the Earth? This discrepancy fueled a lot of debate in the years immediately following publication of Hubble's paper.

You may also be wondering if Hubble was just lucky with his assumption of a linear regression model. Could it be that as we look further out into the universe that the linear model doesn't hold up? Amazingly, the answer is, no. Today, with vastly better telescopes and measurement techniques [2]—many coming from the space-based telescope named in honor of Hubble himself— not only is the relationship linear, but the slope is known to much greater accuracy. The accepted value is closer to H0 ~ 70; almost an order magnitude smaller than Hubble's original estimate!

And that's the important message for performance analysis. The data accessible to Hubble was not very good, by today's standards, and that state of affairs also caused him to misinterpret some of the measurements. Yet, because he already had certain expectations in his mind (the linear model), he was able to see his way through the rather messy data and reach the correct conclusion.

As I've said in Guerrilla training classes: having some expectations is better than not having any expectations. Having data is only half the story, at best. You need to have a framework or model by which to assess those data. I'll say more about this example in the upcoming Guerrilla Data Analysis Techniques class.

  1. S. Weinberg, Gravitation and Cosmology: Principles and Applications of the General Theory of Relativity, Wiley 1972
  2. This is what modern data (Proc. Natl. Acad. Sci., 2003) looks like like:

    It shows measurements out to more than 2 billion light years (700 Mpc). On that scale, the area of Figure 4 corresponds to the tiny red square at the origin.

1 comment:

AlexGilgur said...

Fascinating stuff! Visually, the data in the Footnotes graph seem to fall on a sublinear curve as the distance increases. Does it make sense? Would be interesting to look at the residuals and how they trend as a function of the distance.