Tuesday, April 1, 2014

Melbourne's Weather and Cross Correlations

During a lunchtime discussion among recent GCaP class attendees, the topic of weather came up and I casually mentioned that the weather in Melbourne, Australia, can be very changeable because the continent is so old that there is very little geographical relief to moderate the prevailing winds coming from the west.

In general, Melbourne is said to have a mediterranean climate, but it can also be subject to cold blasts of air coming up from Antarctic regions at any time, but especially during the winter. Fortunately, the island state of Tasmania acts as something of a geographical barrier against those winds. Understanding possible relationships between these effects presents an interesting exercise in correlation analysis.

Gathering Weather Data

Weather data for all major Australia cities are available from the Bureau of Meteorology. The subsequent discussion will employ weather records for the past calendar year (2013) collected from Perth, in Western Australia, and Hobart and Launceston, in Tasmania. The city of Perth has been in the news lately because it's the base for aircraft searching for wreckage of Malaysian Airlines flight MH 370. The available weather indicators include daily min and max temperatures and rainfall.

Figure 1 shows maximum temperatures in degrees Celsius. The trough occurs in the middle of the calendar year because that's the winter season in Australia.

Which city is most strongly correlated with Melbourne's temperatures? It's impossible to decide based on the raw data alone. To answer such questions more rigorously we can use the cross correlation function (CCF) in R.

Cross Correlation Plots

Applying the ccf function to the data in Fig. 1:

df.mel  <- read.table("~/.../mel.csv",header=TRUE,sep=",")
df.per  <- read.table("~/.../per.csv",header=TRUE,sep=",")
df.hob  <- read.table("~/.../hob.csv",header=TRUE,sep=",")
df.laun <- read.table("~/.../laun.csv",header=TRUE,sep=",")
mel.ts  <- ts(df.mel$MaxT)
per.ts  <- ts(df.per$MaxT)
hob.ts  <- ts(df.hob$MaxT)
laun.ts <- ts(df.laun$MaxT)


produces the plots shown in Fig. 2.

Like a ripple in a pond, there can be a delay or lag between an event exhibiting itself in one time series and it's effect showing up in the other time series. So, simply calculating the correlation coefficient at the same point in time for both series is not sufficient.

The CCF is defined as the set of correlations (height of the vertical line segments in Fig. 2) between two time series $x_t + h$ and $y_t$ for lags $h = 0, \pm1, \pm2, \ldots$. A negative value for $h$ represents a correlation between the x-series at a time before $t$ and the y-series at time $t$. If, for example, the lag $h = -3$, then the cross correlation value would give the correlation between $x_t - 3$ and $y_t$. Negative line segments correspond to events that are anti-correlated.

The CCF helps to identify lags of $x_t$ that could be predictors of the $y_t$ series.

  1. When $h < 0$ (left side of plots in Fig. 2), $x$ leads $y$.
  2. When $h > 0$ (right side of plots in Fig. 2), $x$ lags $y$.
For the weather correlation analysis, we would like to identify which series is leading or influencing the Melbourne time series.

Interpreting the CCF Plots

The dominant or fundamental signal over 365 days in Fig. 1 resembles one period of a sine wave. The first row in Fig. 3. shows two pure sine waves (red and blue) that are in phase with each other (left column). The correlation plot (right column ) shows a peak at $h=0$, in the middle of the plot, indicating that the two curves are most strongly correlated when there is no horizontal displacement between the curves.

The second row in Fig. 3. shows sine waves that are 90 degrees out of phase with each other (left column ). The correlation plot (right column ) shows that these two curves are most weakly correlated at zero lag. Conversely, they are more strongly correlated at $h=-16$ (left side of CCF plot) or anti-correlated at $h=+16$ (right side of CCF plot).

The third row in Fig. 3 is similar to the first row but with some Gaussian noise added to both signals. The correlation plot shows a slight loss of symmetry but otherwise doesn't indicate much additional structure because the randomness of the noise in both signals tends to cancel out.

Figure 4 has the same signals as Fig. 3 but with 365 sample points to match the weather data in Fig. 1. This has the effect of broadening out the correlation plots and, indeed, they do more closely resemble the correlation plots in Fig. 2.

# Perth-Melbourne analysis
# produces the numerical output:

Autocorrelations of series ‘X’, by lag

  -22   -21   -20   -19   -18   -17   -16   -15   -14   -13   -12   -11   -10    -9    -8    -7    -6 
0.542 0.498 0.488 0.511 0.525 0.545 0.563 0.550 0.549 0.554 0.588 0.576 0.599 0.594 0.549 0.540 0.615 
   -5    -4    -3    -2    -1     0     1     2     3     4     5     6     7     8     9    10    11 
0.631 0.617 0.656 0.595 0.508 0.475 0.512 0.559 0.605 0.618 0.555 0.500 0.512 0.543 0.548 0.536 0.533 
   12    13    14    15    16    17    18    19    20    21    22 
0.525 0.504 0.523 0.520 0.489 0.478 0.494 0.501 0.484 0.503 0.519 

pm <- ccf(per.ts,mel.ts)
max.pmc <- max(pm$acf)
# produces the output:[1] 0.6564855
# At what lag?
pm$lag[which(pm$acf > max.pmc-0.01 & pm$acf < max.pmc+0.01)]
# produces the output: [1] -3

We can carry out the same analysis for Hobart-Melbourne time series:

# Hobart-Melbourne analysis
hm <- ccf(hob.ts,mel.ts)
max.hmc <- max(hm$acf)
pm$lag[which(hm$acf > max.hmc-0.01 & hm$acf < max.hmc+0.01)]
# 0.8269252 occurs at lag h = 0

and Launceston-Melbourne time series:

# Launceston-Melbourne analysis
lm <- ccf(laun.ts,mel.ts)
max.lmc <- max(lm$acf)
lm$lag[which(lm$acf > max.lmc-0.01 & lm$acf < max.lmc+0.10)]
# Two lags satisfy this criterion
# 0.801 occurs at lag h = 0
# 0.791 occurs at lag h = -1

Next, we need to interpret all these statistics.

Analysis and Conclusions

It does indeed take about three days for weather to cross the 2000 miles between Perth and Melbourne. But the correlation at lag $h = -3$ is only 0.66, whereas it's closer to 0.80 for Hobart and Launceston. To help interpret these correlation values we note that the GIS coordinates of the respective cities are:

The prevailing westerly winds originate with the Roaring Forties in the Indian Ocean. That name is a reference to 40 degrees south latitude. Melbourne is located at about 38 degrees south latitude. Perth, on the other hand, is located at a latitude much further north; it's even north of Sydney! In addition, there is a considerable desert region (roughly two thirds of the breadth of the continent) between Perth and Melbourne. Therefore, we can expect the correlations between Perth and Melbourne temperatures to be weaker than those associated with the Tasmanian cities.

Hobart is further south than Melbourne and, although it's on the eastern side of the Tasmanian island, there is no other land mass between the longitudes at Perth and Hobart. Hence, Hobart and Melbourne are more strongly correlated than Perth at zero lag.

Launceston is closest to Melbourne by latitude and, at zero lag, has a similar correlation to that for Hobart. No surprise there. There is one difference, however. A similar correlation exists at $h = -1$, which means Launceston leads Melbourne by a day, even though it is slightly east of Melbourne by about two degrees longitude. How can that be? One possibility is that it represents the effect of more southerly winds, such as those originating with the Screaming Sixties, circulating approximately counter-clockwise around the east coast of Tasmania. Cross correlated cross winds.

I'll cover more about time series analysis in the upcoming Guerrilla Data Analysis Techniques class.

These are the same winds used by the early European traders, like the Dutch and Portugese, to reach the "spice islands" in the Indonesian archipelago. Hence the term trade winds. The basic idea was, you sail down the west coast of Africa, round the Cape of Good Hope, catch the Roaring Forties across the Indian Ocean, and hang a left at about 100 degrees east longitude. All this was at a time before longitude could be determined accurately. In 1616, the Dutch naval explorer, Dirk Hartog, missed that off-ramp and found himself staring at the west coast of Australia. Hence, the name of the continent was changed from Terra Australis (Southern Land) to Neu Holland (New Holland) on maps of the day. In a similar way, another Dutch navigator, Abel Tasman, sighted the west coast of Tasmania and named it Van Diemen's Land. Two centuries later, it was renamed after Tasman.


Andrew Taylor said...

I had never come across the ccf function before.

Unfortunately, of course, my data is never so nice as to just be day-of-the-year type data. Instead it is longitudinal data. A few projects ago I had someone who had given different number of interventions to individuals each day they remained in the hospital and tracked their mobility level each day. While analyzing it, I kept thinking, well it doesn't make sense that interventions would effect day-of mobility level, but some future mobility level.

I spent the next few hours lagging at different intervals and re-running the regression and seeing if things were better.

The ccf plotting function gave me the inspiration to write something very similar but that takes into account that observations are being clustered within individuals.

Just wanted to say thanks for the inspiration.

Neil Gunther said...

Glad to be of service, Andrew.

Speaking of "longitudinal data," you inspired me to add a footnote about how the inability to accurately determine longitude led early Dutch navigators to accidentally discover Neu Holland (Australia) and Van Diemen's Land (Tasmania).

martin catt said...

wow! good service.what aGathering Weather Data!i like this.