Tuesday, April 1, 2014

Melbourne's Weather and Cross Correlations

During a lunchtime discussion among recent GCaP class attendees, the topic of weather came up and I casually mentioned that the weather in Melbourne, Australia, can be very changeable because the continent is so old that there is very little geographical relief to moderate the prevailing winds coming from the west.

In general, Melbourne is said to have a mediterranean climate, but it can also be subject to cold blasts of air coming up from Antarctic regions at any time, but especially during the winter. Fortunately, the island state of Tasmania acts as something of a geographical barrier against those winds. Understanding possible relationships between these effects presents an interesting exercise in correlation analysis.

Gathering Weather Data

Weather data for all major Australia cities are available from the Bureau of Meteorology. The subsequent discussion will employ weather records for the past calendar year (2013) collected from Perth, in Western Australia, and Hobart and Launceston, in Tasmania. The city of Perth has been in the news lately because it's the base for aircraft searching for wreckage of Malaysian Airlines flight MH 370. The available weather indicators include daily min and max temperatures and rainfall.

Figure 1 shows maximum temperatures in degrees Celsius. The trough occurs in the middle of the calendar year because that's the winter season in Australia.

Which city is most strongly correlated with Melbourne's temperatures? It's impossible to decide based on the raw data alone. To answer such questions more rigorously we can use the cross correlation function (CCF) in R.

Cross Correlation Plots

Applying the ccf function to the data in Fig. 1:

df.mel  <- read.table("~/.../mel.csv",header=TRUE,sep=",")
df.per  <- read.table("~/.../per.csv",header=TRUE,sep=",")
df.hob  <- read.table("~/.../hob.csv",header=TRUE,sep=",")
df.laun <- read.table("~/.../laun.csv",header=TRUE,sep=",")
mel.ts  <- ts(df.mel$MaxT)
per.ts  <- ts(df.per$MaxT)
hob.ts  <- ts(df.hob$MaxT)
laun.ts <- ts(df.laun$MaxT)


produces the plots shown in Fig. 2.

Like a ripple in a pond, there can be a delay or lag between an event exhibiting itself in one time series and it's effect showing up in the other time series. So, simply calculating the correlation coefficient at the same point in time for both series is not sufficient.

The CCF is defined as the set of correlations (height of the vertical line segments in Fig. 2) between two time series $x_t + h$ and $y_t$ for lags $h = 0, \pm1, \pm2, \ldots$. A negative value for $h$ represents a correlation between the x-series at a time before $t$ and the y-series at time $t$. If, for example, the lag $h = -3$, then the cross correlation value would give the correlation between $x_t - 3$ and $y_t$. Negative line segments correspond to events that are anti-correlated.

The CCF helps to identify lags of $x_t$ that could be predictors of the $y_t$ series.

  1. When $h < 0$ (left side of plots in Fig. 2), $x$ leads $y$.
  2. When $h > 0$ (right side of plots in Fig. 2), $x$ lags $y$.
For the weather correlation analysis, we would like to identify which series is leading or influencing the Melbourne time series.

Interpreting the CCF Plots

The dominant or fundamental signal over 365 days in Fig. 1 resembles one period of a sine wave. The first row in Fig. 3. shows two pure sine waves (red and blue) that are in phase with each other (left column). The correlation plot (right column ) shows a peak at $h=0$, in the middle of the plot, indicating that the two curves are most strongly correlated when there is no horizontal displacement between the curves.

The second row in Fig. 3. shows sine waves that are 90 degrees out of phase with each other (left column ). The correlation plot (right column ) shows that these two curves are most weakly correlated at zero lag. Conversely, they are more strongly correlated at $h=-16$ (left side of CCF plot) or anti-correlated at $h=+16$ (right side of CCF plot).

The third row in Fig. 3 is similar to the first row but with some Gaussian noise added to both signals. The correlation plot shows a slight loss of symmetry but otherwise doesn't indicate much additional structure because the randomness of the noise in both signals tends to cancel out.

Figure 4 has the same signals as Fig. 3 but with 365 sample points to match the weather data in Fig. 1. This has the effect of broadening out the correlation plots and, indeed, they do more closely resemble the correlation plots in Fig. 2.

# Perth-Melbourne analysis
# produces the numerical output:

Autocorrelations of series ‘X’, by lag

  -22   -21   -20   -19   -18   -17   -16   -15   -14   -13   -12   -11   -10    -9    -8    -7    -6 
0.542 0.498 0.488 0.511 0.525 0.545 0.563 0.550 0.549 0.554 0.588 0.576 0.599 0.594 0.549 0.540 0.615 
   -5    -4    -3    -2    -1     0     1     2     3     4     5     6     7     8     9    10    11 
0.631 0.617 0.656 0.595 0.508 0.475 0.512 0.559 0.605 0.618 0.555 0.500 0.512 0.543 0.548 0.536 0.533 
   12    13    14    15    16    17    18    19    20    21    22 
0.525 0.504 0.523 0.520 0.489 0.478 0.494 0.501 0.484 0.503 0.519 

pm <- ccf(per.ts,mel.ts)
max.pmc <- max(pm$acf)
# produces the output:[1] 0.6564855
# At what lag?
pm$lag[which(pm$acf > max.pmc-0.01 & pm$acf < max.pmc+0.01)]
# produces the output: [1] -3

We can carry out the same analysis for Hobart-Melbourne time series:

# Hobart-Melbourne analysis
hm <- ccf(hob.ts,mel.ts)
max.hmc <- max(hm$acf)
pm$lag[which(hm$acf > max.hmc-0.01 & hm$acf < max.hmc+0.01)]
# 0.8269252 occurs at lag h = 0

and Launceston-Melbourne time series:

# Launceston-Melbourne analysis
lm <- ccf(laun.ts,mel.ts)
max.lmc <- max(lm$acf)
lm$lag[which(lm$acf > max.lmc-0.01 & lm$acf < max.lmc+0.10)]
# Two lags satisfy this criterion
# 0.801 occurs at lag h = 0
# 0.791 occurs at lag h = -1

Next, we need to interpret all these statistics.

Analysis and Conclusions

It does indeed take about three days for weather to cross the 2000 miles between Perth and Melbourne. But the correlation at lag $h = -3$ is only 0.66, whereas it's closer to 0.80 for Hobart and Launceston. To help interpret these correlation values we note that the GIS coordinates of the respective cities are:

The prevailing westerly winds originate with the Roaring Forties in the Indian Ocean. That name is a reference to 40 degrees south latitude. Melbourne is located at about 38 degrees south latitude. Perth, on the other hand, is located at a latitude much further north; it's even north of Sydney! In addition, there is a considerable desert region (roughly two thirds of the breadth of the continent) between Perth and Melbourne. Therefore, we can expect the correlations between Perth and Melbourne temperatures to be weaker than those associated with the Tasmanian cities.

Hobart is further south than Melbourne and, although it's on the eastern side of the Tasmanian island, there is no other land mass between the longitudes at Perth and Hobart. Hence, Hobart and Melbourne are more strongly correlated than Perth at zero lag.

Launceston is closest to Melbourne by latitude and, at zero lag, has a similar correlation to that for Hobart. No surprise there. There is one difference, however. A similar correlation exists at $h = -1$, which means Launceston leads Melbourne by a day, even though it is slightly east of Melbourne by about two degrees longitude. How can that be? One possibility is that it represents the effect of more southerly winds, such as those originating with the Screaming Sixties, circulating approximately counter-clockwise around the east coast of Tasmania. Cross correlated cross winds.

I'll cover more about time series analysis in the upcoming Guerrilla Data Analysis Techniques class.

These are the same winds used by the early European traders, like the Dutch and Portugese, to reach the "spice islands" in the Indonesian archipelago. Hence the term trade winds. The basic idea was, you sail down the west coast of Africa, round the Cape of Good Hope, catch the Roaring Forties across the Indian Ocean, and hang a left at about 100 degrees east longitude. All this was at a time before longitude could be determined accurately. In 1616, the Dutch naval explorer, Dirk Hartog, missed that off-ramp and found himself staring at the west coast of Australia. Hence, the name of the continent was changed from Terra Australis (Southern Land) to Neu Holland (New Holland) on maps of the day. In a similar way, another Dutch navigator, Abel Tasman, sighted the west coast of Tasmania and named it Van Diemen's Land. Two centuries later, it was renamed after Tasman.

Thursday, March 13, 2014

Performance of N+1 Redundancy

How can you determine the performance impact on SLAs after an N+1 redundant hosting configuration fails over? This question came up in the Guerrilla Capacity Planning class, this week. It can be addressed by referring to a multi-server queueing model.

N+1 = 4 Redundancy

We begin by considering a small-N configuration of four hosts where the load is distributed equally to each of the hosts. For simplicity, the load distribution is assumed to be performed by some kind of load balancer with a buffer. The idea of N+1 redundancy is that the load balancer ensures all four hosts are utilized. The common misconception is that none of the hosts should consume more than 75% of their available capacity. Referring to Fig. 1, the total consumed capacity is assumed to be $4 \times 3/4 = 3$ or 300% of the configuration (rather than 400%) so that, when one of the hosts fails, its lost capacity is compensated by redistributing that load across the remaining three available hosts.

The circles in Fig. 1 represent hosts and rectangles represent incoming requests buffered at the load-balancer. The blue area in the circles signifies the available capacity of a host, whereas white signifies unavailable capacity. When one of the hosts fails, its load must be redistributed across the remaining three hosts. What Fig. 1 doesn't show is the performance impact of this capacity redistribution.

N+1 = 4 Performance

The performance metric of interest is response time as it pertains to service targets expressed in SLAs. To assess the performance impact of failover, we model the N+1 configuration as an M/M/4 queue with per-server utilization constrained to be no more than 75% busy.

When a failover event occurs, the configuration becomes an M/M/3 queue. The corresponding response time curves are shown in Fig. 2. The y-axis is the response time, R, expressed as multiples of the service period, S. A typical scenario is where the SLA (horizontal line) corresponds to maximum or near maximum utilization. The SLA in this case is a mean response time no greater than 1.45 service periods.

On failover, upon which only three hosts remain available, the SLA will be exceeded because the utilization of each host will be heading for 100% due to the additional load. (See Fig. 1) Correspondingly, this has the effect of pushing the response time very high up the M/M/3 curve. In order to maintain the SLA, the load would have to be reduced so that it corresponds to a lower utilization of 68.25%. Fig. 3 shows this effect in more detail.

In practice, proper capacity planning, such as the M/M/m queueing models employed in this discussion, would have revealed that the maximum host utilization should not have exceeded 68.25% busy in the N+1 configuration.

Large-N Performance

With a large number of hosts the difference in response time after failover becomes less significant. This follows from the fact that response-time curves for an M/M/m queue are flatter at high utilizations when the number of servers, m, is large. The effect is illustrated in Fig. 4 for N+1 = 16 hosts. (cf. Fig. 2)

However, the most common installations are small-N configurations of the type discussed in the previous section. Therefore, preserving your SLA requires capacity planning based on host utilizations that match your SLA targets.

Thanks to the GCaP class participants for doing a group-edit on this post in real time.

Wednesday, March 12, 2014

Modeling the Mythical Man-Month

This article was originally posted in 2007. When I updated the image today (in 2014), it reappeared with the more recent date and I don't know how to override that (wrong) timestamp. This seems to be a bug in Blogger.

In the aftermath of a discussion about software management with my colleague Steve Jenkin, I looked up the Mythical Man-Month concept in wikipedia. The main thesis of Fred Brooks, often referred to as "Brooks's law [sic]," is simply stated as:

Adding manpower to a late software project makes it later.
In other words, some number of cooks are necessary to prepare a dinner, but adding too many cooks in the kitchen can inflate the delivery schedule.

Wednesday, February 19, 2014

Facebook Meets Florence Nightingale and Enrico Fermi

Highlighting Facebook's mistakes and weaknesses is a popular sport. When you're the 800 lb gorilla of social networking, it's inevitable. The most recent rendition of FB bashing appeared in a serious study entitled, Epidemiological Modeling of Online Social Network Dynamics, authored by a couple of academics in the Department of Mechanical and Aerospace Engineering (???) at Princeton University.

They use epidemiological models to explain adoption and abandonment of social networks, where user adoption is analogous to infection and user abandonment is analogous to recovery from disease, e.g., the precipitous attrition witnessed by MySpace. To this end, they employ variants of an SIR (Susceptible Infected Removed) model to predict a precipitous decline in Facebook activity in the next few years.

Channeling Mark Twain, FB engineers lampooned this conclusion by pointing out that Princeton would suffer a similar demise under the same assumptions.

Irrespective of the merits of the Princeton paper, I was impressed that they used an SIR model. It's the same one I used, in R, last year to reinterpret Florence Nightingale's zymotic disease data during the Crimean War as resulting from epidemic spreading.

Another way in which FB was inadvertently dinged by incorrect interpretation of information—this time it was the math—occurred in the 2010 movie, "The Social Network" that tells the story of how FB (then called Facemash) came into being. While watching the movie, I noticed that the ranking metric that gets written on a dorm window (only in Hollywood) is wrong! The correct ranking formula is analogous to the Fermi-Dirac distribution, which is key to understanding how electrons "rank" themselves in atoms and semiconductors.

"The reports of my death have been greatly exaggerated."

Tuesday, February 18, 2014

Guerrilla Classes in March 2014

The upcoming GBoot and GCaP training classes are your fast track to enterprise performance and capacity management. You can now register entirely online using either your corporate or personal credit card.

New topics include:

  • The effect of think-time settings in load tests
  • Closed vs. open workloads in load testing

Classic topics include:

  • There are only 3 performance metrics
  • How performance metrics are related to each another
  • How to quantify scalability with the Universal Scalability Law (USL)
  • IT Infrastructure Library (ITIL) for Guerrillas
  • The Virtualization Spectrum from hyperthreads to hyperservices

Entrance Larkspur Landing hotel Pleasanton California

As usual, all classes are held at our lovely Larkspur Landing Pleasanton location in California. Attendees should bring their laptops to the class as course materials are provided on a flash drive. Larkspur Landing also provides free wi-fi Internet in their residence-style rooms as well as the training room.

Thursday, January 2, 2014

Monitoring CPU Utilization Under Hyper-threading

The question of accurately measuring processor utilization with hyper-threading (HT) enabled came up recently in a Performance Engineering Group discussion on Linked-in. Since I spent some considerable time looking into this issue while writing my Guerrilla Capacity Planning book, I thought I'd repeat my response here (slightly edited for this blog), in case it's useful to a broader audience interested in performance and capacity management. Not much has changed, it seems.

In a nutshell, the original question concerned whether or not it was possible for a single core to be observed running at 200% busy, as reported by Linux top, when HT is enabled.

This question is an old canard (well, "old" for multicore technology). I call it the "Missing MIPS" paradox. Regarding the question, "Is it really possible for a single core to be 200% busy?" the short answer is: never! So, you are quite right to be highly suspicious and confused.

You don't say which make of processor is running on your hardware platform, but I'll guess Intel. Very briefly, the OS (Linux in your case) is being lied to. Each core has 2 registers where inbound threads are stored for processing. Intel calls these AS (Architectural State) registers. With HT *disabled*, the OS only sees a single AS register as being available. In that case, the mapping between state registers and cores is 1:1. The idea behind HT is to allow a different application thread to run when the currently running app stalls; due to branch misprediction, bubbles in the pipeline, etc. To make that possible, there has to be another port or AS register. That register becomes visible to the OS when HT is enabled. However, the OS (and all the way up the food chain to whatever perf tools you are using) now thinks twice the processor capacity is available, i.e., 100% CPU at each AS port.

Wednesday, December 25, 2013

Response Time Percentiles for Multi-server Applications

In a previous post, I applied my rules-of-thumb for response time (RT) percentiles (or more accurately, residence time in queueing theory parlance), viz., 80th percentile: $R_{80}$, 90th percentile: $R_{90}$ and 95th percentile: $R_{95}$ to a cellphone application and found that the performance measurements were not completely consistent. Since the relevant data only appeared in a journal blog, I didn't have enough information to resolve the discrepancy; which is ok. The first job of the performance analyst is to flag performance anomalies but most probably let others resolve them—after all, I didn't build the system or collect the measurements.

More importantly, that analysis was for a single server application (viz., time-to-first-fix latency). At the end of my post, I hinted at adding percentiles to PDQ for multi-server applications. Here, I present the corresponding rules-of-thumb for the more ubiquitous multi-server or multi-core case.

Single-server Percentiles

First, let's summarize the Guerrilla rules-of-thumb for single-server percentiles (M/M/1 in queueing parlance): \begin{align} R_{1,80} &\simeq \dfrac{5}{3} \, R_{1} \label{eqn:mm1r80}\\ R_{1,90} &\simeq \dfrac{7}{3} \, R_{1}\\ R_{1,95} &\simeq \dfrac{9}{3} \, R_{1} \label{eqn:mm1r95} \end{align} where $R_{1}$ is the statistical mean of the measured or calculated RT and $\simeq$ denotes approximately equal. A useful mnemonic device is to notice the numerical pattern for the fractions. All denominators are 3 and the numerators are successive odd numbers starting with 5.