Tuesday, August 23, 2011

Subjugation to the Sigmas

No doubt you've heard about the tyranny of the 9s in reference to computer system availability. You're probably also familiar with the phrase six sigma, either in the context of manufacturing process quality control or the improvement of business processes. As we discovered in the recent Guerrilla Data Analysis Techniques class, the two concepts are related.

 Nines  Percent  Downtime/Year   σ Level 
4 99.99%   52.596 minutes 
5 99.999%   5.2596 minutes  -
6 99.9999%   31.5576 seconds 
7 99.99999%   3.15576 seconds  -
8  99.999999%   315.6 milliseconds 

In this way, people like to talk about achieving "5 nines" availability or a "six sigma" quality level. These phrases are often bandied about without appreciating:
  1. that nines and sigmas refer to similar criteria.
  2. that high nines and high sigmas are very difficult to achieve consistently.
See the appended Comments below for more details and examples.

To arrive at the 3rd column of numbers in the table, you can use the following R function to find out how much shorter downtime per year each additional 9 imposes. Hence, the term tyranny.

downt <- function(nines,tunit=c('s','m','h')) {
	ds <- 10^(-nines) * 365.25*24*60*60
	if(tunit == 's') { ts <- 1; tu <- "seconds" }
	if(tunit == 'm') { ts <- 60; tu <- "minutes" }
	if(tunit == 'h') { ts <- 3600; tu <- "hours" }
	return(sprintf("Downtime per year at %d nines: %g %s", nines, ds/ts,tu))

> downt(5,'m')
[1] "Downtime per year at 5 nines: 5.2596 minutes"
> downt(8,'s')
[1] "Downtime per year at 8 nines: 0.315576 seconds"
The associated σ levels correspond to the area under the Normal (Gaussian) or "bell shaped" curve within that 2σ interval centered on the mean (μ). The σ refers to the standard deviation in the usual way.
The corresponding area under the Normal curve can be calculated using the following R function:

sigp <- function(sigma) {
	sigma <- as.integer(sigma)
	apc <- erf(sigma/sqrt(2))
	return(sprintf("%d-sigma bell area: %10.8f%%; Prob(chance): %e", sigma, apc*100, 1-apc))

> sigp(2)
[1] "2-sigma bell area: 95.44997361%; Prob(chance): 4.550026e-02"
> sigp(5)
[1] "5-sigma bell area: 99.99994267%; Prob(chance): 5.733031e-07"
So, 5σ corresponds to slightly more than 99.9999% of the area under in the bell curve; the total area being 100%. It also corresponds closely to six 9s availability. The 2nd number computed by sigp is the probability that the achieved availability was a fluke. A reasonable mnemonic for some of these values is:
  • 3σ corresponds roughly to a probability of 1 in 1,000 that four 9s availability occurred by chance.
  • 5σ is roughly a 1 in a million chance, which is like flipping a fair coin and getting 20 heads in a row.
  • 6σ is roughly a 1 in a billion chance that it was a fluke.
Now you see why these goals are easy to covet but hard to achieve.


Efrique said...

How often are distributions - even of sample means - normal out to 6-sigmas?

Efrique said...

How often is it the case that actual data distributions - even of sample averages - are sufficiently close to normal out to 6 sigma that those percentages are at all meaningful?

Neil Gunther said...

That, of course, is part of the point of this post: big talk, little measurement.

Major web sites often do have enough data to support claims about 3 to 5 nines availability.

As an aside, particle physics measurements require 5sigma levels to be considered valid. Current measurements at the LHC, regarding the existence of the Higgs boson, are more like 2sigma.

Efrique said...

My apologies about posting the question twice. The first time, the message I got seemed to suggest it hadn't posted.

Neil Gunther said...

No worries.

I could delete one, but I would have to get really worked up about it. :)

jeff said...

Here's another interesting tidbit: how's the expected deviation of a series conditional upon being >= 6 sigmas different than any other sigma.

Neil Gunther said...

I'll see that and raise you 5 ...sigma? :)

There's something of a technical contradiction in attempting to apply "6sigma" improvement (e.g., SPC) to computer performance data. The assumption is that the samples are discrete point events (defects), whereas almost all computer performance data are time series where the events have already been averaged over some predefined sampling time interval.

Neil Gunther said...

This just in from Twitter...

@OReillyMedia O'Reilly Media
100% uptime. It's not just about technology. CIO @Reichental discusses the value of good change management: oreil.ly/mTlYDw

Neil Gunther said...

Salesforce.com is holding a big conference in downtown San Francisco at the moment (30,000 attendees) and I happened across this comment in their Wikipedia entry:

"The service has suffered some downtime; during an outage in January 2009 services were unavailable for at least 40 minutes, affecting thousands of businesses."

But that's about four 9s availability!

Of course, they couldn't afford another outage in 2009 in order to maintain that availability level. And that level is the statistical mean, which says nothing about variance. :)

Unknown said...

Very nice site. I came across this on Google, and I am stoked that I did.
I will definitely be coming back here more often.
Wish I could add to the conversation and bring a bit more to the table, but am just taking in as much info as I can at the moment.Six Sigma Certification

Neil Gunther said...

Hi Carl,

Thank you and welcome!

BTW, this sigmas business is going to become a hot topic on the web as of tomorrow when CERN makes their public statement about the status of the latest Higgs data from the LHC.

For example, and quoting:
"... 5 sigma, meaning that it has just a 0.00006% chance of being wrong. The ATLAS and CMS experiments are each seeing signals between 4.5 and 5 sigma, just a whisker away from a solid discovery claim."

5 sigma is the minimum bar for particle physics data. It's also just as well to keep in mind that the confidence level doesn't tell the whole story.

Last year, a different CERN-related experiment was seeing superluminal neutrinos (i.e., Einstein busters) at the 6-sigma level. After eventually finding the loose connector in their detector, those revolutionary neutrinos suddenly disappeared http://is.gd/WWaj7F

Forearmed by that fiasco, The Higgs boys are very unlikely to have that problem, but they are still going to have to demonstrate that their 4+ sigma bumps are really The Higgs.