Wednesday, August 3, 2011

Q-Q Plots and Power Laws in Database Performance Data

I'm in the process of putting together some slides on how to apply Quantile-Quantile plots to performance data. Q-Q plots are a handy tool for visually inspecting how well your data matches a known probability distribution (prob dsn). If the match is good, the data should line up more or less diagonally in the Q-Q plot. A common usage is to verify normality, i.e. how well the data matches a Normal or Gaussian dsn. In fact, this usage is so common that R even has a separate function called qqnorm() for doing just that.

That's all textbook stuff, so I was wondering if anyone had tried it on computer performance data, e.g., response times. No sooner had I fired up the Google tab than blow me down if I don't see a blog post entitled "Q-Q plots to examine SQL execution time distributions". Not only that, but it's written by my colleague from the Hotsos Symposium, Dave Abercrombie—who has also been beavering away at applying my USL model to Oracle performance data. I recommend you take a look at Dave's post because he has tried to apply Q-Q plots to some tricky SQL performance data that looks benign when plotted as a histogram, but is actually very difficult to tame. This is demonstrated when Dave's attempts to model the underlying prob dsn using Q-Q plots doesn't meet with much success. Nonetheless, his point that it's better to use Q-Q plots than to eyeball histograms, is a good one.

Since I'm a bit of an old hand at constructing models out of data, and Dave was kind enough to provide the data for download from his blog, I decided to see what I could do with it. After a little fooling around, it struck me that there are actually three separate regions in these data. Therefore, it's a lost cause trying to model it with a single prob dsn. The following cluster of 4 plots shows what I came up with.

The top-left plot shows a histogram of the raw data set (some 500 query times), but with one very important difference compared with what Dave did. I ranked the data according to decreasing SQL elapse times. I then partitioned the data into 3 sub-regions (A, B and C). The top-right plot shows a portion of the data (region A) on double-log axes. The near linearity of those data demonstrates rather clearly that they most likely conform to a power law dsn, like Zipf's law. In fact, this looks like a textbook example. A power law means that the SQL queries are highly correlated in some way.

Both plots on the bottom row (for regions B and C, respectively) also exhibit linear behavior but those data are plotted on log-linear axes, so they cannot be power law. It does suggest, however, that those queries are exponentially distributed. Moreover, since these SQL query times are periods or intervals, that further suggests they belong to Poisson processes with two different rates. It also implies that whatever correlations are present in region A, are not present in regions B and C.

The blue lines in the above plots are the respective regression fits (using R) for each of the proposed models.

Now let's see what the Q-Q plots have to tell us about how well these 3 separate models do.

The top-left plot shows the entire data set compared with a normal dsn and, as Dave observed, it is clearly not normal (in every sense of the word). The top-right plot corresponds to the power law model and, although not a perfect diagonal line, it's better than the normal plot and I know that power laws can be notoriously difficult to model. At the very least, I think it says not to reject the power law on this data alone.

The bottom 2 plots would seem to confirm the validity of the exponential models in their respective regions. I'll be interested to hear if there is any support for this multi-modal approach in the Oracle world.

1 comment:

Tom said...

I encountered the QQ-plot when writing a paper about think times: I was trying to fit data to a lognormal curve and I used the fitdistrplus package in R. Figure 2 in the paper has the QQ-plot as well as 3 other plots (PP-plot being one of them) that are automatically generated. The fitdist function does all of the work and computes the distribution's parameters for the desired distribution. The paper also includes the script (#1) and it is very trivial. I cannot claim expertise in this area and may not have done everything correctly (I wrote up what I did, not necessarily the proper procedure).