Friday, February 24, 2012

On the Accuracy of Exponentials and Expositions

The following is a slightly edited version of my response to a Discussion on the Linkedin CPPE group, which is accessible to Members Only. It's written in the style of a journal reviewer. The original Discussion topic was based on a link to a blog-post. I've been asked to make my Linkedin review more widely available so, here tiz...

The blog-post Capacity Planning on a Cocktail Napkin is a really good example of a really bad explanation. There are so many things that are misleading, at best, and flat-out wrong, at worst, it's hard to know where to begin (or where to stop). Nevertheless, I'll try to keep it brief [I failed in that endeavor. — njg].

The author applies the equation:

\begin{equation} E = λ \end{equation}

Why? What is that equation? We don't know because the author does not say yet what all the symbols mean. It's all very well to impress people by slinging equations around, but it's more important to say what the equations actually mean. After all, the author might have chosen the wrong one.

Eventually, we learn that E stands for Erlangs; which is actually a teletraffic measure of capacity. The author possibly got it from here (or similar) which states, quite clearly, what the symbols mean, viz., "average call-holding time, h" and later, "...average holding time/service time, h". It is very important to understand that 'h' in eqn. (1) is a service time (I'll call it 'S' hereafter) that does not include buffering or wait time (W). The connection between the S and W metrics is the residence time (R) where:

\begin{equation} R = S + W \end{equation}

Why is this important? Because the author is using only S [or h in eqn. (1)], which is typically a shorter time than R in (2), and that means the author is possibly underestimating capacity.

Little's Law
Equation (1) is actually a version of Little's law which I can rewrite slightly as:

\begin{equation} ρ = λ S \end{equation}

In this definition, E corresponds to the utilization (ρ) of the server or switch. The more general case of Little's law, however, includes buffering time (W) and is therefore expressed in terms of R as:

\begin{equation} N = λ R \end{equation}

where N is the level of concurrency or requests in the entire system, not just the occupancy at the telephone switch or service process.

Little's law in eqn.(4) applied to this load average data tells us that the average number resident in the system during the 20 minute window (indicated by the width of the black box) is N = 3.845 processes. This result can be seen by adding up all the blocks belonging to the 20 brown columns (1-minute load average data) between 21:30 and 21:50. That total brown area is 3.845 and the equivalent rectangular area (Little's law) is obtained by making the black box 0.1923 units high—the average load during that period. Clearly, some columns are above the top of the black box and some are below. In this picture, the column heights correspond to the instantaneous number of processes in the system during each minute. We can also think of those heights as representing the instantaneous arrival rate during each minute. The average arrival rate is λ = 0.1923 processes per minute and R = 20 minutes. Hence, N = λ R, which yields N = 3.845 processes.

To summarize, so far: eqns. (1) or (3) tell us the average number of requests in service, whereas eqn. (4) tells us the average number of requests in residence. The latter includes those not yet being serviced but waiting for service. Clearly therefore: N ≥ E.

Exponentials et al.
The other really confusing part of the "Napkin" post is the way the author's explanation jumps between Pareto, Poisson, Exponential and Normal dsns. (Pity the poor student.)

  1. If the arrivals are Pareto distributed (i.e., highly correlated), as the author assumes so as to get his 0.93 users/sec number, then the Poisson/Exponential assumption (i.e., statistical independence) cannot apply. But the author inexplicably decides to use the Exponential dsn anyway.
  2. It's a property of the (continuous) Exponential dsn for interarrival times that the mean == standard deviation == √var* (cf. the discrete Poisson dsn for event occurrences, where the mean == variance). These properties do not apply to the Normal dsn.
  3. The author claims that the Normal dsn approximates the Exponential dsn, to wit: "an arrivals rate of about 0.93 per second—then we can approximate with a normal distribution, where the mean is 0.93 per second, and the variance is also 0.93 seconds." That's news to me!

Referring to the above plot, the Normal or Gaussian dsn is an exponential function with an exponent that is quadratic in the random variable (X), i.e., has the form exp(−X2). The Exponential dsn, on the other hand, has an exponent that is linear in the positive random variable (X ≥ 0), i.e., has the form exp(−X). Because of the X2 in the exponent of the Normal dsn, it is symmetric about it's mean (μ = 1 in the above plot). Conversely, the Exponential dsn is highly asymmetric about it's mean (λ = 1 in the above plot). It is therefore groundless and confusing to suggest that the Exponential dsn can be approximated by a Normal dsn without a significant loss of accuracy.

Apparently unaware of such details, the author proceeds to use a Normal approximation with "Mean is λ, variance σ2 = λ". However, for an Exponential dsn with rate λ (i.e., inverse time), the mean is 1/λ and the variance is (1/λ)2. That's because 1/λ is the interarrival time; the average time between the arrival of requests.

This is a very important distinction because bursts of arrivals will be statistically much greater with Exp-distributed traffic (due to the above-mentioned asymmetry), and that could be very significant for any capacity planning assumptions‡ §. To see this, think percentiles.

The 95th percentile is approximately 2σ. Whether you use his (incorrect) identification σ2 = λ or σ2 = (1/λ)2—it's hard to know what to use here because everything is going wrong by this stage of his exposition—you get values near 2. If instead you apply the Exponential dsn, the 95th percentile is approximately 3 times the mean or a value around 3.25 (in the above plot); a very big difference. Therefore, applying the author's (incorrect) Normal approximation will generally incur a large underestimation error.

Cocktail-napkin discussions are only as useful as their accuracy. Otherwise, I can think of better uses for napkins.

Related posts: Load Testing Think Time Distributions

* Erlang (the engineer) reported in his 1909 paper that his measurements of arriving telephone calls were indeed Poisson distributed. The same is true today for Help Center calls and so on.
See Perl:::PDQ Section 3.5 for more about using the Gamma dsn as an approximation to the Exponential dsn.
I find it astounding the FOSS load-test tools like JMeter and, even worse, very expensive tools like LoadRunner do not implement the Exponential dsn as an option. Keep in mind that JM and LR are performance modeling tools: they are workload simulators. As I point out in my GCaP classes, the Exponential dsn is important for several reasons:
  1. It offers a higher degree of variation (fluctuations) from the client request-load than either Uniform or Gaussian dsns, thereby promoting the detection of unanticipated fails, like buffer overflow.
  2. It offers a convenient opportunity to compare with other performance tools, like PDQ, which also makes use of Exponential arrival patterns.
  3. It is the correct load profile if you are modeling a Help Center, teletraffic system or similar.
§ Bursts of Exponential arrivals should not be confused with Pareto-type correlations (as the "Napkin" author does). In my GCaP classes, I show a movie of Exponential arrivals where clusters of arrivals are clearly visible. That suggests the presence of correlations in what is otherwise supposed to be statistically independent arrivals. This paradox is resolved by watching the movie for a longer time and noting the appearance of gaps (i.e., no arrivals) of the same or similar length to the previously observed clusters. In other words, the bursts are balanced out by gaps so that overall, the arrivals are indeed statistically independent.


Neil Gunther said...

Added a new plot to explain the connection between Little's law (N, the average number in the system) and Unix/Linux load-average data represented by the usual time series one sees coming from modern performance monitoring tools.

AlexGilgur said...

Neil, great reminder! Independence between the lambda and the h is a very often overlooked assumption indeed.

Matteo said...

Hi Dr Gunther, I agree with your comment on load tools that do not implement exponential distribution to generate the workload. Do you think is it possible to emulate an exponential distribution of requests by combining more runs of a tools in parallel using another distribution (e.g. uniform), different average and different quantities? I was thinking something like:
to emulate 10 requests per seconds
run 1: 10 events with delay 1 sec
run 2: 18 events with delay 2 sec
run 3: 25 events with delay 3 sec


Neil Gunther said...

Hi Matteo, Interesting idea, but I suspect it will be thwarted by a well-known theorem in probability theory, which states roughly that adding the resulting collection metrics (random variables with whatever underlying dsn---uniform in your suggestion) will merely tend to emulate a Normal dsn. And that is arguably less effective than a uniform dsn.

Neil Gunther said...

For more details about how to generate exponentially distributed delays, see my latest blog post.

Matteo said...

Hi Dr Gunther,
I've seen your new post, it seems should be easy to include a development on the tools of they wanted to. Meanwhile I tried to see if my original idea was feasible with a simulation. Central limit theorem should apply when the summed ramdom variables have the same average. In this case I tried to do something like this,, rotated by 90 degrees clockwise. I've loaded the data (x, y) from a exp dsn (generated with excel) into a table and run a script to generate, for each x, a set of random uniform numbers (to fulfill each sub rectangle area as per the picture) with a range from 0 to y. I then summed the occurrences for each x and the graph resembles the original exponential curve.
This is the PL/SQL code I used:
type tab is table of number index by binary_integer;
l_t tab;
l_n number;
i number;
j number;
for a in (select x, round((exp_dsn - nvl(lag(exp_dsn) over(order by x desc),0))*x*1000) ex from mat_exp order by x desc) loop

for i in 1..a.ex loop
select abs(mod(dbms_random.random, a.x)) into l_n from dual;
if (l_t.exists(l_n)) then
l_t(l_n):= l_t(l_n) +1;
l_t(l_n):= 0;
end if;
end loop;
end loop;

for j in (select x from mat_exp order by x) loop

end loop;

Sorry for not using R.
How much wrong is this approach? :-)


Matteo said...

errata corrige:
...with a range from 0 to y...
should be
...with a range from 0 to x...

Neil Gunther said...

Matteo, I'm not about to debug your code here but, if you are starting with a table of Exp values and by "rotation" you actually mean pivoting about the diagonal if the xy-axes then, reading off using a uniform dsn is the same as coin flipping and should produce Exp-distd variates.

In other words, it sounds like you are implementing the inverse transform, described in my blog post, as a lookup table.