Wednesday, March 21, 2012

How to Generate Exponential Delays

This question arose while addressing Comments on a previous blog post about exponentially distributed delays. One of my ongoing complaints is that many, if not most, popular load-test generation tools do not provide exponential variates as part of a library of time delays or think-time distributions. Not only is this situation bizarre, given that all load tests are actually performance models (and who doesn't love an exponential distribution in their performance models?), but without the exponential distribution you are less likely to observe such things as buffer overflow conditons due to larger than normal (or uniform) queueing fluctuations. Exponential delays are both simple and useful for that purpose, but we are often left to roll our own code and then debug it.

Listing 2.2 on p. 35 of my Perl::PDQ book shows you how to generate exponential variates in Perl. Here, however, I want to use R to compare exponential delays with both the uniform distribution (the default distribution available in all load-test simulators) and the normal distribution (the familiar "bell curve").

The R function that generates exponential variates directly is rexp(n, rate = 1) where, for example, the parameter called rate might correspond to the arrival rate of requests going into your test rig or system under test (SUT). The R programming language uses the same notation as p. 57 of my Perl::PDQ book. In my books and classes, I usually write that rate as $\lambda$ to match conventional queueing theory symbology. The probability density function (PDF), or dexp() in R, is usually written as:

\begin{equation} f(t) = \lambda e^{-\lambda t} \end{equation}

and shown in Figure 1.

The rate is $\lambda$, but the average or statistical mean of (1) is given by the inverse rate or $1/\lambda$. Since $\lambda$ is the average arrival rate, $1/\lambda$ is the average interarrival time as would be seen by the SUT. The view from the load-test client corresponds to a think-time delay of $Z = 1/\lambda$ in your script. In either case, the delay is the time interval between requests, whether departing the client or arriving at the SUT. If you could apply the R function rexp() directly to produce 10 exponentially distributed delays with a mean time of $Z=30$ seconds, you would write rexp(10,1/30) with the result:


> rexp(10,1/30)
 [1] 126.841780  78.137417  12.989666  31.544992   9.598437  89.299508   2.063628   4.597275
 [9]  24.792890  11.950121

Note that some delays are much smaller than the mean while other delays are much greater. Unfortunately, this R function is not available to you in load-test scripts so, you have to code your own. Here's how that works.

Inverse Transformation
In eqn. (1), we have the output $f(t)$ on the left and the corresponding delay $(t)$ on the right side (in the exponent). However, we would really prefer to have things the other way around: flip a coin to get an input on the right and find out what delay that corresponds to as an output on the left.

We can use the inverse transform to do precisely that. The inverse function does not necessarily exist for an arbitrary probability distribution but, thankfully, the exponential distribution has a very simple form which allows it. The inverse of the exponential function is the natural logarithm function. The PDF in (1) lies in the range $0 \le f < \lambda$ on the $y$-axis, but we need to work with probabilities. The function which does this is the cumulative distribution function $F(t)$ in Figure 2:

\begin{equation} F(t) = 1 - e^{-\lambda t} \end{equation}

which is strictly bounded by the range $0 \le F < 1$. $F(t)$ is the corresponding area under $f(t)$ and corresponds to pexp(q, rate = 1) in R.

Typically, we would look along the $t$-axis (horizontal) for a particular time $(t)$ and then look up (to the curve) and across to the y-axis $(F)$ to find out the probability of that time occurring. Here, instead, we pick a random point on y-axis interval corresponding to $F$ (e.g., by flipping a coin). The corresponding delay is read off from the t-axis by following the dashed arrow in Figure 2, which shows this inversion process for probability values $0.90$, $0.80$ and $0.30$. BTW, those probability values also correspond respectively to $90$th, $80$th, and $30$th percentiles, if you prefer to think of them that way. We can simulate the coin flip by using a variate $u \sim U(0,1)$ chosen from a uniform distribution $0 \le u < 1$.

Based on Figure 2, how can we calculate the corresponding interarrival delay $(t)$ that the load generator should use? Letting $u$ represent $F$ in (2) and transposing produces:

\begin{equation} e^{-\lambda t} = 1 - u \end{equation}

Next, we solve (3) for $t$ by taking natural logs of both sides—the inverse function:

\begin{equation} \lambda t = - \ln(1 - u) \end{equation}

But the value of $u$ lies in the same interval as $(1-u)$, since they have the same uniform distribution. Hence, we can use the slightly simpler form:

\begin{equation} t = - \frac{\ln(u)}{\lambda} \end{equation}

Finally, we have arrived at the place where we wanted to be: flip a coin to get a random input on the right hand side of (5) and find out what delay the client script should use as an output on the left. For load testing, the random delay $(t)$ is associated with a mean think time $Z = 1/\lambda$ and is therefore computed using:

\begin{equation} t = -Z \ln(u) \end{equation}

Equation (6) is what rexp() uses under the covers, and it's what you need to code in your client test scripts. A rather simple formula which, again, underscores the lunacy of not having it integrated into the load-test simulator.

Examples in R
Using R, we first generate $10$ random variates (coin tosses) from a uniform distribution:


> uVariates <- runif(10) # create $10$ uniform variates
> uVariates
 [1] 0.13737362 0.23143301 0.59315715 0.56576222 0.48420383 0.66065404 0.77418521 0.07673259
 [9] 0.39985820 0.99247847
Next, we use these values as inputs into eqn. (6), with a mean time of 30 seconds, to calculate 10 exponentially distributed delays:


> thinkZ  <- 30 # specify the average desired delay
> tDelays <- -thinkZ*log(uVariates) # see eqn. (6)
> tDelays
 [1] 59.5515274 43.9039444 15.6688773 17.0874419 21.7574800 12.4357489  7.6783242 77.0228628
 [9] 27.4993587  0.2264988

Note the spread of delay times, which would also create significant fluctuations in queue depth as seen by buffers on the SUT side.

For comparison, here are $10$ delay samples produced by a uniform distribution with the same mean as used for the exponential samples, i.e., the arithmetic mean $\frac{0+60}{2}=30$ seconds:


> runif(10,0,60)
 [1] 25.90247 51.94069 30.43872 10.11337 29.17908 25.90095 31.91967 42.46816 49.94629 34.62311

Similarly, here are $10$ delay samples produced by a normal distribution with a mean of $30$ seconds:


> rnorm(10,30)
 [1] 30.57623 30.63393 30.45814 29.40551 29.59376 30.38025 29.91399 30.12652 29.67296 29.00941
Clearly, the exponential distribution produces a greater spread of delay times.

Related Posts

10 comments:

metasoft said...

I was also reviewing the literatures on creating Exponential, Normal, Gamma and Poisson random number generators.

Both Numerical Recipes 3rd Edition and James E. Gentle's Random Number Generation and Monte Carlo Methods 2e were useful in coding non-uniformly distributed random number generators.

metasoft said...

I was also reviewing the literatures on creating Exponential, Normal, Gamma and Poisson random number generators.

Both Numerical Recipes 3rd Edition and James E. Gentle's Random Number Generation and Monte Carlo Methods 2e were useful in coding non-uniformly distributed random number generators.

Neil Gunther said...

I guess the issue with eqn.(6) is the logical equivalent of how you import log() from math.h or whatever numerical libs are in your load-testing environment.

It's still pathetic, when you consider that this is supposed to be the 21st century and yet, we are discussing such issues.

metasoft said...

A free e-book on non-uniform random number generator.

http://luc.devroye.org/rnbookindex.html

Neil Gunther said...

Just discovered how to do LaTeX style equations in Blogger.

The math processing is done on a remote server and therefore may take a little while to finalize, depending on the amount of math content and the server latency.

However, I think you'll agree the more readable typesetting is worth the wait.

Mohan Radhakrishnan said...

Hi meta,
I have searched long for a statistics book that can help me understand distributions etc. and code Java. I have looked at James E. Gentle's Amazon review.

Is this book the right one in your opinion ?

Thanks.

Mohan Radhakrishnan said...

I have a general question about generating distributions of data and sending it to the server and exponential delays etc.

What is the explanation for these kinds of loads with certain types of delays ? Are they supposed to highlight some weak points in the server like exponential delays and buffer sizes you have mentioned ?

Neil Gunther said...

Mohan, I'm thinking about a good way to demonstrate the impact on buffering due to the differences in arrival dsns. It will likely be too involved to write as a Comment here, so I will post it as a new blog entry after the GCaP class completes next week.

In the meantime, you might want to re-read this blog post—especially the footnotes.

Neil Gunther said...

Mohan et al.,

Take a look at Load Testing with Uniform vs. Exponential Arrivals.

metasoft said...

Mohan,

To me, James E Gentle's book has easier to understand exposition than Numerical Recipes. You may be able to find sample chapters of both to see which one you like better. Also check out the link to the free book by Devroye.