Listing 2.2 on p. 35 of my Perl::PDQ book shows you how to generate exponential variates in Perl. Here, however, I want to use R to compare exponential delays with both the uniform distribution (the default distribution available in all load-test simulators) and the normal distribution (the familiar "bell curve").
The R function that generates exponential variates directly is rexp(n, rate = 1) where, for example, the parameter called rate might correspond to the arrival rate of requests going into your test rig or system under test (SUT). The R programming language uses the same notation as p. 57 of my Perl::PDQ book. In my books and classes, I usually write that rate as $\lambda$ to match conventional queueing theory symbology. The probability density function (PDF), or dexp() in R, is usually written as:
\begin{equation} f(t) = \lambda e^{-\lambda t} \end{equation}
and shown in Figure 1.
The rate is $\lambda$, but the average or statistical mean of (1) is given by the inverse rate or $1/\lambda$. Since $\lambda$ is the average arrival rate, $1/\lambda$ is the average interarrival time as would be seen by the SUT. The view from the load-test client corresponds to a think-time delay of $Z = 1/\lambda$ in your script. In either case, the delay is the time interval between requests, whether departing the client or arriving at the SUT. If you could apply the R function rexp() directly to produce 10 exponentially distributed delays with a mean time of $Z=30$ seconds, you would write rexp(10,1/30) with the result:
> rexp(10,1/30)
[1] 126.841780 78.137417 12.989666 31.544992 9.598437 89.299508 2.063628 4.597275
[9] 24.792890 11.950121
Note that some delays are much smaller than the mean while other delays are much greater. Unfortunately, this R function is not available to you in load-test scripts so, you have to code your own. Here's how that works.
Inverse Transformation
In eqn. (1), we have the output $f(t)$ on the left and the corresponding delay $(t)$ on the right side (in the exponent). However, we would really prefer to have things the other way around: flip a coin to get an input on the right and find out what delay that corresponds to as an output on the left.
We can use the inverse transform to do precisely that. The inverse function does not necessarily exist for an arbitrary probability distribution but, thankfully, the exponential distribution has a very simple form which allows it. The inverse of the exponential function is the natural logarithm function. The PDF in (1) lies in the range $0 \le f < \lambda$ on the $y$-axis, but we need to work with probabilities. The function which does this is the cumulative distribution function $F(t)$ in Figure 2:
\begin{equation} F(t) = 1 - e^{-\lambda t} \end{equation}
which is strictly bounded by the range $0 \le F < 1$. $F(t)$ is the corresponding area under $f(t)$ and corresponds to pexp(q, rate = 1) in R.
Typically, we would look along the $t$-axis (horizontal) for a particular time $(t)$ and then look up (to the curve) and across to the y-axis $(F)$ to find out the probability of that time occurring. Here, instead, we pick a random point on y-axis interval corresponding to $F$ (e.g., by flipping a coin). The corresponding delay is read off from the t-axis by following the dashed arrow in Figure 2, which shows this inversion process for probability values $0.90$, $0.80$ and $0.30$. BTW, those probability values also correspond respectively to $90$th, $80$th, and $30$th percentiles, if you prefer to think of them that way. We can simulate the coin flip by using a variate $u \sim U(0,1)$ chosen from a uniform distribution $0 \le u < 1$.
Based on Figure 2, how can we calculate the corresponding interarrival delay $(t)$ that the load generator should use? Letting $u$ represent $F$ in (2) and transposing produces:
\begin{equation} e^{-\lambda t} = 1 - u \end{equation}
Next, we solve (3) for $t$ by taking natural logs of both sides—the inverse function:
\begin{equation} \lambda t = - \ln(1 - u) \end{equation}
But the value of $u$ lies in the same interval as $(1-u)$, since they have the same uniform distribution. Hence, we can use the slightly simpler form:
\begin{equation} t = - \frac{\ln(u)}{\lambda} \end{equation}
Finally, we have arrived at the place where we wanted to be: flip a coin to get a random input on the right hand side of (5) and find out what delay the client script should use as an output on the left. For load testing, the random delay $(t)$ is associated with a mean think time $Z = 1/\lambda$ and is therefore computed using:
\begin{equation} t = -Z \ln(u) \end{equation}
Equation (6) is what rexp() uses under the covers, and it's what you need to code in your client test scripts. A rather simple formula which, again, underscores the lunacy of not having it integrated into the load-test simulator.
Examples in R
Using R, we first generate $10$ random variates (coin tosses) from a uniform distribution:
> uVariates <- runif(10) # create $10$ uniform variates
> uVariates
[1] 0.13737362 0.23143301 0.59315715 0.56576222 0.48420383 0.66065404 0.77418521 0.07673259
[9] 0.39985820 0.99247847
Next, we use these values as inputs into eqn. (6), with a mean time of 30 seconds, to calculate 10 exponentially distributed delays:
> thinkZ <- 30 # specify the average desired delay
> tDelays <- -thinkZ*log(uVariates) # see eqn. (6)
> tDelays
[1] 59.5515274 43.9039444 15.6688773 17.0874419 21.7574800 12.4357489 7.6783242 77.0228628
[9] 27.4993587 0.2264988
Note the spread of delay times, which would also create significant fluctuations in queue depth as seen by buffers on the SUT side.
For comparison, here are $10$ delay samples produced by a uniform distribution with the same mean as used for the exponential samples, i.e., the arithmetic mean $\frac{0+60}{2}=30$ seconds:
> runif(10,0,60)
[1] 25.90247 51.94069 30.43872 10.11337 29.17908 25.90095 31.91967 42.46816 49.94629 34.62311
Similarly, here are $10$ delay samples produced by a normal distribution with a mean of $30$ seconds:
> rnorm(10,30)
[1] 30.57623 30.63393 30.45814 29.40551 29.59376 30.38025 29.91399 30.12652 29.67296 29.00941
Clearly, the exponential distribution produces a greater spread of delay times.
Related Posts
I was also reviewing the literatures on creating Exponential, Normal, Gamma and Poisson random number generators.
ReplyDeleteBoth Numerical Recipes 3rd Edition and James E. Gentle's Random Number Generation and Monte Carlo Methods 2e were useful in coding non-uniformly distributed random number generators.
I was also reviewing the literatures on creating Exponential, Normal, Gamma and Poisson random number generators.
ReplyDeleteBoth Numerical Recipes 3rd Edition and James E. Gentle's Random Number Generation and Monte Carlo Methods 2e were useful in coding non-uniformly distributed random number generators.
I guess the issue with eqn.(6) is the logical equivalent of how you import log() from math.h or whatever numerical libs are in your load-testing environment.
ReplyDeleteIt's still pathetic, when you consider that this is supposed to be the 21st century and yet, we are discussing such issues.
A free e-book on non-uniform random number generator.
ReplyDeletehttp://luc.devroye.org/rnbookindex.html
Just discovered how to do LaTeX style equations in Blogger.
ReplyDeleteThe math processing is done on a remote server and therefore may take a little while to finalize, depending on the amount of math content and the server latency.
However, I think you'll agree the more readable typesetting is worth the wait.
Hi meta,
ReplyDeleteI have searched long for a statistics book that can help me understand distributions etc. and code Java. I have looked at James E. Gentle's Amazon review.
Is this book the right one in your opinion ?
Thanks.
I have a general question about generating distributions of data and sending it to the server and exponential delays etc.
ReplyDeleteWhat is the explanation for these kinds of loads with certain types of delays ? Are they supposed to highlight some weak points in the server like exponential delays and buffer sizes you have mentioned ?
Mohan, I'm thinking about a good way to demonstrate the impact on buffering due to the differences in arrival dsns. It will likely be too involved to write as a Comment here, so I will post it as a new blog entry after the GCaP class completes next week.
ReplyDeleteIn the meantime, you might want to re-read this blog post—especially the footnotes.
Mohan et al.,
ReplyDeleteTake a look at Load Testing with Uniform vs. Exponential Arrivals.
Mohan,
ReplyDeleteTo me, James E Gentle's book has easier to understand exposition than Numerical Recipes. You may be able to find sample chapters of both to see which one you like better. Also check out the link to the free book by Devroye.