The 2-parameter USL model
The original USL model, presented in my GCAP book and updated in the blog post How to Quantify Scalability, is defined in terms of two fitting parameters $\alpha$ (contention) and $\beta$ (coherency). \begin{equation} X(N) = \frac{N \, X(1)}{1 + \alpha (N - 1) + \beta N (N - 1)} \label{eqn: usl2} \end{equation}Fitting this nonlinear USL equational model to data requires several steps:
- normalizing the throughput data, $X$, to determine relative capacity, $C(N)$.
- equation (\ref{eqn: usl2}) is equivalent to $X(N) = C(N) \, X(1)$.
- if the $X(1)$ measurement is missing or simply not available—as is often the case with data collected from production systems—the GCAP book describes an elaborate technique for interpolating the value.
- providing each term of the USL with a proper physical meaning, i.e., not treat the USL like a conventional multivariate statistical model (statistics is not math)
- satisfying the von Neumann criterion: minimal number of modeling parameters
The 3-parameter USL model
Going back to equation (\ref{eqn: usl2}), let's just consider the simplest case where scaling is linear-rising, as would be the case for ideal parallelism. In the linear region, where $\alpha = \beta = 0$, equation (\ref{eqn: usl2}) simplifies to \begin{equation} X(N) = N \, X(1) \label{eqn: usl1} \end{equation}In other words, the overall throughput $X(N)$ increases in simple proportion to $N$. The "single-user" throughput, $X(1)$, doesn't change and therefore acts like a constant of proportionality.
But what happens when we don't know the value of $X(1)$? That means the $X(1)$ factor in equations (\ref{eqn: usl2}) and (\ref{eqn: usl1}) is undefined. We might denote this situation by writing
\begin{equation} X(N) = N \, ? \label{eqn: uslx} \end{equation}
Of course, that makes no sense, mathematically speaking. As already mentioned, the conventional way out of this situation is to estimate the value of $X(1)$ using mathematical interpolation. But here's the epiphany.
Rather than using the more complicated interpolation procedure, we can simply appeal to statistical regression! Yes, that's right, we treat the USL equation as a conventional multivariate statistical model. After all, we're already using nonlinear statistical regression to determine the $\alpha$ and $\beta$ parameters. More importantly, since statistics is not math, we can replace equation ($\ref{eqn: uslx}$) with a statement about correlation, rather than strict equality. In statistical models, that's accomplished by introducing another parameter (I'll call it $\gamma$, since that's the third letter of the Greek alphabet) to replace the question mark in equation ($\ref{eqn: uslx}$), namely
\begin{equation} X(N) = N \, \gamma \label{eqn: uslg} \end{equation}
The new parameter $\gamma$ is just a constant of proportionality that represents the slope of the line associated with ideal parallel scaling. See the plots below.
And here's a little piece of magic. If we choose $N = 1$ in equation ($\ref{eqn: uslg}$), it becomes $X(1) = \gamma$. So, when the $\gamma$ parameter is determined by statistical regression, it also tells us the estimated value of $X(1)$, whether it was measured or missing. In other words, we don't need to do any explicit interpolation because the nonlinear regression procedure does it automatically by fitting the third parameter.
Equation (\ref{eqn: usl2}) is now replaced by a 3-parameter version of the USL model: \begin{equation} X(N) = \frac{N \, \gamma}{1 + \alpha (N - 1) + \beta N (N - 1)} \label{eqn: usl3} \end{equation}
Unlike the 2-parameter USL, equation (\ref{eqn: usl3}) can be fitted directly to your throughput measurements without the need to do any data normalization or interpolation. The following examples show the results of fitting the 3-parameter USL model.
Load-test data
These are load-test data and the "single-user" throughput was measured as $X(1) = 955.16$ per unit time. The 3-parameter USL fit is summarized in the following plot.
The fitted value of $\gamma = 995.65$, which is the estimated value of $X(1)$. It can also be regarded as the slope of the linear-rising throughput indicated by the sloping red line on the left of the plot.
Production data
These data are from a continuously running production system and thus, no $X(1)$ was ever produced.
The fitted value of $\gamma = 3.22$ is also equivalent to the estimated value of $X(1)$. Similarly, it can be regarded as the slope of the linear-rising throughput on the left of the plot. Interestingly, in these data, $\alpha = 0$, while $beta$ is non-zero. That suggests there is no significant contention in the workload but there is some data exchange coherency at play.
One word of caution. Fitting the 3-parameter USL can be more sensitive to the actual data, especially with a large number of production data scatter points. I'll go into all this, and more, in the upcoming Guerrilla training classes.
4 comments:
I switched from MathJax to KaTeX and got much better performance.
https://www.xaprb.com/blog/switching-mathjax-katex/
"One parameter is always enough" (See Fig. 1a) Doh!
There goes my Winking Pink Elephant :(
We produce USL models using data from our production systems using this method to estimate X(1). We have thousands of "real" load generators from which we feed the model ~30 days of hourly data points accross a range of load levels. This helps us 1.) get a feel for how much throughput capacity is remaining and 2.) observe how scaling changes between software releases (as defined quantitatively via the contention and coherency parameters). We're also toying with modeling impacts to response time by leveraging the predicted throughput from the USL models.
Post a Comment