Sunday, May 20, 2018

USL Scalability Modeling with Three Parameters

NOTE: Annoyingly, the remote mathjax server often takes it's sweet time rendering LaTex equations (like, maybe a minute!!!). I don't know if this is deliberate on the part of Google or a bug. It used to be faster. If anyone knows, I'd be interested to hear; especially if there is a way to speed it up. And no, I'm not planning to move to WordPress.

The 2-parameter USL model

The original USL model, presented in my GCAP book and updated in the blog post How to Quantify Scalability, is defined in terms of two fitting parameters $\alpha$ (contention) and $\beta$ (coherency). \begin{equation} X(N) = \frac{N \, X(1)}{1 + \alpha (N - 1) + \beta N (N - 1)} \label{eqn: usl2} \end{equation}

Fitting this nonlinear USL equational model to data requires several steps:

  1. normalizing the throughput data, $X$, to determine relative capacity, $C(N)$.
  2. equation (\ref{eqn: usl2}) is equivalent to $X(N) = C(N) \, X(1)$.
  3. if the $X(1)$ measurement is missing or simply not available—as is often the case with data collected from production systems—the GCAP book describes an elaborate technique for interpolating the value.
The motivation for a 2-parameter model arose out of a desire to meet the twin goals of:
  1. providing each term of the USL with a proper physical meaning, i.e., not treat the USL like a conventional multivariate statistical model (statistics is not math)
  2. satisfying the von Neumann criterion: minimal number of modeling parameters
Last year, I realized the 2-paramater constraint is actually overly severe. Introducing a third parameter would make the statistical fitting process even more universal, as well as simplify the overall procedure. For the USL particularly, the von Neumann criterion should not be taken too literally. It's really more of a guideline: fewer is generally better. Additionally, Baron Schwarz told me that he'd had better luck fitting production RDBMS data in Excel by substituting a third parameter into the numerator of the USL. As ever, the question remained: How could this actually work?

The 3-parameter USL model

Going back to equation (\ref{eqn: usl2}), let's just consider the simplest case where scaling is linear-rising, as would be the case for ideal parallelism. In the linear region, where $\alpha = \beta = 0$, equation (\ref{eqn: usl2}) simplifies to \begin{equation} X(N) = N \, X(1) \label{eqn: usl1} \end{equation}

In other words, the overall throughput $X(N)$ increases in simple proportion to $N$. The "single-user" throughput, $X(1)$, doesn't change and therefore acts like a constant of proportionality.

But what happens when we don't know the value of $X(1)$? That means the $X(1)$ factor in equations (\ref{eqn: usl2}) and (\ref{eqn: usl1}) is undefined. We might denote this situation by writing

\begin{equation} X(N) = N \, ? \label{eqn: uslx} \end{equation}

Of course, that makes no sense, mathematically speaking. As already mentioned, the conventional way out of this situation is to estimate the value of $X(1)$ using mathematical interpolation. But here's the epiphany.

Rather than using the more complicated interpolation procedure, we can simply appeal to statistical regression! Yes, that's right, we treat the USL equation as a conventional multivariate statistical model. After all, we're already using nonlinear statistical regression to determine the $\alpha$ and $\beta$ parameters. More importantly, since statistics is not math, we can replace equation ($\ref{eqn: uslx}$) with a statement about correlation, rather than strict equality. In statistical models, that's accomplished by introducing another parameter (I'll call it $\gamma$, since that's the third letter of the Greek alphabet) to replace the question mark in equation ($\ref{eqn: uslx}$), namely

\begin{equation} X(N) = N \, \gamma \label{eqn: uslg} \end{equation}

The new parameter $\gamma$ is just a constant of proportionality that represents the slope of the line associated with ideal parallel scaling. See the plots below.

And here's a little piece of magic. If we choose $N = 1$ in equation ($\ref{eqn: uslg}$), it becomes $X(1) = \gamma$. So, when the $\gamma$ parameter is determined by statistical regression, it also tells us the estimated value of $X(1)$, whether it was measured or missing. In other words, we don't need to do any explicit interpolation because the nonlinear regression procedure does it automatically by fitting the third parameter.

Equation (\ref{eqn: usl2}) is now replaced by a 3-parameter version of the USL model: \begin{equation} X(N) = \frac{N \, \gamma}{1 + \alpha (N - 1) + \beta N (N - 1)} \label{eqn: usl3} \end{equation}

Unlike the 2-parameter USL, equation (\ref{eqn: usl3}) can be fitted directly to your throughput measurements without the need to do any data normalization or interpolation. The following examples show the results of fitting the 3-parameter USL model.

Load-test data

These are load-test data and the "single-user" throughput was measured as $X(1) = 955.16$ per unit time. The 3-parameter USL fit is summarized in the following plot.

The fitted value of $\gamma = 995.65$, which is the estimated value of $X(1)$. It can also be regarded as the slope of the linear-rising throughput indicated by the sloping red line on the left of the plot.

Production data

These data are from a continuously running production system and thus, no $X(1)$ was ever produced.

The fitted value of $\gamma = 3.22$ is also equivalent to the estimated value of $X(1)$. Similarly, it can be regarded as the slope of the linear-rising throughput on the left of the plot. Interestingly, in these data, $\alpha = 0$, while $beta$ is non-zero. That suggests there is no significant contention in the workload but there is some data exchange coherency at play.

One word of caution. Fitting the 3-parameter USL can be more sensitive to the actual data, especially with a large number of production data scatter points. I'll go into all this, and more, in the upcoming Guerrilla training classes.

4 comments:

Baron Schwartz said...

I switched from MathJax to KaTeX and got much better performance.
https://www.xaprb.com/blog/switching-mathjax-katex/

Neil Gunther said...
This comment has been removed by the author.
Neil Gunther said...

"One parameter is always enough" (See Fig. 1a) Doh!

There goes my Winking Pink Elephant :(

Justin Martin said...

We produce USL models using data from our production systems using this method to estimate X(1). We have thousands of "real" load generators from which we feed the model ~30 days of hourly data points accross a range of load levels. This helps us 1.) get a feel for how much throughput capacity is remaining and 2.) observe how scaling changes between software releases (as defined quantitatively via the contention and coherency parameters). We're also toying with modeling impacts to response time by leveraging the predicted throughput from the USL models.