## Tuesday, May 12, 2009

### Negative Scalability Coefficients in Excel

Recently, several performance engineers, who have been applying my universal scalability law (USL) to their throughput measurements, reported a problem whereby their Excel spreadsheet calculations produced a negative value for the coherency parameter (β < 0) on what otherwise appears to be an application that scales extremely well. You can download the Excel spreadsheet sscalc.xls from the Guerrilla class materials. Negative USL parameters are verboten.

Naturally, I've seen this kind of negative numbers no-no before, but usually there has been some other problem with the measurements and once that was corrected everything worked as advertised. Lately, however, something else seemed to be going on and that caused me to investigate things more closely. I discovered that there is a genuine limitation when applying the USL in Excel and that's what I'm going to explain here. It's a bit subtle, so fasten your seat-belt.

The following set of plots show a generic situation. The top row shows how things would go in Excel, while the bottom row shows some ideal curves that I'll explain in a minute.

Excel Curves
Referring to the 3 plots in the top row. The series of dots in Figure A are the throughput X(N) measured at each user load (N on the x-axis). Technically, these data have been normalized to produce the relative capacity C(N) = X(N)/X(1). This is the quantity that we eventually model with the USL function:

C(N) = N/{1 + α (N − 1) + β N(N − 1)}

where α represents the degree of contention, e.g., queueing, in the system and β is the coherency delay incurred, e.g., when keeping multiple data copies consistent across caches.

Most importantly, because these parameters (α, β) have a physical meaning, i.e., they're not just some curve-fitting parameters, they must ALWAYS be positive-valued (no ifs, ands or buts). It's obedience to this physical constraint that allows us to know when the performance measurement process is broken (which is more often than you might dare to imagine).

The other thing to notice in Figure A is that the data points are very close to linear, in that they have been statistically fitted to a straight line (solid line). Prima facie, that looks very good, but such linear data should not be confused with ideal linear scalability in the sense of maximal bang for the buck. Ideal linearity is represented by the dotted line, which has a higher slope than the solid line.

Figure B shows the corresponding efficiency C(N)/N or the amount of capacity per user. This is the 4th column in the spreadsheet (depending on how you have things set up). As expected, it decreases gracefully in the direction of the x-axis. You can't have more than 100% efficiency and, as a general rule, efficiency degrades with N because the degree of overhead increases with more users. In this case, the degradation is very small due to the linearity in Figure A.

Now comes the fun part. In Excel, we must do some handstands to accommodate the USL model. The drawback with Excel is that it still doesn't know how to fit a Trend Line to a rational function like the USL model. It's not a dialog-box option. Sigh! Therefore, we have to tip the USL upside-down, fit the denominator as a 2nd-degree polynomial (quadratic equation) with coefficients (a, b, c=0), and then transform those coefficients back to the USL parameters using the definitions on pp. 80 and 104 of my GCaP book, viz., α = b − a and β = a.

The result of this inversion procedure in Excel appears in Figure C. It can be interpreted physically as showing the deviation from linearity. Here, I've simply labeled it L(N) but it's the same thing as (N/C) − 1 in the Excel spreadsheet. And this is where the trouble begins.

Notice that the data points tend to bend ever so slightly toward the x-axis as they increase. When we ask Excel to fit these points to a quadratic function---the denominator of C(N)---it produces the dashed line. The denominator of C(N) in the USL model is quadratic in N. A general quadratic function can be written: aN2 + bN + c and has the shape of a parabola. We constrain our L(N) parabola to go through the origin by setting c = 0 in the Excel dialog box, which leaves only a and b to be calculated by Excel. But Excel thinks the parabola is upside-down like MacDonald's arches and fits the data points to the dashed line. Whether a parabola looks like like a cup (as demanded by the USL) or a cap (MacDonald's arch), is controlled by the sign of the a-coefficient. A MacDonald's arch has a < 0 (negative), and that in turn causes β in the USL to incorrectly obtain a negative value!

The problem lies with Excel, not the USL. There is no way to tell Excel that we want only positive a and b coefficients. At least, I don't know how to do that in Excel and I don't plan on wasting time writing a macro either because I'm already worried about numerical precision issues with Excel (as explained in the Preface to book). Better to go with a tool like R or Mathematica in this case. So, why is Excel running into trouble?

Idealized Curves
Referring now to the 3 plots in the bottom row, Figure D shows the near-linear data points represented this time by a continuous red line (it's the same as the solid blue line in Figure A). The idea here is to see what happens to that continuous line as I push it through the same transformations as are applied to the data in Excel.

In Figure E the resulting theoretical efficiency curve looks very similar to Figure B. In fact, it looks like it could be a regression fit to those data. So, that's cool. It seems to be consistent. The drama occurs in the next plot.

Figure F is the result of applying the Excel transformation to determine the deviation from linearity. Very clearly it is a cap-shaped curve (concave), rather than cup-shaped curve (as required by the USL). Put another way, the red curve in Figure F looks more like a throughput curve than a throughput curve! But that's crazy. How can an inverted throughput curve still look like a throughput curve?
1. If you start with linear data like the points in Figure A, those same data will tend to lie on the red curve shown in Figure F as a result of simply applying the inversion transformation in Excel. That's what you see happening in Figure C.
2. The red curve in Figure F is not a parabola. I won't bother explaining what function it is because it's not salient to this discussion.
3. Excel doesn't know any of this.
4. When applying the Trend Line fit in Excel, you tell Excel that you want it to fit the data points in Figure C to a parabola. That's the logical equivalent of telling Excel to fit to the USL model, but distanced by several levels of indirection. You do it by choosing "polynomial of degree 2" in the Excel dialog box.
5. When Excel charges off to fulfill your request by doing a linear least-squares fit to a parabola, it senses that the data points in Figure C are "curved" downward and comes back with a chunk of a "MacDonald's arch"; the dashed line in Figure C.
6. Since Excel knows nothing of the USL, it thinks it has done a good job, e.g., R2 ~ 0.99.
What To Do? Discretion being the better part of valor, I would not do any additional handstands on behalf of Excel. Rather, I suggest that you avoid all such problems having to do with the signs of a and b by avoiding a and b altogether. Instead, fit the USL parameters directly to the original data. You can do this in R, for example, using its NLS function:
# Do a standard NLS fit ...
usl <- nls(Norm ~ n/(1 + alpha * (n-1) + beta* n*(n-1)), input,
start=c(alpha=0.1, beta=0.01))

Simultaneously, you can constrain the values of α and β to be positive. Here's how you do that in R:
# Set positivity constraints on USL parameters ...
usl <- nls(Norm ~ n/(1 + alpha * (n-1) + beta * n*(n-1)), input,
start=c(alpha=0.1, beta=0.01), algorithm="port", lower=c(0,0,0))

You can learn more about applying R to quantitative scalability analysis and other performance management problems in our upcoming GDAT class.

As a final sanity check, scalability measurements that are linear rising (like Figure A) suggest the overheads are small. In particular, you can expect that there is relatively little sharing of write-data and therefore the coherency penalty (β parameter) should be vanishingly small. That doesn't mean that your system will scale linearly forever. It means that the maximum in the USL scalability curve is there, but it's going to be off the scale to the right somewhere relative to the physically allowed system size.

Anonymous said...

Yes ... you'd be surprised how often non-physical results happen when you use "ordinary" linear or non-linear least squares. I should dig up my R code to fit the USM and post it; it was a bit more complicated than this and I used an explicit optimization rather than the built-in constrained least squares for some reason I don't remember now.

metasoft said...

can you provide me with the data set. interested in applying that data set to other stat tools and see what happens.

Anonymous said...

Yeah ... in fact, if it's OK with the original poster, could you add this data set to the other sample data sets in the Excel spreadsheet that does that USM calculations and post that? Or perhaps include them in the next release of PDQ?

Neil Gunther said...

N C(N)
1 1.0
2 1.61413
3 2.13575
4 2.67178
5 3.11948
6 3.50256

If you prefer X(N) data, simply multiply C(N) by your favorite value of X(1).

metasoft said...

thx neil. i tested the data you posted with R 2.9.0 and get alpha = 0.1531 and beta = 0. however, as a sanity check, i test the R call you provided with the GCap book on page 76, Fig 5.3, i got alpha = 0 and beta = 0, or an erro, which is not good.

Neil Gunther said...

Send me your R script and I'll take a look.

People, quantitative scalability analysis is not always push-button. Fitting the USL can be very subtle depending on the data---especially if the data a bad. This is true for ANY kind of statistical analysis and that's what this is; statistical analysis. You need a lot of practice and you may also need a lot of patient fiddling around to get things consistent.

But here's the bottom line: It's the only quantitative scalability game in town. :)

That means, when it all comes together you can be supremely confident about your results and understanding. Is there any other way?

Anonymous said...

OK ... got the data from GCap in R:

> d
N X C
1 1 20 1.0
2 4 78 3.9
3 8 130 6.5
4 12 170 8.5
5 16 190 9.5
6 20 200 10.0
7 24 210 10.5
8 28 230 11.5
9 32 260 13.0
10 48 280 14.0
11 64 310 15.5
>

There are three possible algorithms with NLS ... do 'help(nls)' for the details.

So:

> usl <- nls(C ~ N/(1 + alpha * (N-1) + beta *N*(N-1)), d,
+ start=c(alpha=0.1, beta=0.01), algorithm="port", lower=c(0,0))
Error in nls(C ~ N/(1 + alpha * (N - 1) + beta * N * (N - 1)), d, start = c(alpha = 0.1, :
Convergence failure: false convergence (8)
> usl <- nls(C ~ N/(1 + alpha * (N-1) + beta *N*(N-1)), d,
+ start=c(alpha=0.1, beta=0.01), algorithm="plinear", lower=c(0,0))
Warning message:
In nls(C ~ N/(1 + alpha * (N - 1) + beta * N * (N - 1)), d, start = c(alpha = 0.1, :
Upper or lower bounds ignored unless algorithm = "port"
> usl
Nonlinear regression model
model: C ~ N/(1 + alpha * (N - 1) + beta * N * (N - 1))
data: d
alpha beta .lin
0.0781476 -0.0002011 1.2383069
residual sum-of-squares: 1.21

Number of iterations to convergence: 7
Achieved convergence tolerance: 3.382e-06
> usl <- nls(C ~ N/(1 + alpha * (N-1) + beta *N*(N-1)), d,
+ start=c(alpha=0.1, beta=0.01), lower=c(0,0))
Warning message:
In nls(C ~ N/(1 + alpha * (N - 1) + beta * N * (N - 1)), d, start = c(alpha = 0.1, :
Upper or lower bounds ignored unless algorithm = "port"
> usl
Nonlinear regression model
model: C ~ N/(1 + alpha * (N - 1) + beta * N * (N - 1))
data: d
alpha beta
4.980e-02 1.143e-05
residual sum-of-squares: 2.184

Number of iterations to convergence: 7
Achieved convergence tolerance: 1.344e-06
>

What happened? The "port" algorithm crashed. The "plinear" algorithm converged, but, since it doesn't accept the constraints, it generated a beta with a negative value. And the default algorithm converged and generated positive alpha and beta.

For now, let's assume that the default answer is "best". What does R tell us about significance?

> summary(usl)

Formula: C ~ N/(1 + alpha * (N - 1) + beta * N * (N - 1))

Parameters:
Estimate Std. Error t value Pr(>|t|)
alpha 4.980e-02 3.207e-03 15.527 8.36e-08 ***
beta 1.143e-05 6.933e-05 0.165 0.873
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4926 on 9 degrees of freedom

Number of iterations to convergence: 7
Achieved convergence tolerance: 1.344e-06

So ... the alpha is statistically significant but the beta isn't.

To do this any better, you'd need to use an algorithm that did constrained optimization *and* converged on this test case. There are some available for R, but I'm not sure they offer much more insight than what we've done already. Beta does not appear to be significantly different from zero for this dataset.

metasoft said...

thank you both.

i was able to get R to derive the same result in the GCaP book and Ed's results, by playing with the starting value, namely, alpha=0.00001 and beta=0.00001, and continue to use the "port" algorithm:

> usl2 <- nls(
+ formula = C_p ~ p/(1 + alpha * (p-1) + beta * p * (p-1))
+ ,data = mydf2
+ ,start=c(alpha=0.00001,beta=0.00001)
+ ,algorithm="port"
+ ,lower=c(0,0,0)
+ )
>
> summary(usl2)

Formula: C_p ~ p/(1 + alpha * (p - 1) + beta * p * (p - 1))

Parameters:
Estimate Std. Error t value Pr(>|t|)
alpha 4.980e-02 3.207e-03 15.527 8.36e-08 ***
beta 1.143e-05 6.933e-05 0.165 0.873
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4926 on 9 degrees of freedom

Algorithm "port", convergence message: relative convergence (4)

i need to read up some, and figures that Bates and Watts as mentioned in the nls function man page is as good as any.

Anonymous said...

Hmmm ... what happens if you start with alpha = beta = 0?

metasoft said...

hi ed,

also converges with alpha=beta=0 to the correct result. makes 0 a good start value.

metasoft said...

hi neil,

i understand your hesitation with excel, but just sharing an idea.
in theory we should be able to turn the curve fitting problem into an optimization problem (minimize sum of squares) and thereby use excel solver. will share if i get some good results with excel solver.

Neil Gunther said...

Yep, I had already thought it could be possible to use the Excel Solver, but didn't want to spend the time on it. If you do come up with something, please let me know. Questions of precision still remain, of course.

Anonymous said...

Even if you use a sum of squares minimization, I still don't think Excel will do a constrained minimization. I'd stick with a statistics package with a built-in constrained non-linear least squares, because you get the significance tests "for free". :)

metasoft said...

hi ed,

excel is far from perfect, however, it is both familiar and ubiquitous, which makes it the perfect tool to start, which is in the spirit of being guerrilla.

on the flip side, i have a strong distrust with blackboxes that output numbers. when possible, i like to check the numbers with a couple of other statistical packages.

metasoft said...

some good news. i tested both data set with excel solver and the results looks good (error <= 5%) i'll send you what i have, and then look into how to redo sscalc.xls with solver. besides percentage error, what other fitting diagnostics would be a good idea to have. of course, this is going to be limited by what excel can do, i am open to code some VBA functions too, but must admit i only have a limited knowledge of VBA.