Wednesday, September 9, 2009

This is Your Measurements on Models

When it comes to analyzing scalability data, I've stressed the importance of bringing measurements and models together. Some recent conversations with people who are just beginning to model their scalability data using the Universal Scalability Law (USL), have led me to realize that I have not made the steps behind the procedure as clear as I might have. So, let me address that here.

There are 2 sides to the performance modeling coin:
  1. Your data. To apply the USL model, this merely requires gathering a set of throughput measurements. Let's call them: X(1), X(2), X(3), ..., X(N), where N is the number of virtual users, for example.
  2. Your model. You are free to choose any model you like (but choose carefully). In my Guerrilla Capacity Planning book I claim the USL is the best choice for quantitatively assessing application scalability.
The objective of performance modeling can be thought of as being equivalent to matching your data to your model. The bridge between the world of data and the world of models is the quantity I call the relative capacity, C(N). Schematically, this can be represented as:

Data  ⇒   X(N)
        C(N)        N
1 + α (N − 1) + β N(N − 1) 
  ⇐   Model (1)

The left-hand side indicates how your raw data is first normalized to produce the actual values of C(N). The right-hand side indicates how the USL model provides the expected values of C(N).

Your data are matched to the USL model by adjusting the value of the α and β parameters to obtain the best fit. Technically, that's accomplished through the magic of statistical regression. This video shows how it's done for the case of a simple linear model rather than the nonlinear USL model.

Graphically, the procedure looks something like this.

Notice how the roll off or maximum in the relative throughput C(N) is only manifest at user loads (N) much higher than were actually measured. Magic! That kind of look-ahead insight is what makes USL modeling important.

Any questions?


Unknown said...

Hi Neil,

I've just finished your GCaP book; very thought-provoking. Thanks!

I've applied the procedure you lay out in chap 5 (using R) to some of my own application test measurements, and have a question.

My test measurements include response-time as well as throughput. The measured response-time vs N conforms to the expected "hockey stick". This makes me happy and gives me some confidence that my measurements are not completely wrong.

I perform the USL computations described in chap 5.

Using USL-modeled values to compute response-time (as in GCaP Sec 6.7.3), the resulting modeled-response-time vs N does not have a hockey-stick shape, and in some cases doesn't closely match the measured response-time values.

In general, the modeled-response-time vs N either looks like a straight line (if coherency coef = 0) or parabolic (if coherency coef > 0).

Can you shed some light on how/why measured response time curves are (or should be) like a hockey-stick, while the USL-modeled curves are not?

Could this be a problem with my calculations?

Thank you very much!

Neil Gunther said...

Very difficult for me to comment without a specific example. Can you send me some sample data?

Neil Gunther said...

Andy sent me some algebra via email, and now that I understand his question better, I will abbreviate it here.

We know from my previous post of AUGUST 22, 2009
like.html that, as a general matter, response time R(N) and throughput X(N) are inversely related. The relative throughput C(N) is given by the
USL model on the RHS of eqn.(1) above. By definition, the denominator of C(N) is a polynomial of degree 2 in N. Inverting C(N), to get the
corresponding R, will produce a quadratic form in N. A quadratic function is a curve, a parabola, and not the linear "hockey stick" handle that we expect from standard queueing theory (for a closed system). Andy's question is, "What gives!?"

First off, I want to acknowledge that this is an excellent question which, as far as I can recall, no one has asked before. Curiously perhaps,
although I do discuss this in my GCaP classes (which is why you should attend them), I haven't written it down anywhere, that I know of.

Andy is quite correct ... and so is the USL. Here's what it means. Let's assume for the moment that the β coefficient is zero (i.e., no coherency delays). Then, the corresponding response time would be linear, as expected. Above saturation, the throughput X(N) would plateau at a constant value and R(N) would climb linearly with the increasing user-load N.

When β > 0, however, the throughput becomes retrograde (not seen in standard queueing models). Since it takes longer to complete each request, the response time R(N) must increase faster than linear! The quadratic form of R accounts for this "super linear"
behavior by bending the expected hockey stick upward.

So, all is well with the USL. :)

MariuszW said...

Can USL be applied in all situations? In your GCAP (6.7.4 Why It Works) you write about "homogeneity". I write currently JMeter tests to measure scalability fo mutltitier application - load balance + weblogics + database - and homogeneity is still im my head:). Application has web services that will be loaded. That web services handle different functionality. So, can I use USL or rather PDQ (PPDQ) and queue model?

Thanks for your books and articles.


Neil Gunther said...

Hi MW,

I understand your confusion because the statements in my GCaP book are a bit more ambiguous than I intended. What I was trying to convey is this.

The power of the USL comes from its simplicity, e.g., you don't need any queueing models or simulations. Moreover, there is nothing in the formulation of the USL that restricts its utility to any specific systems, e.g., Linux, Intel, multi-cores or multi-tiers. It can model them all. It is truly universal.

On the other hand, everything involves trade-offs.
One such trade-off in the USL is that the workload is treated as an aggregation of whatever is actually executing on the system. As you describe, there will be many components of hardware and software in the real system. Nonetheless, the X(N) used as INPUT into the USL and the X(N) predicted as an OUTPUT of the USL, is the aggregate throughput.

There can be no resolution into individual workload components, as one might be able to do in PDQ. It's in that sense that the USL views the system as homogeneous. There is no way to explicitly express any heterogeneity, even though it exists in the real system.

How can the USL be USefuL if workload heterogeneity is not captured? The answer lies in my use of the word "explicit." I said it is not represented explicitly; I didn't say it's not represented as all. So, you may well ask, where is it? It's encoded in the 2 coefficients: α and β. This encoding is one-way and that's what makes the USL universally applicable but also limited in what information it can reveal predictively.

Another way to think of this limitation is the following. Suppose you apply the USL and it reveals a larger than desirable value of the β coefficient and therefore the throughput is predicted to roll off at some load value. You cannot then turn around and ask the USL model to tell you which of the multifarious workload components is responsible for that roll off. However, as described in my 2008 CMG "Zones" paper, the USL can certainly direct your attention at the right tuning effects to improve scalability. See

Thanks for your question.

MariuszW said...

So question (or two) again :)
I understand that system characteristic is in alpha and beta and the interpretation is the key. And that there is aggregation.

In GCaP (in table 5.1) there are ray tracing benchmark results - what is workload (users, tasks) described in such case? Numbers? Is it Xmax on given processor in this table? - so processor p1 is loaded to measure Imax, next p4 is loaded to measuer Imax? - isn't task or user number important for given p-Imax?

I want to use USL - C(p) version - but in my case p is web server numbers (Tomcat) hosting web application (java) run on three physical machines - I want to estimate scalability - how many new Tomcat (or physical machines) is required with growing workload (new users using web services on Tomcats) - there are three Tomcats on one physical machine and three physical machines now, so 3x3=9 Tomcats.
I want to construct C(p) chart gathering benchmark data using Jmeter as load generator.
Does treatment p as web server number is good idea? Does X for gven p is Xmax for that p (so different users number in load generator has to be used to find that Imax)?

Neil Gunther said...

Hi again MariuszW,

I actually started putting together a reply to your question on Fri, 16 Apr 2010, but ran into 2 problems:

1. I had to travel out of town that weekend/week, but I did manage to get a response from one of the authors of the BRLCAD benchmark that you asked about.

2. To answer your question(s) properly requires more space than these little Q&A boxes provide.

Therefore, I'm putting together a separate blog post, so please bear with me.