Tuesday, May 5, 2009

Queues, Schedulers and the Multicore Wall

The other day, I came across a blog post entitled "Server utilization: Joel on queuing", so naturally I had to stop and take a squiz. The blogger, John D. Cook, is an applied mathematician by training and a cancer researcher by trade, so he's quite capable of understanding a little queueing theory. What he had not heard of before was a rule-of-thumb (ROT) that was quoted in a podcast (skip to 00:26:35) by NYC software developer and raconteur, Joel Spolsky. Although rather garbled, as I think any Guerrilla graduate would agree, what Spolsky says is this:
If you have a line of people (e.g., in a bank or a coffee shop) and the utilization of the people serving them gets above 80% , things start to go wrong because the lines get very long and the average amount of time people spend waiting tends to get really, really bad.

No news to us performance weenies, and the way I've sometimes heard it expressed at CMG is:
No resource should be more than 75% busy.
Personally, I don't like this kind of statement because it is very misleading. Let me explain why.

As John Cook discovered, what's being referred to in these ROTs, is the classic knee that appears in all response time curves, such as those shown in the following diagram.

These curves assume an M/M/m queue, which means exponentially distributed inter-arrival periods and service periods. The x-axis is the server or processor utilization (ρ) and the y-axis is the response time (R) of the system (normalized by the service time S). The y-units are mean number of service periods. Strictly speaking, R is not identical to the waiting time (W), but that's a technical detail in this context.

The response time is always proportional to the queue length. Let's consider each case, starting with the top curve.

  1. Single Processor

    The top curve represents the response time for a single-server queue, e.g., the checkout at a grocery store. Once you've chosen a line, you're stuck with the service rate of that cashier. Here, I'm excluding the possibility of defection to another line, just to keep the discussion focused. In the case of a computer system, the waiting line is the scheduler's run-queue.

    Clearly, at around 80% utilization (ρ = 0.8) of the cashier, the response time curve does start to climb significantly. But that doesn't necessarily mean it's a problem or "things are going wrong." Why?

    First, this is a plot of average response time or, more formally, the statistical mean. It does not show fluctuations (i.e., variance about the mean). A system that is 80% utilized will always have response times that climb significantly. The real question is, how sustained are those longer response times? That's where things like percentile response times or queue lengths, come in.

    Second, there are better ROTs:

    1. Light Traffic: At low utilization (near ρ = 0), R/S = 1 which means that the response time is identical to 1 service period (S); the time it takes to ring up your own groceries because there was nobody ahead of you at the checkout.
    2. Moderate Traffic: At intermediate utilization (near ρ = 0.5), R/S = 1/2 which means that the response time R = 2S has climbed to 2 service units. On average, one person will be getting their groceries rung up when you arrive at the checkout.
    3. Heavy Traffic: At high utilization (near ρ = 0.75), R/S = 3/4 which means that the response time R = 4S has climbed to 4 service units. In other words, your response time has doubled again even though the server only became 50% busier than case (b), i.e., from 0.50 to 0.75. On average, 1 person will be getting their groceries rung up and 2 people will be waiting in line when you arrive.

    This breakdown is necessary due to the intrinsic nonlinear characteristic of the queue and that’s what makes queueing theory rather unintuitive (even for experts). That's also why we need performance models to understand performance data.

    Moreover, I find these ROTs (a)--(c) to be more practical for on-the-spot performance analysis. Believe me, I've caught quite a few people out by slinging these ROTs at their "pristine" performance measurements. Guerrilla Mantra 2.13: If the measurements don't agree with the model, change the measurements.

    Now let's look at the other curves.

  2. The Multicore Wall

    The other curves in the above plot correspond to a single waiting-line with multiple servers, like the multiple bank tellers mentioned by Joel Spolsky. Once again, the waiting line represents the scheduler's run-queue in a computer system. As the number of servers or cores is increased, it is evident that the knee in the curve becomes sharper in the direction of the lower right-hand corner of the above graph. In fact, the locus of the knee appears to move along the diagonal dashed-line. Immediately, you may be wondering if that dashed line provides yet another ROT by which to simply characterize these curves. That's a great question, but unfortunately that line turns out to be an optical illusion. But I digress.

    More importantly, notice that the lower curve (in blue) remains quite flat right up to about 95% busy. This curve corresponds to 64-way processors (virtual or physical, doesn't matter for this discussion). Only above that traffic level (ρ = 0.95) does the response time increase, and it increases very suddenly; much more suddenly that the single processor case. In fact, the gradient is nearly infinite!

    So, on this curve, R = S above ρ = 0.75 (CMG) and even at ρ = 0.80 (Joel). Neither the queue-length nor the response time increase significantly and therein lies the rub. If you've already plunked down the bucks for your 64-way multicore box, you had better be making use of them as often as possible in order to justify the expense of the hardware or software dev or both. In performance lingo, you had better utilize them to the max, which in this case means, you had better be running them near 95% busy. But ... and here's the catch, .... no busier!

    Put differently: You bought a Ferrari, you need to drive it at top speed (otherwise, what's the point?), but if you redline it for too long, the engine will blow up. The redline represents a barrier or wall; the multicore wall in this case.

To sum all this up, a better ROT in my view would be:
If a resource is more than 75% busy for sustained periods, take a closer look to see whether or not that level of utilization is acceptable in terms of response time or queue length.
Not as simple as CMG or Joel perhaps, but more realistic. Everything in performance analysis involves trade-offs, which are often subtle and defy common sense. But, as I say in the epigram of my Perl PDQ book,
Common sense is the pitfall of performance analysis.


John Allspaw said...

"If a resource is more than 75% busy for sustained periods, take a closer look to see whether or not that level of utilization is acceptable in terms of response time or queue length."

I couldn't agree more, and agree with your pointing out that ROT's can be notoriously misleading without the standard caveat of "your mileage may vary".

I might add further that response times aren't very valuable out of context of the entire system. Most growing web architectures have multiple tiers, each with their own load characteristics, latency profiles, and failure modes. So you can't really get the entire picture of your specific response time-to-resource-usage pain points without that inter-system context. It's coincidental that I wrote about that topic this morning:


p.s. I'm a fan of your book and blog. :)

michele said...

Very interesting post.
I have a question. Would it be correct to model a dual core machine as a multiserver queueing system with 2 servers?
And what about a cluster of multi-core servers?, e.g., should 10 dual-core nodes be modeled as (a) an M/M/20 system, or rather as (b) an M/M/10 queue whose service rate is twice that of option (a)?

michele said...

Very interesting post.
I have a question. Would it be correct to model a dual core machine as a multiserver queueing system with 2 servers?
And what about a cluster of multi-core servers?, e.g., should 10 dual-core nodes be modeled as (a) an M/M/20 system, or rather as (b) an M/M/10 queue whose service rate is twice that of option (a)?

Neil Gunther said...

michele: This is the kind of question everyone who tries to model any computer system asks; even me. :)

The first unhelpful thing to get out of your head is the word "correct." There is no correct model.
All models are wrong, but some are wronger than others. It all about finding the best approximation.

The real test is, how well any choice of model matches the data or whatever other constraints you are trying to meet. Within those boundary conditions, the simplest model that does the job is usually best choice. So, depending on your data or other goals, M/M/2 might be perfectly fine for a dual-core model.

For the cluster, you need to think about where requests wait if they can't get serviced (all cores busy); assuming they're not just dropped on the floor. In other words: Is there one or more waiting lines? M/M/20, for example, can only have a single waiting line, by definition. Now ask yourself: What does that single waiting line or buffer correspond to in the real cluster?

It might be the run-queue of the O/S or a load-balancer or ... it might not be there at all. In which case it can't be M/M/20. The important point here is, the queueing models are already forcing you to understand more clearly how the cluster operates from a performance standpoint.

For a broader view and more exampls along these lines, take a look at Chap. 7 of my Guerrilla CaP book. There, I model a dual core system HP ML530 server, not as M/M/4 but M/M/4/16. Reason: the system only has a finite number of active threads (16) in the test rig and that has more significant ramifications for performance than the number of servers.

All of this (and more) is discussd in great detail in my Guerrilla training classes.

michele said...

Hello Neil, thanks for your quick answer.
OK, I understand perfectly what you mean, so I'll go into further details.
There is actually no waiting (no queue), so if all servers are busy, further jobs are lost (i.e, Erlang-B).
Now, suppose that each server has a maximum of 10 connections, e.g., we can serve at most 200 concurrent connections with our 10 dual core servers.
Would Erlang-B be the best choice (either M/M/10/10 or M/M/20/20)? A truncated Erlang-C (M/M/n/K/FIFO) seems inaccurate to me, as the servers work in processor sharing.

Neil Gunther said...

It's very hard to make any further progress when you haven't stated what perf question you are trying to address.

Quoting: "at most 200 concurrent connections" suggests to me a finite N=200 or M/M/m/N/N queue, which isn't Erlang-anything.

A lossy Erlang-B queue suggests you might have a batch system in mind (viz., as if additional batch requests would be dropped). However, in steady state equilibrium, you expect to see some waiting time, even in a batch system, because the observation period T >> S mean service period. To avoid any waiting time at all in steady state, would require infinite servers (delay node).