The Pith of Performance: How Long Should My Queue Be?

A simple question; there should be a simple answer, right? Guerrilla alumus Sudarsan Kannan asked me if a rule-of-thumb could be constructed for quantitatively assessing the load average on both dual-core and multicore platforms. He had seen various remarks, from time to time, alluding to optimal load averages.

To make this question more concrete, let's focus on a particularly well-known queue; the system run-queue. As I explain in Chapter 4 of my Perl::PDQ book, the load average metric is a measure of run-queue length. As far as optimal load averages are concerned, let's see what the experts tell us.

Tim O'Reilly and crew state on p. 726 ff. of UNIX Power Tools:
"The load average tries to measure the number of active processes at any time. As a measure of CPU utilization, the load average is simplistic, poorly defined, but far from useless. .... Ideally, you'd like a load average under, say, 3..."

Wonder where that "3" comes from? And does the load average really measure CPU utilization? I've seen values of load average that are greater than 100, for example, and you can't have more than 100% busy! (See Example 1 below)
Blair Zajac, author of the ORCA performance tool, states on his web site:
"If long term trends indicate increasing values, more or faster CPUs will eventually be necessary unless load can be displaced. For ideal utilization of your CPU, the maximum value here should be equal to the number of CPUs in the box."

Although there is a certain amount of truth to this statement (see Example 2 below), it is misleading because it implies that queues are bad. Nothing could be further from the truth. (See Example 1 below)

These opinions also seem to conflict with one another. The reason for the confusion stems from the fact that the original question is ill-posed. Several points of clarification need to be made about the load average metric:

It does not measure CPU utilization (although the word load can also mean that in performance analysis).
It is not a simple arithmetic average (its a damped moving average).
It does measure time-averaged run-queue length.
It was one of the earliest performance metrics originally introduced for the Multics project in the late 1960s.
There are 3 numbers reported because it was an attempt to indicate trend information before GUIs arrived.
Queue length means the number of processes running plus the number that are runnable (waiting).
The code has been broken from time to time in some OS releases. Therefore, you need to run your own suite of regression tests on all the performance metrics you rely on, including the load average.
It is a rather meaningless metric on its own, e.g., is a load average of 30 good or bad?

As with all benign performance questions, they just look simple. The original question about queue length is like asking, How long is a piece of string? A slightly cynical answer is, not so long that you strangle yourself with unintended consequences. Similarly with queues. They shouldn't be so long that they introduce unintended consequences. What might those consequences be?

Long queues correspond to long response times and it's really this latter metric that should get your attention. So, one consequence might be that a long queue causes "poor response times", but that depends on what poor means. There is usually an empirical disconnect between measured queue lengths and user-perceived response times. Another problem is that queue length is an absolute measure, whereas what is really needed is a relative performance measure. Even the words, poor and good are relative terms. Such a relative measure is called the Stretch Factor, which measures the mean queue length relative to the mean number of requests already in service. It is expressed in terms of service units.

What makes the stretch factor really useful is that it can be compared easily with service level targets. Service targets are usually expressed in some business units, e.g., quotes per hour is a common business unit of work in the insurance industry. The expected service level is called the service level objective or SLO, and is expressed as multiples of the relevant service unit. An SLO might be documented as: The average user response time is not to exceed 15 service units between the peak operating hours of 10 am and 2 pm. That, to us performance weenies, is the same thing as saying the SLO shall not exceed a stretch factor of 15.

If we denote by Q the measured value of the load average, and let ρ be the measured CPU utilization on an m-way multiprocessor or multicore, then the stretch factor (SF) can be estimated using the ratio:

SF = Q / mρ

In other words, it's really this ratio of the load average to the utilization that is a more meaningful performance indicator than the load average on its own.

Let's look at a couple of examples using the stretch factor based on PDQ models of (1) a large-scale email spam-scanning farm, and (2) a scientific number-crunching application.

Spam Canner
All the large web portals have email-spam analyzers these days. A typical configuration might comprise specialized servers, each raking over email text using a filtering tool like SpamAssassin. One such large-scale web portal, that was a client of mine, had a battery of between 50 to 100 SMP 4-way servers doing this job full tilt 7 by 24. The performance metric they were monitoring was, of course, the load average.

A simple M/M/4 model in PDQ gave results that were in excellent agreement with monitored queue lengths. Here is that model in PyDQ (PDQ in python):
```
#!/usr/bin/env python
import pdq

# Input parameters 
servers  = 4
arivrate = 0.66 # per min
servtime = 6.0  # seconds

# PyDQ model 
pdq.Init("SPAM Analyzer")
nstreams = pdq.CreateOpen("Email", arivrate)
nnodes   = pdq.CreateNode("spamCan", int(servers), pdq.MSQ)
pdq.SetDemand("spamCan", "Email", servtime)
pdq.Solve(pdq.CANON)
pdq.Report()
```
Running this model produces the following output as a PDQ Report:
```
                ******   RESOURCE Performance   *******

Metric          Resource     Work              Value   Unit
------          --------     ----              -----   ----
Throughput      spamCan      Email            0.0660   Trans/Sec
Utilization     spamCan      Email           99.0000   Percent
Queue length    spamCan      Email          100.7726   Trans
Waiting line    spamCan      Email           96.8126   Trans
Waiting time    spamCan      Email          146.6858   Sec
Residence time  spamCan      Email          152.6858   Sec
Stretch factor                               25.4476
```
On average, each of the 4 CPUs is 99% busy reading email text, so that between 3 and 4 emails are being processed with 96 to 97 emails waiting to be processed. The predicted load average is therefore about 100 because the average queue length is 100.78. The predicted stretch factor is 25.45, which means that it takes about 25 service units to get an email through the system. Since each email takes about 6 seconds to scan, the stretch factor corresponds to just over 2 minutes from the time an email message arrives at the spam canner until it gets deposited into the recipient's mailbox. This prediction matches the numerical estimate obtained from Q / mρ.

This stretch factor was considered acceptible under peak load conditions. In other words, the battery of 4-ways was meeting the agreed upon service level or SLA. That's now. What about the future? Since the CPUs are pretty much pegged out in this situation, it would be prudent to use the same PDQ model to start assessing what server capacity might be needed to meet the same service level objectives (SLOs) as email traffic grows over the next 6 to 12 months. The growth would be reflected in increasing CPU utlization. Since they're already near 100% busy (on average) at peak, you would need to add more 4-way boxes (for example) to handle the additional work. PDQ helps you to figure out how many additional boxes.
Number Cruncher
Now, let's return to the earlier remark (2) made in the context of the ORCA tool. We can use a similar PDQ model to see what it means to have no waiting line with all the CPUs busy. This time, we assume that each request is a serious number crunching job (e.g., pre-processing oil exploration data) that takes 10 hours to complete.
```
#!/usr/bin/env python
import pdq

# Measured parameters 
servers  = 4
arivrate = 0.099  # per hr
servtime = 10.0   # hrs

pdq.Init("Orca LA Model")
nstreams = pdq.CreateOpen("Crunch", arivrate)
pdq.SetWUnit("Jobs")
pdq.SetTUnit("Hours")

nnodes   = pdq.CreateNode("HPC", int(servers), pdq.MSQ)

pdq.SetDemand("HPC", "Crunch", servtime)
pdq.Solve(pdq.CANON)
pdq.Report()
```
The corresponding PDQ Report contains the following output:
```
                ******   RESOURCE Performance   *******

Metric          Resource     Work              Value   Unit
------          --------     ----              -----   ----
Throughput      HPC          Crunch           0.0990   Jobs/Hours
Utilization     HPC          Crunch          24.7500   Percent
Queue length    HPC          Crunch           0.9965   Jobs
Waiting line    HPC          Crunch           0.0065   Jobs
Waiting time    HPC          Crunch           0.0656   Hours
Residence time  HPC          Crunch          10.0656   Hours
```
There is no waiting line, and all 4 CPUs are busy; though only 25% utilized. How can this be? If we were to look at the CPU stats while the system was running, we observe that each CPU was actually 100% busy. To understand what PDQ is telling us, we need to look at the System Performance section of the PDQ report.
```
                ******   SYSTEM Performance     *******

Metric                     Value    Unit
------                     -----    ----
Workload: "Crunch"
Number in system          0.9965    Jobs
Mean throughput           0.0990    Jobs/Hours
Response time            10.0656    Hours
Stretch factor            1.0066
```
The stretch factor is 1, so there is no waiting line. On the other hand, it takes 10 hrs for each job to get through the system, so the Response Time is about 10 hours. But, remember, PDQ makes predictions based on steady state behavior, i.e., the behavior over the long term. Therefore, with a service time of 10 hours, we really need to observe the system for much longer than that to see what it looks like in steady state. Much longer here, means on the order of 100 hours or longer. We don't really need to do that, but PDQ is telling about how things would look if we did. In such a long measurement period, we would so no new arrivals. By Little's law: ρ = λ S, and ρ < 1. Since S is relatively large, λ has to be correspondingly small, which means ρ = 25%. Looking at the system for a few minutes while it is running, is really an instantaneous snapshot of the system, rather than a steady-state view.

In summary, to address the original question about queue length, we have shown that the stretch factor SF is a better relative performance indicator than the absolute load average (LA).

SF = Q / mρ

The reasons SF is better than the LA are:

The load average metric is a measure of run-queue length (Q)
Queue length measures the total number of requests; both waiting and in service
Queue length on its own is not very meaningful because its an absolute number
You also need to know the number of configured CPUs (m) and their average utilization (ρ)
The stretch factor is compared with established service level targets (SLOs, SLAs)

1 comment:

Anonymous said...: A couple of comments:

1. A little digging in the kernel source shows that the load average is the moving average of the total number of threads in the R (running or runnable) and the D (uninterruptible sleep/disk wait) state. I'm still trying to understand why that is, but in the mean time, if you want a moving average of the processes in just the R state, you have to do it yourself.

2. The "slow-medium-fast exponential moving average" trend finding trick has been well known for many years among technical stock and futures traders, and they apparently discovered it from artillery tracking or something like that.

3. The 64 dollar question in my mind is: you have a machine with four hyperthreaded processors, giving eight logical processors. Should the formula be

a. Q/(4*rho),
b. Q/(8*rho), or
c. Q/(x*rho) for some x between 4 and 8?; Thursday, June 28, 2007 at 9:14:00 PM PDT

Wednesday, April 18, 2007

How Long Should My Queue Be?

1 comment: