*Sudarsan Kannan*asked me if a

**rule-of-thumb**could be constructed for quantitatively assessing the load average on both dual-core and multicore platforms. He had seen various remarks, from time to time, alluding to optimal load averages.

To make this question more concrete, let's focus on a particularly well-known queue; the system

**run-queue**. As I explain in Chapter 4 of my Perl::PDQ book, the load average metric is a measure of run-queue length. As far as optimal load averages are concerned, let's see what the experts tell us.

- Tim O'Reilly and crew state on p. 726 ff. of UNIX Power Tools:
*"The load average tries to measure the number of active processes at any time. As a measure of CPU utilization, the load average is simplistic, poorly defined, but far from useless. .... Ideally, you'd like a load average under, say, 3..."*

Wonder where that "3" comes from? And does the load average really measure CPU utilization? I've seen values of load average that are greater than 100, for example, and you can't have more than 100% busy! (See**Example 1**below) - Blair Zajac, author of the ORCA performance tool, states on his web site:
*"If long term trends indicate increasing values, more or faster CPUs will eventually be necessary unless load can be displaced. For ideal utilization of your CPU, the maximum value here should be equal to the number of CPUs in the box."*

Although there is a certain amount of truth to this statement (see**Example 2**below), it is misleading because it implies that queues are bad. Nothing could be further from the truth. (See**Example 1**below)

These opinions also seem to conflict with one another. The reason for the confusion stems from the fact that the original question is

**ill-posed**. Several points of clarification need to be made about the load average metric:

- It does
**not**measure CPU utilization (although the word*load*can also mean that in performance analysis). - It is
**not**a simple arithmetic average (its a damped moving average). - It
**does**measure**time-averaged**run-queue length. - It was one of the earliest performance metrics originally introduced for the Multics project in the
**late 1960s**. - There are
**3 numbers**reported because it was an attempt to indicate trend information before GUIs arrived. - Queue length means the number of processes
**running**plus the number that are**runnable**(waiting). - The code has been
**broken**from time to time in some OS releases. Therefore, you need to run your own suite of**regression tests**on all the performance metrics you rely on, including the load average. - It is a rather
**meaningless**metric on its own, e.g., is a load average of 30 good or bad?

As with all benign performance questions, they just

*look*simple. The original question about queue length is like asking, How long is a piece of string? A slightly cynical answer is, not so long that you strangle yourself with unintended consequences. Similarly with queues. They shouldn't be so long that they introduce unintended consequences. What might those consequences be?

Long queues correspond to long

**response times**and it's really this latter metric that should get your attention. So, one consequence might be that a long queue causes "poor response times", but that depends on what

*poor*means. There is usually an empirical disconnect between measured queue lengths and user-perceived response times. Another problem is that queue length is an

**absolute**measure, whereas what is really needed is a

**relative**performance measure. Even the words,

*poor*and

*good*are relative terms. Such a relative measure is called the

**Stretch Factor**, which measures the mean queue length relative to the mean number of requests already in service. It is expressed in terms of

*service units*.

What makes the stretch factor really useful is that it can be compared easily with

**service level targets**. Service targets are usually expressed in some business units, e.g.,

*quotes per hour*is a common business unit of work in the insurance industry. The expected service level is called the

**service level objective**or SLO, and is expressed as multiples of the relevant service unit. An SLO might be documented as:

*The average user response time is not to exceed 15 service units between the peak operating hours of 10 am and 2 pm*. That, to us performance weenies, is the same thing as saying the SLO shall not exceed a

**stretch factor**of 15.

If we denote by Q the

*measured*value of the load average, and let ρ be the

*measured*CPU utilization on an m-way multiprocessor or multicore, then the

**stretch factor**(SF) can be

*estimated*using the ratio:

SF = Q / mρ |

In other words, it's really this

**ratio**of the load average to the utilization that is a more meaningful performance indicator than the load average on its own.

Let's look at a couple of examples using the

**stretch factor**based on PDQ models of (1) a large-scale email spam-scanning farm, and (2) a scientific number-crunching application.

**Spam Canner**

All the large web portals have email-spam analyzers these days. A typical configuration might comprise specialized servers, each raking over email text using a filtering tool like SpamAssassin. One such large-scale web portal, that was a client of mine, had a battery of between 50 to 100 SMP 4-way servers doing this job full tilt 7 by 24. The performance metric they were monitoring was, of course, the load average.

A simple M/M/4 model in PDQ gave results that were in excellent agreement with monitored queue lengths. Here is that model in PyDQ (PDQ in python):

#!/usr/bin/env python

import pdq

# Input parameters

servers = 4

arivrate = 0.66 # per min

servtime = 6.0 # seconds

# PyDQ model

pdq.Init("SPAM Analyzer")

nstreams = pdq.CreateOpen("Email", arivrate)

nnodes = pdq.CreateNode("spamCan", int(servers), pdq.MSQ)

pdq.SetDemand("spamCan", "Email", servtime)

pdq.Solve(pdq.CANON)

pdq.Report()

Running this model produces the following output as a PDQ Report:

****** RESOURCE Performance *******

Metric Resource Work Value Unit

------ -------- ---- ----- ----

Throughput spamCan Email 0.0660 Trans/Sec

Utilization spamCan Email 99.0000 Percent

Queue length spamCan Email 100.7726 Trans

Waiting line spamCan Email 96.8126 Trans

Waiting time spamCan Email 146.6858 Sec

Residence time spamCan Email 152.6858 Sec

Stretch factor 25.4476

On average, each of the 4 CPUs is 99% busy reading email text, so that between 3 and 4 emails are being processed with 96 to 97 emails waiting to be processed. The predicted**load average**is therefore about 100 because the average queue length is 100.78. The predicted**stretch factor**is 25.45, which means that it takes about 25 service units to get an email through the system. Since each email takes about 6 seconds to scan, the stretch factor corresponds to just over 2 minutes from the time an email message arrives at the spam canner until it gets deposited into the recipient's mailbox. This prediction matches the numerical estimate obtained from Q / mρ.

This stretch factor was considered acceptible under peak load conditions. In other words, the battery of 4-ways was meeting the agreed upon**service level**or SLA. That's now. What about the future? Since the CPUs are pretty much pegged out in this situation, it would be prudent to use the same PDQ model to start assessing what server capacity might be needed to meet the same**service level objectives**(SLOs) as email traffic grows over the next 6 to 12 months. The growth would be reflected in increasing CPU utlization. Since they're already near 100% busy (on average) at peak, you would need to add more 4-way boxes (for example) to handle the additional work. PDQ helps you to figure out how many additional boxes.**Number Cruncher**

Now, let's return to the earlier remark (2) made in the context of the ORCA tool. We can use a similar PDQ model to see what it means to have no waiting line with all the CPUs busy. This time, we assume that each request is a serious number crunching job (e.g., pre-processing oil exploration data) that takes 10 hours to complete.

#!/usr/bin/env python

import pdq

# Measured parameters

servers = 4

arivrate = 0.099 # per hr

servtime = 10.0 # hrs

pdq.Init("Orca LA Model")

nstreams = pdq.CreateOpen("Crunch", arivrate)

pdq.SetWUnit("Jobs")

pdq.SetTUnit("Hours")

nnodes = pdq.CreateNode("HPC", int(servers), pdq.MSQ)

pdq.SetDemand("HPC", "Crunch", servtime)

pdq.Solve(pdq.CANON)

pdq.Report()

The corresponding PDQ Report contains the following output:

****** RESOURCE Performance *******

Metric Resource Work Value Unit

------ -------- ---- ----- ----

Throughput HPC Crunch 0.0990 Jobs/Hours

Utilization HPC Crunch 24.7500 Percent

Queue length HPC Crunch 0.9965 Jobs

Waiting line HPC Crunch 0.0065 Jobs

Waiting time HPC Crunch 0.0656 Hours

Residence time HPC Crunch 10.0656 Hours

There is no waiting line, and all 4 CPUs are busy; though only 25% utilized. How can this be? If we were to look at the CPU stats while the system was running, we observe that each CPU was actually 100% busy. To understand what PDQ is telling us, we need to look at the System Performance section of the PDQ report.

****** SYSTEM Performance *******

Metric Value Unit

------ ----- ----

Workload: "Crunch"

Number in system 0.9965 Jobs

Mean throughput 0.0990 Jobs/Hours

Response time 10.0656 Hours

Stretch factor 1.0066

The stretch factor is 1, so there is no waiting line. On the other hand, it takes 10 hrs for each job to get through the system, so the Response Time is about 10 hours. But, remember, PDQ makes predictions based on**steady state**behavior, i.e., the behavior over the long term. Therefore, with a service time of 10 hours, we really need to observe the system for much longer than that to see what it looks like in steady state. Much longer here, means on the order of 100 hours or longer. We don't really need to do that, but PDQ is telling about how things would look if we did. In such a long measurement period, we would so no new arrivals. By Little's law: ρ = λ S, and ρ < 1. Since S is relatively large, λ has to be correspondingly small, which means ρ = 25%. Looking at the system for a few minutes while it is running, is really an**instantaneous**snapshot of the system, rather than a steady-state view.

In summary, to address the original question about queue length, we have shown that the

**stretch factor**SF is a better relative performance indicator than the absolute load average (LA).

SF = Q / mρ |

The reasons SF is better than the LA are:

- The
**load average**metric is a measure of run-queue length (Q) - Queue length measures the
**total number of requests**; both waiting and in service - Queue length on its own is not very meaningful because its an
**absolute**number - You also need to know the number of configured CPUs (m) and their average utilization (ρ)
- The stretch factor is compared with established service level targets (
**SLOs, SLAs**)

## 1 comment:

A couple of comments:

1. A little digging in the kernel source shows that the load average is the moving average of the total number of threads in the R (running or runnable)

andthe D (uninterruptible sleep/disk wait) state. I'm still trying to understand why that is, but in the mean time, if you want a moving average of the processes in just the R state, you have to do it yourself.2. The "slow-medium-fast exponential moving average" trend finding trick has been well known for many years among technical stock and futures traders, and

theyapparently discovered it from artillery tracking or something like that.3. The 64 dollar question in my mind is: you have a machine with four hyperthreaded processors, giving eight logical processors. Should the formula be

a. Q/(4*rho),

b. Q/(8*rho), or

c. Q/(x*rho) for some x between 4 and 8?

Post a Comment