The Pith of Performance: Hockey Elbow and Other Response Time Injuries

You've heard of tennis elbow. Well, there's a non-sports, performance injury that I like to call hockey elbow. An example of such an "injury" is shown in Figure 1, which appeared in a recent computer performance analysis presentation. It's a reminder of how easy it is to become complacent when doing performance analysis and possibly end up reaching the wrong conclusion.

Figure 1. injured response time performance

Figure 1 is seriously flawed for two reasons:

It incorrectly shows the response time curve with a vertical asymptote.
It compounds the first error by employing a logarithmic x-axis.

The relationship between performance metrics is generally nonlinear. That's what makes performance analysis both interesting and hard. Your brain has evolved to think linearly, whereas computer systems behave nonlinearly. As a consequence, your brain needs help to comprehend those nonlinear effects. That's one reason plotting performance data can be so helpful—as long as it's done correctly.

Nonlinearity is a class, not a thing.

Response-time data belongs to a class of nonlinearity that is best characterized by a convex function. That's a mathematical term that simply means the plotted data tends to curve upward and away from the x-axis.

Figure 2a. Elbow response time profile

Figure 2b. Hockey-stick response time profile

Although there are many possible ways data can trend upward, the convexity of response-time data is limited to the two monotonically increasing cases shown in Figure 2, viz., an elbow curve (Fig. 2a) or a hockey stick curve (Fig. 2b). The distinction is easily understood by noting the following visual characteristics:

Elbow form: the blue curve representing the response time data trends upward so as to follow the 90 degree bend formed by y axes on the right side of the plot, i.e., a vertical asymptote.
Hockey stick: the blue curve representing the response time data trends upward but at an angle much wider than 90 degrees and runs along the asymptote represented, quite literally, by the hickey stick handle.

As the "load" increases, both response time curves go off to infinity, but at very different rates. The elbow slope is vertical (cf. Figure 1) while the hockey stick slope is inclined at a shallower angle.

The burning question now becomes, what determines whether the response time data follows the elbow or the hockey stick profile? The answer hinges on the use of the word "load"—one of the most overloaded words in the performance lexicon.

Pay very close attention to the metric displayed on the x-axis.

It should be emphasized that the one metric "load" cannot represent is time. When carrying out load tests or running an internal benchmarks, tools like LoadRunner or JMeter will usually present performance metrics as a time series, like Figure 3, while the data are being collected.

Figure 3. Monitored response times

Here, however, I am talking about post-processed performance metrics, where the collected data has been time-averaged. In Figure 3, the time average is represented by the height of the horizontal red line, i.e., a single number that represents the average response time during that measurement period. An example of multiple post-processed response-time measurements is shown in Figure 4. The timestamps have been eliminated. Each data point in Figure 4 is derived from the equivalent of the red line in Figure 3.

Times series performance data has to be time averaged.

Note the resemblance of the curves in Figure 4 to the hockey stick curve in Figure 2b. In this case, the load represents the "number of concurrent requests" in the system under test (SUT). The testing harness was Apache HTTP server benchmarking tool.

Figure 4. Measured response time hockey sticks [Source: Juicebox Games]

Why do these data have a hockey stick profile? Each request that is issued by a load generator traverses the SUT in order to be processed and returns its result (e.g., status or data) to the client load script. The script may be programed to represent an additional user delay, called the think time, before submitting the next request. The foot of the hockey stick corresponds to light load due to only a few load generators (between 1 and 10) issuing a relatively small number requests into the SUT. Hence, size of any queues is relatively small. The response time corresponds closely to the sum of the service demand at each of the processing resources (whatever they might be). Since lower is better for response times, that's also the best time you can achieve on the SUT.

At some point, however, as the number of requests is increased, one of the servicing resources will become 100% busy: the bottleneck resource. In Figure 4, this seems to occur around 50 generators. Beyond that load, the resource queues increase almost linearly with the number of concurrent requests. Consequently, the response time approaches the linear handle of the hockey stick. The slope of the handle is determined by the service time at the bottleneck. For that slope to be vertical (i.e., to become an elbow) would require its service demand be infinite! In practical terms, the SUT would not be functional. If there is no think-time delay on the client side, the queues begin to grow immediately in the SUT and the hockey-stick appears to have no foot, i.e., it's all handle.

Response times plotted as a function of the number of client generators or users will always have a hockey stick profile.

With that pronouncement firmly in mind, you might think that a load-test system can never exhibit an elbow profile. That turns out not to be true. Figure 1 can occur, just not as shown. Once again, it depends on the definition of the load metric.

Figure 5. Measured NFS response time elbow [Source: SPEC.org]

The elbow profile in Figure 5 is taken from the Standard Performance Evaluation Corporation (SPEC) web site and shows the response time in milliseconds for an EMC Isolon NFS server as a (convex) function of throughput measured in ops/second. Many similar curves can be found online in the sanctioned SPECsfs2008_nfs.v3 benchmark results. The important point here is that the load metric is defined to be the throughput, not the number of users or generators, as it is in Figure 4 (or Figure 1).

Figure 6. Extrapolated NFS response time curve

Unfortunately, the SPEC benchmark run-rules only require 10 load points to be reported, so Figure 5 is somewhat visually ambiguous when it comes to deciding if that response time curve is an elbow or a hockey stick. However, I can use my PDQ queueing analyzer in R to produce an approximate rendition of Figure 5. I don't know all the exact service times in the EMC benchmark, so my PDQ model cannot be exact. But I only want to extrapolate the basic profile to show the "shoulder" above the elbow (as it were) in Figure 5, so precise reproduction is unnecessary. The resulting PDQ data in Figure 6 indeed demonstrate that the NFS response times do follow an elbow curve, like Figure 2a.

Once the bottleneck resource saturates at 100% utilization, the throughput becomes throttled at around 275,000 ops per second in the EMC benchmark (Figure 5) or 200,000 ops per second in the PDQ model (Figure 6). In other words, the throughput cannot exceed that value because it's rate-limited by the bottleneck resource. Nonetheless, the queues will continue to grow under increasing request load, so the response time curve has no choice but to increase in the vertical direction and thereby produce the elbow profile.

Response times plotted as a function of the throughput will always have an elbow profile.

To convince you that this must be true, I can apply Little's law to the reported EMC benchmark data to determine the number of requests resident in the SUT, i.e., the otherwise latent independent variable.


df <- read.table(text="X  R
25504  0.7
51054 0.6
76667 0.7
102288 0.8
127879 0.9
153497 1.0
179261 1.2
205226 1.4
231069 2.0
253357 5.7",  
header=TRUE
)

# Estimate N from Little's law X x R:
df$Nest <- df$X * df$R * 1e-3
df$Nact <- df$Nest * 7 * 192 / max(df$Nest)

> df

        X   R      Nest       Nact
1   25504 0.7   17.8528   16.61490
2   51054 0.6   30.6324   28.50838
3   76667 0.7   53.6669   49.94569
4  102288 0.8   81.8304   76.15636
5  127879 0.9  115.0911  107.11080
6  153497 1.0  153.4970  142.85367
7  179261 1.2  215.1132  200.19746
8  205226 1.4  287.3164  267.39416
9  231069 2.0  462.1380  430.09380
10 253357 5.7 1444.1349 1344.00000

plot(df$Nact,df$R,type="b", 
   xlab="SFS client generators (actual)",
   ylab="Response (msecs)"
)

The inferred response times in Figure 7 are plotted as a function of the number of generated requests (the independent variable). Since the SPEC SFS benchmark rules permit the think-time to be set to zero (to achieve maximal throughput—competitive benchmarking is war, after all) the response times immediately start climbing up the hockey stick handle such that the foot of the hockey stick in Figure 2b appears to be amputated.

Figure 7. SPEC response times plotted as a function of the number of generators

In Figures 4 and 7 the load, expressed as the number of generators, is an independent variable. It's a system configuration parameter, not a measured quantity. The number of generators is established independently, and prior to running the test, in order to cause a certain load to be impressed on the SUT at runtime. The SUT's response to that load is the measured dependent variable, viz., the response time. In Figures 5 and 6, the throughput is not an independent variable because it also depends on the impressed load. Therefore, it is a dependent variable. Like the response time, the throughput is also a measured quantity, not an independent test parameter. In Figures 5 and 6, the independent variable is implicit rather than explicit.

From the preceding discussion, it should be clear that the elbow curve in Figure 1 cannot be correct, since the response time is plotted as a function of the number of users, not the throughput.

If you'd like to know about how to analyze performance data in this way, you should consider attending the upcoming GCaP class on September 21.

Oh! I almost forgot. Here's the computed PDQ version of Figure 1 with a logarithmic x-axis taken out to 5000 users. Clearly, it is not an elbow curve with a vertical asymptote at around 1000 users.

Figure 8. Calculated redition of Figure 1

Naively applying log transforms to performance data is something I've cautioned against before because it usually alters the original nonlinear curve into different nonlinear curve. Since nonlinear transformations are unintuitive, the altered curve then gives the wrong visual cues—amongst other potential problems. See "Sex, Lies and Log Plots" and "What's Wrong with This Picture?." If you do decide to use log-axes, then clearly label them as such so as to warn the reader about the additional nonlinear effects that you are introducing.

Although the hockey stick shape appears to be retained under the log transform in Figure 8, that's mostly an optical illusion: the details are very different.

Don't use a logarithmic scale in a data plot without thinking carefully about its nonlinear side-effects.

One redeeming feature of Figure 1 is that it correctly depicts the service level target ("SLA") as a horizontal line through the response time curve rather than a knee in the curve, that doesn't actually exist. See "Response Time Knees and Queues" and "Mind Your Knees and Queues: Responding to Hyperbole with Hyperbolae" for more about that topic.

The Pith of Performance

Wednesday, July 29, 2015

Hockey Elbow and Other Response Time Injuries

No comments: