Wednesday, July 29, 2015

Hockey Elbow and Other Response Time Injuries

You've heard of tennis elbow. Well, there's a non-sports, performance injury that I like to call hockey elbow. An example of such an "injury" is shown in Figure 1, which appeared in a recent computer performance analysis presentation. It's a reminder of how easy it is to become complacent when doing performance analysis and possibly end up reaching the wrong conclusion.

Figure 1. injured response time performance

Figure 1 is seriously flawed for two reasons:

1. It incorrectly shows the response time curve with a vertical asymptote.
2. It compounds the first error by employing a logarithmic x-axis.

The relationship between performance metrics is generally nonlinear. That's what makes performance analysis both interesting and hard. Your brain has evolved to think linearly, whereas computer systems behave nonlinearly. As a consequence, your brain needs help to comprehend those nonlinear effects. That's one reason plotting performance data can be so helpful—as long as it's done correctly.

Nonlinearity is a class, not a thing.

Response-time data belongs to a class of nonlinearity that is best characterized by a convex function. That's a mathematical term that simply means the plotted data tends to curve upward and away from the x-axis.

 Figure 2a. Elbow response time profile Figure 2b. Hockey-stick response time profile

Although there are many possible ways data can trend upward, the convexity of response-time data is limited to the two monotonically increasing cases shown in Figure 2, viz., an elbow curve (Fig. 2a) or a hockey stick curve (Fig. 2b). The distinction is easily understood by noting the following visual characteristics:

1. Elbow form: the blue curve representing the response time data trends upward so as to follow the 90 degree bend formed by y axes on the right side of the plot, i.e., a vertical asymptote.
2. Hockey stick: the blue curve representing the response time data trends upward but at an angle much wider than 90 degrees and runs along the asymptote represented, quite literally, by the hickey stick handle.
As the "load" increases, both response time curves go off to infinity, but at very different rates. The elbow slope is vertical (cf. Figure 1) while the hockey stick slope is inclined at a shallower angle.

The burning question now becomes, what determines whether the response time data follows the elbow or the hockey stick profile? The answer hinges on the use of the word "load"—one of the most overloaded words in the performance lexicon.

Pay very close attention to the metric displayed on the x-axis.

It should be emphasized that the one metric "load" cannot represent is time. When carrying out load tests or running an internal benchmarks, tools like LoadRunner or JMeter will usually present performance metrics as a time series, like Figure 3, while the data are being collected.

Figure 3. Monitored response times

Here, however, I am talking about post-processed performance metrics, where the collected data has been time-averaged. In Figure 3, the time average is represented by the height of the horizontal red line, i.e., a single number that represents the average response time during that measurement period. An example of multiple post-processed response-time measurements is shown in Figure 4. The timestamps have been eliminated. Each data point in Figure 4 is derived from the equivalent of the red line in Figure 3.

Times series performance data has to be time averaged.

Note the resemblance of the curves in Figure 4 to the hockey stick curve in Figure 2b. In this case, the load represents the "number of concurrent requests" in the system under test (SUT). The testing harness was Apache HTTP server benchmarking tool.

Figure 4. Measured response time hockey sticks [Source: Juicebox Games]

Why do these data have a hockey stick profile? Each request that is issued by a load generator traverses the SUT in order to be processed and returns its result (e.g., status or data) to the client load script. The script may be programed to represent an additional user delay, called the think time, before submitting the next request. The foot of the hockey stick corresponds to light load due to only a few load generators (between 1 and 10) issuing a relatively small number requests into the SUT. Hence, size of any queues is relatively small. The response time corresponds closely to the sum of the service demand at each of the processing resources (whatever they might be). Since lower is better for response times, that's also the best time you can achieve on the SUT.

At some point, however, as the number of requests is increased, one of the servicing resources will become 100% busy: the bottleneck resource. In Figure 4, this seems to occur around 50 generators. Beyond that load, the resource queues increase almost linearly with the number of concurrent requests. Consequently, the response time approaches the linear handle of the hockey stick. The slope of the handle is determined by the service time at the bottleneck. For that slope to be vertical (i.e., to become an elbow) would require its service demand be infinite! In practical terms, the SUT would not be functional. If there is no think-time delay on the client side, the queues begin to grow immediately in the SUT and the hockey-stick appears to have no foot, i.e., it's all handle.

Response times plotted as a function of the number of client generators or users will always have a hockey stick profile.

With that pronouncement firmly in mind, you might think that a load-test system can never exhibit an elbow profile. That turns out not to be true. Figure 1 can occur, just not as shown. Once again, it depends on the definition of the load metric.

Figure 5. Measured NFS response time elbow [Source: SPEC.org]

The elbow profile in Figure 5 is taken from the Standard Performance Evaluation Corporation (SPEC) web site and shows the response time in milliseconds for an EMC Isolon NFS server as a (convex) function of throughput measured in ops/second. Many similar curves can be found online in the sanctioned SPECsfs2008_nfs.v3 benchmark results. The important point here is that the load metric is defined to be the throughput, not the number of users or generators, as it is in Figure 4 (or Figure 1).

Figure 6. Extrapolated NFS response time curve

Unfortunately, the SPEC benchmark run-rules only require 10 load points to be reported, so Figure 5 is somewhat visually ambiguous when it comes to deciding if that response time curve is an elbow or a hockey stick. However, I can use my PDQ queueing analyzer in R to produce an approximate rendition of Figure 5. I don't know all the exact service times in the EMC benchmark, so my PDQ model cannot be exact. But I only want to extrapolate the basic profile to show the "shoulder" above the elbow (as it were) in Figure 5, so precise reproduction is unnecessary. The resulting PDQ data in Figure 6 indeed demonstrate that the NFS response times do follow an elbow curve, like Figure 2a.

Once the bottleneck resource saturates at 100% utilization, the throughput becomes throttled at around 275,000 ops per second in the EMC benchmark (Figure 5) or 200,000 ops per second in the PDQ model (Figure 6). In other words, the throughput cannot exceed that value because it's rate-limited by the bottleneck resource. Nonetheless, the queues will continue to grow under increasing request load, so the response time curve has no choice but to increase in the vertical direction and thereby produce the elbow profile.

Response times plotted as a function of the throughput will always have an elbow profile.

To convince you that this must be true, I can apply Little's law to the reported EMC benchmark data to determine the number of requests resident in the SUT, i.e., the otherwise latent independent variable.


25504  0.7
51054 0.6
76667 0.7
102288 0.8
127879 0.9
153497 1.0
179261 1.2
205226 1.4
231069 2.0
253357 5.7",
)

# Estimate N from Little's law X x R:
df$Nest <- df$X * df$R * 1e-3 df$Nact <- df$Nest * 7 * 192 / max(df$Nest)

> df

X   R      Nest       Nact
1   25504 0.7   17.8528   16.61490
2   51054 0.6   30.6324   28.50838
3   76667 0.7   53.6669   49.94569
4  102288 0.8   81.8304   76.15636
5  127879 0.9  115.0911  107.11080
6  153497 1.0  153.4970  142.85367
7  179261 1.2  215.1132  200.19746
8  205226 1.4  287.3164  267.39416
9  231069 2.0  462.1380  430.09380
10 253357 5.7 1444.1349 1344.00000

plot(df$Nact,df$R,type="b",
xlab="SFS client generators (actual)",
ylab="Response (msecs)"
)

The inferred response times in Figure 7 are plotted as a function of the number of generated requests (the independent variable). Since the SPEC SFS benchmark rules permit the think-time to be set to zero (to achieve maximal throughput—competitive benchmarking is war, after all) the response times immediately start climbing up the hockey stick handle such that the foot of the hockey stick in Figure 2b appears to be amputated.

Figure 7. SPEC response times plotted as a function of the number of generators

In Figures 4 and 7 the load, expressed as the number of generators, is an independent variable. It's a system configuration parameter, not a measured quantity. The number of generators is established independently, and prior to running the test, in order to cause a certain load to be impressed on the SUT at runtime. The SUT's response to that load is the measured dependent variable, viz., the response time. In Figures 5 and 6, the throughput is not an independent variable because it also depends on the impressed load. Therefore, it is a dependent variable. Like the response time, the throughput is also a measured quantity, not an independent test parameter. In Figures 5 and 6, the independent variable is implicit rather than explicit.

From the preceding discussion, it should be clear that the elbow curve in Figure 1 cannot be correct, since the response time is plotted as a function of the number of users, not the throughput.

If you'd like to know about how to analyze performance data in this way, you should consider attending the upcoming GCaP class on September 21.

Oh! I almost forgot. Here's the computed PDQ version of Figure 1 with a logarithmic x-axis taken out to 5000 users. Clearly, it is not an elbow curve with a vertical asymptote at around 1000 users.

Figure 8. Calculated redition of Figure 1

Naively applying log transforms to performance data is something I've cautioned against because it usually alters the original nonlinear curve into different nonlinear curve. Since nonlinear transformations are unintuitive, the altered curve then gives the wrong visual cues—amongst other potential problems. See "Sex, Lies and Log Plots" and "What's Wrong with This Picture?." If you do decide to use log-axes, then clearly label them as such so as to warn the reader about the additional nonlinear effects that you are introducing.

Although the hockey stick shape appears to be retained under the log transform in Figure 8, that's mostly an optical illusion: the details are very different.

Don't use a logarithmic scale in a data plot without thinking carefully about its nonlinear side-effects.

One redeeming feature of Figure 1 is that it correctly depicts the service level target ("SLA") as a horizontal line through the response time curve rather than a knee in the curve, that doesn't actually exist. See "Response Time Knees and Queues" and "Mind Your Knees and Queues: Responding to Hyperbole with Hyperbolae" for more about that topic.

Sunday, July 26, 2015

Next GCaP Class: September 21, 2015

The next Guerrilla Capacity Planning class will be held during the week of September 21, 2015 at our new Sheraton Four Points location in Pleastaton, California. Early bird rate ends August 21st.

During the class, I will bust some entrenched CaP management myths (in no particular order):

• All performance measurements are wrong by definition.
• There is no response-time knee.
• Throughput is not the same as execution rate.
• Throughput and latency metrics are related — nonlinearly.
• There is no parallel computing.

No particular knowledge about capacity and performance management is assumed.

Attendees should bring their laptops as course materials are provided on CD or flash drive. The Sheraton provides free wi-fi to the internet.

We look forward to seeing you there!

Monday, March 23, 2015

Hadoop is hot, not because it necessarily represents cutting edge technology, but because it's being rapidly adopted by more and more companies as a solution for engaging in the big data trend. It may be coming to your company sooner than you think.

The Hadoop framework is designed to facilitate the parallel processing of massive amounts of unstructured data. Originally intended to be the basis of Yahoo's search-engine, it is now open sourced at Apache. Since Hadoop now has a broad range of corporate users, a number of companies offer commercial implementations of Hadoop.

However, certain aspects of Hadoop performance, especially scalability, are not well understood. These include:

1. So called flat development scalability
2. Super scaling performance
3. New TPC big data benchmark
Therefore, I've added a new module on Hadoop performance and capacity management to the Guerrilla Capacity Planning course material that also includes such topics as:
• There are only 3 performance metrics you need to know
• How performance metrics are related to one another
• How to quantify scalability with the Universal Scalability Law
• IT Infrastructure Library (ITIL) for Guerrillas
• The Virtualization Spectrum from hyperthreads to hyperservices
• Hadoop performance and capacity management
The course outline has more details.

Early bird registration ends in 5 days.

I'm also interested in hearing from anyone who plans to adopt Hadoop or has experience using it from a performance and capacity perspective.

Friday, March 20, 2015

Performance Analysis vs. Capacity Planning

This question came up in a (members only) Linkedin discussion group:
Often found a misconception about these terms. I'm sure this must be written in a book, but for informal discussions is always preferable to cite sources from standardization institutes or IT industry referents.

Gian Piero

This is a very good question that few people ever ask, let alone try to answer correctly.

Don't quote me on this but, I view it as the difference between how long vs. how much. ;) Yes, that's intended to be somewhat ambiguous, because performance management and capacity management are rather ambiguous concepts, in that there's considerable overlap between them.

Most people who proffer an answer will tend to incorporate a lot of details that reflect their own history with the subject. In my classes, I try to boil it down to fundamentals that can then be elaborated on with the specifics related to your particular context.

1. Performance analysis or performance management is fundamentally about time: how long does it take? (BTW, thruput is just an inverse-time metric.)
2. Capacity planning or capacity management is fundamentally about size: how much resource is needed?
To make things a little more concrete, consider a freeway. The number of lanes (and length between ramps) represents capacity (bandwidth). The unstated assumption is that the freeway has enough capacity to allow the traffic to travel in the shortest time or near the speed limit (throughput), i.e., maximal performance. Of course, in California we know all about that ruse. At peak traffic hours the freeway often approximates a parking lot.

The point is that performance and capacity are intimately related: how much resource is available to achieve a specified performance goal or service level at a given load (like traffic)? The reason we consider any distinction at all is mostly one of perspective.

• If you're coming at it from a capacity management standpoint, you're usually assessing/measuring capacity under a set of assumptions about performance (current or projected).

• If you're coming at it from a performance management standpoint, you're assessing/measuring performance under a set of assumptions about capacity.
The other important point to stress is that the relationship between capacity and performance metrics is generally nonlinear, e.g., the relationship between response time and resource utilization (an oft used proxy for size) is nonlinear—although it can look linear at low loads. That's what makes the subject both interesting and difficult. And, as I say in the epigram to the 1st edition of my Perl::PDQ book: Common sense is the pitfall of all performance analysis.

To go back to the freeway example, the usual "solution" to the parking-lot effect is to simply add more capacity, in the form of more freeways, which we already know doesn't work because adding more freeways just creates more cars! Another unintuitive relationship. Mainframers call this unexpected capacity consumption latent demand.

Beyond that, it's all about trade-offs; including meeting budgetary constraints and so forth.

Postscript:
Doctor Gunther, it's hard not to quote your opinion if we consider that your book: Guerrilla Capacity Planing was one of the first that I read as an introduction to the topic of IT capacity planning.

GP

Monday, March 9, 2015

Guerrilla Training: New Location

Finally! We have a new location for our Guerrilla training classes in Pleasanton, California: Sheraton Four Points.

We had some complaints last year about noise from the car parks of surrounding restaurants during the night at the previous location. Four Points is much more secluded. It also has its, own restaurant, which some of you will recognize if you've attended previous Guerrilla classes (more than likely, we did lunch and/or dinner there).

The current 2015 schedule and registration page is now posted. The classroom is intimate and only holds about 10-12 people, so book early, book often.

Monday, October 6, 2014

Tactical Capacity Management for Sysadmins at LISA14

On November 9th I'll be presenting a full-day tutorial on performance analysis and capacity planning at the USENIX Large Scale System Administration (LISA) conference in Seattle, WA.

The registration code is S4 in System Engineering section.

Hope to see you there.

Wednesday, August 13, 2014

Intel TSX Multicore Scalability in the Wild

Multicore processors were introduced to an unsuspecting marketplace more than a decade ago, but really became mainstream circa 2005. Multicore was presented as the next big thing in microprocessor technology. No mention of falling off the Moore's law (uniprocessor) curve. A 2007 PR event—held jointly between Intel, IBM and AMD—announced a VLSI fabrication technology that broke through the 65 nm barrier. The high-κ Hafnium gate enabled building smaller transistors at 45 nm feature size and thus, more cores per die. I tracked and analyzed the repercussions of that event in these 2007 blog posts:

Intel met or beat their projected availability schedule (depending on how you count) for Penryn by essentially rebooting their foundries. Very impressive.

In my Guerrilla classes I like to pose the question: Can you procure a 10 GHz microprocessor? On thinking about it (usually for the first time), most people begin to realize that they can't, but they don't know why not. Clearly, clock frequency limitations have an impact on both performance and server-side capacity. Then, I like to point out that programming multicores (since that decision has already been made for you) is much harder than it is for uniprocessors. Moreover, there is not too much in the way of help from compilers and other development tools, at the moment, although that situation will continually improve, presumably. Intel TSX (Transactional Synchronization Extensions) for Haswell multicores offers assistance of that type at the hardware level. In particular, TSX instructions are built into Haswell cores to boost the performance and scalability of certain types of multithreaded applications. But more about that in a minute.

I also like to point out that Intel and other microprocessor vendors (of which there are fewer and fewer due the enormous cost of entry), have little interest in how well your database, web site, or commercial application runs on their multicores. Rather, their goal is to produce the cheapest chip for the largest commodity market, e.g., PCs, laptops, and more recently mobile. Since that's where the profits are, the emphasis is on simplest design, not best design.

Fast, cheap, reliable: pick two.
Server-side performance is usually relegated to low man on the totem pole because of its relatively smaller market share. The implicit notion is that if you want more performance, just add more cores. But that depends on the threadedness of the applications running on those cores. Of course, there can also be side benefits, such as inheriting lower power servers from advances in mobile chip technology.

Intel officially announced multicore processors based on the Haswell architecture in 2013. Because scalability analysis can reveal a lot about limitations of the architecture, it's generally difficult to come across any quantitative data in the public domain. In their 2012 marketing build up, however, Intel showed some qualitative scalability characteristics of the Haswell multicore with TSX. See figure above. You can take it as read that these plots are based on actual measurements.

Most significantly, note the classic USL scaling profiles of transaction throughput vs. number of threads. For example, going from coarse-grain locking without TSX (red curve exhibiting retrograde throughput) to coarse-grain locking with TSX (green curve) has reduced the amount of contention (i.e., USL α coefficient). It's hard to say what is the impact of TSX on coherency delay (i.e., USL β coefficient) without being in possession of the actual data. As expected, however, the impact of TSX on fine-grain locking seems to be far more moderate. A 2012 AnandTech review summed things up this way:

TSX will be supported by GCC v4.8, Microsoft's latest Visual Studio 2012, and of course Intel's C compiler v13. GLIBC support (rtm-2.17 branch) is also available. So it looks like the software ecosystem is ready for TSX. The coolest thing about TSX (especially HLE) is that it enables good scaling on our current multi-core CPUs without an enormous investment of time in the fine tuning of locks. In other words, it can give developers "fined grained performance at coarse grained effort" as Intel likes to put it.

In theory, most application developers will not even have to change their code besides linking to a TSX enabled library. Time will tell if unlocking good multi-core scaling will be that easy in most cases. If everything goes according to Intel's plan, TSX could enable a much wider variety of software to take advantage of the steadily increasing core counts inside our servers, desktops, and portables.

With claimed clock frequencies of 4.6 GHz (i.e., nominal 5000 MIPS), Haswell with TSX offers superior performance at the usual price point. That's two. What about reliability? Ah, there's the rub. TSX has been disabled in the current manufacturing schedule due to a design bug.