The Pith of Performance: 2014

Monday, October 6, 2014

Tactical Capacity Management for Sysadmins at LISA14

On November 9th I'll be presenting a full-day tutorial on performance analysis and capacity planning at the USENIX Large Scale System Administration (LISA) conference in Seattle, WA.

The registration code is S4 in System Engineering section.

Hope to see you there.

Wednesday, August 13, 2014

Intel TSX Multicore Scalability in the Wild

Multicore processors were introduced to an unsuspecting marketplace more than a decade ago, but really became mainstream circa 2005. Multicore was presented as the next big thing in microprocessor technology. No mention of falling off the Moore's law (uniprocessor) curve. A 2007 PR event—held jointly between Intel, IBM and AMD—announced a VLSI fabrication technology that broke through the 65 nm barrier. The high-κ Hafnium gate enabled building smaller transistors at 45 nm feature size and thus, more cores per die. I tracked and analyzed the repercussions of that event in these 2007 blog posts:

Intel met or beat their projected availability schedule (depending on how you count) for Penryn by essentially rebooting their foundries. Very impressive.

In my Guerrilla classes I like to pose the question: Can you procure a 10 GHz microprocessor? On thinking about it (usually for the first time), most people begin to realize that they can't, but they don't know why not. Clearly, clock frequency limitations have an impact on both performance and server-side capacity. Then, I like to point out that programming multicores (since that decision has already been made for you) is much harder than it is for uniprocessors. Moreover, there is not too much in the way of help from compilers and other development tools, at the moment, although that situation will continually improve, presumably. Intel TSX (Transactional Synchronization Extensions) for Haswell multicores offers assistance of that type at the hardware level. In particular, TSX instructions are built into Haswell cores to boost the performance and scalability of certain types of multithreaded applications. But more about that in a minute.

I also like to point out that Intel and other microprocessor vendors (of which there are fewer and fewer due the enormous cost of entry), have little interest in how well your database, web site, or commercial application runs on their multicores. Rather, their goal is to produce the cheapest chip for the largest commodity market, e.g., PCs, laptops, and more recently mobile. Since that's where the profits are, the emphasis is on simplest design, not best design.

Fast, cheap, reliable: pick two.

Server-side performance is usually relegated to low man on the totem pole because of its relatively smaller market share. The implicit notion is that if you want more performance, just add more cores. But that depends on the threadedness of the applications running on those cores. Of course, there can also be side benefits, such as inheriting lower power servers from advances in mobile chip technology.

Source: AnandTech, Inc.

Intel officially announced multicore processors based on the Haswell architecture in 2013. Because scalability analysis can reveal a lot about limitations of the architecture, it's generally difficult to come across any quantitative data in the public domain. In their 2012 marketing build up, however, Intel showed some qualitative scalability characteristics of the Haswell multicore with TSX. See figure above. You can take it as read that these plots are based on actual measurements.

Most significantly, note the classic USL scaling profiles of transaction throughput vs. number of threads. For example, going from coarse-grain locking without TSX (red curve exhibiting retrograde throughput) to coarse-grain locking with TSX (green curve) has reduced the amount of contention (i.e., USL α coefficient). It's hard to say what is the impact of TSX on coherency delay (i.e., USL β coefficient) without being in possession of the actual data. As expected, however, the impact of TSX on fine-grain locking seems to be far more moderate. A 2012 AnandTech review summed things up this way:

TSX will be supported by GCC v4.8, Microsoft's latest Visual Studio 2012, and of course Intel's C compiler v13. GLIBC support (rtm-2.17 branch) is also available. So it looks like the software ecosystem is ready for TSX. The coolest thing about TSX (especially HLE) is that it enables good scaling on our current multi-core CPUs without an enormous investment of time in the fine tuning of locks. In other words, it can give developers "fined grained performance at coarse grained effort" as Intel likes to put it.
In theory, most application developers will not even have to change their code besides linking to a TSX enabled library. Time will tell if unlocking good multi-core scaling will be that easy in most cases. If everything goes according to Intel's plan, TSX could enable a much wider variety of software to take advantage of the steadily increasing core counts inside our servers, desktops, and portables.

With claimed clock frequencies of 4.6 GHz (i.e., nominal 5000 MIPS), Haswell with TSX offers superior performance at the usual price point. That's two. What about reliability? Ah, there's the rub. TSX has been disabled in the current manufacturing schedule due to a design bug.

Wednesday, July 30, 2014

A Little Triplet

Little's law appears in various guises in performance analysis. It was known to Agner Erlang (the father of queueing theory) in 1909 to be intuitively correct but was not proven mathematically until 1961 by John Little. Even though you experience it all the time, queueing is not such a trivial phenomenon as it may seem. In the subsequent discussion, I'll show you that there is actually a triplet of such laws, where each version refers to a slightly different aspect of queueing. Although they have a common general form, the less than obvious interpretation of each version is handy to know for solving almost any problem in performance analysis.

To see the Little's law triplet, consider the line of customers at the grocery store checkout lane shown in Figure 1. Following the usual queueing theory convention, the queue includes not only the customers waiting but also the customer currently in service.

Figure 1. Checkout lane decomposed into its space and time components

As an aside, it is useful to keep in mind that there are only three types of performance metric:

Time $T$ (the fundamental performance metric), e.g., minutes
Count or a number $N$ (no formal dimensions), e.g., transactions
Rate $N/T$ (inverse time dimension), e.g., transactions per minute

Continuous Integration Gets Performance Testing Radar

As companies embrace continuous integration (CI) and fast release cycles, a serious problem has emerged in the development pipeline: Conventional performance testing is the new bottleneck. Every load testing environment is actually a highly complex simulation assumed to be a model of the intended production environment. System performance testing is so complex that the cost of modifying test scripts and hardware has become a liability for meeting CI schedules. One reaction to this situation is to treat performance testing as a checkbox item, but that exposes the new application to unknown performance idiosyncrasies in production.

In this webinar, Neil Gunther (Performance Dynamics Company) and Sai Subramanian (Cognizant Technology Solutions) will present a new type of model that is not a simulation, but instead acts like continuous radar that warns developers of potential performance and scalability issues during the CI process. This radar model corresponds to a virtual testing framework that precludes the need for developing performance test scripts or setting up a separate load testing environment. Far from being a mere idea, radar methodology is based on a strong analytic foundation that will be demonstrated by examining a successful case study.

Broadcast Date and Time: Tuesday, July 22, 2014, at 11 am Pacific

Thursday, July 17, 2014

Restaurant Performance Sunk by Selfies

An interesting story appeared last weekend about a popular NYC restaurant realizing that, although the number of customers they served on a daily basis is about the same today as it was ten years ago, the overall service had slowed significantly. Naturally, this situation has led to poor online reviews to the point where the restaurant actually hired a firm to investigate the problem. The analysis of surveillance tapes led to a surprising conclusion. The unexpected culprit behind the slowdown was neither the kitchen staff nor the waiters, but customers taking photos and otherwise playing around with their smartphones.

Using the data supplied in the story, I wanted to see how the restaurant performance would look when expressed as a PDQ model. First, I created a summary data frame in R, based on the observed times:


> df
           obs.2004 obs.2014
wifi.data         0        5
menu.data         8        8
menu.pix          0       13
order.data        6        6
eat.mins         46       43
eat.pix           0       20
paymt.data        5        5
paymt.pix         0       15
total.mins       65      115

The 2004 situation can be represented schematically by the following queueing network

Figure 1. 2004 queueing stages

Referring to Figure 1:

How to Remember the Poisson Distribution

The Poisson cumulative distribution function (CDF) \begin{equation} F(α,n) = \sum_{k=0}^n \dfrac{α^k}{k!} \; e^{-α} \label{eqn:pcdf} \end{equation} is the probability of at most $n$ events occurring when the average number of events is α, i.e., $\Pr(X \le n)$. Since \eqref{eqn:pcdf} is a probability function, it cannot have a value greater than 1. In R, the CDF is given by the function ppois(). For example, with α = 4 the first 16 values are


> ppois(0:15,4)
 [1] 0.01831564 0.09157819 0.23810331 0.43347012 0.62883694 0.78513039 0.88932602 0.94886638
 [9] 0.97863657 0.99186776 0.99716023 0.99908477 0.99972628 0.99992367 0.99998007 0.99999511

As the number of events increases from 0 to 15 the CDF approaches 1. See Figure.

Load Average in FreeBSD

In my Perl PDQ book there is a chapter, entitled Linux Load Average, where I dissect how the load average metric (or metrics, since there are three reported numbers) is computed in shell commands like uptime, viz.,

[njg]~/Desktop% uptime
16:18  up 9 days, 15 mins, 4 users, load averages: 2.11 1.99 1.99

For the book, I used Linux 2.6 source because it was accessible on the web with convenient hyperlinks to navigate the code. Somewhere in the kernel scheduler, the following C code appeared:


#define FSHIFT    11          /* nr of bits of precision */ 
#define FIXED_1   (1<<FSHIFT) /* 1.0 as fixed-point */ 
#define LOAD_FREQ (5*HZ)      /* 5 sec intervals */ 
#define EXP_1     1884        /* 1/exp(5sec/1min) fixed-pt */ 
#define EXP_5     2014        /* 1/exp(5sec/5min) */ 
#define EXP_15    2037        /* 1/exp(5sec/15min) */ 

#define CALC_LOAD(load,exp,n) \
 load *= exp; \
 load += n*(FIXED_1-exp); \
 load >>= FSHIFT;

where the C macro CALC_LOAD computes the following equation

The Visual Connection Between Capacity And Performance

Whether or not computer system performance and capacity are related is a question that comes up from time to time, especially from those with little experience in either discipline. Most recently, it appeared on a Linked-in discussion group:

"...the topic was raised about the notion that we are Capacity Management not Performance Management. It made me think about whether performance is indeed a facet of Capacity, or if it belongs completely separate."

As a matter of course, I address this question in my Guerrilla training classes. There, I like to appeal to a simple example—a multiserver queue—to exhibit how the performance characteristics are intimately related to system capacity. Not only are they related but, as the multiserver queue illustrates, the relationship is nonlinear. In terms of daily operations, however, you may choose to focus on one aspect more than the other, but they are still related nonetheless.

Importing an Excel Workbook into R

The usual route for importing data from spreadsheet applications like Excel or OpenOffice into R involves first exporting the data in CSV format. A newer and more efficient CRAN package, called XLConnect (c. 2011), facilitates reading an entire Excel workbook and manipulating worksheets and cells programmatically from within R.

XLConnect doesn't require a running installation of Microsoft Excel or any other special drivers to be able to read and write Excel files. The only requirement is a recent version of a Java Runtime Environment (JRE). Moreover, XLConnect can handle older .xls (BIFF) as well as the newer .xlsx (Office XML) file formats. Internally, XLConnect uses Apache POI (Poor Obfuscation Implementation) to manipulate Microsoft Office documents.

As a simple demonstration, the following worksheet, from a Guerrilla Capacity Planning workbook, will be displayed in R.

First, the Excel workbook is loaded as an R object:

Guerrilla Training in July 2014

Time to register online for the upcoming GBoot (July 17) and GCaP (July 21) training classes. This is your fast track for learning how to do enterprise performance and capacity management.

Entrance Larkspur Landing hotel Pleasanton California

Classic topics include:

How performance metrics are related to each another
Queueing theory for those who can't wait
How to quantify scalability with the Universal Scalability Law (USL)
IT Infrastructure Library (ITIL) for Guerrillas
The Virtualization Spectrum from hyperthreads to hyperservices

All classes are held at the Larkspur Landing hotel in Pleasanton, California. Directions are available on the registration page. Larkspur Landing also provides free wi-fi Internet in their residence-style rooms as well as the training room.

Saturday, April 26, 2014

Guerrilla Data Analysis Class, May 19th

This is your fast track to enterprise performance and capacity management with an emphasis on applying R statistical tools to your performance data.

Discounts are available for 3 or more people from the same company. Enquire for details.

All classes are held at our lovely Larkspur Landing location in Pleasanton. Larkspur Landing also provides free Wi-Fi Internet access in their residence-style rooms, as well as our classroom.

Some of the topics covered will include:

Introduction to using R
How to Detect Bad Data
Measurement Errors and Analysis
Multivariate Linear and Nonlinear Regression
Quantitative Scalability Analysis
Statistical forecasting techniques for CaP
Machine learning algorithms applied to CaP Data
Wild Not Mild Data: SQL access, web traffic, data recovery
Taming the Data Torrent with PCA and PerfViz techniques

Attendees should bring their laptops to the class as course materials are provided on a flash drive and calculational tools like OpenOffice, Excel and R will be useful.

Before registering online, take a look at what former students have said about their Guerrilla training experience.

Tuesday, April 1, 2014

Melbourne's Weather and Cross Correlations

During a lunchtime discussion among recent GCaP class attendees, the topic of weather came up and I casually mentioned that the weather in Melbourne, Australia, can be very changeable because the continent is so old that there is very little geographical relief to moderate the prevailing winds coming from the west.

In general, Melbourne is said to have a mediterranean climate, but it can also be subject to cold blasts of air coming up from Antarctic regions at any time, but especially during the winter. Fortunately, the island state of Tasmania acts as something of a geographical barrier against those winds. Understanding possible relationships between these effects presents an interesting exercise in correlation analysis.

Performance of N+1 Redundancy

How can you determine the performance impact on SLAs after an N+1 redundant hosting configuration fails over? This question came up in the Guerrilla Capacity Planning class, this week. It can be addressed by referring to a multi-server queueing model.

N+1 = 4 Redundancy

We begin by considering a small-N configuration of four hosts where the load is distributed equally to each of the hosts. For simplicity, the load distribution is assumed to be performed by some kind of load balancer with a buffer. The idea of N+1 redundancy is that the load balancer ensures all four hosts are equally utilized prior to any failover.

Consider the constraint that none of the hosts should use more than 75% of their available capacity: the blue areas on the left side of Fig. 1. The total consumed capacity is assumed to be $4 \times 3/4 = 3$ or 300% of the total host configuration (rather than all 4 hosts or 400% capacity). Then, when any single host fails, its lost capacity is compensated by redistributing that same load across the remaining three available hosts (each running 100% busy after failover). As we shall show in the next section, this concept involves a misconception.

Figure 1. N+1 redundant hosts before (left) and after failover (right)

The circles in Fig. 1 represent hosts and boxes represent incoming requests buffered at the load-balancer. The blue area in the circles signifies the available capacity of a host, whereas white signifies unavailable capacity. When one of the hosts fails, its load must be redistributed across the remaining three hosts. What Fig. 1 doesn't show is the performance impact of this capacity redistribution.

Modeling the Mythical Man-Month

This article was originally posted in 2007. When I updated the image today (in 2014), it reappeared with the more recent date and I don't know how to override that (wrong) timestamp. This seems to be a bug in Blogger.

In the aftermath of a discussion about software management, I looked up the Mythical Man-Month concept on Wikipedia. The main thesis of Fred Brooks, often referred to as "Brooks's law [sic]," can simply be stated as:

Adding manpower to a late software project makes it later.

In other words, some number of cooks are necessary to prepare a dinner, but adding too many cooks in the kitchen can inflate the delivery schedule.

Facebook Meets Florence Nightingale and Enrico Fermi

Highlighting Facebook's mistakes and weaknesses is a popular sport. When you're the 800 lb gorilla of social networking, it's inevitable. The most recent rendition of FB bashing appeared in a serious study entitled, Epidemiological Modeling of Online Social Network Dynamics, authored by a couple of academics in the Department of Mechanical and Aerospace Engineering (???) at Princeton University.

They use epidemiological models to explain adoption and abandonment of social networks, where user adoption is analogous to infection and user abandonment is analogous to recovery from disease, e.g., the precipitous attrition witnessed by MySpace. To this end, they employ variants of an SIR (Susceptible Infected Removed) model to predict a precipitous decline in Facebook activity in the next few years.

Channeling Mark Twain^†, FB engineers lampooned this conclusion by pointing out that Princeton would suffer a similar demise under the same assumptions.

Irrespective of the merits of the Princeton paper, I was impressed that they used an SIR model. It's the same one I used, in R, last year to reinterpret Florence Nightingale's zymotic disease data during the Crimean War as resulting from epidemic spreading.

Another way in which FB was inadvertently dinged by incorrect interpretation of information—this time it was the math—occurred in the 2010 movie, "The Social Network" that tells the story of how FB (then called Facemash) came into being. While watching the movie, I noticed that the ranking metric that gets written on a dorm window (only in Hollywood) is wrong! The correct ranking formula is analogous to the Fermi-Dirac distribution, which is key to understanding how electrons "rank" themselves in atoms and semiconductors.

^†"The reports of my death have been greatly exaggerated."

Tuesday, February 18, 2014

Guerrilla Classes in March 2014

The upcoming GBoot and GCaP training classes are your fast track to enterprise performance and capacity management. You can now register entirely online using either your corporate or personal credit card.

New topics include:

The effect of think-time settings in load tests
Closed vs. open workloads in load testing

Classic topics include:

There are only 3 performance metrics
How performance metrics are related to each another
How to quantify scalability with the Universal Scalability Law (USL)
IT Infrastructure Library (ITIL) for Guerrillas
The Virtualization Spectrum from hyperthreads to hyperservices

As usual, all classes are held at our lovely Larkspur Landing Pleasanton location in California. Attendees should bring their laptops to the class as course materials are provided on a flash drive. Larkspur Landing also provides free wi-fi Internet in their residence-style rooms as well as the training room.

Thursday, January 2, 2014

Monitoring CPU Utilization Under Hyper-threading

The question of accurately measuring processor utilization with hyper-threading (HT) enabled came up recently in a Performance Engineering Group discussion on Linked-in. Since I spent some considerable time looking into this issue while writing my Guerrilla Capacity Planning book, I thought I'd repeat my response here (slightly edited for this blog), in case it's useful to a broader audience interested in performance and capacity management. Not much has changed, it seems.

In a nutshell, the original question concerned whether or not it was possible for a single core to be observed running at 200% busy, as reported by Linux top, when HT is enabled.

This question is an old canard (well, "old" for multicore technology). I call it the "Missing MIPS" paradox. Regarding the question, "Is it really possible for a single core to be 200% busy?" the short answer is: never! So, you are quite right to be highly suspicious and confused.
You don't say which make of processor is running on your hardware platform, but I'll guess Intel. Very briefly, the OS (Linux in your case) is being lied to. Each core has 2 registers where inbound threads are stored for processing. Intel calls these AS (Architectural State) registers. With HT *disabled*, the OS only sees a single AS register as being available. In that case, the mapping between state registers and cores is 1:1. The idea behind HT is to allow a different application thread to run when the currently running app stalls; due to branch misprediction, bubbles in the pipeline, etc. To make that possible, there has to be another port or AS register. That register becomes visible to the OS when HT is enabled. However, the OS (and all the way up the food chain to whatever perf tools you are using) now thinks twice the processor capacity is available, i.e., 100% CPU at each AS port.

The Pith of Performance