Monday, October 6, 2014

Tactical Capacity Management for Sysadmins at LISA14

On November 9th I'll be presenting a full-day tutorial on performance analysis and capacity planning at the USENIX Large Scale System Administration (LISA) conference in Seattle, WA.

The registration code is S4 in System Engineering section.

Hope to see you there.

Wednesday, August 13, 2014

Intel TSX Multicore Scalability in the Wild

Multicore processors were introduced to an unsuspecting marketplace more than a decade ago, but really became mainstream circa 2005. Multicore was presented as the next big thing in microprocessor technology. No mention of falling off the Moore's law (uniprocessor) curve. A 2007 PR event—held jointly between Intel, IBM and AMD—announced a VLSI fabrication technology that broke through the 65 nm barrier. The high-κ Hafnium gate enabled building smaller transistors at 45 nm feature size and thus, more cores per die. I tracked and analyzed the repercussions of that event in these 2007 blog posts:
  1. Moore's Law II: More or Less?
  2. More on Moore
  3. Programming Multicores Ain't Easy

Intel met or beat their projected availability schedule (depending on how you count) for Penryn by essentially rebooting their foundries. Very impressive.

In my Guerrilla classes I like to pose the question: Can you procure a 10 GHz microprocessor? On thinking about it (usually for the first time), most people begin to realize that they can't, but they don't know why not. Clearly, clock frequency limitations have an impact on both performance and server-side capacity. Then, I like to point out that programming multicores (since that decision has already been made for you) is much harder than it is for uniprocessors. Moreover, there is not too much in the way of help from compilers and other development tools, at the moment, although that situation will continually improve, presumably. Intel TSX (Transactional Synchronization Extensions) for Haswell multicores offers assistance of that type at the hardware level. In particular, TSX instructions are built into Haswell cores to boost the performance and scalability of certain types of multithreaded applications. But more about that in a minute.

I also like to point out that Intel and other microprocessor vendors (of which there are fewer and fewer due the enormous cost of entry), have little interest in how well your database, web site, or commercial application runs on their multicores. Rather, their goal is to produce the cheapest chip for the largest commodity market, e.g., PCs, laptops, and more recently mobile. Since that's where the profits are, the emphasis is on simplest design, not best design.

Fast, cheap, reliable: pick two.
Server-side performance is usually relegated to low man on the totem pole because of its relatively smaller market share. The implicit notion is that if you want more performance, just add more cores. But that depends on the threadedness of the applications running on those cores. Of course, there can also be side benefits, such as inheriting lower power servers from advances in mobile chip technology.

Intel officially announced multicore processors based on the Haswell architecture in 2013. Because scalability analysis can reveal a lot about limitations of the architecture, it's generally difficult to come across any quantitative data in the public domain. In their 2012 marketing build up, however, Intel showed some qualitative scalability characteristics of the Haswell multicore with TSX. See figure above. You can take it as read that these plots are based on actual measurements.

Most significantly, note the classic USL scaling profiles of transaction throughput vs. number of threads. For example, going from coarse-grain locking without TSX (red curve exhibiting retrograde throughput) to coarse-grain locking with TSX (green curve) has reduced the amount of contention (i.e., USL α coefficient). It's hard to say what is the impact of TSX on coherency delay (i.e., USL β coefficient) without being in possession of the actual data. As expected, however, the impact of TSX on fine-grain locking seems to be far more moderate. A 2012 AnandTech review summed things up this way:

TSX will be supported by GCC v4.8, Microsoft's latest Visual Studio 2012, and of course Intel's C compiler v13. GLIBC support (rtm-2.17 branch) is also available. So it looks like the software ecosystem is ready for TSX. The coolest thing about TSX (especially HLE) is that it enables good scaling on our current multi-core CPUs without an enormous investment of time in the fine tuning of locks. In other words, it can give developers "fined grained performance at coarse grained effort" as Intel likes to put it.

In theory, most application developers will not even have to change their code besides linking to a TSX enabled library. Time will tell if unlocking good multi-core scaling will be that easy in most cases. If everything goes according to Intel's plan, TSX could enable a much wider variety of software to take advantage of the steadily increasing core counts inside our servers, desktops, and portables.

With claimed clock frequencies of 4.6 GHz (i.e., nominal 5000 MIPS), Haswell with TSX offers superior performance at the usual price point. That's two. What about reliability? Ah, there's the rub. TSX has been disabled in the current manufacturing schedule due to a design bug.

Wednesday, July 30, 2014

A Little Triplet

Little's law appears in various guises in performance analysis. It was known to Agner Erlang (the father of queueing theory) in 1909 to be intuitively correct but was not proven mathematically until 1961 by John Little. Even though you experience it all the time, queueing is not such a trivial phenomenon as it may seem. In the subsequent discussion, I'll show you that there is actually a triplet of such laws, where each version refers to a slightly different aspect of queueing. Although they have a common general form, the less than obvious interpretation of each version is handy to know for solving almost any problem in performance analysis.

To see the Little's law triplet, consider the line of customers at the grocery store checkout lane shown in Figure 1. Following the usual queueing theory convention, the queue includes not only the customers waiting but also the customer currently in service.

Figure 1. Checkout lane decomposed into its space and time components

As an aside, it is useful to keep in mind that there are only three types of performance metric:

  1. Time $T$ (the fundamental performance metric), e.g., minutes
  2. Count or a number $N$ (no formal dimensions), e.g., transactions
  3. Rate $N/T$ (inverse time dimension), e.g., transactions per minute
From this standpoint, Little's law has the simple general form \begin{equation} N = \dfrac{N}{T} ~\times ~T \label{eqn:dims} \end{equation} which says: a rate (type- C metric) multiplied by time (type- A metric) produces a number (type- B metric), because the $T$s cancel out. The three metric types (A,B,C) should not be confused with the three forms of Little's law being discussed here. That's just a coincidence because good things often come in threes. :-)

Sunday, July 20, 2014

Continuous Integration Gets Performance Testing Radar

As companies embrace continuous integration (CI) and fast release cycles, a serious problem has emerged in the development pipeline: Conventional performance testing is the new bottleneck. Every load testing environment is actually a highly complex simulation assumed to be a model of the intended production environment. System performance testing is so complex that the cost of modifying test scripts and hardware has become a liability for meeting CI schedules. One reaction to this situation is to treat performance testing as a checkbox item, but that exposes the new application to unknown performance idiosyncrasies in production.

In this webinar, Neil Gunther (Performance Dynamics Company) and Sai Subramanian (Cognizant Technology Solutions) will present a new type of model that is not a simulation, but instead acts like continuous radar that warns developers of potential performance and scalability issues during the CI process. This radar model corresponds to a virtual testing framework that precludes the need for developing performance test scripts or setting up a separate load testing environment. Far from being a mere idea, radar methodology is based on a strong analytic foundation that will be demonstrated by examining a successful case study.

Broadcast Date and Time: Tuesday, July 22, 2014, at 11 am Pacific

Thursday, July 17, 2014

Restaurant Performance Sunk by Selfies

An interesting story appeared last weekend about a popular NYC restaurant realizing that, although the number of customers they served on a daily basis is about the same today as it was ten years ago, the overall service had slowed significantly. Naturally, this situation has led to poor online reviews to the point where the restaurant actually hired a firm to investigate the problem. The analysis of surveillance tapes led to a surprising conclusion. The unexpected culprit behind the slowdown was neither the kitchen staff nor the waiters, but customers taking photos and otherwise playing around with their smartphones.

Using the data supplied in the story, I wanted to see how the restaurant performance would look when expressed as a PDQ model. First, I created a summary data frame in R, based on the observed times:

> df
           obs.2004 obs.2014         0        5         8        8
menu.pix          0       13        6        6
eat.mins         46       43
eat.pix           0       20        5        5
paymt.pix         0       15
total.mins       65      115

The 2004 situation can be represented schematically by the following queueing network

Referring to Figure 1:

Thursday, July 3, 2014

How to Remember the Poisson Distribution

The Poisson cumulative distribution function (CDF) \begin{equation} F(α,n) = \sum_{k=0}^n \dfrac{α^k}{k!} \; e^{-α} \label{eqn:pcdf} \end{equation} is the probability of at most $n$ events occurring when the average number of events is α, i.e., $\Pr(X \le n)$. Since \eqref{eqn:pcdf} is a probability function, it cannot have a value greater than 1. In R, the CDF is given by the function ppois(). For example, with α = 4 the first 16 values are

> ppois(0:15,4)
 [1] 0.01831564 0.09157819 0.23810331 0.43347012 0.62883694 0.78513039 0.88932602 0.94886638
 [9] 0.97863657 0.99186776 0.99716023 0.99908477 0.99972628 0.99992367 0.99998007 0.99999511
As the number of events increases from 0 to 15 the CDF approaches 1. See Figure.

Thursday, June 26, 2014

Load Average in FreeBSD

In my Perl PDQ book there is a chapter, entitled Linux Load Average, where I dissect how the load average metric (or metrics, since there are three reported numbers) is computed in shell commands like uptime, viz.,
[njg]~/Desktop% uptime
16:18  up 9 days, 15 mins, 4 users, load averages: 2.11 1.99 1.99

For the book, I used Linux 2.6 source because it was accessible on the web with convenient hyperlinks to navigate the code. Somewhere in the kernel scheduler, the following C code appeared:

#define FSHIFT    11          /* nr of bits of precision */ 
#define FIXED_1   (1<<FSHIFT) /* 1.0 as fixed-point */ 
#define LOAD_FREQ (5*HZ)      /* 5 sec intervals */ 
#define EXP_1     1884        /* 1/exp(5sec/1min) fixed-pt */ 
#define EXP_5     2014        /* 1/exp(5sec/5min) */ 
#define EXP_15    2037        /* 1/exp(5sec/15min) */ 

#define CALC_LOAD(load,exp,n) \
 load *= exp; \
 load += n*(FIXED_1-exp); \
 load >>= FSHIFT;

where the C macro CALC_LOAD computes the following equation