The Pith of Performance: PDQ as a Performance Periscope

This is a guest post by performance modeling enthusiast, Mohit Chawla, who brought the following very interesting example to my attention. In contrast to many of the examples in my Perl::PDQ book, these performance data come from a production web server, not a test rig. —NJG

Performance analysts usually build performance models based on their understanding of the software application's behavior. However, this post describes how a PDQ model acted like a periscope and also turned out to be pedagogical by uncovering otherwise hidden details about the application's inner workings and performance characteristics.

Some Background

A systems engineer needed some help analyzing the current performance of a web application server, as well as its capacity consumption. Time series data and plots provided a qualitative impression for this purpose; mostly to sanity-check the data. While the act of observing and interpreting information contained in these time-series data was helpful for forming an empirical understanding of traffic patterns and resource utilization of the application, it wasn't sufficient to make an accurate judgement about the expected performance of the web server upon changing the application configuration or system resources, i.e., real capacity planning.

In fact, what was needed was some kind of "periscope" to see above the "surface waves" of the time-series performance data. In addition, a more quantitative framework would be useful to go beyond the initial qualitative review of the time-series. All of this was needed pretty damn quick! ... Enter PDQ.

Transforming the Data

Let's start by examining the general structure of the steady-state data before applying any queueing models to it. In other words, the original time series data for measured response times, $R$ (see the above Figure) was transformed to show the relationship between $R$ and the corresponding load, $N$, on the web server during the same time buckets. This transformation is achieved by applying Little's law \begin{equation} N = X R, \end{equation} where $X$ is the measured throughput of the application.

It's now immediately apparent that these transformed data have the general shape of a statistical hockey stick as a function of the number of threads actively processing requests. The hockey stick shows that resource saturation sets in around $N = 250$ concurrent threads, after which the response times start climbing steadily up the hockey-stick handle.

The PDQ Model

What was known about the application is that each request is handled by a thread in a blocking fashion so, the number of threads in the system answering requests will be rate-limited due to their blocking characteristic. This makes it appropriate to construct a closed PDQ model for this application.

In the first attempt at constructing the PDQ model for these data, the predicted $R$ values were not close to the measured response times. Subsequently, we added 250 "dummy queueing nodes" to account for missing response times. This technique is described in detail in Chapter 12 "Web Application Analysis with PDQ" of the Perl::PDQ book. The physical interpretation of these additional PDQ nodes was not understood initially. Nevertheless, the PDQ model now accurately matched the response time data, as shown in this plot:

Similarly, the mean throughput values predicted by PDQ also matched the measured throughput data:

This is where the PDQ model also became pedagogic and exposed some previously hidden performance aspects of this web application. But before getting into that, a word about service times.

The performance data contained estimated service times for the various workloads based, once again, on Little's law: \begin{equation} \rho = X \, S, \end{equation} or \begin{equation} S = \frac{\rho}{X}, \label{eqn:stimes} \end{equation} where $\rho$ is the CPU utilization, $X$ is throughput and $S$ the service time. The slope of the hockey stick handle corresponds to $S_{max}$, the bottleneck service time that limits the maximum possible throughput of the system according to, $X_{max} = 1 / S_{max}$. See Section 7.3 of the Perl::PDQ book

Here are a few selected data values, where $U_{dat}$ refers to the CPU utilization of the server, $S_{est}$ is the estimated service time, $X_{dat}$ is the throughput and $N_{est}$ is the estimated concurrency:

Nest	Xdat	Sest	Udat
133.833750	416.460510	0.000880	0.366420
137.725453	425.790466	0.000864	0.367760
138.430266	413.350861	0.000842	0.348060

As you can see, the mean service times derived using equation (\ref{eqn:stimes}) are less than one millisecond each, whereas the response times are clearly much higher, viz., in the hundreds of milliseconds range. This takes us back to the "hidden latencies" exposed by the PDQ model.

We added around 250 "dummy queues" each with a service time of about one millisecond per thread. In this way, the sum of the residence times across all 250 dummy queues matched the total response time seen by a user, yet none of the dummy queue service times exceed $S_{max} = 1.25$ milliseconds seen in the slope of the hockey-stick handle. The conjectured explanation was that the dummy queues represented some sort of fast lookup in cache memory, or possibly some kind of polling performed by each thread.

Confirmation

After discussing this conjecture with application developers, it turned out that each thread blocks until responses from various external services are received, and this processing happens inside a loop where each thread waits in the loop until it receives a response. That's exactly the kind of polling PDQ predicted. And so, we now knew what limited the application's performance and where it could be improved.

That PDQ periscope turned out to be pretty damn pedagogical!