The Pith of Performance: Characterizing Performance Bottlenecks

Wednesday, July 4, 2012

Characterizing Performance Bottlenecks

If you do a Google search using keywords like: performance, bottleneck, analysis, you get quite a bewildering list of responses, and none of them seems to clearly define what they mean by the term bottleneck.^†

The word bottleneck refers to a choke point or narrowing, literally like the neck of a bottle, that causes the flow to take longer than it would otherwise. The effect on performance is commonly seen on the freeway in an area undergoing roadwork. Multiple lanes of traffic are forced to converge into a single lane and proceed past the roadwork in single file. Going from parallel traffic flow to serial flow means the same number of cars will take longer to get through that same section of road. As we all know, the delay at a freeway bottleneck can be very significant.

The same is true on a single-lane country road. If you come to a section where roadwork slows down every car, it takes longer to traverse that section of the road. Bottlenecks are synonymous with slow downs and delays, but they really determine a lot more than delay.

Queueing Analysis

When it comes to computer performance analysis, we can draw on the terminology of queues to characterize a performance bottleneck more accurately. Figure 1 shows an arrangement of three queues in tandem with arrival rate $\lambda$ and respective service times: $S_a, S_b$ and $S_c$. In steady state, what goes in must come out, so the throughput ($X$) of the tandem system is also $\lambda$, i.e., $X=\lambda$.

Figure 1. Serial bottleneck based on service times.

Let's suppose that the service time of the middle queue $S_b$ is the longest of the service times (indicated in red). Service completions at the left queue must occur more rapidly than completions at the middle queue because $S_a < S_b$ so, those completions simply feed into the waiting line at the middle queue.

Service completions at the right queue also occur faster than completions at the middle queue because $S_c < S_b$. But that queue is downstream from the middle queue and therefore can have no impact on its performance. The right queue is generally waiting to service arrivals from the middle queue.

In Figure 1, the middle queue is the bottleneck resource. It is identifiable by having the longest service time $S_b > S_a,S_c$, and by Little's law, $\rho = \lambda S$, it is the busiest server.

A Parallel Puzzle

That all seems obvious enough because it's akin to driving on the country road mentioned earlier. Everything is flowing in a single file and one section, the bottleneck, is slower than all the others. But what does the queueing analysis look like if there's a bottleneck in parallel flow? Such a situation is shown in Figure 2.

Figure 2. Parallel bottleneck based on service times.

The flow out of the queueing resource on the left splits into two flows with a fraction of the original flow, say, $f = 95$% going to the bottom resource with service time $S_c$, and the remaining 5% going to the upper resource with the longest service time $S_b$ (also indicated in red, like Fig. 1). This is equivalent to the bottom resource seeing a request rate of $0.95\lambda$ and the upper (bottleneck) resource seeing a request rate of $0.05\lambda$.

The utilization of the top resource is given by: \begin{equation} \rho_b = (0.05\lambda) S_b \label{eqn:utilb} \end{equation} and the utilization of the bottom resource is given by: \begin{equation} \rho_c = (0.95\lambda) S_c. \label{eqn:utilcc} \end{equation} To keep the arithmetic simple, let's assume the service time at the bottleneck resource is twice as long as the service time at the lower resource, i.e., $S_b = 2 S_c$. Substituting, we can rewrite \eqref{eqn:utilb} as: \begin{equation} \rho_b = (0.10\lambda) S_c \label{eqn:utilbc} \end{equation}

Since all the other parameters are equivalent and 95% of the original traffic flow is going to the bottom resource, it's clear that $\rho_c$ is 9.5 times bigger than $\rho_b$. But wait!

If the bottom resource is busier than the bottleneck resource, how can the top resource be the bottleneck!?

A similar question was raised by Guerrilla alumnus, Ken Beer, in another context and it demonstrates how easy it is to become bamboozled when doing performance analysis. Nothing wrong with that; happens to all of us. The key is to unwind the line of argument to find the flaw in the logic. For me, that's where the equations are invaluable.

The Resolution

To make everything consistent again, we need to rearrange utilization equations \eqref{eqn:utilcc} and \eqref{eqn:utilbc} as follows: \begin{equation} \rho_b = \lambda (0.10 S_c) = \lambda D_b \end{equation} \begin{equation} \rho_c = \lambda (0.95 S_c) = \lambda D_c \end{equation} where I've written $D_b = 0.10 S_c$ and $D_c = 0.95 S_c$. But what do those $Ds$ mean?

Figure 3. Parallel bottleneck based on service demands.

In general, $D$ denotes the service demand, i.e., the total time it takes to service a request, if it makes more than a single visit to the server. It's the original service time, $S$, multiplied by the visit ratio or visit fraction, $V$. In other words, it reflects the fact that it takes $V_c = 95$ visits out of 100 to the bottom resource for every $V_b = 5$ visits out of 100 to the top resource, in order to service a request. Since the bottom resource has the longest service demand, $D_c > D_b$, it is the true bottleneck resource, now indicted correctly by the red server in Figure 3.

Figure 4. Serial bottleneck based on service demands.

So, a bottleneck queueing resource is properly determined by its service demand, not it service time. This distinction between service time and service demand is discussed in more detail in my Perl::PDQ book. As Figure 4 shows for the case of sequential queues, the bottleneck resource (in red, again) is the same, irrespective of applying the service time or service demand definition. This is only true, however, in the sequential case because the arrival rate is the same at each resource.

Bottlenecks in Practice

In practice, the approach taken to detecting bottlenecks is very much a function of the system being analyzed, the available instrumentation and the type of performance statistics provided in that environment. That's the reason for the bewildering Google list mentioned in the introduction.

That said, however, all bottleneck detection boils down to measuring, either by sampling or tracing, any of the following fundamental performance metrics:

delays like waiting time, service time, or response time
resource utilization
queue length or buffer occupancy

and comparing them to find the largest one.

We know from the foregoing discussion that all these metrics are actually a function of the service demand. That dependency can be written schematically as: \begin{equation} \max(D_1,D_2,\ldots) \rightarrow \max(\rho_1,\rho_2,\ldots) \rightarrow \max(Q_1,Q_2,\ldots) \end{equation} Moreover, it's very valuable to establish the maximum service demand, if at all possible, because it controls the maximum system throughput according to: \begin{equation} \lambda \equiv X < \dfrac{1}{\max(D_1,D_2,\ldots)} \label{eqn:maxx} \end{equation} From the standpoint of queueing models, like those expressed in PDQ, the arrival rate and the service demand are essential input metrics; everything else is an output metric.

Equation \eqref{eqn:maxx} also accounts for the the following Guerrilla mantra:

You never remove a bottleneck, you just move it.

The red server in Figures 3 and 4 is called the primary bottleneck. If that service demand can be reduced by performance tuning, for example, then the secondary bottleneck becomes the new primary bottleneck. It's also possible to have different bottlenecks existing simultaneously in the same system.

^† The same comment about creating confusion also holds for the book The Goal, by E. M. Goldratt and J. Cox, which I was graciously presented as a gift by some appreciative Guerrilla alumni in 1998. Unfortunately, I had trouble seeing the merits of this book. Any author who takes 140 pages to explain the concept of a bottleneck, has already created a bottleneck for understanding bottlenecks.

POSTSCRIPT

There is a potential ambiguity in Figure 3. When you draw such diagrams, you have a choice:

Split the flow according to fractions $f$ and $(1 - f)$, but use the service time $S$ (not $D$) to label each server; as shown in Figure 3. The reason for doing this is to avoid a double multiplication by $f$. For example, that would happen when calculating the resource utilization with one $f$ coming from the split flow to the lower resource, and another $f$ coming from the demand $D = f S$ at that server. The utilization would then be calculated incorrectly as, $\rho = (\lambda f) \times (f S)$.
Use the service demand $D$ (not $S$) to label each server, but do not split the total flow $\lambda$. The trouble with this approach is that it looks wrong, even when drawn correctly. Since it's logically impossible for $\lambda$ to be going to only one resource and be split at the same time, you would have to show the total flow going to one resource for $f~\%$ of the time, but going to the other resource $(1-f)~\%$ of the time. That probably requires two diagrams.

Option 1 is recommended because it appears more intuitive. PDQ, on the other hand, uses option 2, internally.

10 comments:

Matteo said...: Hi Dr. Gunther,
it seems that in case of parallel flow to maximise the X we need to minimize the demand. The equilibrium should be where the 2 loads are equal, i.e. when (1-f)*lambda*Sb = f*lambda*Sc. In this case, where we only have 2 parallel server and we know that Sb = 2Sc I would say that we should balance the load as 1/3 to b node and 2/3 to c node, that is f=2/3. But how to calculate the optimum balance in a general way, with an arbitrary number of parallel nodes?
Many Thanks
MatteoP; Wednesday, July 4, 2012 at 3:26:00 AM PDT
Neil Gunther said...: Super question, Matteo! How about this?

Encode the representation of a bottleneck using Postscript Option 1 (above) into an R function that computes the throughput branching ratios:

bratios <- function(stimes) {
n <- length(stimes)
h <- 1/mean(1/stimes)
return( (h/n)*(1/stimes) )
}

Try it out for the case of the blog example in Fig. 3:
> bratios(c(2,1))
[1] 0.3333333 0.6666667
which seems to agree with your calculations.

Let's try it using service times in milliseconds:
> st <- c(5,15)*1e-3
> bratios(st)
[1] 0.75 0.25
which agrees with a 3:1 ratio.

And for the general case of an arbitrary number of disks, you asked about:
> bratios(c(1,2,3,4,5))
[1] 0.43795620 0.21897810 0.14598540 0.10948905 0.08759124

Finally, check that those branching ratios sum to one:
> b <- bratios(c(1,2,3,4,5))
> sum(b)
[1] 1; Friday, July 6, 2012 at 12:31:00 AM PDT
Mohan Radhakrishnan said...: Do these queues also model DNS lookups, Web Server times before hitting a web application ? Sometimes browsers cache pages or images. Should I model at this granularity if I were to measure login times or menu access times ?; Sunday, July 8, 2012 at 8:55:00 PM PDT
Neil Gunther said...: Mohan,

Without more information and data, it's impossible for me to say.

In general, however, as a performance modeling abstraction, queues are dumber than a hod of bricks; they merely do what you tell them to do. Therefore, their application is only limited by your imagination and the available data.; Monday, July 9, 2012 at 9:47:00 AM PDT
Mohan Radhakrishnan said...: Maybe this is explained in your class :-) I will ignore the browser caches. The test system i am trying to model based on PDQ(book) does not have a huge DNS delay. It does not have a web server now. But in production both loads will be significant. Is there a recommended way to model it by predicting the missing pieces reasonably ?; Tuesday, July 10, 2012 at 10:20:00 PM PDT
Neil Gunther said...: Two options:

1) Come to class http://is.gd/qLZXIE :)

and/or

2) Try it in PDQ http://is.gd/hxCjL4

and then we can discuss something concrete.; Tuesday, July 10, 2012 at 10:33:00 PM PDT
Mohan Radhakrishnan said...: Tried and plotted samples because We have only 1 measurement for one user login at this time. System is not open for testing. My thought is that there should be dummy PDQ nodes -Pg. 379 in Chap. 12. Line 5 from the bottom of this page seems to point to PDQ code but ends with a :
But I didn't get throughtput rolloff. Will read again.; Friday, July 20, 2012 at 4:38:00 AM PDT
Neil Gunther said...: Mohan,

a) What looks like a typo is just a result of the code listing 12.6 floating from its old position in the 1st edn. There should really be a callout to that listing. Will fix.

b) Just to be clear, the additional dummy queues/nodes in PDQ do not induce throughtput rolloff. That's a separate effect discussed in Section 12.4.6. The dummy queues only add latency to the system, so that all the other parameters match up with the test rig.; Friday, July 20, 2012 at 7:11:00 AM PDT
Mohan Radhakrishnan said...: I did pick up a bottleneck analysis problem from the pages 241-242 in “The Operational Analysis of Queueing Network Models* by PETER J. DENNING and JEFFREY P. BUZEN (Computing Surveys, Vol. 10, No. 3, September 1978)

My initial attempt is this.

# Bottleneck analysis .

library(pdq)

# PDQ globals
load<-20
think<-20

cpu<-"cpunode"
disk<-"disknode"
drum<-"drumnode"

cpustime<-5
diskstime<-8
drumstime<-4

workload<-"workload"

xc<-0
yc<-0

for (n in 1:load) {
Init("")

CreateClosed(workload, TERM, as.double(n), think)

CreateNode(cpu, CEN, FCFS)
#SetVisits(cpu, workload, 20.0, cpustime)
SetDemand(cpu, workload, cpustime)

CreateNode(disk, CEN, FCFS)
#SetVisits(disk, workload, 11.0, diskstime)
SetDemand(disk, workload, diskstime)

CreateNode(drum, CEN, FCFS)
#SetVisits(drum, workload, 8.0, drumstime)
SetDemand(drum, workload, drumstime)

Solve(EXACT)

xc[n]<-as.double(n)
yc[n]<-GetResponse(TERM, workload)
}

M1<-GetLoadOpt(TERM, workload)

plot(xc, yc, type="l", ylim=c(0,50), lwd=2, xlab="M", ylab="seconds")
abline(a=-think, b=cpustime, lty="dashed", col="red")
# sum to the minimal response time is 2.2 seconds
abline( 2.2, 0, lty="dashed", col = "red")
text(M1+6, 0, paste("M1=", as.numeric(M1)))

Even though this is generating a graph with an asymptote I think the calculation is not shown correctly.

Visit rations for CPU, disk and drum are

V0 = 1 = .05V1
Vl = V0 + V2 + V3
v2 = .55V1
V3 = .40V1

Solving these equations we get

v1 = .05(v2/.55) + .55(v2/.55) + .40(v2/.55)

v1 = v2/.55

v1 = ( v2 * 100 ) /55

55v1 = 100v2

11v1 = 20 v2

v1/ v2 = 20/11

v3 = v1 - v0 - v2

v3 = .40 * 20 v2 / 11

11v3 = 8v2

v2/v3 = 8/11

V1S1 = (20)(.05) = 1.00 seconds (Total CPU time is the bottleneck )

V2S2 = (11)(.08) = .88 seconds (Total Disk time)

V3S3 = (8)(.04)= .32 seconds (Total Drum time)

These products sum to the minimal response
time of 2.2 seconds

M1 = Z/VISI = 20/1 = 20 terminals

I am getting M1 = 4 and not 20. Is this because the visit ratios are not used in the pdq script ?; Thursday, August 2, 2012 at 10:03:00 PM PDT
Matteo said...: Hi Dr. Gunther,
thanks a lot for your bratios function. Harmonic mean is the solution.
About bottleneck I'm experiencing a behaviour that is confusing me:
I've a queue that feeds 16 parallel processes. the maximum throughput I'm getting is 15 requests completed per second.
Each request is completed in 663ms. The process of each request consists of 3 sub periods:
1st is a series of DB read only queries, it lasts 149 ms on average;
then there is an enqueue wait to acquire an exclusive lock to a table row (select for update).
This wait is, on average 374ms; once acquired the lock, process continues for 140ms
up to the commit that complete the process and releases the lock.
Row locked is the same for all the processes, so they must wait iwhile the row is locked by another process.
What confuses me it that, by virtue of USL, I'm expecting that maximum
throughput to be a function of the serialized time, 1/0.140 -> 7 reqs/sec (or less).
It seems that each process is waiting for less that the serialized time due
to the fact that requests are not processed simultaneously, so each
process arrives to the serialization point when the process ahead already acquired
the lock for a while, so waiting processes do no wait a full serialized period for each process ahead.
This seems to be confirmed by the waiting times logged on the Db trace file (I traced only one session):
I see several 30 ms wait events before the process is able to acquire the lock (totalling in 374 ms as mentioned above).
Hoping the description is clear (I've some doubts), do you see any inconsistencies on the data?
Are they clearly inconsistent or I just need to understand how to derive the serialization parameter properly?
Thanks
MatteoP; Wednesday, September 26, 2012 at 4:24:00 AM PDT