## Saturday, January 12, 2008

### Erlang Explained

During the GCaP class last November, I mentioned to my students that I was working on a more intuitive derivation of the Erlang B and C formulae (N.B. the English wiki link to Erlang-C is wrong) associated with the multi-server or M/M/m queue. This queue is very important because it provides a simple performance model of a multiprocessor or multicore systems, amongst other things. When I say simple here, I mean the queueing model is "simple" relative to the complexities of a real multiprocessor (e.g., it does not include the effects of computational overhead). On the other hand, the mathematics of this particular queueing model is not simple. Not only do we need to resort to using some kind of computational aid to solve it (some people have built careers around finding efficient algorithms), the formulae themselves are completely unintuitive. During the class I thought I was pretty close to changing all that and be able to present it before the class finished. I was wrong. It took a little bit longer than a week. :-)

In the GCaP course notes, the Erlang equations occur in the section on a single common waiting line with m servers. Erlang, who was employed by the Danish Telephone Company, first formulated and solved this queueing model in 1917 for doing capacity planning of trunk calls; each trunk line being a separate server. As great as Erlang's accomplishments are, even his original paper offers little insight into what is going on physically, nor does it really explain why the M/M/m queue is so much more complicated mathematically than the M/M/1 queue. This summary by my colleague Mike Tanner gives you an idea what "complicated" means in this context. Consequently, you really need to program a calculator or computer application (e.g., Excel) a to solve for the M/M/m performance metrics e.g., the expected queue length (Q) or mean waiting time (W).

The complexity arises from the mathematical expression for the probability that an incoming request (e.g., a phone call) finds all m trunk switches occupied (busy) and therefore must wait ("Call waiting" in modern parlance). That probability is given by the Erlang-C function, which is expressed in terms of the number of servers (m) and how busy each server is (Greek letter: rho). Just to give you an idea, here's what the Erlang B and C functions look like as we progress from m = 1 to m = 6 servers:

What do these complicated rational functions mean physically? The fact that they can be derived from a bunch of inscrutable probability equations associated with Markov Chains is mere mathematics, not a physical explanation. Queues are real and therefore all those rho's and m's must be physically identifiable with what's going on in the queue at the bank or post office or help desk or multiprocessor or multicores. As far as I know, that level of explanation has never been given until now.

In the spirit of "going Guerrilla", I have always avoided such complications in my classes by using a simpler but approximate formula for the residence (or response) time, viz., R = S/(1 - rho^m) where S is the mean service time and rho is the per-server utilization. This simpler formula can easily go into a single Excel cell, for example. Remember, R = S + W. In Chapter 2 of my Perl::PDQ book, I derived this approximate formula using a finite geometric series. The question still remained, could I use my approximation to get to the exact formula? The idea I had was to adjust the approximation by adding in the appropriate corrections. That turns out to be rather fruitless because the corrections are as inscrutable as the exact expressions one is trying to derive. Moreover, any physical insight (the key objective) remains obscured behind a morass of polynomials, like those in the above table. However, now that I've solved it via a completely different route, I can look back with the benefit of hindsight and see the connection between my previous approximation and my new exact derivation.

Using a few simple diagrams, not only can I derive the Erlang-C formula, but remarkably it forces out the waiting-time equation as part of the process. They are intimately linked together. This approach also provides a completely different and IMHO a more comprehensible view of how the M/M/m queue relates to M/M/1. At no point do I invoke Markov Chains or any probably theory beyond the Poisson distribution, which you must grasp anyway to properly understand M/M/1 or any queueing concepts.

The next task is to write this out more clearly than my current writing-pad scribbles and Mathematica code ... Oh! And see if I still believe it. ;-)