The Pith of Performance: Parallelism in PDQ

Friday, October 30, 2009

Parallelism in PDQ

All so-called "analytic solvers" for queueing models, including PDQ, assume that the queueing system being modeled is in steady state. Steady state means that in the long run, the number of arrivals into a service facility, e.g., customers arriving at a grocery checkout, will be identical to the number of customers departing. Why is this important?

Think about observing a real checkout for just 30 seconds. In that time, the current customer might well complete service (i.e., finish having their groceries rung up) and depart for the car park. But, during that same period, 3 new customers might have arrived at the checkout and joined the other waiting customers. That's not good, because the number of arrivals is greater than the number of departures or the number coming in does not equal the number going out. Therefore, it looks like a transient condition and transient systems are more difficult to solve than those in steady state. Moreover, reliable statistics only make sense in steady state. Performance engineers are faced with the same problem of detecting steady state throughput when load testing an application. One way to address the problem is simply to look at the system for a longer time. If the shorter term observations are close to the longer term observations, then the system must be close to steady state. At the grocery store, you might need to observe the checkout of 30 minutes, instead of 30 seconds, to see the same number customers departing as arrived during the observation period. In PDQ, however, you don't have to observe steady state because the solutions can only be produced by assuming the system is already in steady state.

One side-effect of solving queueing systems in steady state is that every customer has to visit every service facility, whether or not they actually do so in the real world. This follows from the fact that in steady state, we don't know where each customer is at any instant because we can't resolve instants when everything is averaged out over the long run. In other words, in a PDQ model of a grocery store, every customer visits every checkout aisle! That's very different from the way things are set up in an event-based simulator or LoadRunner. In PDQ, and similar modeling tools, the solutions are derived by assuming that every customer is sequenced through every queue. Let me clarify: there is no parallelism in PDQ. You may well ask, "If there's no parallelism in PDQ, how can I make a performance model of things like a disk array or Hadoop?" That's a very good question and one that had better have an answer because disk arrays and Hadoop utilize parallelism by their very definition. The answer is, there's a trick to doing it in PDQ.

The trick is to recognize that 2 parallel disks, for example, servicing IO requests (incoming at rate λ) that are split 50/50 between the disks (λ/2), is identical to servicing all the IO requests (λ) using 2 double-speed disks (S/2) in sequence. Diagrammatically, it looks like this:

where S is the average disk service time (including seek, xfer, etc. times) and S/2 means the disk services IOs in half the time or runs twice as fast. This trick (it's really a theorem) generalizes to an m-way JBOD or array.

Let's take a simple example of a 7200 RPM SATA disk servicing IOs arriving at an average rate of λ = 50 IOPS. According to this blog, the average service time for a single random IO is S = 14.1 ms. We can usefully compare 3 cases:

Singe disk with buffering, using the simple M/M/1 formula:

R₁ = S
1 − λ S (1)

gives an average IO latency of 0.047619 s or 47.6 ms. The waiting time in the buffer accounts for the 3× increase in latency relative to the single IO case.
Twin parallel disks using the simple M/M/1 formula (left side of the above diagram):

R_|| = S
1 − (λ/2) S (2)

The 50/50 split in the arriving IO requests shows up in the denominator of eqn. (2), which predicts an average latency of 0.0217391 s or 21.7 ms. That is closer to the best case time of 14.1 ms for a single random IO, and demonstrates that parallelism is a good thing.
Sequential double-speed disks (right side of the above diagram):

R₂   = S/2
1 − λ (S/2)      + S/2
1 − λ (S/2)    (3)

This time there is no split in the arrival stream but the factor 2 now shows up in both the numerator and the denominator because it is associated with S. Eqn. (3) produces a response time identical to that obtained from eqn. (2), thereby corroborating the theorem.

It's eqn. (3) that tells us how to implement parallel disks in Perl PDQ, for example. Create two PDQ nodes, each with half the service time (twice the speed) of a single disk:


$m = pdq::CreateNode("disk1", $pdq::CEN, $pdq::FCFS);
pdq::SetDemand("disk1", "IOrequests", 0.0141/2);

$m = pdq::CreateNode("disk2", $pdq::CEN, $pdq::FCFS);
pdq::SetDemand("disk2", "IOrequests", 0.0141/2);

If there are m-disks in parallel, just write a loop.

What I like about this trick is that it provides an unusual insight into how parallelism works. It tells you that the reason parallel disks are a good thing, is because m-disks effectively run m-times faster. The PDQ model cluster.pl, in Chap. 7 of my Perl::PDQ book, is a generalization of this twin-disk example that utilizes up to 240 disks.

The above trick, also tells you how you can beat parallel disks. Instead of going through all m-sequential fast stages, just go through the first one. That turns out to be a version of Amdahl's law, which I'll discuss in my upcoming Guerrilla Capacity Planning class.

8 comments:

joshjordan said...: How are you getting 17.1 in (2)? I get R_II = 14.1/(1-(0.05/2)*14.1) =~ 21.776.; Saturday, October 31, 2009 at 1:57:00 AM PDT
Neil Gunther said...: You are quite correct. Sorry about that.

I did it in Mathematica, but had changed the number of disks to 4 (instead of 2) while playing around and forgot to change it back.

Now fixed. I get R_|| = 0.0217391 seconds in MMA6.

Well spotted, Batman! Thank you.; Saturday, October 31, 2009 at 8:22:00 AM PDT
MariuszW said...: Hi,
I hava three questions regarding PPDQ book (and PDQ in general).

First question:
In pdq distribution file na examples there is mm1n.pl - changing line from
$pdq::nodes = pdq::CreateNode($ nodeName, $pdq::CEN,$ pdq::FCFS);

to

$pdq::nodes = pdq::CreateMultiNode(1,$ nodeName, $pdq::CEN,$ pdq::FCFS);

get the same result except utilization which is "0.0" in second case. Also changing node numbers from 1 to 2 for example, change nothing in others result (there are the same in both cases) - shouldn't be any difference when multiserver queueing node is used?

Second question:
In PPDQ (10.4.6 Adding Overdriven Throughput in PDQ) there is code:
for ( $users = 1;$ users <= $MAXUSERS;$ users++) {
PDQ::SetDemand( $node1, work, 8.0 * ($ users ** 0.085) * 1e-3); # WS
PDQ::SetDemand( $node2,$ work, 3.12 * 1e-3); # AS
PDQ::SetDemand( $node3,$ work, 1.56 * 1e-3); # DB

Everything (WS, AS, DB)is in "for" loop?

Third question:
Maybe stupid...but in example in PPDQ (3.7 Multiple Workloads in Closed Circuits) there are charts made for N users (N in 0...35) but data (C) is for 35 users - where are data for other users?

Regards,
MariuszW; Thursday, April 22, 2010 at 1:26:00 AM PDT
MariuszW said...: Hi again,
When you use S/2 as in your example PDQ uses it as service demand to compute Residence time on node. Does "Residence time" and "Waiting time" has any meaning when this trick is used? Similary, when SetVisits is used.

Regards,
Mariusz; Thursday, May 6, 2010 at 2:38:00 AM PDT
Neil Gunther said...: Mariusz,

In a word, yes. The factor of 1/2 could be any numerical factor.

If we call the new value of the service time S', and substitute it back into the respective queueing formulae, everything still looks the same. In other words, the meaning is the same, only the numbers are different.; Thursday, May 6, 2010 at 6:51:00 AM PDT
Neil Gunther said...: MariuszW. Re: PDQ CreateNode vs CreateMulti

I just did a test in PDQ-R and I don't see the problem.

Node Sched Resource Workload Class Demand
---- ----- -------- -------- ----- ------
1 MSQ multin1 tranxns TRANS 1.0000

Queueing Circuit Totals:
Streams: 1
Nodes: 1

WORKLOAD Parameters:
tranxns 0.7500 1.0000

Metric Resource Work Value Unit
------ -------- ---- ----- ----
Throughput multin1 tranxns 0.7500 Trans/Sec
Utilization multin1 tranxns 75.0000 Percent

If this is not it, please post your PDQ questions over at the PDQ dev forum https://sourceforge.net/projects/pdq-qnm-pkg/forums/forum/737917; Friday, June 18, 2010 at 12:31:00 PM PDT
Aater Suleman said...: "It tells you that the reason parallel disks are a good thing, is because m-disks effectively run m-times faster."

This statement is misleading. M-disks have the same throughput as a disk which is M-times faster. They don't get faster as in the latency is still the same.

Lets suppose that each slow disk access takes 1ms (a bogus number). If we have one slow disk, the latency of each access is 1ms and throughput is 1 access per 1ms.

With 8 slow disks, latency is 1ms and throughput is 8 per 1ms.

With one 8x faster disk, latency is 1/8 ms and throughput is 8 per ms.on; Monday, June 27, 2011 at 4:45:00 PM PDT
Neil Gunther said...: Aater,

I believe we are agreed that the math is ok, so it's a question of interpretation. I am probably guilty of trying to say too much in one sentence. My defense is that (hopefully) it gets presented more clearly in my classes.

The utilization for a single disk is ρ = λS and the latency is given by eqn. (1).

One goal for workloads where latency really counts, e.g., parallel database queries, is to reduce the single-processor time per query R_1 by striping the tables across m backend disks so that R_m = R_1/m, which would be equivalent to m-disks enabling the workload to run m-times faster.

Looking at the LHS of diagram: Halving of the arrival rate λ across the twin disks (m = 2) reduces the utilization by a half, so ρ = (λ/2)S, which is the same as ρ = λS/2.

In this sense, the service time is reduced by a factor of 2 (or the service rate is doubled), which is effectively the goal of query parallelism. But this effect is thwarted from reaching the ideal case because the S in the numerator of eqn. (2) is not divided by 2. In that sense the disk speed (service time) is the same—as you point out. Only the utilization has been reduced due to the lower traffic rate.

The parallel latency in eqn. (2) is less than eqn. (1) because of the lower utilization, but it's not lower by a factor of 1/2.

Looking at the RHS of diagram: By contrast, the disk speed is truly halved in eqn. (3), but now each request has to visit both double-speed disks. Hence, the RHS latency is equal to the LHS latency.

What I meant to convey was that although it appears like you should be able to achieve the ideal query response time goal of R_m = R_1/m, in practice you can't because of queueing.

You can see this more explicitly in Figure 2 of another post on The Mythical Man Month which shows the Amdahl bound on latency as a horizontal line above the x-axis. The corresponding curve for R_m = R_1/m is shown in Figure 1. The Amdahl bound prevents you from actually achieving that ideal parallel goal because of serialization or waiting time.; Sunday, July 3, 2011 at 7:50:00 PM PDT