Monday, September 17, 2007

MVA, Upgrades, and Other Visitations Upon PDQ

Guerrilla alumnus Sudarsan Kannan asks:

"I'm trying to understand concepts behind performance modeling (analytical modeling) based on MVA algorithm. ...I'm also trying to understand WHAT IF scenarios such as: What if I increase my CPU speed or upgrade Disk I/O subsystem? What impact will the increase in CPU speed have on throughput, response time and more variables? What if I increase number of users. I have couple of questions to get a better picture of MVA algorithm:
  1. How to find visit ratios for CPU?
  2. Can I vary service time (S) for a resource (CPU or disk) if I increase/decrease the processor/disk speed to answer WHAT IF scenarios?"
Without getting too technical, let me briefly review what MVA is. MVA stands for "Mean Value Analysis", and is otherwise a rather poor choice of name (IMHO) for a particular algorithm used to solve certain queueing models analytically (as Sudarsan says). It usually refers to an iterative technique used to solve closed queueing circuits (See figure). Here, closed means that there is a finite number of requests (N) in the system. No requests enter from the outside and none leave the queueing circuit. Therefore, all queues are well-bounded. Closed queueing circuits are therefore a good model for things like an operating systems (with a fixed number N of active processes) and load-test platforms (with a fixed number N of active clients or virtual users).

The difficulty with solving a closed queueing model is that it contains a negative feedback loop: the more requests that are at service resources (out of the N possible requests), the fewer new requests that can be issued. On the one hand, this is a good thing because it is a self-regulating system (can't have unbounded queues), on the other is it impossible to solve analytically because although I could solve for the throughput X(N) using Little's law: X(N) = Q(N)/R(N), to do so I need to know R(N). I could use the residence-time formula for a closed queue: R(N) = [N/X(N)] - Z, but I need to know X(N) to do that. This circular logic merely expresses the existence of the feedback loop.

That's where the MVA algorithm comes in. It breaks this circular logic by first calculating all the performance metrics for n = 1, then n = 2, and so on, up to the value n = N. In other words, what we can't accomplish analytically, we can accomplish iteratively. The key observation that permits this iterative scheme to work is called the Arrival Theorem, and I won't go into that here since it's discussed in Section 3.5.1 of my Perl::PDQ book.

Now, to Sudarsan's questions. Let me address question (2) first. Once you have a baseline PDQ model, varying the service times (S) and the the user load (N) as part of answering "what-if" scenarios is exactly the correct idea because those quantities are input parameters to the PDQ model. For example, if you wanted to estimate the impact of upgrading certain disks to ones which had 10 times faster transfer rate, you would replace the baseline service time (S) at each disk by S/10 in the PDQ SetDemand() call. Running the PDQ model then provides you with output performance metrics such as: throughput, queue lengths, utilizations, etc., for your particular choice of N and S.

The answer to question (1) relies on a kind of trick. You are already aware of the service time (S). There is another quantity called the service demand which is defined as D = V × S. Here's how it works. Suppose I am modeling the scheduler in Linux. Since it's a time-share operating system, it uses the concept of time quanta. Any of the N processes that takes longer than the time quantum (e.g., 20 ms) gets kicked off the CPU and put back in the run-queue where it waits to return to the CPU for further service. Depending on the total amount of execution time (D) needed to service that process, it could return to the CPU more than once. It may make several visits (V), each of which gets a time quantum S at the CPU. The total time spent at the CPU will be the product V × S; the service demand.

Technically speaking, what I have just described is visit counts; a positive integer (V). Sudarsan asked about visit ratios; a real number <= 1. The distinction is that in the first instance I'm counting arrivals at a particular resource, e.g., a disk. The visit count (V_k) is the number of IO requests to that disk, now labelled with the subscript k. However, suppose I am considering a set of disks belonging to a database. One disk could be for the transaction log, others could be for data tables and so on. Various groups of disks will have different average IO request rates: V_k where k = 1, 2, 3, .... If I count all the IOs in the system (V) then the visit ratio to any particular disk will be V_k/V. This gives a measure of the probability that IOs go to one disk versus another. It acts just like branching probabilities and is included in the service demand in exactly the same way I described earlier; it's the product D_k = (V_k/V) × S_k. This is all discussed in more detail in Chapter 3 of my Perl::PDQ book. Interestingly enough, this is another of those differences between simulators and PDQ discussed in the previous blog item. In a simulation, you need to be able to specify the explicit workflow. In PDQ you can just use visit ratios instead.

Great questions and I hope my answers are helpful.


Unknown said...

I have more questions based on your post above.

Can we also express visit ratio as Vi = Ci/C0, where
Ci = Total number of service completions at resource i in time T
C0 = Total number of requests completed by the whole system in time T

For example if disk is a resource can I safely assume Ci to be the number of I/O's completed in time T
and C0 to be the total number of transactions that I measure using a tool like loadrunner.

In other words can we say Visit ratio as the average number of disk I/0's per transaction

Also can I apply the same concept for CPU too?

Neil Gunther said...

You have to be careful how you define a request. If you want to use visit ratios, then they have to be <= 1 because they are acting like branching probabilities. That means the system workload has to be homogeneous e.g., all IOs or all DB tx's. You can't mix them together, otherwise you could have 1 DB tx issuing 3 physical IOs (which may well be true), but the visit ratios would come out all wrong. Same logic holds for CPU work.