The Pith of Performance: May 2010

Sunday, May 30, 2010

Simulating a Queue in R

In the GCaP class earlier this month, we talked about the meaning of the load average (in Unix and Linux) and simulating a grocery store checkout lane, but I didn't actually do it. So, I decided to take a shot at constructing a discrete-event simulation (as opposed to Monte Carlo simulation) of a simple M/M/1 queue in R.

Jackson's Theorem for the Cloud

Queueing theory, as a distinct discipline, just turned 100 last year. Compared with mathematics and physics, it's a relative youngster. Some seminal results include: Erlang's original solution for the M/D/1 queue (1909), his solutions for a multiserver queue without a waiting line M/M/m/m and with a waiting line M/M/m/∞; AKA "call waiting" (1917), the Pollaczek–Khinchine formula for the M/G/1 queue (1930) and Little's proof (1961). These results were established in the context of individual queueing facilities.

Load Testing Think Time Distributions

One of my gripes about some commercial load testing tools is that they only provide a think time distribution (Z) that is equivalent to uniform variates in the client-script. If you want some other distribution, you have to code it and debug it yourself. Load test generators are essentially very expensive workload simulators; especially when you take into account the cost of the SUT platform. At those prices, a selection of distributions should be provided as a standard library—like they are in event-based simulators.

To make this point a bit clearer, I used the very convenient variate-generation functions in R to compare some of the distributions that I consider should be included in such a library for the convenience of workload-test designers and performance engineers. The statistical mean (i.e., the average think delay) is the same in all these plots and is shown as the red vertical line, but pay particular attention to the spread around the mean on the x-axis.

Intel's Cloud Computer on a Chip

Last week in the GCaP class, I underscored how important it is to "look out the window" and keep an eye on what is happening in the marketplace, because some of those developments may eventually impact capacity planning in your shop. Here's a good example:

This Intel processor (code named "Rock Creek") integrates 48 IA-32 cores, 4 DDR3 memory channels, and a voltage regulator controller in a 6×4 2D-mesh network-on-chip architecture. Located at each mesh node is a five-port virtual cut-through packet switched router shared between two cores. Core-to-core communication uses message passing while exploiting 384KB of on-die shared memory. Fine grain power management takes advantage of 8 voltage and 28 frequency islands to allow independent DVFS of cores and mesh. At the nominal 1.1V, cores operate at 1GHz while the 2D-mesh operates at 2GHz. As performance and voltage scales, the processor dissipates between 25W and 125W. The 567 sq-mm processor die is implemented in 45nm Hi-K CMOS and has 1,300,000,000 transistors.

The "cloud" reference is a marketing hook, but note that it uses a 2D mesh interconnect topology (like we discussed in class), contains 1.3 billion transistors with the new Hafnium metal gate (as we discussed in class), and produces up to 125 watts of heat.

The details of this processor were presented at the annual ISSCC meeting in San Francisco, February 2010.

Saturday, May 15, 2010

Emulating Web Traffic in Load Tests

One of the recurring questions in the GCaP class last week was: How can we make web-application load tests more representative of real Internet traffic? The sticking point is that conventional load-test simulators like LoadRunner, JMeter, and httperf, represent the load in terms of a finite number of virtual user (or vuser) scripts, whereas the Internet has an indeterminately large number of real users creating load.

GCaP Class Highlights

It was sunny outside but we ended up staying in the shade for better lighting balance. Other grads had to catch earlier flights home by the time this shot was taken.

Some graduates of the May 2010 GCaP class. Photo courtesy Manu M.
Here are some of the interesting topics that popped up in class discussions this week:

How to emulate Internet traffic with load test tools like LoadRunner
Contol Groups (cgroups) for fair-share resource allocation in Linux containers—not to be confused with CFS (completely fair scheduler)
Demo of JXinsight by the tool architect (and now Guerrilla graduate) William Louth
DTrace for Solaris and Mac OS X
Instruments in Mac OS X (Leopard and higher)
The performance and capacity implications of cloud computing
How to get started doing GCaP with VAM: Visualize, Analyze, Modelize
httperf web workload generator

This is why you too should consider attending an upcoming Guerrilla class.

Monday, May 10, 2010

BRL-CAD Benchmark and USL Modeling

MariuszW asked a question in a previous post entitled This is Your Measurements on Models. Since answering it is rather involved, I decided to address it here as a separate blog post. The context of the question concerns the application of my universal scalability model (USL).

"I understand that system characteristic is in α and β and the interpretation is the key. And that there is aggregation. In GCaP [book] (in table 5.1) there are ray tracing benchmark results - what is workload (users, tasks) described in such case? Numbers? Is it Xmax on given processor in this table? - so processor p1 is loaded to measure Imax, next p4 is loaded to measuer Imax? - isn't task or user number important for given p-Imax?"

Using Think Times to Determine Arrival Rates

This question came up at the NorCal CMG meeting last week. Hugh S. asked me: Is there is a relationship between the choice of think time (Z) in a load-test client script and the rate at which requests will arrive into the system under test? The answer is, yes, and it's easy to understand how by using the preceding blog post about mapping virtual users to real users.

Mapping Virtual Users to Real Users

In performance engineering scenarios that use commercial load testing tools, e.g., LoadRunner, the question often arises: How many virtual users (vusers) should be exercised in order to simulate by some expected number of real users? This is important, more often than not, because the requirement might be to simulate thousands or even tens of thousands of real users, but the stiff licensing fees associated with each vuser (above some small default number) makes that cost-prohibitive. As I intend to demonstrate here, we can apply Little's law to map vusers to real users.

A commonly used practical approach to ameliorate this circumstance is to run the load test scenarios with zero think time (i.e., Z = 0) in the client scripts on the driver (DVR) side of the test rig. This choice effectively increases the number of active transactions running on the system under test (SUT), which might include apps servers and database servers. These two subsystems are usually connected by a local area network, as shown in the following diagram.