Tuesday, December 1, 2009

Guerrilla Capacity Planning: The Movie

For those of you who weren't able to attend the recent Guerrilla Capacity Planning training live in California, here's a small sampler of what you missed (not shown are the high quality lunches we provide):

Guerrilla Capacity Planning class Nov 2009
Instructor: Dr. Neil Gunther

Thomas Crosman did an outstandlng job of getting the entire 5-day class (that's more than 30 hours!) recorded as digital bits—all on very short notice, I might add. This ain't no YouTube vid. The plan is to make this GCaP class available online. Stay tuned to this blog for announcements about when it will appear at a theater near you.


But there's nothing like live! So, the 2010 training schedule has now been posted. The dates are tentative until we finalize the contracts with the hotel, but you may as well start harassing your management to cut that P/O now. :-)

Oh! And if they need a little extra convincing, they can check out the testimonials.

Season's Greetings!

Tuesday, November 24, 2009

GCaP Class: Sawzall Optimum

In a side discussion during last week's class, now Guerrilla alumnus, Greg S. (who used to work at Google a few years ago) informed me that typical Sawszall preprocessing-setup times typically lie in the range from around 500 ms to about 10 seconds, depending on such factors as: cluster location, GFS chunkserver hit rate, borglet affinity hits, etc. This is the information that was missing in the original Google paper and prevented me from finding the optimal machine configuration in my previous post.

To see how these new numbers can be applied to estimating the corresponding optimal configuration of Sawzall machines, let's take the worst case estimate of 10 seconds for the preprocessing time. First, we convert 10 s to 10/60 = 0.1666667 min (original units) and plot that constant as the horizontal line (gray) in the lower part to the figure at left (click to enlarge). Next, we extend the PDQ elapsed-time model (blue curve) until it intersects the horizontal line. That point is the optimum, as I explained in class, and it occurs at p = 18,600 machines (vertical line).

That's more than thirty times the number of machines reported in the original Google paper—those data points appear on the left side of the plot. Because of the huge scale involved, it is difficult to see the actual intersection, so the figure on the right shows a zoomed-in view of the encircled area. Increasing the number of parallel machines beyond the vertical line means that the elapsed time curve (blue) goes into the region below the horizontal line. The horizontal line represents the fixed preprocessing time, so it becomes the system bottleneck as the degree of parallelism is increased. Since the elapsed times in that region would always be less than the bottleneck service time, they can never be realized. Therefore, adding more parallel machines will not improve response time performance.

Conversely, a shorter preprocessing time of 500 ms (i.e., a shorter bottleneck service time) should permit a higher degree of parallelism.

Friday, November 20, 2009

GCaP Class: Odds and Sods

Some interesting side discussions came up in class:

  • Cat Brain: IBM Almaden announced a supercomputer brain-simulation (called C2) in which the number of simulated neurons and synapses exceeds those in a cat brain (from Josh B)
  • MARS: A MapReduce Framework on Graphics Processors (from Josh B)
  • MapReduce ported to R (from Josh B)
  • Tsung: FOSS distributed load-testing harness written in Erlang (from Greg S)
  • GPUs are the new CPUs (from NJG)
  • SmokePing: Another Tobi tool (from Josh B)

Today, we wrap up and Guerrilla Level-2 certificates will be awarded to all attendees (as proof of purchase).


† The claim has since elicited a clawing missive from researchers working on EPFL's BlueBrain project. Quote from the project leader: "[it's] not even close to an ant's brain." (from GCaP 2008 alumnus, Stefan P.)

Wednesday, November 18, 2009

GCaP Class: 3-Tier Queueing Model

Guerrilla Capacity Planning class attendee, Greg S., raised an interesting question during the section where I present a PDQ model of a 3-tier client/server system. The assumptions used to develop the model are summarized on this slide:


In the baseline configuration there are 125 desktop clients generating 3 types of database transactions corresponding to 3 different workload classes, or streams in PDQ parlance. This could be represented as either:

  1. 125 × 3 workload streams or
  2. 1 × 3 streams with 124 × 3 aggregated streams.

The respective arrival rates for the second case look like this:

Friday, November 13, 2009

Scalability of Sawzall, MapReduce and Hadoop

This is a follow-up to a reader comment by Paul P. on my previous post about MapReduce and Hadoop. Specifically, Paul pointed me at the 2005 Google paper entitled "Parallel Analysis with Sawzall," which states:

"The set of aggregations is limited but the query phase can involve more general computations, which we express in a new interpreted, procedural programming language called Sawzall"

Not related to the portable reciprocating power saw manufactured by the Milwaukee Electric Tool Corporation.

More important, for our purposes, is Section 12 Performance. It includes the following plot, which tells us something about Sawzall scalability; but not everything.

Figure 1.

Tuesday, November 10, 2009

EU Queries MySQL in Sun-Oracle Merger

The European Union's statement of objections expresses concerns that businesses might have fewer choices and see higher prices if Oracle (already the world's largest proprietary database vendor) ends up with MySQL by default.

In case you're getting a bit confused by all these fish eating each other, the Wikipedia entry for MySQL reminds us:
The project has made its source code available under the terms of the GNU General Public License, as well as under a variety of proprietary agreements. MySQL is owned and sponsored by a single for-profit firm, the Swedish company MySQL AB, now a subsidiary of Sun Microsystems. As of 2009 Oracle Corporation began the process of acquiring Sun Microsystems; Oracle holds the copyright to most of the MySQL codebase.
Oracle Corp. has stated that the commission's objection "reveals a profound misunderstanding of both database competition and open source dynamics," but some FOSS developers have a different take on that.

Monday, November 9, 2009

Last 2009 Guerrilla Class Next Week

Good news! You can still pile into the last Guerrilla Capacity Planning class for 2009 at the Early Bird rate. Since this class will be professionally videotaped for later distribution on the web, the more the merrier. It's also your chance to be digitally immortalized.

Entrance Larkspur Landing hotel Pleasanton California

As usual, it will be held at our lovely Larkspur Landing location. Click on the image for booking information.

Registered attendees please bring your laptops, as course materials will only be provided on CD or flash drive, this time. We will be distributing free notepads so you can also take hand-written notes. The venue also has free Wi-Fi to the internet.