Tuesday, November 24, 2009

GCaP Class: Sawzall Optimum

In a side discussion during last week's class, now Guerrilla alumnus, Greg S. (who used to work at Google a few years ago) informed me that typical Sawszall preprocessing-setup times typically lie in the range from around 500 ms to about 10 seconds, depending on such factors as: cluster location, GFS chunkserver hit rate, borglet affinity hits, etc. This is the information that was missing in the original Google paper and prevented me from finding the optimal machine configuration in my previous post.

To see how these new numbers can be applied to estimating the corresponding optimal configuration of Sawzall machines, let's take the worst case estimate of 10 seconds for the preprocessing time. First, we convert 10 s to 10/60 = 0.1666667 min (original units) and plot that constant as the horizontal line (gray) in the lower part to the figure at left (click to enlarge). Next, we extend the PDQ elapsed-time model (blue curve) until it intersects the horizontal line. That point is the optimum, as I explained in class, and it occurs at p = 18,600 machines (vertical line).

That's more than thirty times the number of machines reported in the original Google paper—those data points appear on the left side of the plot. Because of the huge scale involved, it is difficult to see the actual intersection, so the figure on the right shows a zoomed-in view of the encircled area. Increasing the number of parallel machines beyond the vertical line means that the elapsed time curve (blue) goes into the region below the horizontal line. The horizontal line represents the fixed preprocessing time, so it becomes the system bottleneck as the degree of parallelism is increased. Since the elapsed times in that region would always be less than the bottleneck service time, they can never be realized. Therefore, adding more parallel machines will not improve response time performance.

Conversely, a shorter preprocessing time of 500 ms (i.e., a shorter bottleneck service time) should permit a higher degree of parallelism.

Friday, November 20, 2009

GCaP Class: Odds and Sods

Some interesting side discussions came up in class:

  • Cat Brain: IBM Almaden announced a supercomputer brain-simulation (called C2) in which the number of simulated neurons and synapses exceeds those in a cat brain (from Josh B)
  • MARS: A MapReduce Framework on Graphics Processors (from Josh B)
  • MapReduce ported to R (from Josh B)
  • Tsung: FOSS distributed load-testing harness written in Erlang (from Greg S)
  • GPUs are the new CPUs (from NJG)
  • SmokePing: Another Tobi tool (from Josh B)

Today, we wrap up and Guerrilla Level-2 certificates will be awarded to all attendees (as proof of purchase).

† The claim has since elicited a clawing missive from researchers working on EPFL's BlueBrain project. Quote from the project leader: "[it's] not even close to an ant's brain." (from GCaP 2008 alumnus, Stefan P.)

Wednesday, November 18, 2009

GCaP Class: 3-Tier Queueing Model

Guerrilla Capacity Planning class attendee, Greg S., raised an interesting question during the section where I present a PDQ model of a 3-tier client/server system. The assumptions used to develop the model are summarized on this slide:

In the baseline configuration there are 125 desktop clients generating 3 types of database transactions corresponding to 3 different workload classes, or streams in PDQ parlance. This could be represented as either:

  1. 125 × 3 workload streams or
  2. 1 × 3 streams with 124 × 3 aggregated streams.

The respective arrival rates for the second case look like this:

Friday, November 13, 2009

Scalability of Sawzall, MapReduce and Hadoop

This is a follow-up to a reader comment by Paul P. on my previous post about MapReduce and Hadoop. Specifically, Paul pointed me at the 2005 Google paper entitled "Parallel Analysis with Sawzall," which states:

"The set of aggregations is limited but the query phase can involve more general computations, which we express in a new interpreted, procedural programming language called Sawzall"

Not related to the portable reciprocating power saw manufactured by the Milwaukee Electric Tool Corporation.

More important, for our purposes, is Section 12 Performance. It includes the following plot, which tells us something about Sawzall scalability; but not everything.

Figure 1.

Tuesday, November 10, 2009

EU Queries MySQL in Sun-Oracle Merger

The European Union's statement of objections expresses concerns that businesses might have fewer choices and see higher prices if Oracle (already the world's largest proprietary database vendor) ends up with MySQL by default.

In case you're getting a bit confused by all these fish eating each other, the Wikipedia entry for MySQL reminds us:
The project has made its source code available under the terms of the GNU General Public License, as well as under a variety of proprietary agreements. MySQL is owned and sponsored by a single for-profit firm, the Swedish company MySQL AB, now a subsidiary of Sun Microsystems. As of 2009 Oracle Corporation began the process of acquiring Sun Microsystems; Oracle holds the copyright to most of the MySQL codebase.
Oracle Corp. has stated that the commission's objection "reveals a profound misunderstanding of both database competition and open source dynamics," but some FOSS developers have a different take on that.

Monday, November 9, 2009

Last 2009 Guerrilla Class Next Week

Good news! You can still pile into the last Guerrilla Capacity Planning class for 2009 at the Early Bird rate. Since this class will be professionally videotaped for later distribution on the web, the more the merrier. It's also your chance to be digitally immortalized.

Entrance Larkspur Landing hotel Pleasanton California

As usual, it will be held at our lovely Larkspur Landing location. Click on the image for booking information.

Registered attendees please bring your laptops, as course materials will only be provided on CD or flash drive, this time. We will be distributing free notepads so you can also take hand-written notes. The venue also has free Wi-Fi to the internet.

Tuesday, November 3, 2009

Len Kleinrock Reflects on Booting The Inter-(ARPA)-net

Len Kleinrock (Mr. Queueing Theory) discussed his role in the innovation of packet-switching for the ARPAnet at NPR last week.
Forty years ago this week, the first information was transmitted across the ARPANET, a test message routed from UCLA to the Stanford Research Institute. Though the message sent on the evening of Wednesday, Oct. 29, 1969 was incomplete—the system crashed after the 'L'and 'O' of 'LOGIN' were transmitted to SRI—that packet-switched transmission became the basis of much of our modern era of communications. In this segment, Ira talks with internet pioneer Leonard Kleinrock about that first transmission and what networked computing has become.
Here's the podcast (mp3).

Lucasian Litotes

Well, it wasn't a woman (Gee! I'm shocked), although it could have been but the committee wimped out. And it won't be long. At 63, Mike Green is the oldest appointment yet, and if the retirement rules are applied consistently (which they haven't always been), he only gets four years in the prestigious Chair once held by such luminaries as Newton (at 26) and Dirac (at 30). Conventional wisdom has it that theoreticians are past their "sell-by" date in their late twenties, but Newton didn't write The Principia until his mid forties and Hawking is still publishing in his sixties.