Thursday, October 15, 2009

Hadoop, MAA, ML, MR and Performance Data

Over the past few months, I've been attending a series of talks on machine learning (ML), sponsored by ACM.org at the NASA Ames Research Center, with an eye to applying such things to gobs of computer performance data. Two pieces of technology that kept cropping up were Google MapReduce and Apache Hadoop.

Hadoop is a FOSS version of MapReduce (MR). Although these are really data-flow applications, the terms Map and Reduce originate with Lisp and functional programming.
"Without understanding functional programming, you can't invent MapReduce, the algorithm that makes Google so massively scalable."JoelOnSoftware (August, 2006)
To grok MR, you need to understand the difference between procedural programming and functional programming. I'm actually more familiar with: Map and Fold in Mathematica (MMA) or the parallel versions: ParallelMap and ParallelCombine in the newer MMA 7, which are even closer to MR and Hadoop.

At a higher level:
"Google App Engine (GAE) is focused on making development easy, but limits your options. Amazon Web Services is focused on making development flexible, but complicates the development process. Real enterprise applications require both of these paradigms to achieve success, according to a debate at the JavaOne conference this week." (June 2009)

The functional part is not that difficult to understand. Just think of everything (including a '+' operator) as a function or procedure with args as inputs and the return as output. The main difference from procedural languages, like C or Java, is that the output of one function can be the input to another function: a LISP-ism.

Here's an example in Mathematica. Typing:
In[1]:= Times[Plus[a, b], c]

into the MMA interpreter, produces:
Out[1]:= (a + b) c

which is what a mathematician would've written in the first place (the multiplication operator being implicit in math notation). If 'a' and 'b' are given numerical values, MMA would produce a single number as the evaluation output, since it can compute both numbers or symbols. From a programming standpoint, the 'a' and 'b' are used as args into the function Plus and it's output (together with the input 'c') is an input into the function Times.

This example is pedestrian but it generalizes to some very cool and powerful algorithmic constructs that can be written with relatively little code. For example:
  1. A Quine:
    Print[# 1,FromCharacterCode[91], #1,FromCharacterCode[93]]&[Print[#1,FromCharacterCode[91],
    #1,FromCharacterCode[93]]&]

  2. A sequence generator using recursion:
    Nest[Join[#, ReplacePart[#, Length[#] -> Last[#] + 1]] &, {0, 1}, 5]

    produces the Binary Carry sequence: 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, ...

Regarding how Hadoop might be applied to performance data, here's a video of one of the ACM talks explaining how Google.com uses MR in their AdSense group (where a lot of the action is). The MR reference appears @ 00:37:37. This presentation also provides a quick overview of biz analytics and data mining, which can be hard to find, otherwise.

[This blog entry is also cross-posted at Posterous]

2 comments:

Unknown said...

You should also check out Pig (see http://wiki.apache.org/pig/ & http://hadoop.apache.org/pig/about.html). It's intended to be used like google's sawzall(http://labs.google.com/papers/sawzall.html). I have several friends (who all work for companies who have been recipients of our government's largesse
) that using Pig to mine application logs as part of an effort to track usage. An I'm also know that it'sbeing used as a critical component of a system that gathers resource demand data off of a very large server cluster of VMware hosts as part of a system to track and rebalance guest vm workload.

Neil Gunther said...

Some good pointers here. Thanks.

I see some interesting performance data in the google-sawzall paper that jives with the discussion in Chap. 7 (section 7.4.2) of my Perl::PDQ book. Might do a blog post about it.