Thursday, November 29, 2012

PDQ 6.0 from a Developer Standpoint

This is a guest post by Paul Puglia, who contibuted significant development effort for the PDQ 6.0 release; especially as it relates to interfacing with R. Here, Paul provides more details about the motivation described in my eariler announcement.

PDQ was designed and implemented around a couple of basic assumptions. First, the library would be a C-language API running on some variant of the Unix operating system where we could reasonably assume that we'd be able to link it against a standard C library. Second, programs built using this API would be "stand-alone" executables in the sense that they'd run in their own, dedicated memory address spaces, could route its I/O through the standard streams (stdout or stderr), and had complete control over how error conditions would be handled.

Not surprisingly, the above assumptions drove a set of implementation decisions for the library, namely:

  • All I/O would be pushed through the standard stream library functions like printf and fprintf
  • Memory for internal data structures would be allocated and released through calls to the standard library functions calloc and free
  • Error conditions would result in PDQ causing the model execution to stop with an explicit call to exit().
These aren’t usual decisions for a C API.

With the arrival of PDQ 2.0, we introduced foreign interfaces programming environments (PERL, Python and R) that allowed PDQ to be called from these other environments. All these new foreign interfaces were built and released using the SWIG interface building tool, which allows us to build these interfaces with absolutely no modification to the underlying PDQ code—a major benefit when you’ve got a mature, debugged API that you really want to remain that way. For the most part this arrangement worked pretty well—at least for those environments where it was natural to write and execute PDQ models like standalone C-programs (you can also read this as PERL and Python).

When it came to R, however, our early implementation decisions weren’t such a great fit for how R is commonly used, which is as an interactive environment, similar to programs like Mathematica, Maple, and Matlab. Like these other environments, R users do most of their interaction with a REPL (Read-Execute-Print Loop) usually wrapped in either full-fledged GUI interface or a terminal-like interface called the console.

It turns out that most of PDQ's implementation decisions could (and do) interfere with using R interactively. In particular:

  • Calling the exit() function results in the entire R environment exiting – not a good feature for an interactive environment.
  • Writing directly to the stdout and strerr using fprintf, bypasses R's own internal I/O mechanisms and prevents internal /O functions (like the sink() command) from working properly.
  • Using calloc() and free() functions interfere with R's own internal memory management mechanisms and would prove to be a major impediment for any Windows version of the interface.

Not only do these severely degrade the interactive experience for R users, their use also gets flagged by R’s extension building mechanism when it does a consistency check. And not passing that check would prove a major impediment for getting the PDQ's R interface accepted on CRAN (Comprehensive R Archive Network).

Luckily, none of the fixes for these issues are particularly hard to implement. Most are either fairly simple substitutions of the R API calls for C library routines or/and localized changes to PDQ library. And, while all of this does potentially create a risk of introducing bugs in the PDQ library, the reward for taking that risk is a stable R interface that can be eventually be submitted to CRAN. A version of the PDQ library can be easily built under Windows™ using the Rtools utilities.

Monday, November 12, 2012

PDQ 6.0 is On Its Way

PDQ (Pretty Damn Quick) version 6.0.β is in the QA pipeline. Although this is a major release, cosmetically, things won't look any different when it comes to writing PDQ models. All the big changes have taken place under the hood in order to make PDQ more consistent with the R statistical environment.

R version 2.15.2 (2012-10-26) -- "Trick or Treat"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

> library(pdq)
> source("/Users/njg/PDQ/Test Suites/R-Test/mm1.r")
                ****** Pretty Damn Quick REPORT *******
                ***  of : Thu Nov  8 17:42:48 2012  ***
                ***  for: M/M/1 Test                ***
                ***  Ver: PDQ Analyzer 6.0b 041112  ***

The main trick is that the Perl and Python versions of PDQ will remain entirely unchanged while at the same time invisibly incorporating significant changes to accommodate R.

Friday, November 9, 2012

Tuesday, November 6, 2012

Hotsos 2013: Superlinear Scalability

As readers of this blog know, the Universal Scalability Law (USL) is a framework for quantifying performance measurements and extrapolating load-test data. Applied as a statistical regression model, the two USL contention (α) and coherency (β) parameters numerically indicate the degree of sublinear scalability in the data, i.e., how much linear scaling you're losing due to sharing and consistency overheads. Some examples of USL scalability analysis applied to databases, include:

More recently, it was brought to my attention that the USL fails when it comes to modeling superlinear performance (e.g., see this Comments section). Superlinear scalability means you get more throughput than the available capacity would be expected to support. It's even discussed on the Wikipedia (so it must be true, right?). Nice stuff, if you can get it. But it also smacks of an effect like perpetual motion.

Every so often, you see a news report about someone discovering (again) how to beat the law of conservation of energy. They will swear up and down that it works and it will be accompanied by a contraption that proves it works. Seeing is believing, after all. The hard part is not whether to believe their claim, it's debugging their contraption to find the mistake that has led them to the wrong conclusion.

Similarly with superlinearity. Some data are just plain spurious. In other cases, however, certain superlinear measurements do appear to be correct, in that they are repeatable and not easily explained away. In that case, it was assumed that the USL needed to be corrected to accommodate superlinearity by introducing a third modeling parameter. This is bad news for many reasons, but primarily because it would weaken the universality of the universal scalability law.

To my great surprise, however, I eventually discovered that the USL can accommodate superlinear data without any modification to the equation. As an unexpected benefit, the USL also warns you that you're modeling an unphysical effect: like a perpetual-motion detector. A corollary of this new analysis is the existence of a payback penalty for incurring superlinear scalability. You can think of this as a mathematical statement of the old adage: If it looks too good to be true, it probably is.

I'll demonstrate this remarkable result with examples in my Hotsos presentation.