Wednesday, March 14, 2018

WTF is Modeling, Anyway?

A conversation with performance and capacity management veteran Boris Zibitsker, on his BEZnext channel, about how to save multiple millions of dollars with a one-line performance model (at 21:50 minutes into the video) that has less than 5% error. I wish my PDQ models were that good. :/

The strength of the model turns out to be its explanatory power, rather than prediction, per se. However, with the correct explanation of the performance problem in hand (which also proved that all other guesses were wrong), this model correctly predicted a 300% reduction in application response time for essentially no cost. Modeling doesn't get much better than this.


  1. According to Computer World in 1999, a 32-node IBM SP2 cost $2 million to lease over 3 years. This SP2 cluster was about 6 times bigger.
  2. Because of my vain attempt to suppress details (in the interests of video length), Boris gets confused about the kind of files that are causing the performance problem (near 26:30 minutes). They're not regular data files and they're not executable files. The executable is already running but sometimes waits—for a long time. The question is, waits for what? They are, in fact, special font files that are requested by the X-windows application (the client, in X parlance). These remote files may also get cached, so it's complicated. In my GCAP class, I have more time to go into this level of detail. Despite all these potential complications, my 'log model' accurately predicts the mean application launch time.
  3. Question for the astute viewer. Since these geophysics applications were all developed in-house, how come the developers never saw the performance problems before they ever got into production?
  4. Some ppl have asked why there's no video of me. This was the first time Boris had recorded video of a Skype session and he pushed the wrong button (or something). It's prolly better this way. :P

Wednesday, February 21, 2018

CPU Idle Is Not Like White Space

This post seems like it ought to be too trite to write but, I see the following performance gotcha cropping up over and over again.

Under pressure to consolidate resources, usually driven by management and especially regarding processor capacity, there is often an urge to "use up" any idle processor cycles. Idle processor capacity tends to be viewed like it's whitespace on a written page—just begging to be filled up.

The logical equivalent of filling up the "whitespace" is absorbing idle processor capacity by migrating applications that are currently running on other servers and turning those excess servers off or using them for something else.

The blindspot, however, is that idle processor capacity is not like whitespace and the rush to absorb it is likely to have unintended consequences. The reason is that performance metrics are neither isolated nor independent of one another. Most metrics are:

  1. related to one another due to interdependencies between various computing subsystems
  2. related in a nonlinear way: a small change in one metric can cause a large change in another
The first couple of days of my Guerrilla Capacity Planning course lays out a consistent framework that demonstrates how all the familiar performance metrics are related.

For example, application response time depends nonlinearly on processor utilization. In the above case, it may be have been forgotten that the processor utilization must be kept low in order for the application to meet its response time SLA. A lot of idle processor cycles can appear to be unused processor capacity only because there is no obvious warning sign that low CPU is a necessary condition for correct application performance.

Even on a printed page, whitespace is not usually an invitation to start scribbling on it. Similarly, a notification equivalent to "This page intentionally left blank" would be a useful reminder in the context of potential application migration.

Of course, any page that says "This page left blank" isn't blank, but that's a topic for a different discussion. :)