Sunday, May 1, 2011

Fundamental Performance Metrics

Baron Schwartz invited me to comment on his latest blog post entitled "The four fundamental performance metrics," which I did. Coincidentally, I happened to address this same topic in my presentation at CMG Atlanta last week. As the following slides shows, I claim there are really only 3 fundamental performance metrics (actually 2, if you want to get truly fundamental about it).

Here are the rest of the relevant slides from that talk, which I hope you will be able to compare with Baron's perspective.

Baron seems to be getting his cue from certain Linux kernel metrics. However, I would strongly caution against using metrics defined by Linux kernel hackers (or any other hackers, for that matter). I've seen too many examples of bastardized nomenclature, not to mention broken code that incorrectly computes said metrics. Moreover, I see no virtue in reinventing the wheel only to end up with an irregular polygon.

With that in mind, what I think Baron is trying to elucidate is that just a small number of simple performance metrics (4?) is all that is required to parameterize certain performance models, such as the USL scalability model. In other words, performance modeling may not be as far away from your already collected performance data as you might think, and therefore there is little excuse not to do it. Amen to that!

I'll be discussing this topic again in my Guerrilla Boot Camp class, later this week.


Baron said...

I've updated my blog post to correct the things you pointed out. I still haven't found a good short phrase for what I call the weighted time. Caution about Linux kernel docs duly noted :-)

ks said...

Hi Neil,

In I asked a question which I think is more appropriate here.

If you cannot capture the arrival and completion time for every message flowing through a complex system like IoT solution; what are the essential metrics that you have to capture?

For example, say that millions of devices are writing to one IoT Hub (or Kafka cluster), and then 20 VMs are simultaneously processing messages from 20 partitions (devices are partitioned by device id hash). In this situation, IoT Hub (or Kafka cluster) is W(aiting) time, and Message Processor code is S(ervicing) time (

In the Message Processor running on each VM, what should one capture gor time interval T (say 15 seconds)?
1. arrival messages
2. average latency of message in IoT Hub (i.e. average Waiting time): This can be calculated from timestamp in message when retrieved.
3. average latency to "process"/"complete" in Message Processor (i.e. average Service time)

Each processor on each VM would then send this telemetry to a central server.

Is there anything else should capture? As suggested by

Neil Gunther said...

You raise a lot of sub-topics here.

The above post is intended to underscore the idea that all performance metrics can be broken down into these 3 fundamental metric types (time, rate, or number). Conversely, other derived metrics can be built up from them.

To address the questions about what to measure in an IoT application, I'd suggest drawing a functional block diagram, or diagrams, along the longs suggested in this post on PDQ. The goal is to fill in all the boxes you draw with times (either known or unknown). If there are boxes that cannot be assigned a time then, either the box is wrong or something needs to be measured that isn't. Educated guesses are also completely fine. Once you have the boxes all filled in it's a fairly straightforward matter to generate the corresponding PDQ code. The PDQ model will soon let you know if your proposed processing times are consistent or not—that's one of its main values. And when I say "consistent' that includes a validation of the measurement instrumentation not being broken or misleading.

In cases were the time cannot be directly measured, it may be inferred or derived from Little Law or other metric relationships.

ks said...

Thanks very much! I am diving in to these and the other great resources you have including your books to learn more.

Neil Gunther said...

Cool but you really should come to my GCAP or PDQ Guerrilla class and learn in DAYS what will otherwise take you months (or longer).