I can't say I completely follow the latest per-entity load proposal, but I would suggest that change for the sake of change is not usually a good thing. And when it comes to metrics like load average, there are plenty of automated tools, as well as sysadms and cap planners, that rely on it as an historical measure of server run-queue length for trend analysis.
Without having thought about it too deeply, and if I were actually involved with Linux kernel development (which I'm not), my response would probably go something like this:
Dear Linux kernel devs (or someone named Linus), if you are going to make changes to things like the load average metric, knock yourself out. But please make it a new metric collector that is separately accessible from the existing metrics. That way, historical performance data and CaP models will not get screwed by your latest (possibly evanescent) brainwave.
David's observation should also serve as a warning to all performance analysts and cap planners. It means that performance metrics can change underneath you. That's a very disturbing thought, which most of us remain blissfully unaware of. And even if you are aware of it, what's a body to do?
As I describe in Chap. 6 of my Perl::PDQ book, the load average is historically the first instantiation of performance instrumentation. Starting circa 1965, and continuing through the various releases of AT&T Unix and beyond, it must have been an extremely novel idea when it was introduced. But, because it's a metric defined in software, it's not guaranteed to be immutable. There are documented cases where the load average and other performance metrics have been broken by dint of some kernel dev's bright idea. I discuss some of them in the GCaP class.
Therefore, we performance analysts and cap planners need to independently verify that the metrics we collect (such as load avearge) have valid definitions and that they remain consistent over time. The most efficient way to do that is by regression testing the performance tools that we rely on. How to do that is a topic that I present in the GDAT class.
Linux Load Average already includes "IO" since processes blocked on disk are included. This is unlike the original definition and other Unix implementations that only include processes blocked on CPU. It means that disk intensive Linux systems show a much higher load average, even when their CPU is idle. It's just a broken definition and I found this and traced this all the way into the kernel code back in 2007.
ReplyDeleteIt would be an improvement to have per entity load tracking, as long as CPU and disk are separated.