Friday, June 15, 2007

Linux Instrumentation: Is Linus Part of the Problem?

I was interested to read in a LinuxWorld article entitled "File System, Power and Instrumentation: Can Linux Close Its Technical Gaps?", that Linus Torvalds believes the current kernel instrumentation sufficiently addresses real-world performance problems.

This statement would be laughable, if he weren't serious. Consider the following:
  • How can current Linux instrumentation be sufficient when older UNIX performance instrumentation is still not sufficient?
  • UNIX instrumentation was not introduced to solve real-world performance problems. It was a hack by and for kernel developers to monitor the performance impact of code changes in a light-weight O/S. We're still living with that legacy. It might've been necessary, but that doesn't make it sufficient.
  • The level of instrumentation in Linux (and UNIX-es) is not greatly different from what it was 30 years ago. As I discuss in Chap. 4 of my Perl::PDQ book, the idea of instrumenting an O/S goes back (at least) to c.1965 at MIT.
  • Last time I checked, this was the 21st century. By now, I would have expected (foolishly, it seems) to have at my fingertips, a common set of useful performance metrics, together with a common means for accessing them across all variants of UNIX and Linux.
  • Several attempts have been made to standardize UNIX performance instrumentation. One was called the Universal Measurement Architecture (UMA), and another was presented at CMG in 1999.
  • The UMA spec arrived DOA because the UNIX vendors, although they helped to design it, didn't see any ROI where there was no apparent demand from users/analysts. Analysts, on the other hand, didn't demand what they had not conceived was missing. Such a Mexican standoff was cheaper for the UNIX vendors. (How conveeeeenient!) This remains a potential opportunity for Linux, in my view.
  • Rich Pettit wrote a very thoughtful paper entitled "Formalizing Performance Metrics in Linux", which was resoundingly ignored by Linux developers, as far as I know.
Elsewhere, I've read that Linus would like to keep Linux "lean and mean". This reflects a naive PC mentality. When it comes to mid-range SMP servers, it is not possible to symmetrize the kernel without making the code paths longer. Longer code paths are necessary for control and scalability. That's a good thing. And improved performance instrumentation is needed a fortiori for the layered software architectures that support virtual machines. Since Linus is the primary gate-keeper of Linux, I can't help wondering if he is part of the solution or part of the problem.


Ed Borasky said...

Well ... I saw the article too. As you may know from talking to some of my colleagues, I've been digging around in this area in the Linux kernel for a number of years. So, some comments:

1. "UNIX instrumentation was not introduced to solve real-world performance problems. It was a hack by and for kernel developers to monitor the performance impact of code changes in a light-weight O/S. We're still living with that legacy. It might've been necessary, but that doesn't make it sufficient."

Very true. But the very nature of the Linux development process more or less guarantees this. The kernel, the counters and the performance tools that interpret the counters are developed by a mix of volunteers and paid professionals, mostly operating under the GPL and related open source licenses. Some are operating in nations disallowed from using certain technologies by United States laws. And some must ultimately be accountable for corporate profits and losses. So yes, you get "good enough" rather than "outstanding" in some areas.

2. You've done quite a bit of digging into the load average and related metrics. In my case, I've done a similar amount of digging into the block I/O layer, and the Summer 2007 issue of the CMG Journal has a paper by Dominique Heger that goes even deeper -- she's actually built a deterministic/stochastic Petri net model of the 2.6 kernel and the I/O schedulers!

But on a more mundane level, if you have a few hours to spare, take a look at the "iostat" code for extended disk statistics on a modern Linux system. It's in the "sysstat" package, which lives at

The underlying code in the Linux kernel is in the routines "ll_rw_block.c" and "genhd.c". What you will find, once you decode all the variable names and the logic behind the code, is that what "iostat" is doing is a basic operational queuing analysis of the block I/O layer!

3. From the point of view of enterprise Linux IT managers, I don't think the situation is as bleak as you describe. The basic command line tools that have been in most flavors of Unix for decade may be primitive, but thousands of people know how to use them and make real-time decisions based on them. "top", "sar", "vmstat", "iostat", etc. are all there.

And there are both commercial and open-source performance management tools that provide logging, graphics, analysis, reporting, etc. I don't know much about the commercial tools, but there are two open-source packages, "cricket" and "cacti", that are based on "rrdtool" and are highly customizable to any enterprise.

Rae Y said...

I mainly use Solaris, which is growing to be more and more unlike other UNIXes. What I'd like to figure out is what are the proper tools for measuring Solaris 10?

The ZFS folks have already said that iostat device utilization and latency aren't useful, but didn't suggest alternative tools for identifying bottlenecks.

DTrace is a great tool, but not everything's instrumented, and is more for relative measurement than absolute numbers.

Solaris Zones pose yet another problem for planning and measurement, as CPU utilization is no longer helpful with virtualization.

Hyperthreading just makes me hyper-paranoid. Does that 1U rack really have 128 CPUs in it??

Seems like we are taking two steps backward in measurability and predictability for every step forward in functionality. Hopefully this can change.

Stefan Parvu said...

Rae Y:

Solaris 10 brings new technologies versus the others, example some: DTrace, ZFS, SMF... Just remember the days when we had no NFS. Others are checking these and integrating them: Apple(ZFS, DTrace), FreeBSD(ZFS, DTrace) etc...

So this is good stuff which makes or should easy our pain. One message you hear/or think you hear ,from marketing droids, is: with all these tools we dont need at all humans controlling and handling the machines. Thats a bit too extreme and sometimes bogus. SysAdmins will always exist no matter what.

Coming back to monitoring and getting the most of your machine performance counters: take a look at SDR to see how easily I integrated some scripts into Solaris using SMF for instance.

SDR can help a bit. It is an open effort which I start just because I ended up in troubles understanding Tivoli, BMC, etc. TeamQuest might help you if you can afford that.

Zone utilisation is a bit primitive at this time, since we have no KSTAT interfaces to retrieve the needed data. This is something which I bet they are working on. Take a look on OpenSolaris Performance group.

Cores: this as well will be polished and integrated into Solaris, in near future. Or at least this was the message I got.

So things are moving.

Paul Linehan said...

You make several good points. You should read Mogens Norgaard (an Oracle guru, editor of Oracle Insights - his article in that book) - he expounds "his" law whereby any system that becomes well instrumented automatically gets overwhelmed by more (uninstrumented) tiers. His point is that the two best instrumented systems in the world (IBM's z/OS and Oracle's database) are now merely part of a chain - browser -> router -> load balancer -> web server -> app server -> database and back down through the chain. He says that Apache isn't instrumented and neither are many of the other components of modern systems. Anyway, worth reading if you're the kind of propeller-head who is interested in this kind of stuff.