Thursday, January 27, 2011

Idleness Is Not Waste

A common fallacy is to view all idle CPU cycles as wasted server capacity. It's not unusual for management and various bean-counters to display a reluctance to procure new hardware if unused cycles are clearly observable on existing hardware. This puts the pressure on sys admins to reduce idleness. Such is often the case during consolidation efforts: cram as many apps as possible onto a server to soak up every remaining CPU cycle.

All performance analysis and capacity planning is essentially about optimizing resource usage under a particular set of constraints. The fallacy is treating maximization as optimization. This mistake is further exacerbated if only one performance metric, i.e., CPU utilization, is taken into account: a common situation promoted by the superficiality of performance dashboards. Maximization doesn't necessarily mean 100% utilization, either. The same is true even if some amount of CPU capacity is retained as headroom for workload growth. The tendency to "redline" it can still prevail.

You can't optimize a single number. Server utilization has to be optimized with respect to other measures, e.g., application response-time targets. We know from simple queueing theory that response time increases nonlinearly (the proverbial "hockey stick") with increasing server utilization. If the response-time goals are being met at 10% CPU busy, pre-consolidation, then almost certainly they will be exceeded at higher CPU utilization, post-consolidation. The response-time metric is an example of a cost that has to be taken into account to satisfy all the constraints of the optimized capacity plan.

Maximizing server utilization is as foolhardy as maximizing revenue. Both goals look attractive on their face, but if you don't keep track of outgoing CapEx and OpEx costs incurred to generate revenue, you could lose the company!

Wednesday, January 26, 2011