The Pith of Performance: performance management

Showing posts with label performance management. Show all posts

Tuesday, May 6, 2025

Performance Ponderings Updated

Just updated my "Performance Ponderings" page. It covers my performance analysis papers published over the past 35 yrs. 😳 You may be thinking: Who cares about performance analysis cases that is that dated?

Amazingly, performance analysis (done right) is often not technology-dependent, per se. The right abstraction can remain invariant into perpetuity. For example, my 2025 analysis of the GPT Efficient Computer Frontier (ref. 8 in paper 1) is based partly on my 1989 analysis of virtual memory thrashing in paper 53. Same paradigm; vastly different context.

The connection between the two papers is by no means obvious but, quite striking (not to mention satisfying) when you realize it.

Monday, June 25, 2018

Guerrilla 2018 Classes Now Open

All Guerrilla training classes are now open for registration.

GCAP: Guerrilla Capacity and Performance — From Counters to Containers and Clouds
GDAT: Guerrilla Data Analytics — Everything from Linear Regression to Machine Learning
PDQW: Pretty Damn Quick Workshop — Personal tuition for performance and capacity mgmt

The following highlights indicate the kind of thing you'll learn. Most especially, how to make better use of all that monitoring and load-testing data you keep collecting.

How to save millions of dollars with a one-line performance model (video)
How to minimize chargeback after you lift and shift to the cloud (video)
How to correctly emulate web traffic on a load-testing rig (PDF)

See what Guerrilla grads are saying about these classes. And how many instructors do you know that are available for you from 9am to 9pm (or later) each day of your class?

Who should attend?

IT architects
Application developers
Performance engineers
Sysadmins (Linux, Unix, Windows)
System engineers
Test engineers
Mainframe sysops (IBM. Hitachi, Fujitsu, Unisys)
Database admins
Devops practitioners
SRE engineers
Anyone interested in getting beyond performance monitoring

As usual, Sheraton Four Points has bedrooms available at the Performance Dynamics discounted rate. The room-booking link is on the registration page.

Tell a colleague and see you in September!

Wednesday, June 20, 2018

Chargeback in the Cloud - The Movie

If you were unable to attend the live presentation on cost-effective defenses against chargeback in the cloud, or simply can't get enough performance and capacity analysis for the AWS cloud (which is completely understandable), here's a direct link to the video recording on CMG's YouTube channel.

The details concerning how you can do this kind of cost-benefit analysis for your cloud applications will be discussed in the upcoming GCAP class and the PDQW workshop. Check the corresponding class registration pages for dates and pricing.

Wednesday, March 14, 2018

WTF is Modeling, Anyway?

A conversation with performance and capacity management veteran Boris Zibitsker, on his BEZnext channel, about how to save multiple millions of dollars with a one-line performance model (at 21:50 minutes into the video) that has less than 5% error. I wish my PDQ models were that good. :/

The strength of the model turns out to be its explanatory power, rather than prediction, per se. However, with the correct explanation of the performance problem in hand (which also proved that all other guesses were wrong), this model correctly predicted a 300% reduction in application response time for essentially no cost. Modeling doesn't get much better than this.

Footnotes

According to Computer World in 1999, a 32-node IBM SP2 cost $2 million to lease over 3 years. This SP2 cluster was about 6 times bigger.
Because of my vain attempt to suppress details (in the interests of video length), Boris gets confused about the kind of files that are causing the performance problem (near 26:30 minutes). They're not regular data files and they're not executable files. The executable is already running but sometimes waits—for a long time. The question is, waits for what? They are, in fact, special font files that are requested by the X-windows application (the client, in X parlance). These remote files may also get cached, so it's complicated. In my GCAP class, I have more time to go into this level of detail. Despite all these potential complications, my 'log model' accurately predicts the mean application launch time.
Log_2 assumes a binary tree organization of font files whereas, Log_10 assumes a denary tree.
Question for the astute viewer. Since these geophysics applications were all developed in-house, how come the developers never saw the performance problems before they ever got into production? Here's a hint.
Some ppl have asked why there's no video of me. This was the first time Boris had recorded video of a Skype session and he pushed the wrong button (or something). It's prolly better this way. :P

Wednesday, February 21, 2018

CPU Idle Is Not Like White Space

This post seems like it ought to be too trite to write but, I see the following performance gotcha cropping up over and over again.

Under pressure to consolidate resources, usually driven by management and especially regarding processor capacity, there is often an urge to "use up" any idle processor cycles. Idle processor capacity tends to be viewed like it's whitespace on a written page—just begging to be filled up.

The logical equivalent of filling up the "whitespace" is absorbing idle processor capacity by migrating applications that are currently running on other servers and turning those excess servers off or using them for something else.

Performance Analysis vs. Capacity Planning

This question came up in a (members only) Linkedin discussion group:

Often found a misconception about these terms. I'm sure this must be written in a book, but for informal discussions is always preferable to cite sources from standardization institutes or IT industry referents.
Thanks in advance

Gian Piero

Here's how I answered it.

Continuous Integration Gets Performance Testing Radar

As companies embrace continuous integration (CI) and fast release cycles, a serious problem has emerged in the development pipeline: Conventional performance testing is the new bottleneck. Every load testing environment is actually a highly complex simulation assumed to be a model of the intended production environment. System performance testing is so complex that the cost of modifying test scripts and hardware has become a liability for meeting CI schedules. One reaction to this situation is to treat performance testing as a checkbox item, but that exposes the new application to unknown performance idiosyncrasies in production.

In this webinar, Neil Gunther (Performance Dynamics Company) and Sai Subramanian (Cognizant Technology Solutions) will present a new type of model that is not a simulation, but instead acts like continuous radar that warns developers of potential performance and scalability issues during the CI process. This radar model corresponds to a virtual testing framework that precludes the need for developing performance test scripts or setting up a separate load testing environment. Far from being a mere idea, radar methodology is based on a strong analytic foundation that will be demonstrated by examining a successful case study.

Broadcast Date and Time: Tuesday, July 22, 2014, at 11 am Pacific

Thursday, July 17, 2014

Restaurant Performance Sunk by Selfies

An interesting story appeared last weekend about a popular NYC restaurant realizing that, although the number of customers they served on a daily basis is about the same today as it was ten years ago, the overall service had slowed significantly. Naturally, this situation has led to poor online reviews to the point where the restaurant actually hired a firm to investigate the problem. The analysis of surveillance tapes led to a surprising conclusion. The unexpected culprit behind the slowdown was neither the kitchen staff nor the waiters, but customers taking photos and otherwise playing around with their smartphones.

Using the data supplied in the story, I wanted to see how the restaurant performance would look when expressed as a PDQ model. First, I created a summary data frame in R, based on the observed times:


> df
           obs.2004 obs.2014
wifi.data         0        5
menu.data         8        8
menu.pix          0       13
order.data        6        6
eat.mins         46       43
eat.pix           0       20
paymt.data        5        5
paymt.pix         0       15
total.mins       65      115

The 2004 situation can be represented schematically by the following queueing network

Figure 1. 2004 queueing stages

Referring to Figure 1:

Load Average in FreeBSD

In my Perl PDQ book there is a chapter, entitled Linux Load Average, where I dissect how the load average metric (or metrics, since there are three reported numbers) is computed in shell commands like uptime, viz.,

[njg]~/Desktop% uptime
16:18  up 9 days, 15 mins, 4 users, load averages: 2.11 1.99 1.99

For the book, I used Linux 2.6 source because it was accessible on the web with convenient hyperlinks to navigate the code. Somewhere in the kernel scheduler, the following C code appeared:


#define FSHIFT    11          /* nr of bits of precision */ 
#define FIXED_1   (1<<FSHIFT) /* 1.0 as fixed-point */ 
#define LOAD_FREQ (5*HZ)      /* 5 sec intervals */ 
#define EXP_1     1884        /* 1/exp(5sec/1min) fixed-pt */ 
#define EXP_5     2014        /* 1/exp(5sec/5min) */ 
#define EXP_15    2037        /* 1/exp(5sec/15min) */ 

#define CALC_LOAD(load,exp,n) \
 load *= exp; \
 load += n*(FIXED_1-exp); \
 load >>= FSHIFT;

where the C macro CALC_LOAD computes the following equation

The Visual Connection Between Capacity And Performance

Whether or not computer system performance and capacity are related is a question that comes up from time to time, especially from those with little experience in either discipline. Most recently, it appeared on a Linked-in discussion group:

"...the topic was raised about the notion that we are Capacity Management not Performance Management. It made me think about whether performance is indeed a facet of Capacity, or if it belongs completely separate."

As a matter of course, I address this question in my Guerrilla training classes. There, I like to appeal to a simple example—a multiserver queue—to exhibit how the performance characteristics are intimately related to system capacity. Not only are they related but, as the multiserver queue illustrates, the relationship is nonlinear. In terms of daily operations, however, you may choose to focus on one aspect more than the other, but they are still related nonetheless.

Monitoring CPU Utilization Under Hyper-threading

The question of accurately measuring processor utilization with hyper-threading (HT) enabled came up recently in a Performance Engineering Group discussion on Linked-in. Since I spent some considerable time looking into this issue while writing my Guerrilla Capacity Planning book, I thought I'd repeat my response here (slightly edited for this blog), in case it's useful to a broader audience interested in performance and capacity management. Not much has changed, it seems.

In a nutshell, the original question concerned whether or not it was possible for a single core to be observed running at 200% busy, as reported by Linux top, when HT is enabled.

This question is an old canard (well, "old" for multicore technology). I call it the "Missing MIPS" paradox. Regarding the question, "Is it really possible for a single core to be 200% busy?" the short answer is: never! So, you are quite right to be highly suspicious and confused.
You don't say which make of processor is running on your hardware platform, but I'll guess Intel. Very briefly, the OS (Linux in your case) is being lied to. Each core has 2 registers where inbound threads are stored for processing. Intel calls these AS (Architectural State) registers. With HT *disabled*, the OS only sees a single AS register as being available. In that case, the mapping between state registers and cores is 1:1. The idea behind HT is to allow a different application thread to run when the currently running app stalls; due to branch misprediction, bubbles in the pipeline, etc. To make that possible, there has to be another port or AS register. That register becomes visible to the OS when HT is enabled. However, the OS (and all the way up the food chain to whatever perf tools you are using) now thinks twice the processor capacity is available, i.e., 100% CPU at each AS port.

What happened at HealthCare.gov?

On Oct. 6th Federal officials admitted the online marketplace needed design changes, as well as more server capacity to improve efficiency on the federally run exchange that serves 36 states. More details in this WSJ article.

And finally, from the PR horse's mouth on Oct 20th:

"Initially, we implemented a virtual 'waiting room,' but many found this experience to be confusing. We continued to add more capacity in order to meet demand and execute software fixes to address the sign up and log in issues, stabilizing those parts of the service and allowing us to remove the virtual 'waiting room.' "

Quite apart from the bizarre architectural description, a "virtual waiting room" implies a buffer or buffers where pending requests must wait for service because the necessary resources to complete those requests are not available due to being either busy or failed. A certain amount of waiting time can be tolerated by users (both applicants and providers) but if it becomes too long or simply fails to complete, that kind of poor performance points to grossly under-scaled capacity in the original design.

Harmonic Averaging of Monitored Rate Data

The following slides constitute evolving notes made in response to remarks that arose during the Monitorama Conference in Boston MA, March 28-29, 2013. Since they are evolving, the content will be updated continuously in place. So, get on RSS or Twitter or check back often to read the latest version.

During the Graphite workshop session at Monitorama, the topic of aggregating monitored rate data came up. This caused me to interject the cautionary comment:

Monitorama 2013 Conference

Here is my Keynote presentation that opened the first Monitorama conference and hackathon in Cambridge MA yesterday:

Comments from the #monitorama Twitter stream:

Wednesday, May 23, 2012

Queues in the News

You might not have noticed, but queues have been in the news a lot lately. Not from the standpoint of computer performance but people performance or more accurately, crowd control. Most recently, queues popped up in the context of long delays expected at Heathrow airport due to big crowd arrivals for the London Olympics.

Fears over Heathrow queues during Olympics
The Waiting Game by The Numbers Guy
QANTAS to Get No Queues

In a queue, at least everyone is pointing in the same logical direction. Moreover, if you snake the queue, as they do at Disneyland, people always feel close to the destination and can see it getting even closer as customers ahead of them are processed. That helps to minimize their level of frustration: the control part. For some reason, the post office hasn't figured this out yet.

And, last but not least, queues and computer performance still remain an inevitable perennial. Most recently having to do with the Internet.

The SSD World Will End in 2024

So says the Non-Volatile Systems Lab at UC San Diego. The claim is, in order to achieve higher densities, flash manufacturers must sacrifice both read and write latencies. I haven't had time to explore this claim in any detail, but I thought it might be useful for you to know about it. Some highlights include:

They tested 45 different NAND flash chips from six vendors that ranged in size from 72 nm circuitry to the current 25nm technology.
They then took their test results and extrapolated them to the year 2024, when NAND flash development road maps show flash circuitry is expected to be only 6.5 nm in size. At that point, read/write latency is expected to increase by a factor of two or more.
They did not use specialized NAND flash controllers such as those used by Intel, OCZ or Fusion-io. Their results can be viewed as "optimistic" because they didn't include latency added through error correction or garbage collection algorithms.
Considering the diminishing returns on performance versus capacity, Grupp said, "it's not going to be viable to go past 6.5 nm ... 2024 is the end."

The technical paper entitled, The Bleak Future of NAND Flash Memory (PDF), was presented and published at the FAST'12 conference held in San Jose, CA on February 14—17, 2012.

Related post: Green Disk Sizing

Thursday, October 20, 2011

Kanban Revived in an Agile Kind of Way

I just returned from a workshop on the latest in web technologies (invitation only) where I was surprised to hear reference made to kanban. Kanban is a card-based system originally developed by Toyota in the 1950s for controlling their manufacturing lines. I have a note about it in the "Brief History of Buffers" section of my Perl::PDQ book because it is a logical precursor to JIT scheduling and compiling. I will also discuss it in the up-coming Guerrilla classes.

Now, it seems kanban has been revived under the "agile" banner for making software development more efficient. Of course, the concept of using cards to capture dev state information is not entirely new, even in the context of software engineering. So-called Snow Cards are another card-based technique used to monitor the software development process.

The Pith of Performance