The Pith of Performance: capacity planning

Showing posts with label capacity planning. Show all posts

Tuesday, April 20, 2021

PDQ Online Workshop, May 17-21, 2021

PDQ (Pretty Damn Quick) is a free, open source, performance analyzer available from the Performance Dynamics web site.

All modern computer systems, no matter how complex, can be thought of as a directed graph of individual buffers that hold requests until to be serviced at a shared computational resource, e.g., a CPU or disk. Since a buffer is just a queue, any computer infrastructure, from your laptop up to Facebook.com, can be represented as a directed graph of queues.

The directed arcs or arrows in such a graph correspond to workflows between different queues. In the parlance of queueing theory, a directed graph of queues is called a queueing network model. PDQ is a tool for predicting performance metrics such as, waiting time, throughput, optimal user-load.

Two major benefits of using PDQ are:

confirmation that monitored performance metrics have their expected values
predict performance for circumstances that lie beyond current measurements

Find out more about the workshop and register today.

Thursday, November 26, 2020

PDQ 7.0 is Not a Turkey

Giving Thanks for the release of PDQ 7.0, after a 5-year drought, and just in time for the PDQW workshop next week.

New Featues

The introduction of the STREAMING solution method for OPEN queueing networks. (cf. CANON, which can still be used).
The CreateMultiNode() function is now defined for CLOSED queueing networks and distinguished via the MSC device type (cf. MSO for OPEN networks).
The format of Report() has been modified to make the various types of queueing network parameters clearer.
See the R Help pages in RStudio for details.
Run the demo(package="pdq") command in the R console to review a variety of PDQ 7 models.

Maintenance Changes

The migration of Python from 2 to 3 has introduced maintenance complications for PDQ. Python 3 may eventually be accommodated in future PDQ releases. Perl maintenance has ended with PDQ release 6.2, which to remain compatible with the Perl::PDQ book (2011).

Wednesday, January 2, 2019

DSConf 2019 Featured Talk

"Applying The Universal Scalability Law to Distributed Systems"

DSConf'19 - Distributed Systems Conference (scroll down)
Pune, India
11am IST
February 16

I'm very much looking forward to this event and I thank @ShrivedAgashe for the invitation.

Monday, June 25, 2018

Guerrilla 2018 Classes Now Open

All Guerrilla training classes are now open for registration.

GCAP: Guerrilla Capacity and Performance — From Counters to Containers and Clouds
GDAT: Guerrilla Data Analytics — Everything from Linear Regression to Machine Learning
PDQW: Pretty Damn Quick Workshop — Personal tuition for performance and capacity mgmt

The following highlights indicate the kind of thing you'll learn. Most especially, how to make better use of all that monitoring and load-testing data you keep collecting.

How to save millions of dollars with a one-line performance model (video)
How to minimize chargeback after you lift and shift to the cloud (video)
How to correctly emulate web traffic on a load-testing rig (PDF)

See what Guerrilla grads are saying about these classes. And how many instructors do you know that are available for you from 9am to 9pm (or later) each day of your class?

Who should attend?

IT architects
Application developers
Performance engineers
Sysadmins (Linux, Unix, Windows)
System engineers
Test engineers
Mainframe sysops (IBM. Hitachi, Fujitsu, Unisys)
Database admins
Devops practitioners
SRE engineers
Anyone interested in getting beyond performance monitoring

As usual, Sheraton Four Points has bedrooms available at the Performance Dynamics discounted rate. The room-booking link is on the registration page.

Tell a colleague and see you in September!

Tuesday, January 17, 2017

GitHub Growth Appears Scale Free

Update of Thursday, August 17, 2017: It's looks like we can chalk up another one for the scale-free model (described below) as Github apparently surpasses 20 million users. Outgoing CEO Wanstrath mentioned this number in an emailed statement to Business Insider.

"As GitHub approaches 700 employees, with more than $200M in ARR, accelerating growth, and more than 20 million registered users, I'm confident that this is the moment to find a new CEO to lead us into the next stage of growth. ....."

The Original Analysis

In 2013, a Redmonk blogger claimed that the growth of GitHub (GH) users follows a certain type of diffusion model called Bass diffusion. Here, growth refers to the number of unique user IDs as a function of time, not the number project repositories, which can have a high degree of multiplicity.

In a response, I tweeted a plot that suggested GH growth might be following a power law, aka scale free growth. The tell-tale sign is the asymptotic linearity of the growth data on double-log axes, which the original blog post did not discuss. The periods on the x-axis correspond to years, with the first period representing calendar year 2008 and the fifth period being the year 2012.

Performance Analysis vs. Capacity Planning

This question came up in a (members only) Linkedin discussion group:

Often found a misconception about these terms. I'm sure this must be written in a book, but for informal discussions is always preferable to cite sources from standardization institutes or IT industry referents.
Thanks in advance

Gian Piero

Here's how I answered it.

Tactical Capacity Management for Sysadmins at LISA14

On November 9th I'll be presenting a full-day tutorial on performance analysis and capacity planning at the USENIX Large Scale System Administration (LISA) conference in Seattle, WA.

The registration code is S4 in System Engineering section.

Hope to see you there.

Friday, June 6, 2014

The Visual Connection Between Capacity And Performance

Whether or not computer system performance and capacity are related is a question that comes up from time to time, especially from those with little experience in either discipline. Most recently, it appeared on a Linked-in discussion group:

"...the topic was raised about the notion that we are Capacity Management not Performance Management. It made me think about whether performance is indeed a facet of Capacity, or if it belongs completely separate."

As a matter of course, I address this question in my Guerrilla training classes. There, I like to appeal to a simple example—a multiserver queue—to exhibit how the performance characteristics are intimately related to system capacity. Not only are they related but, as the multiserver queue illustrates, the relationship is nonlinear. In terms of daily operations, however, you may choose to focus on one aspect more than the other, but they are still related nonetheless.

What happened at HealthCare.gov?

On Oct. 6th Federal officials admitted the online marketplace needed design changes, as well as more server capacity to improve efficiency on the federally run exchange that serves 36 states. More details in this WSJ article.

And finally, from the PR horse's mouth on Oct 20th:

"Initially, we implemented a virtual 'waiting room,' but many found this experience to be confusing. We continued to add more capacity in order to meet demand and execute software fixes to address the sign up and log in issues, stabilizing those parts of the service and allowing us to remove the virtual 'waiting room.' "

Quite apart from the bizarre architectural description, a "virtual waiting room" implies a buffer or buffers where pending requests must wait for service because the necessary resources to complete those requests are not available due to being either busy or failed. A certain amount of waiting time can be tolerated by users (both applicants and providers) but if it becomes too long or simply fails to complete, that kind of poor performance points to grossly under-scaled capacity in the original design.

Linux Per-Entity Load Tracking: Plus ça change

Canadian capacity planner, David Collier-Brown, pointed me at this post about some more proposed changes to how load is measured in the Linux kernel. He's not sure they're on the right track. David has written about such things as cgroups in Linux and I'm sure he understands these things better than I do, so he might be right. I never understood the so-called CFS: Completely Fair Scheduler. Is it a fair-share scheduler or something else? Not only was there a certain amount of political fallout over CFS but, do we care about such things anymore? That was back in 2007. These days we are just as likely to run Linux in a VM under VMware or XenServer or the cloud. Others have proposed that the Linux load average metric be made "more accurate" by including IO load. Would that be local IO, remote IO or both? Disk IO, network IO, etc., etc?

The SSD World Will End in 2024

So says the Non-Volatile Systems Lab at UC San Diego. The claim is, in order to achieve higher densities, flash manufacturers must sacrifice both read and write latencies. I haven't had time to explore this claim in any detail, but I thought it might be useful for you to know about it. Some highlights include:

They tested 45 different NAND flash chips from six vendors that ranged in size from 72 nm circuitry to the current 25nm technology.
They then took their test results and extrapolated them to the year 2024, when NAND flash development road maps show flash circuitry is expected to be only 6.5 nm in size. At that point, read/write latency is expected to increase by a factor of two or more.
They did not use specialized NAND flash controllers such as those used by Intel, OCZ or Fusion-io. Their results can be viewed as "optimistic" because they didn't include latency added through error correction or garbage collection algorithms.
Considering the diminishing returns on performance versus capacity, Grupp said, "it's not going to be viable to go past 6.5 nm ... 2024 is the end."

The technical paper entitled, The Bleak Future of NAND Flash Memory (PDF), was presented and published at the FAST'12 conference held in San Jose, CA on February 14—17, 2012.

Related post: Green Disk Sizing

Friday, February 24, 2012

On the Accuracy of Exponentials and Expositions

The following is a slightly edited version of my response to a Discussion on the Linkedin CPPE group, which is accessible to Members Only. It's written in the style of a journal reviewer. The original Discussion topic was based on a link to a blog-post. I've been asked to make my Linkedin review more widely available so, here tiz...

The blog-post Capacity Planning on a Cocktail Napkin is a really good example of a really bad explanation. There are so many things that are misleading, at best, and flat-out wrong, at worst, it's hard to know where to begin (or where to stop). Nevertheless, I'll try to keep it brief [I failed in that endeavor. — njg].
The author applies the equation:
\begin{equation} E = λ \end{equation}
Why? What is that equation? We don't know because the author does not say yet what all the symbols mean. It's all very well to impress people by slinging equations around, but it's more important to say what the equations actually mean. After all, the author might have chosen the wrong one.

A List of CaP Skills

This question popped up recently on Linkedin:

"Can someone tell me what skill set should a Performance and Capacity Analyst have and develop throughout his career?"

and I realized that, although I have a kind of list in my head, and I talk about such skills in my classes, I have been too lazy to write them down anywhere; which is pretty dumb. I must try to do something about that (New Year resolution? What are the odds?). In some ways, my fallback is the online Guerrilla Manual. Anyway, here is my (slightly edited) response to the LI question, and let it therefore constitute my first attempt at writing down such a list.

How Much Wayback for CaP?

How much data do you need to retain for meaningful capacity planning and performance analysis purposes? Sounds like one of those "how long is a piece of string?" questions and I've never really thought about it in any formal way, but it occurred to me that 5 years is not an unreasonable archival period.

Mister Peabody and Sherman in front of the WABAC machine
My reasoning goes like this:

Idleness Is Not Waste

A common fallacy is to view all idle CPU cycles as wasted server capacity. It's not unusual for management and various bean-counters to display a reluctance to procure new hardware if unused cycles are clearly observable on existing hardware. This puts the pressure on sys admins to reduce idleness. Such is often the case during consolidation efforts: cram as many apps as possible onto a server to soak up every remaining CPU cycle.

All performance analysis and capacity planning is essentially about optimizing resource usage under a particular set of constraints. The fallacy is treating maximization as optimization. This mistake is further exacerbated if only one performance metric, i.e., CPU utilization, is taken into account: a common situation promoted by the superficiality of performance dashboards. Maximization doesn't necessarily mean 100% utilization, either. The same is true even if some amount of CPU capacity is retained as headroom for workload growth. The tendency to "redline" it can still prevail.

You can't optimize a single number. Server utilization has to be optimized with respect to other measures, e.g., application response-time targets. We know from simple queueing theory that response time increases nonlinearly (the proverbial "hockey stick") with increasing server utilization. If the response-time goals are being met at 10% CPU busy, pre-consolidation, then almost certainly they will be exceeded at higher CPU utilization, post-consolidation. The response-time metric is an example of a cost that has to be taken into account to satisfy all the constraints of the optimized capacity plan.

Maximizing server utilization is as foolhardy as maximizing revenue. Both goals look attractive on their face, but if you don't keep track of outgoing CapEx and OpEx costs incurred to generate revenue, you could lose the company!

Friday, June 25, 2010

Velocity 2010 The Aftermathglow

I was so impressed with Velocity 2009, I really wanted to present something at Velocity 2010.

Thread-limited scalability of memcached
Working with Shanti and Stefan of Oracle (née Sun Microsystems), I was able to accomplish that goal. Our session was rated 92.4%, which is an A+ in anyone's books. Congrats to us and the Velocity organizers and thank you, crowd.

Intel's Cloud Computer on a Chip

Last week in the GCaP class, I underscored how important it is to "look out the window" and keep an eye on what is happening in the marketplace, because some of those developments may eventually impact capacity planning in your shop. Here's a good example:

This Intel processor (code named "Rock Creek") integrates 48 IA-32 cores, 4 DDR3 memory channels, and a voltage regulator controller in a 6×4 2D-mesh network-on-chip architecture. Located at each mesh node is a five-port virtual cut-through packet switched router shared between two cores. Core-to-core communication uses message passing while exploiting 384KB of on-die shared memory. Fine grain power management takes advantage of 8 voltage and 28 frequency islands to allow independent DVFS of cores and mesh. At the nominal 1.1V, cores operate at 1GHz while the 2D-mesh operates at 2GHz. As performance and voltage scales, the processor dissipates between 25W and 125W. The 567 sq-mm processor die is implemented in 45nm Hi-K CMOS and has 1,300,000,000 transistors.

The "cloud" reference is a marketing hook, but note that it uses a 2D mesh interconnect topology (like we discussed in class), contains 1.3 billion transistors with the new Hafnium metal gate (as we discussed in class), and produces up to 125 watts of heat.

The details of this processor were presented at the annual ISSCC meeting in San Francisco, February 2010.

Friday, April 16, 2010

Significant Figures in R and Rounding

This is a follow-on to my previous post about determining significant digits or sigdigs, in performance and capacity management calculations. See Significant Figures in R and Info Zeros

Once we know how to identify significant digits, inevitably we will be faced with rounding the result of a calculation to the least number of sigdigs. Whereas the signif() function in R suffered from truncating trailing info-zeros in measured values, when it comes to rounding, signif shines. Better yet, it agrees with the Algorithm 3.2 in my GCaP book. Let's see how well it does.

Significant Figures in R and Info Zeros

The other day, I stumbled upon the signif function in R, so I thought I'd take a look at what it does and compare it with some results discussed in Chap. 3 "Damaging Digits in Capacity Calculations" of my GCaP book, viz., Example 3.5 on page 31. The measured numbers in that example are reproduced here in Table 1 using read.table in R.

Bridges, Booms, Busts, Banks, Bailouts, ... Who Needs Capacity Planning?

Given that Wall Street management has proven once again that there are black swans; this time on a global scale, why would anyone be crazy enough to contemplate capacity management in the middle of such a mess?

See what Wall Street IT managers Simple CIO and Sal Viati have to say about it in "Let the Bridge Fall—As Long as It Falls on Time" (with apologies to Galileo).

"The commonly held idea that it's cheaper to over-engineer the hardware architecture to ensure adequate capacity is patently false. Here's the simple counter-example. If performance testing is skipped in order to meet the release schedule (and who knows if that's really valid?), and the deployed application ends up running single-threaded with lousy performance, a boat-load of the cheapest servers from China won't improve that."

...

"The bottom line is not really new. The sagacity of looking beyond the end of your nose is a truism, but incredibly that truth has been lost in the irrational exuberance of false Wall Street economics. A robust economy and IT customer satisfaction both come from foresight, not just eyesight. In fact, it's the second word in capacity planning.

Lest you think I'm being too hard on Wall St., listen to Peter Day of the BBC interviewing Philip Delves-Broughton about his new book entitled, What They Teach You at Harvard Business School: My Two Years in the Cauldron of Capitalism. Some points to listen for:

MBAs are not taught to get their hands dirty with such sleazy activities as sales. That's sales as in: salesforce, sales people, the Fuller Brush Man.
Neither Steve Jobs nor Bill Gates have an MBA.
Too much devotion to spreadsheet calculations and powerpoint presentations. This is why Robert McNamara (Harvard MBA) mis-managed the Vietnam War. Too much faith in (manipulated) uncorroborated numbers.
Total disconnect between teaching abstract business models and the business of business which is, err... like... selling stuff.
These are the people running things! (Let's read that again)
The Economist rankings: French model beats Anglo-Saxon model (of which Wall St. is obviously a subset).

Update (Tue, Jun 9, 2009): George Soros (hedge-funder extraordinaire) estimates this black swan could be 3-5 times bigger than the one seen in 1929. Update (Wed, Jun 10, 2009): Nobel economist, Joseph Stiglitz, slams Wall St. for tarnishing the reputation of American-style capitalism, which may pose new threats to global stability and U.S. security. Update (Mon, Jul 20, 2009): Associate director of M.I.T. Media Lab considers institutional monocultures in his Boston Globe op-ed piece: "What can failures teach us?"

Tuesday, April 20, 2021

Thursday, November 26, 2020

New Featues

Maintenance Changes

Wednesday, January 2, 2019

Monday, June 25, 2018

Tuesday, January 17, 2017

The Original Analysis

Friday, March 20, 2015

Monday, October 6, 2014

Friday, June 6, 2014

Monday, October 21, 2013

Friday, January 18, 2013

Wednesday, March 7, 2012

Friday, February 24, 2012

Tuesday, December 27, 2011

Tuesday, September 6, 2011

Thursday, January 27, 2011

Friday, June 25, 2010

Tuesday, May 18, 2010

Friday, April 16, 2010

Sunday, April 11, 2010

Monday, June 8, 2009