The Pith of Performance: multicore

Showing posts with label multicore. Show all posts

Wednesday, August 13, 2014

Intel TSX Multicore Scalability in the Wild

Multicore processors were introduced to an unsuspecting marketplace more than a decade ago, but really became mainstream circa 2005. Multicore was presented as the next big thing in microprocessor technology. No mention of falling off the Moore's law (uniprocessor) curve. A 2007 PR event—held jointly between Intel, IBM and AMD—announced a VLSI fabrication technology that broke through the 65 nm barrier. The high-κ Hafnium gate enabled building smaller transistors at 45 nm feature size and thus, more cores per die. I tracked and analyzed the repercussions of that event in these 2007 blog posts:

Intel met or beat their projected availability schedule (depending on how you count) for Penryn by essentially rebooting their foundries. Very impressive.

In my Guerrilla classes I like to pose the question: Can you procure a 10 GHz microprocessor? On thinking about it (usually for the first time), most people begin to realize that they can't, but they don't know why not. Clearly, clock frequency limitations have an impact on both performance and server-side capacity. Then, I like to point out that programming multicores (since that decision has already been made for you) is much harder than it is for uniprocessors. Moreover, there is not too much in the way of help from compilers and other development tools, at the moment, although that situation will continually improve, presumably. Intel TSX (Transactional Synchronization Extensions) for Haswell multicores offers assistance of that type at the hardware level. In particular, TSX instructions are built into Haswell cores to boost the performance and scalability of certain types of multithreaded applications. But more about that in a minute.

I also like to point out that Intel and other microprocessor vendors (of which there are fewer and fewer due the enormous cost of entry), have little interest in how well your database, web site, or commercial application runs on their multicores. Rather, their goal is to produce the cheapest chip for the largest commodity market, e.g., PCs, laptops, and more recently mobile. Since that's where the profits are, the emphasis is on simplest design, not best design.

Fast, cheap, reliable: pick two.

Server-side performance is usually relegated to low man on the totem pole because of its relatively smaller market share. The implicit notion is that if you want more performance, just add more cores. But that depends on the threadedness of the applications running on those cores. Of course, there can also be side benefits, such as inheriting lower power servers from advances in mobile chip technology.

Source: AnandTech, Inc.

Intel officially announced multicore processors based on the Haswell architecture in 2013. Because scalability analysis can reveal a lot about limitations of the architecture, it's generally difficult to come across any quantitative data in the public domain. In their 2012 marketing build up, however, Intel showed some qualitative scalability characteristics of the Haswell multicore with TSX. See figure above. You can take it as read that these plots are based on actual measurements.

Most significantly, note the classic USL scaling profiles of transaction throughput vs. number of threads. For example, going from coarse-grain locking without TSX (red curve exhibiting retrograde throughput) to coarse-grain locking with TSX (green curve) has reduced the amount of contention (i.e., USL α coefficient). It's hard to say what is the impact of TSX on coherency delay (i.e., USL β coefficient) without being in possession of the actual data. As expected, however, the impact of TSX on fine-grain locking seems to be far more moderate. A 2012 AnandTech review summed things up this way:

TSX will be supported by GCC v4.8, Microsoft's latest Visual Studio 2012, and of course Intel's C compiler v13. GLIBC support (rtm-2.17 branch) is also available. So it looks like the software ecosystem is ready for TSX. The coolest thing about TSX (especially HLE) is that it enables good scaling on our current multi-core CPUs without an enormous investment of time in the fine tuning of locks. In other words, it can give developers "fined grained performance at coarse grained effort" as Intel likes to put it.
In theory, most application developers will not even have to change their code besides linking to a TSX enabled library. Time will tell if unlocking good multi-core scaling will be that easy in most cases. If everything goes according to Intel's plan, TSX could enable a much wider variety of software to take advantage of the steadily increasing core counts inside our servers, desktops, and portables.

With claimed clock frequencies of 4.6 GHz (i.e., nominal 5000 MIPS), Haswell with TSX offers superior performance at the usual price point. That's two. What about reliability? Ah, there's the rub. TSX has been disabled in the current manufacturing schedule due to a design bug.

Thursday, March 13, 2014

Performance of N+1 Redundancy

How can you determine the performance impact on SLAs after an N+1 redundant hosting configuration fails over? This question came up in the Guerrilla Capacity Planning class, this week. It can be addressed by referring to a multi-server queueing model.

N+1 = 4 Redundancy

We begin by considering a small-N configuration of four hosts where the load is distributed equally to each of the hosts. For simplicity, the load distribution is assumed to be performed by some kind of load balancer with a buffer. The idea of N+1 redundancy is that the load balancer ensures all four hosts are equally utilized prior to any failover.

The idea is that none of the hosts should use more than 75% of their available capacity: the blue areas on the left side of Fig. 1. The total consumed capacity is assumed to be $4 \times 3/4 = 3$ or 300% of the total host configuration (rather than all 4 hosts or 400% capacity). Then, when any single host fails, its lost capacity is compensated by redistributing that same load across the remaining three available hosts (each running 100% busy after failover). As we shall show in the next section, this is a misconception.

Figure 1. N+1 redundant hosts before (left) and after failover (right)

The circles in Fig. 1 represent hosts and rectangles represent incoming requests buffered at the load-balancer. The blue area in the circles signifies the available capacity of a host, whereas white signifies unavailable capacity. When one of the hosts fails, its load must be redistributed across the remaining three hosts. What Fig. 1 doesn't show is the performance impact of this capacity redistribution.

Response Time Percentiles for Multi-server Applications

In a previous post, I applied my rules-of-thumb for response time (RT) percentiles (or more accurately, residence time in queueing theory parlance), viz., 80th percentile: $R_{80}$, 90th percentile: $R_{90}$ and 95th percentile: $R_{95}$ to a cellphone application and found that the performance measurements were not completely consistent. Since the relevant data only appeared in a journal blog, I didn't have enough information to resolve the discrepancy; which is ok. The first job of the performance analyst is to flag performance anomalies but most probably let others resolve them—after all, I didn't build the system or collect the measurements.

More importantly, that analysis was for a single server application (viz., time-to-first-fix latency). At the end of my post, I hinted at adding percentiles to PDQ for multi-server applications. Here, I present the corresponding rules-of-thumb for the more ubiquitous multi-server or multi-core case.

Single-server Percentiles

First, let's summarize the Guerrilla rules-of-thumb for single-server percentiles (M/M/1 in queueing parlance): \begin{align} R_{1,80} &\simeq \dfrac{5}{3} \, R_{1} \label{eqn:mm1r80}\\ R_{1,90} &\simeq \dfrac{7}{3} \, R_{1}\\ R_{1,95} &\simeq \dfrac{9}{3} \, R_{1} \label{eqn:mm1r95} \end{align} where $R_{1}$ is the statistical mean of the measured or calculated RT and $\simeq$ denotes approximately equal. A useful mnemonic device is to notice the numerical pattern for the fractions. All denominators are 3 and the numerators are successive odd numbers starting with 5.

The Multiserver Numbers Game

In a previous post, I explained why \begin{equation} R_m \neq \dfrac{R_1}{m} \label{eqn:badest} \end{equation} and therefore doesn't work as an estimator of the mean residence time in an M/M/m multi-server queue. Although we expect the extra server capacity with m-servers to produce a shorter residence time ($R_m$), it is not m-times smaller than the residence time ($R_1$) for a single-server (i.e., $m = 1$) queue.

M/M/m multiserver queue

The problem is that eqn. \eqref{eqn:badest} grossly underestimates $R_m$, which is precisely the wrong direction for capacity planning scenarios. For that purpose, it's generally better to overestimate performance metrics. That's too bad because it would be a handy Guerrilla-style formula if it did work. You would be able do the calculation in your head and impress everyone on your team (not to mention performing it as a party trick).

Given that eqn. \eqref{eqn:badest} is a poor estimator, you might wonder if there's a better one, and if you'd been working for Thomas Edison he would have told you: "There's a better wsy. Find it!" Easy for him to say. But if you did decide to take up Edison's challenge, how would you even begin to search for such a thing?

Queues, Schedulers and the Multicore Wall

The other day, I came across a blog post entitled "Server utilization: Joel on queuing", so naturally I had to stop and take a squiz. The blogger, John D. Cook, is an applied mathematician by training and a cancer researcher by trade, so he's quite capable of understanding a little queueing theory. What he had not heard of before was a rule-of-thumb (ROT) that was quoted in a podcast (skip to 00:26:35) by NYC software developer and raconteur, Joel Spolsky. Although rather garbled, as I think any Guerrilla graduate would agree, what Spolsky says is this:

If you have a line of people (e.g., in a bank or a coffee shop) and the utilization of the people serving them gets above 80% , things start to go wrong because the lines get very long and the average amount of time people spend waiting tends to get really, really bad.

No news to us performance weenies, and the way I've sometimes heard it expressed at CMG is:

No resource should be more than 75% busy.

Personally, I don't like this kind of statement because it is very misleading. Let me explain why.

Intel Goes Giddy-up on GPUs

Parallel computing is not just about CPUs, anymore. Think GPUs and (perhaps more importantly) the dev tools to utilize them. From a piece by Michael Feldman at HPCwire:

"With the advent of general-purpose GPUs, the Cell BE processor, and the upcoming Larrabee chip from Intel, data parallel computing has become the hot new supermodel in HPC. And even though NVIDIA took the lead in this area with its CUDA development environment, Intel has been busy working on Ct, its own data parallel computing environment." ...

"Ct is a high-level software environment that applies data parallelism to current multicore and future manycore architectures. ... Ct allows scientists and mathematicians to construct algorithms in familiar-looking algebraic notation. Best of all, the programmer does not need to be concerned with mapping data structures and operations onto cores or vector units; Ct's high level of abstraction performs the mappings transparently. 'The two big challenges in parallel computing is getting it correct and getting it to scale, and Ct directly takes aim at both' ..."

Ct stands for C/C++ for throughput computing.

Sunday, February 1, 2009

Poor Scalability on Multicore Supercomputers

Guerrilla grad Paul P. sent me another gem in which Sandia scientists discover that more core processors don't produce more parallelism on their supercomputer applications:

"16 multicores perform barely as well as 2 for complex applications."

Pay per VPU Application Development

So-called "Cloud Computing" (aka Cluster Computing, aka Grids, aka Utility Computing, etc.) is even making the morning news these days, where it is being presented as having a supercomputer with virtual processor units just a click away on your home PC. I don't know too many home PC users who need a supercomputer, but even if they did and it was readily available, how competitive would it be given the plummeting cost of multicores for PCs?

Ruby on de-Rails?

Here we go ... According to a recent post on TechCrunch, Twitter.com is planning to abandon Ruby on Rails after two years of fighting scalability issues. The candidates for replacing RoR are rumored to be PHP, Java, and/or just plain Ruby. On the other hand, Twitter is categorically denying the rumor saying that they use other development tools besides RoR. This is typical of the kind of argument one can get into over scalability issues when scalability is not analyzed in a QUANTIFIED way.

As I discuss in Chap. 9 of my Perl::PDQ book, the qualitative design rules for incrementally increasing the scalability of distributed apps go something like this:

Move code, e,g., business logic, from the App Servers to the RDBMS backend and repartition the DB tables.
Use load balancers between every tier. This step can accommodate multi-millions of pageviews per day.
Partition users into groups of 100,000 across replicated clusters with partitioned DB tables.

All the big web sites (e.g., eBay.com and EA.com) do this kind of thing. But these rules-of-thumb beg the question, How can I quantify the expected performance improvement for each of these steps? Here, I only hear silence. But there is an answer: the Universal Scalability Law. However, it needs to be generalized to accommodate the concept of homogeneous clustering, and I do just that in Section 4.6 of my GCaP book.

The following slides (from 2001) give the basic idea from the standpoint of hardware scalability.

Think of each box as an SMP containing p-processors or a CMP with p-cores. These processors are connected by a local bus, e.g., a shared-memory bus; the intra-node bus. Therefore, we can apply the Universal Scalability model as usual, keeping in mind that the 2 model parameters refer to local effects only. The data for determining those parameters via regression could come from workload simulation tools like LoadRunner. To quantify what happens in a cluster with k-nodes, an equivalent set of measurements have to be made using the global interconnect between cluster nodes; the inter-node bus. Applying the same statistical regression technique to those data gives a corresponding pair of parameters for global scalability.

The generalized scalability law for clusters is shown in the 5th slide. If (in some perfect world) all the overheads were exactly zero, then the clusters would scale linearly (slide 6). However, things get more interesting in the real world because the scaling curves can cross each other in unintuitive ways. For example, slide 7 "CASE 4" shows the case where the level of contention is less in the global bus than it is the local bus, but the (in)coherency is greater in the global bus than the local bus. This is the kind of effect one might see with poor DB table partitioning causing more data than anticipated to be shared across the global bus. And it's precisely because it is so unintuitive that we need to do the quantitative analysis.

I intend to modify these slides to show how things scale with a variable number of users (i.e., software scalability) on a fixed hardware configuration per cluster node and present it in the upcoming GCaP class. If you have performance data for apps running on clusters of this type, I would be interested in throwing my scalability model at it.

Sunday, April 27, 2008

Don Knuth on Open Source, Multicores and Literate Programming

Donald Knuth, the father of TeX and author of the long unfinished multi-volume set of books entitled "The Art of Computer Programming", sounds off in this interesting interview.

His comment on unit testing (or an overzealous reliance on it) seems a bit obscured here. I am certainly more inclined to concur with his other quote: "early optimization is the root of all evil," because unit testing tends to promote that bad strategy with respect to system performance. Some personal war stories describing exactly that will be delivered this week in my Guerrilla Boot Camp class.

However, do read and heed what he says about multicores and multithreaded programming. It should sound familiar. BTW, where he apparently says "Titanium", he means Itanium.

His riff on "literate programming" is a bit of a yawn for me because we had that capability at Xerox PARC, 25 years ago. Effectively, you wrote computer code (Mesa/Cedar in that case) using the same WYSIWYG editor that you used for writing standard, fully-formatted documentation. The compiler didn't care. This encouraged very readable commentary within programs. In fact, you often had to look twice to decide if you were looking at a static document or dynamic program source. As I understand it, Knuth's worthy objective is to have this capability available for other languages. The best example that I am aware of today, that comes closest to what we had at Xerox, is the Mathematica notebook. You can also use Mathematica Player to view any example Mathematica notebooks without purchasing the Mathematica product.

Tuesday, March 25, 2008

Hickory, Dickory, Dock. The Mouse Just Sniffed at the Clock

Following the arrival of Penryn on schedule, Intel has now announced its "tock" phase (Nehalem) successor to the current 45 nm "tick" phase (Penryn). This is all about more cores (up to 8) and the return of 2-threads per core (SMT), not increasing clock speeds. That game does seem to be over, for example:

Tick: Penryn XE, Core 2 Extreme X9000 45 nm @ 2.8 GHz
Tock: Bloomfield, Nehalem micro-architecture 45 nm @ 3.0 GHz

Note that Sun is already shipping 8 cores × 8 threads = 64 VPUs @ 1.4 GHz in its UltraSPARC T2.

Nehalem also signals the replacement of Intel's aging frontside bus architecture by its new QuickPath chip-to-chip interconnect; the long overdue competitor to AMD’s HyperTransport bus. A 4-core Nehalem processor will have three DDR3 channels and four QPI links.

What about performance benchmarks besides those previously mentioned? I have no patience with bogus SPECxx_rate benchmarks which simply run multiple instances of a single-threaded benchmark. Customers should be demanding that vendors run the SPEC SDM to get a more reasonable assessment of scalability. The TPC-C benchmark results are perhaps a little more revealing. Here's a sample:

A HP Proliant DL380 G5 server 3.16GHz
2 CPU × 4 cores × 1 threads/core = 8 VPU
Pulled 273,666 tpmC on Oracle Enterprise Linux running Oracle 10g RDBMS (11/09/07)
HP ProLiant ML370G5 Intel X5460 3.16GHz
2 CPU × 4 cores × 1 threads/core = 8 VPU
Pulled 275,149 tpmC running SQL Server 2005 on Windows Server 2003 O/S (01/07/08)
IBM eServer xSeries 460 4P
Intel Dual-Core Xeon Processor 7040 - 3.0 GHz
2 CPU × 4 cores × 2 threads/core = 16 VPU
Pulled 273,520 tpmC running DB2 on Windows Server 2003 O/S (05/01/06)

Roughly speaking, within this grouping, the 8-way Penryn TPC-C performance now matches a 16-way Xeon of 2 years ago. Note that the TPC-C Top Ten results, headed up by the HP Integrity Superdome-Itanium2/1.6GHz at 64 CPUs × 2 cores × 2 threads/core = 256 VPUs, are in the 1-4 million tpmC range.

The next step down is from 45 nm to 32 nm technology (code named Westmere), which was originally scheduled for 2013. Can't accuse Intel of not being aggressive.

Saturday, January 12, 2008

Erlang Explained

During the GCaP class, last November, I mentioned to my students that I was working on a more intuitive derivation of the Erlang B and C formulae ( the English wiki link to Erlang-C is wrong) associated with the multi-server or M/M/m queue. This queue is very important because it provides a simple performance model of a multiprocessor or multicore systems, amongst other things. When I say simple here, I mean the queueing model is "simple" relative to the complexities of a real multiprocessor (e.g., it does not include the effects of computational overhead). On the other hand, the mathematics of this particular queueing model is not simple. Not only do we need to resort to using some kind of computational aid to solve it (some people have built careers around finding efficient algorithms), the formulae themselves are completely unintuitive. During that class I thought I was pretty close to changing all that; to the point where I thought I might be in a position to present it before the class finished. I was wrong. It took a little bit longer than a week. :-)

How Long Should My Queue Be?

A simple question; there should be a simple answer, right? Guerrilla alumus Sudarsan Kannan asked me if a rule-of-thumb could be constructed for quantitatively assessing the load average on both dual-core and multicore platforms. He had seen various remarks, from time to time, alluding to optimal load averages.

More On Penryn

In a previous blog entry, I noted that Intel was planning to release "penryn" in the final quarter of this year (2007). During a conference call Monday morning, Intel executives provided an overview of the more than twenty new products and initiatives being announced later today at the Intel Developer Forum in Beijing, including new performance specs for the company's next generation Penryn processor family.

Intel said that early Penryn performance tests show a 15 percent increase in imaging related applications, a 25 percent performance increase for 3D rendering, and more than 40 percent performance increase for gaming. The tests, according to Maloney, were based on pre-production 45nm Intel quad core processors running at 3.33 GHz with a 1333 front side bus and 12 MB cache versus a 2.93 GHz Intel Core 2 Extreme (QX6800) processor, just announced last week . Intel said that for high-performance computing, users can expect gains of up to 45 percent for bandwidth intensive applications, and a 25 percent increase for servers using Java. Those tests were based on 45nm Xeon processors with 1,600-MHz front side buses for workstations and HPCs, and a 1,333 MHz front side bus for servers - versus current quad-core X5355 processors, the company said.

During the call, Intel execs also took the opportunity to reveal a few more details on Project Larrabee, a new "highly parallel, IA-based programmable" architecture that the company says it is now designing products around. While details were scant, Maloney did say that the architecture is designed to scale to trillions of floating point operations per second (teraflops) of performance and will include enhancements to accelerate applications such as scientific computing, recognition mining, synthesis, visualization, financial analytics, and health applications.

Monday, April 16, 2007

Forget Multicores, Think Speckles

Prototypes of the so-called "speckled computers" will be presented at the forthcoming Edinburgh International Science Festival.

"Speckled Computing offers a radically new concept in information technology that has the potential to revolutionise the way we communicate and exchange information. Specks will be around 1 mm3 semiconductor grains; that's about the size of a matchhead, that can sense and compute locally and communicate wirelessly. Each speck will be autonomous, with its own captive, renewable energy source. Thousands of specks, scattered or sprayed on the person or surfaces, will collaborate as programmable computational networks called Specknets.

Computing with Specknets will enable linkages between the material and digital worlds with a finer degree of spatial and temporal resolution than hitherto possible; this will be both fundamental and enabling to the goal of truly ubiquitous computing.

Speckled Computing is the culmination of a greater trend. As the once-separate worlds of computing and wireless communications collide, a new class of information appliances will emerge. Where once they stood proud – the PDA bulging in the pocket, or the mobile phone nestling in one’s palm, the post-modern equivalent might not be explicit after all. Rather, data sensing and information processing capabilities will fragment and disappear into everyday objects and the living environment. At present there are sharp dislocations in information processing capability – the computer on a desk, the PDA/laptop, mobile phone, smart cards and smart appliances. In our vision of Speckled Computing, the sensing and processing of information will be highly diffused – the person, the artefacts and the surrounding space, become, at the same time, computational resources and interfaces to those resources. Surfaces, walls, floors, ceilings, articles, and clothes, when sprayed or “speckled” with specks will be invested with a “computational aura” and sensitised post hoc as props for rich interactions with the computational resources."

I have absolutely no idea what that last sentence means in English, but it sounds like an interesting research goal.

Tuesday, February 27, 2007

Moore's Law II: More or Less?

For the past few years, Intel, AMD, IBM, Sun, et al., have been promoting the concept of multicores i.e., no more single CPUs. A month ago, however, Intel and IBM made a joint announcement that they will produce single CPU parts using 45 nanometer (nm) technology. Intel says it is converting all its fab lines and will produce 45 nm parts (code named "penryn") by the end of this year. What's going on here?

We fell off the Moore's law curve, not because photolithography collided with limitations due to quantum physics or anything else exotic, but more mudanely because it ran into a largely unanticipated thermodynamic barrier. In other words, Moore's law was stopped dead in its tracks by old-fashioned 19th century physics.

The Pith of Performance