The Pith of Performance: August 2014

Multicore processors were introduced to an unsuspecting marketplace more than a decade ago, but really became mainstream circa 2005. Multicore was presented as the next big thing in microprocessor technology. No mention of falling off the Moore's law (uniprocessor) curve. A 2007 PR event—held jointly between Intel, IBM and AMD—announced a VLSI fabrication technology that broke through the 65 nm barrier. The high-κ Hafnium gate enabled building smaller transistors at 45 nm feature size and thus, more cores per die. I tracked and analyzed the repercussions of that event in these 2007 blog posts:

Intel met or beat their projected availability schedule (depending on how you count) for Penryn by essentially rebooting their foundries. Very impressive.

In my Guerrilla classes I like to pose the question: Can you procure a 10 GHz microprocessor? On thinking about it (usually for the first time), most people begin to realize that they can't, but they don't know why not. Clearly, clock frequency limitations have an impact on both performance and server-side capacity. Then, I like to point out that programming multicores (since that decision has already been made for you) is much harder than it is for uniprocessors. Moreover, there is not too much in the way of help from compilers and other development tools, at the moment, although that situation will continually improve, presumably. Intel TSX (Transactional Synchronization Extensions) for Haswell multicores offers assistance of that type at the hardware level. In particular, TSX instructions are built into Haswell cores to boost the performance and scalability of certain types of multithreaded applications. But more about that in a minute.

I also like to point out that Intel and other microprocessor vendors (of which there are fewer and fewer due the enormous cost of entry), have little interest in how well your database, web site, or commercial application runs on their multicores. Rather, their goal is to produce the cheapest chip for the largest commodity market, e.g., PCs, laptops, and more recently mobile. Since that's where the profits are, the emphasis is on simplest design, not best design.

Fast, cheap, reliable: pick two.

Server-side performance is usually relegated to low man on the totem pole because of its relatively smaller market share. The implicit notion is that if you want more performance, just add more cores. But that depends on the threadedness of the applications running on those cores. Of course, there can also be side benefits, such as inheriting lower power servers from advances in mobile chip technology.

Source: AnandTech, Inc.

Intel officially announced multicore processors based on the Haswell architecture in 2013. Because scalability analysis can reveal a lot about limitations of the architecture, it's generally difficult to come across any quantitative data in the public domain. In their 2012 marketing build up, however, Intel showed some qualitative scalability characteristics of the Haswell multicore with TSX. See figure above. You can take it as read that these plots are based on actual measurements.

Most significantly, note the classic USL scaling profiles of transaction throughput vs. number of threads. For example, going from coarse-grain locking without TSX (red curve exhibiting retrograde throughput) to coarse-grain locking with TSX (green curve) has reduced the amount of contention (i.e., USL α coefficient). It's hard to say what is the impact of TSX on coherency delay (i.e., USL β coefficient) without being in possession of the actual data. As expected, however, the impact of TSX on fine-grain locking seems to be far more moderate. A 2012 AnandTech review summed things up this way:

TSX will be supported by GCC v4.8, Microsoft's latest Visual Studio 2012, and of course Intel's C compiler v13. GLIBC support (rtm-2.17 branch) is also available. So it looks like the software ecosystem is ready for TSX. The coolest thing about TSX (especially HLE) is that it enables good scaling on our current multi-core CPUs without an enormous investment of time in the fine tuning of locks. In other words, it can give developers "fined grained performance at coarse grained effort" as Intel likes to put it.
In theory, most application developers will not even have to change their code besides linking to a TSX enabled library. Time will tell if unlocking good multi-core scaling will be that easy in most cases. If everything goes according to Intel's plan, TSX could enable a much wider variety of software to take advantage of the steadily increasing core counts inside our servers, desktops, and portables.

With claimed clock frequencies of 4.6 GHz (i.e., nominal 5000 MIPS), Haswell with TSX offers superior performance at the usual price point. That's two. What about reliability? Ah, there's the rub. TSX has been disabled in the current manufacturing schedule due to a design bug.

The Pith of Performance

Wednesday, August 13, 2014

Intel TSX Multicore Scalability in the Wild