The Pith of Performance: Moore's law

Showing posts with label Moore's law. Show all posts

Wednesday, August 13, 2014

Intel TSX Multicore Scalability in the Wild

Multicore processors were introduced to an unsuspecting marketplace more than a decade ago, but really became mainstream circa 2005. Multicore was presented as the next big thing in microprocessor technology. No mention of falling off the Moore's law (uniprocessor) curve. A 2007 PR event—held jointly between Intel, IBM and AMD—announced a VLSI fabrication technology that broke through the 65 nm barrier. The high-κ Hafnium gate enabled building smaller transistors at 45 nm feature size and thus, more cores per die. I tracked and analyzed the repercussions of that event in these 2007 blog posts:

Intel met or beat their projected availability schedule (depending on how you count) for Penryn by essentially rebooting their foundries. Very impressive.

In my Guerrilla classes I like to pose the question: Can you procure a 10 GHz microprocessor? On thinking about it (usually for the first time), most people begin to realize that they can't, but they don't know why not. Clearly, clock frequency limitations have an impact on both performance and server-side capacity. Then, I like to point out that programming multicores (since that decision has already been made for you) is much harder than it is for uniprocessors. Moreover, there is not too much in the way of help from compilers and other development tools, at the moment, although that situation will continually improve, presumably. Intel TSX (Transactional Synchronization Extensions) for Haswell multicores offers assistance of that type at the hardware level. In particular, TSX instructions are built into Haswell cores to boost the performance and scalability of certain types of multithreaded applications. But more about that in a minute.

I also like to point out that Intel and other microprocessor vendors (of which there are fewer and fewer due the enormous cost of entry), have little interest in how well your database, web site, or commercial application runs on their multicores. Rather, their goal is to produce the cheapest chip for the largest commodity market, e.g., PCs, laptops, and more recently mobile. Since that's where the profits are, the emphasis is on simplest design, not best design.

Fast, cheap, reliable: pick two.

Server-side performance is usually relegated to low man on the totem pole because of its relatively smaller market share. The implicit notion is that if you want more performance, just add more cores. But that depends on the threadedness of the applications running on those cores. Of course, there can also be side benefits, such as inheriting lower power servers from advances in mobile chip technology.

Source: AnandTech, Inc.

Intel officially announced multicore processors based on the Haswell architecture in 2013. Because scalability analysis can reveal a lot about limitations of the architecture, it's generally difficult to come across any quantitative data in the public domain. In their 2012 marketing build up, however, Intel showed some qualitative scalability characteristics of the Haswell multicore with TSX. See figure above. You can take it as read that these plots are based on actual measurements.

Most significantly, note the classic USL scaling profiles of transaction throughput vs. number of threads. For example, going from coarse-grain locking without TSX (red curve exhibiting retrograde throughput) to coarse-grain locking with TSX (green curve) has reduced the amount of contention (i.e., USL α coefficient). It's hard to say what is the impact of TSX on coherency delay (i.e., USL β coefficient) without being in possession of the actual data. As expected, however, the impact of TSX on fine-grain locking seems to be far more moderate. A 2012 AnandTech review summed things up this way:

TSX will be supported by GCC v4.8, Microsoft's latest Visual Studio 2012, and of course Intel's C compiler v13. GLIBC support (rtm-2.17 branch) is also available. So it looks like the software ecosystem is ready for TSX. The coolest thing about TSX (especially HLE) is that it enables good scaling on our current multi-core CPUs without an enormous investment of time in the fine tuning of locks. In other words, it can give developers "fined grained performance at coarse grained effort" as Intel likes to put it.
In theory, most application developers will not even have to change their code besides linking to a TSX enabled library. Time will tell if unlocking good multi-core scaling will be that easy in most cases. If everything goes according to Intel's plan, TSX could enable a much wider variety of software to take advantage of the steadily increasing core counts inside our servers, desktops, and portables.

With claimed clock frequencies of 4.6 GHz (i.e., nominal 5000 MIPS), Haswell with TSX offers superior performance at the usual price point. That's two. What about reliability? Ah, there's the rub. TSX has been disabled in the current manufacturing schedule due to a design bug.

Monday, October 29, 2007

Folsom: Not Just a Prison But A Cache

A nice update to my previous posts about the Intel Penryn microprocessor:

appears on a Dutch blog (in English---damn good English, BTW). The blogger was apparently invited to Intel's geographical home for the development of Penryn; not HQ in Santa Clara, California but Folsom (pronounced: 'full sum'), California. Consistent with Intel's January 2007 announcement, he notes that November looks to be show time for their 45 nm technology.

Since the author was a visitor, he failed to appreciate certain local ironies in his report. He missed was the fact that Penryn is a small town due north of Folsom, just off Interstate 80 on the way to Lake Tahoe. He refers to the huge Intel campus at the edge of the town. At the other end of town is an even better known campus; one of the state's major prisons immortalized in this Johnny Cash (not Cache) song. So, not only are criminals cached there but so also are some of Intel's best microprocessor designers (not as an intended punishment for the latter, presumably). OK, I'll stop there because I'm having way too much fun with this. Read the blog.

Thursday, June 28, 2007

Unnatural Numbers

What is the next number on the sequence: 180, 90, 45, ... ? Since each number appears to be obtained by halving the previous number in the sequence, it should be 22.5. This shows that even if you start randomly with a large natural number (which is not a power of 2) and you keep halving the successive values, you will reach an odd natural number and when you halve that, you end up with a fraction (a real number). Here's a little Python code that demonstrates the point:

#!/usr/bin/python
import random

r = random.randrange(100,1000,2)
print r

while True:
r /= 2
print r 
if r % 2 != 0: break

Run it several times and you will see that the sequence, although generally not long, contains an unpredictable number of iterates. This is easily understood by noting that division by 2 is equivalent to a bitwise right-shift (i.e., >> operator in C, Python and Perl), so each successive division simply marches the bit-pattern to the right until a '1' appears in the unit column. That defines an odd number and another right-shift produces '1' as a remainder, which I can either ignore as with integer division or carry as a fraction. It also means that you can predict when you will hit an odd number by first representing the starting number as binary digits, then counting the number of zeros from the rightmost position to the first occurrence of a 1-digit.

In the manufacture of microprocessors and memory chips, the increase in speed follows Moore's law, which is also a halving sequence like the above. Therefore, we see natural numbers like 90 nanometer (nm), 45 nm, and so on. For reference, the diameter of the HIV virus (a molecular machine) is about 100 nm. As I've discussed elsewhere, 45 nm is the next-generation technology that IBM, Intel and AMD will be using to fabricate their microprocessors. But the current smallest technology being used by AMD and Intel is not 90 nm, it's 65 nm. And the next generation after 45 nm will be 32 nm, not 20-something. So, where do these 'unnatural' numbers come from?

Each progression in shrinking the silicon-based photolithographic process is identified by a waypoint or node on a roadmap defined by the Semiconductor Industry Association or SIA. The current SIA roadmap looks like this:

Year  2004    2007    2010    2013    2016    2020 
nm     90      65      45      32      22      14

A technology node is defined primarily by the minimum metal pitch used on any product. Here, pitch refers to the spacing between the metal "wires" or interconnects. In a microprocessor, for example, the minimum metal pitch is the half-pitch of the first layer of metal used to connect the actual terminals of one transistor to another. Notice that IBM and Intel technology is moving faster than SIA expectations, so the roadmap already needs to be updated again.

Wednesday, May 23, 2007

More on Moore

In an opinion piece in this month's CMG MeasureIT e-zine, I examine what is possibly going on behind the joint IBM-Intel announcement and the imminent release of 45 nm 'penryn' parts in CMOS.

Some related blog entries are:

Tuesday, February 27, 2007

Moore's Law II: More or Less?

For the past few years, Intel, AMD, IBM, Sun, et al., have been promoting the concept of multicores i.e., no more single CPUs. A month ago, however, Intel and IBM made a joint announcement that they will produce single CPU parts using 45 nanometer (nm) technology. Intel says it is converting all its fab lines and will produce 45 nm parts (code named "penryn") by the end of this year. What's going on here?

We fell off the Moore's law curve, not because photolithography collided with limitations due to quantum physics or anything else exotic, but more mudanely because it ran into a largely unanticipated thermodynamic barrier. In other words, Moore's law was stopped dead in its tracks by old-fashioned 19th century physics.