The Pith of Performance: USL Fine Point: Sub-Amdahl Scalability

As discussed in Chapter 4 of my GCaP book, Amdahl's law is defined by a single parameter called the serial fraction, denoted by the symbol α and signifying the proportion of the total workload (W) that is serialized during execution. From the standpoint of parallel processing (where reference to Amdahl's law is most frequent) serialization means that portion of the workload can only execute on a single processor out of N parallel processors. The parallel speedup or relative capacity C_A(N) performance metric is given by: \begin{equation} C_A(N) = \frac{N}{1 + \alpha \, (N-1)} \end{equation} If there is no serialization in the workload, i.e., α = 0, then C_A(N) = N, which signifies that the workload scales linearly with the number of physical processors. The important observation made by Gene Amdahl (more than 40 years ago) is that even if α is relatively small, viz., a few percent of the execution time, scalability cannot continue to increase linearly. For example, if α = 5%, then C_A(N) will eventually reach a scalability ceiling given by 20 effective processors (1/α), even if there are hundreds of physical processors available in the system.

An assumption in eqn.(1) is that the parallel fraction (1 − α) of the workload can be equipartitioned across all N processors. In other words, each of the N processors is assumed to run 100% busy for exactly the same elapsed time (T_N). As you might imagine, partitioning a workload into identical subtasks is not always feasible. What happens to scalability C_A(N) in that case?

Fig. 1. Distribution of processor activity

Figure 1 shows a workload that utilizes various numbers of processors during the course of its executing the workload. Let's take the time unit to be seconds. It uses a single processor between time t = 2 and 4 seconds as well as between t = 24 and 27 seconds; a total time of 5 seconds. In between, the workload executes in parallel for an elapsed time of T_N = (27 − 2) − 5 = 20 seconds using up to N = 8 processors, even if more than 8 physical processes are actually available. Moreover, unlike the Amdahl scaling assumption, the time for which the work executes on a particular number of processors (i.e., the width of each column) is also different.

Fig. 2. Average degree of parallelism

The total area under all the columns in Figure 1 corresponds to the work W = 93 cpu-seconds. Although these columns represent a distribution of work, we can summarize this execution profile using a single metric called the average parallelism (denoted by A), shown as the height of the red rectangle in Figure 2. This rectangle has the same width as the total execution time (T = 25 seconds) and the same area as that under all the blue columns in Figure 1. Hence, the average number of active processors A = 93/25 = 3.72. In other words, this particular execution profile cannot make efficient use of more than about p = 4 physical processors, on average.

Since the parallel execution time T_N in Figure 2 is longer than the Amdahl case, the performance must be inferior to Amdahl law. Let's call it sub-Amdahl scaling. The relationship between Amdahl scaling and sub-Amdahl scaling is shown in Figure 3, where the dashed line corresponds to the average parallelism A. Whereas Amdahl scaling has an upper bound at 1/σ (horizontal line not shown) due to maximal parallelism, sub-Amdahl scaling has an upper bound at A (lower dashed horizontal line) due to average parallelism.

Fig. 3. Sub-Amdahl scalability measurements

It's amusing to note that I discussed the concept of average parallelism in The Practical Performance Analyst, more than a decade ago, when massively parallel machines were making a comeback for commercial (rather than scientific) applications. I stopped talking about it because that trend seemed to disappear.
What is the importance of all this for the USL today? According to Chapter 6 of GCaP, a simple symmetry argument tells us that Amdahl's law can also be applied to software scalability, and similarly for the universal scalability law (USL) given by eqn.(2). Instead of N processors, we consider N processes. \begin{equation} C(N) = \frac{N}{1 + \alpha \, (N-1) + \beta \, N (N-1)} \end{equation} where the coefficient α is taken to represent the degree of contention in the application and β the amount of coherency delay, i.e., thrashing-like degradation.

Imagine a sequence of load-test measurements where the scalability starts out following Amdahl's law (the top curve in Figure 3). At some load value, however, the performance becomes inferior to the expected Amdahl scaling because the workload partitioning has more execution-time skew than assumed in Amdahl's law. This effect could show up as a point of inflection in the data (dots) with successively higher loads producing C_A(N) values that head down toward the sub-Amdahl limit (the lower curve in Figure 3).

If we took those data and tried to fit them to the USL model, the point of inflection would likely cause the coherency coefficient to be non-zero, i.e., β > 0. That's a problem because we would normally expect to interpret a non-zero coherency contribution as coming from delays due to some kind of pairwise interactions. Here, it is due to a sub-Amdahl execution profile (Figure 1), which has nothing to do with coherency delays. The regression analysis would therefore be ambiguous. In statistician-speak, the α and β contributions are confounded in the regression analysis.

Fig. 4. Additional scalability measurements

Although sub-Amdahl scaling is s real possibilty, especially in the case of fine-grained parallelism, it can very likely be disambiguated by measuring somewhat higher loads, as shown in Figure 4. At higher loads sub-Amdahl data should become squeezed between the lower curve and the asymptote at A (the dashed line). This might even show up as a second point of inflection. To be sure, we would still not know whether we were measuring true Amdahl scaling or sub-Amdahl scaling. We would only know the maximal degree of parallelism due to either α or A.

Conversely, we would not expect to see this behavior if β > 0 were due to coherency delays. Coherency delays would simply get worse at higher loads (beyond the first point of inflection) and the measurements would therefore be expected to continue heading down toward the alternative asymptote, otherwise known as the x-axis. And, of course, it's possible that both sub-Amdahl and coherency effects could be at play in a given application. In the commercial IT world, however, we generally don't have the necessary instrumentation or tools to discern such fine-grain effects.

This is pretty subtle stuff—a fine point—but it's a good reminder that when doing statistical regression with the USL model, we can't be too categorical about assigning causes to the fitted values of the α and β coefficients.

The Pith of Performance

Friday, February 4, 2011

USL Fine Point: Sub-Amdahl Scalability

No comments: