## Friday, February 4, 2011

### USL Fine Point: Sub-Amdahl Scalability

As discussed in Chapter 4 of my GCaP book, Amdahl's law is defined by a single parameter called the serial fraction, denoted by the symbol α and signifying the proportion of the total workload (W) that is serialized during execution. From the standpoint of parallel processing (where reference to Amdahl's law is most frequent) serialization means that portion of the workload can only execute on a single processor out of N parallel processors. The parallel speedup or relative capacity CA(N) performance metric is given by: $$C_A(N) = \frac{N}{1 + \alpha \, (N-1)}$$ If there is no serialization in the workload, i.e., α = 0, then CA(N) = N, which signifies that the workload scales linearly with the number of physical processors. The important observation made by Gene Amdahl (more than 40 years ago) is that even if α is relatively small, viz., a few percent of the execution time, scalability cannot continue to increase linearly. For example, if α = 5%, then CA(N) will eventually reach a scalability ceiling given by 20 effective processors (1/α), even if there are hundreds of physical processors available in the system.

An assumption in eqn.(1) is that the parallel fraction (1 − α) of the workload can be equipartitioned across all N processors. In other words, each of the N processors is assumed to run 100% busy for exactly the same elapsed time (TN). As you might imagine, partitioning a workload into identical subtasks is not always feasible. What happens to scalability CA(N) in that case?

Fig. 1. Distribution of processor activity

Figure 1 shows a workload that utilizes various numbers of processors during the course of its executing the workload. Let's take the time unit to be seconds. It uses a single processor between time t = 2 and 4 seconds as well as between t = 24 and 27 seconds; a total time of 5 seconds. In between, the workload executes in parallel for an elapsed time of TN = (27 − 2) − 5 = 20 seconds using up to N = 8 processors, even if more than 8 physical processes are actually available. Moreover, unlike the Amdahl scaling assumption, the time for which the work executes on a particular number of processors (i.e., the width of each column) is also different.

Fig. 2. Average degree of parallelism

The total area under all the columns in Figure 1 corresponds to the work W = 93 cpu-seconds. Although these columns represent a distribution of work, we can summarize this execution profile using a single metric called the average parallelism (denoted by A), shown as the height of the red rectangle in Figure 2. This rectangle has the same width as the total execution time (T = 25 seconds) and the same area as that under all the blue columns in Figure 1. Hence, the average number of active processors A = 93/25 = 3.72. In other words, this particular execution profile cannot make efficient use of more than about p = 4 physical processors, on average.

Since the parallel execution time TN in Figure 2 is longer than the Amdahl case, the performance must be inferior to Amdahl law. Let's call it sub-Amdahl scaling. The relationship between Amdahl scaling and sub-Amdahl scaling is shown in Figure 3, where the dashed line corresponds to the average parallelism A. Whereas Amdahl scaling has an upper bound at 1/σ (horizontal line not shown) due to maximal parallelism, sub-Amdahl scaling has an upper bound at A (lower dashed horizontal line) due to average parallelism.

Fig. 3. Sub-Amdahl scalability measurements

It's amusing to note that I discussed the concept of average parallelism in The Practical Performance Analyst, more than a decade ago, when massively parallel machines were making a comeback for commercial (rather than scientific) applications. I stopped talking about it because that trend seemed to disappear.
What is the importance of all this for the USL today? According to Chapter 6 of GCaP, a simple symmetry argument tells us that Amdahl's law can also be applied to software scalability, and similarly for the universal scalability law (USL) given by eqn.(2). Instead of N processors, we consider N processes. $$C(N) = \frac{N}{1 + \alpha \, (N-1) + \beta \, N (N-1)}$$ where the coefficient α is taken to represent the degree of contention in the application and β the amount of coherency delay, i.e., thrashing-like degradation.

Imagine a sequence of load-test measurements where the scalability starts out following Amdahl's law (the top curve in Figure 3). At some load value, however, the performance becomes inferior to the expected Amdahl scaling because the workload partitioning has more execution-time skew than assumed in Amdahl's law. This effect could show up as a point of inflection in the data (dots) with successively higher loads producing CA(N) values that head down toward the sub-Amdahl limit (the lower curve in Figure 3).

If we took those data and tried to fit them to the USL model, the point of inflection would likely cause the coherency coefficient to be non-zero, i.e., β > 0. That's a problem because we would normally expect to interpret a non-zero coherency contribution as coming from delays due to some kind of pairwise interactions. Here, it is due to a sub-Amdahl execution profile (Figure 1), which has nothing to do with coherency delays. The regression analysis would therefore be ambiguous. In statistician-speak, the α and β contributions are confounded in the regression analysis.