Thursday, March 13, 2014

Performance of N+1 Redundancy

How can you determine the performance impact on SLAs after an N+1 redundant hosting configuration fails over? This question came up in the Guerrilla Capacity Planning class, this week. It can be addressed by referring to a multi-server queueing model.

N+1 = 4 Redundancy

We begin by considering a small-N configuration of four hosts where the load is distributed equally to each of the hosts. For simplicity, the load distribution is assumed to be performed by some kind of load balancer with a buffer. The idea of N+1 redundancy is that the load balancer ensures all four hosts are equally utilized prior to any failover.

The idea is that none of the hosts should use more than 75% of their available capacity: the blue areas on the left side of Fig. 1. The total consumed capacity is assumed to be $4 \times 3/4 = 3$ or 300% of the total host configuration (rather than all 4 hosts or 400% capacity). Then, when any single host fails, its lost capacity is compensated by redistributing that same load across the remaining three available hosts (each running 100% busy after failover). As we shall show in the next section, this is a misconception.

The circles in Fig. 1 represent hosts and rectangles represent incoming requests buffered at the load-balancer. The blue area in the circles signifies the available capacity of a host, whereas white signifies unavailable capacity. When one of the hosts fails, its load must be redistributed across the remaining three hosts. What Fig. 1 doesn't show is the performance impact of this capacity redistribution.

N+1 = 4 Performance

The performance metric of interest is response time as it pertains to service targets expressed in SLAs. To assess the performance impact of failover, we model the N+1 configuration as an M/M/4 queue with per-server utilization constrained to be no more than 75% busy.

When a failover event occurs, the configuration becomes an M/M/3 queue. The corresponding response time curves are shown in Fig. 2. The y-axis is the response time, R, expressed as multiples of the service period, S. A typical scenario is where the SLA (horizontal line) corresponds to maximum or near maximum utilization. The SLA in this case is a mean response time no greater than 1.45 service periods.

On failover, only three hosts remain available, and the SLA will be exceeded because the utilization of each host will be heading for 100% due to the additional load. (See Figs. 1 and 2.) Correspondingly, this has the effect of pushing the response time very high up the M/M/3 curve. In order to maintain the SLA, the load would have to be reduced so that it corresponds to an even lower utilization than originally anticipated, viz., 68.25% instead of 75%. Fig. 3 shows this effect in more detail.

In practice, proper capacity planning, such as the M/M/m queueing models employed in this discussion, would have revealed that the maximum host utilization should not have exceeded 68.25% busy in the N+1 configuration.

Large-N Performance

With a large number of hosts the difference in response time after failover becomes less significant. This follows from the fact that response-time curves for an M/M/m queue are flatter at high utilizations when the number of servers, m, is large. The effect is illustrated in Fig. 4 for N+1 = 16 hosts. (cf. Fig. 2)

However, the most common installations are small-N configurations of the type discussed in the previous section. Therefore, preserving your SLA requires capacity planning based on host utilizations that match your SLA targets.


Thanks to the GCaP class participants for doing a group-edit on this post in real time.

6 comments:

harry van der horst said...

A further complication occurs when the redundancy is in the form of virtual machines. If these machines are hosted on the same real hardware. Then the real perfromance can max out even quicker.

Dmitry Agranat said...

I guess this is true for cases like RAC, where you want to guarantee your SLA. I wonder how it can be addressed when working with Cloud in terms of elasticity and auto-scaling.

Brett Allison said...

Thanks for the article. It reminds me of a customer experience I had recently. They saw that our throughput dashboard for the storage controller was green during the middle of the day (peak load) so they assumed they could perform a firmware upgrade in the middle of the day. The firmware upgrade took one controller down (dual-controller) at a time to perform the upgrade. This lead to significant performance issues as the remaining controller could not handle the workload. They blamed our software because we had shown "green" or healthy during the peak workload with both controllers running.

Brett Allison said...

Thanks for the article. It reminds me of a customer experience I had recently. They saw that our throughput dashboard for the storage controller was green during the middle of the day (peak load) so they assumed they could perform a firmware upgrade in the middle of the day. The firmware upgrade took one controller down (dual-controller) at a time to perform the upgrade. This lead to significant performance issues as the remaining controller could not handle the workload. They blamed our software because we had shown "green" or healthy during the peak workload with both controllers running.

Matteo said...

Hello Dr Gunther,
thanks the article.
I have a question about the conclusions: when you say "the maximum host utilization should not have exceeded 68.25% busy in the N+1 configuration" do you mean that the workload should be reduced to 68.25% of the final configuration, after the failover i.e. with only N machines active?
When the redundancy must be sized to allow the SLA is respected in case of failover without reducing the workload then we should first find the maximum allowed load in reduced configuration (68.25% with 3 machines) and then redistribute this utilization on full configuration (68.25*3/4 = 51% for 4 active machines). Correct?
Thanks
MatteoP

Neil Gunther said...

Let's see if I can address your question this way, Blogger Matteo.

With 3/4 available capacity on each host in the N+1=4 config (as per Fig. 1), you might think you can run each host at a max utilization of ρ_max = 3/4 = 75%. However, it's already clear from Fig. 2 that you would expect to exceed the SLA in the N=3 config, after failover, by simply following the vertical line at ρ = 0.75.

But it's actually worse than that b/c, after failover, 3 hosts have to service (with mean service time S) the same inbound traffic, λ, that was previously being serviced by 4 hosts.

If the per-server utilization was ρ_max = 3/4 = λS/4 in the N+1 config, then the traffic intensity must be λS = 3 (Erlangs) or 300% of the total host capacity. After failover, however, 3 hosts must now service that same 3 Erlangs worth of traffic. The new utilization will become ρ_max = λS/3 = 3/3 = 1. In other words, each of the 3 hosts will be driven to 100% busy and the response time will become unbounded.

To ameliorate this potential disaster, the max per-server utilization should be something substantially lower, viz., ρ_max = 68%, so that the SLA is not exceeded *after* failover: the vertical arrow in Fig. 3. In practice, this would be accomplished by choosing N large enough to match the per-server utilization requirement.

Hopefully, this has helped to clarify rather than mudify.