Saturday, October 8, 2016

Crib Sheet for Emulating Web Traffic

Our paper entitled, How to Emulate Web Traffic Using Standard Load Testing Tools is now available online and will be presented at the upcoming CMG conference in November.

Presenter: James Brady
Session Number: 436
Subject Area: APM
Session Date: WED, November 9, 2016
Session Time: 1:00 PM - 2:00 PM
Session Room: PortofinoB

The motivation for this work harks back to a Guerrilla forum in 2014 that essentially centered on the same topic as the title of our paper. It was clear from that discussion that commenters were talking at cross purposes because of misunderstandings on many levels. I had already written a much earlier blog post on the key queue-theoretic concept, viz., holding the $N/Z$ ratio constant as the load $N$ is increased, but I was incapable of describing how that concept should be implemented in a real load-testing environment.

On the other hand, I knew that Jim Brady had presented a similar implementation in his 2012 CMG paper, based on a statistical analysis of the load-generation traffic. There were a few details that I couldn't quite reconcile in Jim's paper but, at the CMG 2015 conference in San Antonia, I suggested that we should combine our separate approaches and aim at a definitive work on the subject. After nine months gestation (ugh!), this 30-page paper is the result.

Although our paper doesn't contain any new invention, per se, the novelty lies in how we needed to bring together so many disparate and subtle concepts in precisely the correct way to reach a complete and consistent methodology. The complexity of this task was far greater than either of us had imagined at the outset. The hyperlinked Glossary should help with the terminology, but because there are so many interrelated parts, I've put together the following crib notes in an effort to help performance engineers get through it (since they're the ones that most stand to benefit).

  1. Standard load testing tools have a finite number of virtual users
  2. Web traffic is characterized by an indeterminate number of users
  3. Attention is usually focused on the performance of the SUT (system under test)
  4. We focus on the DVR (driver) side performance for web traffic
  5. Examine distribution of arriving requests and their mean rate
  6. Web traffic should be a Poisson process (just like A.K. Erlang used in 1909)
  7. That requires statistically independent arrivals (i.e., no correlations)
  8. We also refer to these as asynchronous requests
  9. Standard virtual users become correlated in the queues of the SUT
  10. We refer to these as synchronous requests
  11. We decouple them by reducing the length of queues in the SUT
  12. This is achieved by increasing think delay $Z$ as the load $N$ is increased (Principle A in the paper)
  13. Traffic then approaches a constant mean rate $\lambda_{rat} = N/Z$ as SUT queues decrease
  14. Check the traffic is indeed Poisson by measuring the coefficient of variation ($CoV$)
  15. Must have $CoV = 1$ for a Poisson process (Principle B in the paper)
Originally, I assumed the paper would be no more than a third it's current length but, try as we might, that was not to be. My only defense is: it's all there, you just need to read it. Apologies in advance, but hopefully, the crib notes will help.

4 comments:

test said...

Very interesting.
Have you thought about issues emulating IoT traffic?

Neil Gunther said...

Interesting question; never even occurred to me.

Can you provide some details on how IoT traffic differs from web traffic?

Neil Gunther said...

FYI: Just saw this on Twitter ... How performance testing the IoT is different.

ks said...

Thanks for the article.

Indeed, I have also seen articles and papers regarding trying to improve how benchmarking and performance testing for IoT. Some others that may be of interest:

• IoTAbench: an Internet of Things Analytics benchmark: http://www.hpl.hp.com/techreports/2014/HPL-2014-75.pdf
• RIoTBench: A Real-time IoT Benchmark for Distributed Stream Processing Platforms: https://arxiv.org/pdf/1701.08530.pdf
• A Model to Evaluate the Performance of IoT Applications http://www.iaeng.org/publication/IMECS2017/IMECS2017_pp147-150.pdf
• IoT TCP: http://www.tpc.org/tpc_documents_current_versions/pdf/tpcx-iot_v1.5.x.pdf

For emulating IoT traffic, most I believe are using non-standard tooling to emulate that devices. For example, a cluster to run containers that have device logic in them. So, they may indeed test up to the actual number of emulated devices vs. simulating with a combination of virtual users and think time. So this may be a difference in approach.

I think a potential large different between IoT solutions and Web Apps s is that they:
1. Are messaging based: data flow of message sources and sinks
2. Implement hot, warm, and cold paths
3. Leverage, Real-time analytics (CEP, Stream processing), In-memory computing (Spark, etc.), Indexed Storage (Solr, etc.), MapReduce (Spark, Hadoop)
4. Can think of IoT as data flow of sources and sink

(1) Event Production -> (2) Event Queueing & Stream Ingestion -> (3) Stream Analytics -> (4) Storage & Batch Analysis -> (5) Presentation and Action

And technologies below:

1) Devices and Gateways
2) Azure Event Hubs, IoT Hub ; or Kafka
3) Azure Stream Analytics; or Spark Streaming
4) Azure Data Lake, CosmosDB, SQL Database, SQL Data Warehouse; or Spark, Hadoop [*]
5) Microsoft Power BI; or Tableau

[*] http://perfdynamics.blogspot.com/2015/03/hadoop-scalability-challenges.html. Big Data is usually part of IoT solution.

With this type of complexity, it seems critical to have telemetry needed to do performance and scalability analysis.

1. Ideally you would capture telemetry for each message with timestamps at key points in data flow, including correlation id for end to end visibility
2. Or if that is not possible (because development has not been done, or throughput is too high) just capture metrics that are needed for performance and scalability analysis:
At important points along message data flow capture: time, count, rate. What metrics would you recommend capturing in a messaging system like this? For example in (#2) in stream ingestion code, this may be cluster of N VMs mapped 1:1 to a partition in IoT Hub, each processing messaging in batches of X messages. Then sending for further processing to (#3) Stream Analytics or for storage, analytics to (#4) Data