The Pith of Performance: Hadoop Scalability Challenges

Hadoop is hot, not because it necessarily represents cutting edge technology, but because it's being rapidly adopted by more and more companies as a solution for engaging in the big data trend. It may be coming to your company sooner than you think.

The Hadoop framework is designed to facilitate the parallel processing of massive amounts of unstructured data. Originally intended to be the basis of Yahoo's search-engine, it is now open sourced at Apache. Since Hadoop now has a broad range of corporate users, a number of companies offer commercial implementations of Hadoop.

However, certain aspects of Hadoop performance, especially scalability, are not well understood. These include:

So called flat development scalability
Super scaling performance
New TPC big data benchmark

See "Hadoop Superlinear Scalability: The Perpetual Motion of Parallel Performance" for a more detailed discussion.

Therefore, I've added a new module on Hadoop performance and capacity management to the Guerrilla Capacity Planning course material that also includes such topics as:

There are only 3 performance metrics you need to know
How performance metrics are related to one another
How to quantify scalability with the Universal Scalability Law
IT Infrastructure Library (ITIL) for Guerrillas
The Virtualization Spectrum from hyperthreads to hyperservices
Hadoop performance and capacity management

The course outline has more details.

Early bird registration ends in 5 days.

I'm also interested in hearing from anyone who plans to adopt Hadoop or has experience using it from a performance and capacity perspective.

The Pith of Performance

Monday, March 23, 2015

Hadoop Scalability Challenges

No comments:

Post a Comment