## Monday, October 3, 2011

The ability to visualize data, enabled by the advent of graphical computer tools, has been a great boon to Cap and Perf. The power derives from the way graphical displays provide an efficient impedance match to the visual system in our brain. The weakness derives from the way graphical displays provide an efficient impedance match to the visual system in our brain. We can get carried away by visual representations alone. Every marketing organization exploits that weakness. Numbers do have poor cognitive impedance, but that doesn't mean numbers should ignored altogether. In fact, we often need a combination of both numerical and visual data representations so that we don't suffer visual miscues and thus jump to the wrong conclusion. The following presents an example of how easily this can happen.

Recently, Guerrilla alumnus, Scott J. pointed me at this Chart of the Day showing how Google revenue growth was outpacing both Facebook and Yahoo, when compared 7 years after launching the respective companies.

Clearly, this chart is intended to be an attention getter for the Silicon Alley Insider website but, it looks about right and normally I might have just accepted the claim without giving it anymore thought. The notion that Google growth is dominating, is also consistent with a lot of other things one sees. No surprises there.

#### Exponential doubling period

In this particular case, however, I was struck by the shape of the data and curious to find out if the growth of GOOG and FB revenue follows an exponential trend or not. Exponential growth is not unexpected because it's the continuous analog of compound interest. If they are growing exponentially, I can compare their doubling periods numerically and determine by how their growth will look in the future.

The doubling period is an analysis technique that I use in Chapter 8 of my Guerrilla Capacity Planning book to determine the traffic growth of major websites. In section 8.7.5 the doubling time t2 is defined as:

t2 = Ln(2) / A

where A is the growth parameter of the fitted exponential curve (the rate at which it bends upward) and Ln(2) is the natural logarithm of 2 (2 for doubling). The only fly in the ointment is that I don't have the actual numeric values used in the histogram chart, but that need not be a showstopper. There are only a half dozen data points for each company, so I can estimate them visually. Then, I can use R to fit the exponential models and calculate the respective doubling times.

#### Analysis in R

First, we read the data (as eyeballed from the online chart) into R. Since the amount of data is small, I simply use the textConnection trick to write the data in situ, rather than using an external file.

1 0.001 0.002 0.001
2 0.01 0.02 0.01
3 0.1 0.2 0.1
4 0.5 0.45 0.3
5 1.5 0.75 0.6
6 3.2 2.0 1.1
7 6.1 4.0 0.75"),
closeAllConnections()

I can now plot those estimated data points and compare them with the original chart.

plot(gd$Year,gd$GOOG,type="b",col="green",lwd=2,lty="dashed",
main="Annual revenues for GOOG (green), FB (blue), YAH (red)",
xlab="Years after launch", ylab="$billions") points(gd$Year,gd$FB,type="b",col="blue",lwd=2,lty="dashed") points(gd$Year,gd$YAH,type="b",col="red",lwd=2,lty="dashed")  The result looks like this: The dashed lines simply connect related points together. The two solid lines are produced by performing the corresponding exponential fits to the GOOG and FB data.  # x-values for continuous exp curves x<-seq(from=1, to=7, by=0.1) ggfit<-nls(gd$GOOG ~ g0*exp(g1*gd$Year),data=gd,start=list(g0=1,g1=1)) gc<-coef(ggfit) lines(x,y=gc[1]*exp(gc[2]*x)) fbfit<-nls(gd$FB ~ f0*exp(f1*gd\$Year),data=gd,start=list(f0=1,f1=1))
fc<-coef(fbfit)
lines(x,y=fc[1]*exp(fc[2]*x))

# report the doubling periods
text(1,5.0,sprintf("%2s doubling time: %4.2f months", names(gd)[2],12*log(2)/gc[2]),adj=c(0,0))
text(1,4.5,sprintf("%2s doubling time: %4.2f months", names(gd)[3],12*log(2)/fc[2]),adj=c(0,0))

From the R analysis we see that the doubling period for Google (t2 = 11.39 months) is slightly longer than that for Facebook (t2 = 10.94 months). Despite the banner claim made by Silicon Alley Insider, based on these estimated data, Google is growing revenue at a slightly slower rate than Facebook. How can that be?

#### Conclusion

In the original histogram chart, it looks like Google is growing faster than Facebook. Well, looks can be deceiving. Your brain can be fooled (easily) by optical illusions. That's why we need to do analysis in the first place. Viewed uncritically, your brain can easily be led astray.

To resolve this paradox, let's do two things:

1. Project the growth models out further than the 7 years associated with the data
2. Plot the projected curves on log-linear axes (for reasons that will become clear shortly)
Here's the result (you might want to click on the image to magnify it).

The left-hand plot shows that the two curves cross somewhere between 7 years out and 40 years out. Whereas green (Google) is currently on top, according to the data, blue (Facebook) eventually ends up on top according to the exponential models; assuming nothing else changes in the future. The right-hand plot uses a log-scaled y-axis to reveal more clearly that the crossover occurs at t = 23.9 years. Once again, if you rely purely on visuals, you might think the crossover doesn't occur until after 30 years (what looks like a "knee" in the left-hand plot), but you'd be misled. It occurs almost 10 years earlier.

If, for example, you were only interested in short-term gains (as Wall St is wont to do), the original visual (histogram) is correct. If, on the other hand, you are in your 20s and investing longer term, e.g., for your retirement, you might get a surprise.

By now, you might be thinking that these projections are not very accurate, and I wouldn't completely disagree with you. But what is accurate here? The original data in the histogram (even the really real actual data) probably aren't very accurate either; we really can't know without deeper investigation. And that's my point: independent of the accuracy of the data, the numerical analysis can cause you to pay attention to, and possibly ask questions about, something you might otherwise have taken for granted on purely visual grounds.

Even wrong expectations are better than no expectations

I'm a big fan of data visualization, but not to the exclusion of numerical analysis. We need both and we need both to be easily accessible.

The art is in the science

Larry C said...

While I agree with your point and the message you are trying to get across I think you need to look closer at your model fits.

Assuming you buy into the model the value of A for the Google fit is 0.731 with a standard error of 0.043 and the fit for the Facebook model the value of A is 0.760 with a standard error of 0.033. With such a small difference in the fitted parameters and the relatively large standard error the numerical analysis would not conclude the two model fits are different. Plus there is a fairly wide range of what the doubling interval is. In short with seven data points it is real hard to trust the results of the model you have decided to use.

SteveJ said...

Excellent piece.
Might be one of your best.

Excellent piece.
Might be one of your best.

- clear, concise & well written, good logical progression etc.
- starts with "Something Really Obvious" we can all see for ourselves and agree with
- pose a problem (no datapoints), get over it
- then proceed with straight-forward analyses
- and end up with a surprising, even counter-intuitive, result

By demonstration you've told us:
- things aren't always what they seem, don't just take things on first appearances
- "digging deeper" can be quick and easy
- the tools/techniques to do this are quick and easy, and simple to master.

You might go as far as saying, "always check your results"... Which is Just Good Science.

But what I love is the way you've quietly demonstrated that Xerox PARC observation:

"Point of View is worth 40 to 60 IQ points".

Good one. works at many levels.

Neil Gunther said...

Belated response to Larry C's comment.

Point taken and indeed, I might express it a little differently. If we imagine that the modeled curves were drawn using fatter lines, then the whole notion of "crossing" comes into question. cf. Confidence bands for USL scaling curves.

However, I deliberately didn't show the summary stats on the log fit because I view the whole procedure a bit differently---perversely, perhaps.

Although I've never actually written this down before, the idea is to barrel through to an end point and then review. Roughly put, the steps are:
- Look at the plot
- Question: Is it log growth?
- Problem: No numeric values
- Solution: Guesstimate them
- Do the fit
- Question: What's the doubling period?
- Do the calculation
- Problem: Opposite from claim
- Question: Why?
- Solution: Curves cross at ~20 years (modulo above qualification).

In other words, I don't want to get distracted by numerical details in this phase. The goal is simply to reach a self-consistent explantion to the question about log growth. We may have to go back and revise the whole thing but, hopefully,
we'll know what needs revision and why.