My reasoning goes like this:
To make statistically meaningful estimates for models like linear least squares, you need at minimum about half a dozen data points. This is the kind of rule of thumb (RoT) I developed for USL scalability modeling to be meaningful.
Now consider that same RoT from another standpoint. If you are Amazon.com, for example, and you want to estimate your Christmas growth trend, then you need about half a dozen data points. That means 5 or 6 years of data.
That time scale for a historical data repository certainly distinguishes CaP from simple performance monitoring (no storage) and performance analysis (short-term history) where you are looking for diagnostic patterns rather than trends.
How much data is kept during those 5 yrs is a secondary, sampling question. As surely as you throw away arbitrary periods of data those will become the periods you will need at some later time for CaP purposes. In fact, it's probably more work to selectively remove certain data periods than it is to automatically keep the lot. But keeping everything might be overkill. That's typically where data aggregation comes in. The further back in time you go, the less likely you will need events with fine time-granularity.
With this RoT in mind, I decided to google the topic but discovered there is very little that is relevant to CaP since most commentators are typically considering data storage for applications (e.g., an RDMS), not historical data for CaP analysis. FWIW, here's what I found during a relatively quick review:
- SMB planning mentions 5 years without any justification
"knowing what your business is doing...in an 18-month time frame, to a three-year time frame, to a five-year time frame. You really need to plan out that far, but if you do, it's fantastic. If you have a five-year plan, it will trickle down more easily into getting your hands around capacity and growth."
- Oracle example of 6 weeks
"Suppose, for example, you always want to be able to view hour data for the previous six weeks. ..."
- IBM up to 2 months before aggregating ("pruning")
"Capacity planning and predictive alerting: For capacity planning and predictive analytics, you typically perform long term trend analysis. The Performance Analyzer, for example, uses Daily summarization data for the predefined analytic functions. So, in most cases, configure daily summarization. You can define your own analytic functions and use Hourly or Weekly summarization data."
"For the analytic functions to perform well, ensure that you have an appropriate number of data points in the summarized table. If there are too few, the statistical analysis will not be very accurate. You will probably want at least 25 to 50 data points. To achieve 50 data points using Daily summarization, you must keep the data for 50 days before pruning. More data points will make the statistical predictions more accurate, but will affect the performance of your reporting and statistical analysis. Consider having no more than a few hundred data points per resource being evaluated. If you use Hourly summarization, you get 336 data points every 2 weeks."
"Adaptive monitoring (dynamic thresholding): Keep 7 to 30 days of detailed data when comparing all work days. If you compare Monday to Monday, then you need to keep the Detailed data much longer to be able to establish a trend. When comparing a specific day of the week, you will
probably need to have at least 60 days of data.
- RRDtool example 2 years for consolidation decisions.
"if now is March 1st, 2009, do you want to look at 2007-03-01 until 2009-03-01 or
do you want to be able to look at 2007-03-01 midnight to next midnight."
"What you need to understand here is consolidation. Say that you will be looking at two years worth of information, and that the available data is in a resolution of 300 seconds per bucket. This means you have more than 200,000 buckets."
"Example: Say you want to be able to display the last 2 years, the last 2 months, the last 2 weeks and the last 2 days. The database uses the default step size of 300
seconds per interval."
Ultimately, I'd like to turn my 5-year RoT into a Guerrilla Mantra. Send me your thoughts and comments to help me get there.
I'm posting this, more or less unedited, on behalf of Harry v.d. H:
ReplyDeleteAs I have been involved in capacity management and performance optimalisation intermittantly since 1968, I have some practical views on data retention.
--> For the service records (nr of transactions, types, response time, nr of bytes, monetary value) I advice on retaining them for 10 years condensed at an hourly level. The response times and the nr of bytes I advise to store both the average and the 95 percentile.
I have found that especially for presentations to sr mngmnt it really helps to show what the situation was in the past.
--> For resource info (storage, CPU) the retention is less ambitious, In my experience the applications change so much over 5 yrs that even the data from yrs back is no longer relevant to me, so I am quite happy to scrap that after 2 yrs.
It is a pity but financial info has a very short life span, the organisations change so quickly that even after 2 yrs it is impossible to reconstruct reliably what was paid for which application complex. There I normally just wing it by saying: In my memory the transaction cost price in 1998 was 21 guilder-cents. As nobody has better figures, and as long as my figures are correct for the checkable years, that is workable.
In backtesting trading systems, it isn't really the number of data points or the amount of calendar time that matters. What matters is how many of the possible market scenarios you have captured. For example, if you have real-time data, you could have millions of data points for the year 2009 but would only be looking at data from the first year of the Obama adminstration. Ten years of data would get you back to the bursting of the dot-com bubble and 9/11, etc.
ReplyDeleteAnd now for a slightly different perspective: Which Telecoms Store Your Data the Longest?
ReplyDeleteRetention times range from a few days to 7 years.
Hey,
ReplyDeleteLove your blog BTW.
In my opinion this really has to do with what you expect from capacity planning. From my perspective it's about risk management in a sense. You want to protect against capacity issues that you can predict with some level of accuracy. So for example you would look at business cycles. For retails that might be a yearly cycle, the classical Christmas/summer cycle. One cycle becomes one "data point" almost. So how many Christmas/summer cycles do you want to examine? In my opinion 5 years seems like a good start. And given there is a small weekday / weekend cycle inside that Christmas/summer cycle, that might dictate that a daily average is a good level of resolution. Another way to this about this is: how old does the data have to be before it's not worth anything to you (or the business)? Or at what resolution is the data no good to you? In the retail example a monthly average 3 years later might be worthless. That daily granularity would still be useful 3 years later.
Trying to get a system to do the retention / granularity you want after you built it can be tough. But baking it right in can make it trivial. The two big bits for the design side is the storage of that data, and the reporting of the data. For me storage is peanuts, thanks to mostly converged storage. The reporting bit is typically harder. Few systems "think" in the scale that the business wants. Being a techie I'm inclined "build it myself" and rrdtool is always front of the line for most little projects. It's just so well suited for it's niche job.
This also speaks to the importance of the IT guys talking directly with the business owner. Do not lose sight of the line from the Technology to the business itself. Finance and accounting will have something to say about the resolution and retention of data.
---
David Thornton | Managed Services
Scalar Decisions
e: david.thornton at scalar.ca