As an analyst, one of the most common questions I hear is ‘how much data do you really need for an analysis’? The answer I always give is – it depends! If you are looking at analysing how someone’s behaviour changes from birth till death, you need data from their entire lifespan. However, in most instances, you are probably only looking at a specific time period in someone’s life. As a result, your data needs are a lot less, both in terms of volume and length of time covered.
The Law of Large Numbers
Lots of data always helps with producing better analysis, there is even a probability law to support this - the Law of Large Numbers. The law states that the more data you have, the more likely it is that the mean will be close to the true expected value. So, from this we can deduce that more data will, in general, mean greater accuracy and better insights.
However, even large volumes of data need to be treated with caution. If I told you a data sample size was 2.4 million, you would expect the results of any analysis or survey generated to be accurate. Well, you wouldn’t be alone… that’s what the folks at the Literary Digest thought when they tried to predict the results of the Presidential election fight between Roosevelt and Landon. 57% of their readers picked Landon, but Roosevelt won the election by a landslide. What went wrong?
Big Data Doesn’t Mean Accurate Data
What went wrong was that even large data samples aren’t immune to having a lot of noise. They can also have bias depending on how the data is collected. Even within your organisation, if you have had a shift in strategy – from pushing a specific product to ensuring guest bookings are made earlier, your data could be impacted. Wider economic conditions like a recession will also have an indirect impact on your dataset, affecting the spending power of your customer base.
In the Roosevelt versus Landon survey, the Digest asked their reader base (upper middle class Americans) to mail in responses. This effectively alienated anyone from other socio-economic backgrounds and these were the people largely responsible for making Roosevelt President.
So How Do We Generalise Data Requirements?
Well, firstly make sure you have enough data to cover any seasonal differences. Ideally, that seasonality shouldn’t just extend to Christmas and your annual sales, but also cover at least one economic boom and bust cycle.
There are a number of mathematical cycles which dictate the length of time covered in a boom and bust cycle. The one most applicable to Retail or B2C businesses is the Kitchin cycle, discovered in the 1920’s, this is a short economic cycle covering a period of around 40 months.
What About Small Data Sets?
But don’t fret, having a smaller set of data does not mean you cannot gain any insight from what is at hand. If you are a new organisation, you could use the data you have to identify where your current, most valuable customers are coming from. Irrespective of the volume of data, simple database overviews would give you a wealth of information on your customers.
Having said this, any insight stemming from small volumes of data will need to be treated with caution. Analysis should include rigorous statistical testing to ensure outputs are valid. Analytical projects should have a refresh built into the scope to ensure that all outputs can be refreshed and refined as you obtain more data.
Any Data is Good Data
The long and short of it is you need to have data for analysis. You can make something out of any volume. Shortcomings in any which way or form can be considered and accounted for during the course of an analytical project.
It is what you do with what you have that matters!