With all the hype surrounding ‘big data’ analytics – the magical promises of actionable insight, the glitz and glamour of data visualisation it’s easy to forget about what goes on behind the scenes to make it all happen – data preparation and exploration. These are the powerhouses of analytics and without these the insight gained will be marginal at best.
Most analysts will tell you that preparing and exploring the data will take around 70-80% of any analytical project. This is true no matter whether you are producing reports, creating dashboards or developing models.
As tempting as it may seem to skip the data preparation and exploration phase or do something ‘down and dirty’ – don’t. Putting the effort in up front will save time and frustration in the long run (from continual rework) and produce outputs that give real insight.
Data preparation involves:
- understanding the purpose and use of your report/dashboard/model
- formulating lots of questions and hypotheses
- getting data (often kicking and screaming from multiple data sources)
- deriving new variables from existing ones
- checking if inconsistencies exist within variables
- investigating the quality of the data.
For example, you may be tasked to investigate what impact the frequency of topping up a prepaid phone has on a customer’s total spend. Your data may include the dates and amounts each time a phone was topped up. You’ll need to establish the number of days between top ups as a new field – you may also want to put these into categories. A check on the dates may reveal a substantial number of your records with the exact same date (possibly the date the database was created!) this would cause tremendous problems in your analysis if not dealt with at the start.
Data exploration involves:
- investigating relationships between variables – not just for variable selection in modelling but also key for doing ‘what if’ scenarios often included in dashboards. You want to ensure the variables assigned to the dials interact with each other.
- looking at distributions– useful for identifying what summary measures to use in a report , where to set alerts for dashboards and if any transformations are needed for modelling
- identifying outliers – you’ll need to include these in your report, you’ll want to check what your dashboards and models are like with and without them. Do they cause major changes? Do you include or exclude them?
Often as you work through the exploration phase you think of different ways you’d like to view the data or different questions that you’d like to explore that requires new fields to be created. It’s not unusual for some iteration between the preparation and exploration. However, it is easy to fall into the ‘analysis paralysis’ trap – a great analyst knows when to move onto creating the report, dashboard or model. “You gotta know when to hold ‘em, know when to fold ‘em”