So you’ve gone from the dizzying highs of the midnight idea that will illuminate all the gems hidden in your data, solve world hunger & bring peace to all – to the crashing lows where it turned out to be a dud. You’ve survived the roller coaster ride of exploring your data, now you are now ready to create the dataset to take into the next phase of the modelling process!
You want to come up with a dataset where:
- The variables are relevant to your objective
- The variables are not related to each other (or as little as possible) – don’t cross the streams!
- There are as few missing or erroneous pieces of data as possible
Curse of dimensionality – mwa-ah-ah!
This is when you get carried away with the plethora of variables that you have to play with – and put them all into your model! While this may be very good at explaining the dataset you currently have, it does not necessarily work well for future values – you’re better off drawing a stick figure rather than producing the Mona Lisa. Your data exploration should have highlighted the key variables you need, as well as any variables that will need transforming.
Transformers – more than meets the eye!
Many modelling techniques require the data to form the ‘bell shaped’ normal curve. Skewed data (e.g. lots of records around the average with very few at the far end or vice versa) will probably need a transformation – such as taking the log of the value. Transforming data like this also minimises the impact the outliers have. You need to keep both the original value and the transformed value in your data set.
Know thy quality!
You should also have a feel for the quality of the data. If a variable has a large number of missing records you probably need to exclude it from your dataset. If a variable has a smaller number you could substitute them with an average or something similar.
If you suspect some of your data is erroneous, you should double check to see if you can determine how the error occurred – it could be something systemic that can be fixed. Otherwise, you could substitute for an average or remove the record(s) entirely. However, you need to be really, really, really sure your data is an error before doing anything. You might need to try modelling with both an ‘untouched’ dataset and a ‘clean’ dataset and compare the resulting models to be really sure.
You are now ready to start playing with some modelling techniques. Finally!