I’ve done a few statistics and analytics courses over the years and almost always they tell you to look out for outliers. They’ll often tell you how to identify them graphically and give you numeric rules of thumb such as anything over 3 standard deviations from the mean or anything outside 1.5 times the interquartile range. They’ll also talk about outliers that are influential points (these are ones that tend to drag the model line towards them) and how to identify those. But one thing you pretty much never get is what to do with them.
(Source: my husband – ‘cause I can’t even draw stick figures!)
One thing I know not to do is to simply remove them – I know this as the only marks I lost on an assignment once was due to suggesting the removal of an outlier (I was young) even though the course notes said to remove them – I’m not bitter about it at all.
So what do you do with outliers???
Frustratingly for a black and white mathematician the answer is (as most things statistical)…
IT DEPENDS!
If you know beyond a shadow of a doubt, you’d stake your first born’s life on it (although there a days where this mightn’t be such a rigorous criteria) that the outlier is due to an error, then you may remove it. This, in my mind, is the only time you can remove outliers – although I recommend you actually get the data corrected for future use.
If the outliers are not errors you need to check the impact they have on the model, inferences you make, the statistical assumptions needed for the analytical technique you are using and the strength of the relationship between variables.
Yup, you guessed it – you’re going to have to run your analysis with and without the outliers.
Some practitioners suggest that it’s ok to remove outliers if they do not impact the results and the relationship even if they affect the assumptions. Personally I recommend leaving them in (outliers are often the most interesting part) – but perhaps I’m still scarred from my early days as a student!! If you do decide to remove them (perhaps because you end up with a blob of data points at one end just to fit the outlier in within the graph real-estate), then note it and its characteristics somewhere within your report.
You need to keep in mind the danger of simply removing these outliers. While it seems of little consequence that the only thing that is being impacted is the assumptions underlying the modelling technique, it could be that while the results appear ok in the short time, it makes a heck of difference in the long term.
If you end up with very different models, inferences or relationships when you run your analysis without the outliers you could try transforming your data, or segmenting your data and modelling for each group, or try using different analytical techniques. It could well be that there is no clear cut ‘answer’ – this is when analytics becomes a bit of an art form and you may well have to make a judgement call or two. Be wary of relying on a model where a relationship only exists due to outliers, you are perhaps better to wait until you have more concrete evidence of the relationship between variables.
It’s unlikely you’ll only have one outlier (as most the statistics books and courses only ever seem to have) so inspect the group of outliers – do they have any special characteristics that the non-outlier group don’t have? Perhaps they are a segment of their own. Sense check if the relationships with and without the outliers ‘feel’ right with the business – are there particular business conditions that a relationship holds which could be behind the outliers.
Unfortunately there is no clear cut guarantee ‘push the button’ approach – just a lot of monitoring, exploring and testing. No matter what though, you need to make sure you report on the differences and give possible explanations. It is always good idea to include a summary of what you know about the outliers – it may well be they have characteristics that the business is after! After all it’s the weird and wonderful that makes the world such an interesting place. Michelle