How do you solve a problem like an outlier???

by | Feb 24, 2014

I’ve done a few statistics and analytics courses over the years and almost always they tell you to look out for outliers. They’ll often tell you how to identify them graphically and give you numeric rules of thumb such as anything over 3 standard deviations from the mean or anything outside 1.5 times the interquartile range. They’ll also talk about outliers that are influential points (these are ones that tend to drag the model line towards them) and how to identify those. But one thing you pretty much never get is what to do with them.

(Source: my husband – ‘cause I can’t even draw stick figures!)

One thing I know not to do is to simply remove them – I know this as the only marks I lost on an assignment once was due to suggesting the removal of an outlier (I was young) even though the course notes said to remove them – I’m not bitter about it at all.
So what do you do with outliers???
Frustratingly for a black and white mathematician the answer is (as most things statistical)…

 IT DEPENDS!

If you know beyond a shadow of a doubt, you’d stake your first born’s life on it (although there a days where this mightn’t be such a rigorous criteria) that the outlier is due to an error, then you may remove it. This, in my mind, is the only time you can remove outliers – although I recommend you actually get the data corrected for future use.
If the outliers are not errors you need to check the impact they have on the model, inferences you make, the statistical assumptions needed for the analytical technique you are using and the strength of the relationship between variables.
Yup, you guessed it – you’re going to have to run your analysis with and without the outliers.
Some practitioners suggest that it’s ok to remove outliers if they do not impact the results and the relationship even if they affect the assumptions. Personally I recommend leaving them in (outliers are often the most interesting part) – but perhaps I’m still scarred from my early days as a student!! If you do decide to remove them (perhaps because you end up with a blob of data points at one end just to fit the outlier in within the graph real-estate), then note it and its characteristics somewhere within your report.
You need to keep in mind the danger of simply removing these outliers. While it seems of little consequence that the only thing that is being impacted is the assumptions underlying the modelling technique, it could be that while the results appear ok in the short time, it makes a heck of difference in the long term.
If you end up with very different models, inferences or relationships when you run your analysis without the outliers you could try transforming your data, or segmenting your data and modelling for each group, or try using different analytical techniques.  It could well be that there is no clear cut ‘answer’ – this is when analytics becomes a bit of an art form and you may well have to make a judgement call or two. Be wary of relying on a model where a relationship only exists due to outliers, you are perhaps better to wait until you have more concrete evidence of the relationship between variables.
It’s unlikely you’ll only have one outlier (as most the statistics books and courses only ever seem to have) so inspect the group of outliers – do they have any special characteristics that the non-outlier group don’t have? Perhaps they are a segment of their own. Sense check if the relationships with and without the outliers ‘feel’ right with the business – are there particular business conditions that a relationship holds which could be behind the outliers.
Unfortunately there is no clear cut guarantee ‘push the button’ approach – just a lot of monitoring, exploring and testing. No matter what though, you need to make sure you report on the differences and give possible explanations. It is always good idea to include a summary of what you know about the outliers – it may well be they have characteristics that the business is after! After all it’s the weird and wonderful that makes the world such an interesting place. Michelle

2 Comments
  1. Yves-Marie Lemaître

    Hi Michelle, nice (statistical) approach.
    Actually, I have a different approach, as in Market Research, one does not only look for the trends, but also for abnormal events, which may be signs for new opportunities, differentiation, innovation… So if one starts the discussion on the global view, it (nearly always) ends up talking about outliers, being either risks or opportunities, but never neglected…
    Maybe this is linked to our Christian culture in Western Europe, as we leave the flock alone to look for the lost sheep…
    So, in my opinion, NEVER delete the outlier, and spend as much time as possible finding out why this tiny spot is where it is, instead of being hidden in the crowd.

    Reply
  2. Michelle

    Absolutely agree – the only time I’d recommend removing an outlier completely if it was an error (and even then I’d recommend fixing the error and re-runing the analysis if time permits). You definitely want to spend time ferreting around to work out what lies beneath the outlier(s). With predictive modelling you do need to be cautious (which is the slant the post is coming from) and run the models with and without to understand the impact. It is always good to consider different modelling techniques, segmentations and transformations which may give a more robust model (which incorporates the outliers too). There are times when a judgement call is needed – but you should always highlight what you have done as well as the differences your decision gives.

    Reply
Submit a Comment

Your email address will not be published. Required fields are marked *