Missing in Action

by | Feb 11, 2014

Roberto-Garrido-SunMost data is not created perfect.  And one imperfection quite common to most data analysts is that of missing values.  Records that have missing values in one or more variables are called incomplete cases.  In SAS, most procedures that analyse data ignore records with missing values.  Only those records having complete values also known as complete cases are analysed.
Removing incomplete cases is a simple approach to addressing the missing values problem.  In data having significant missing values of systematic nature, the procedure of removing incomplete records could yield incorrect results.  The remaining “good” data fail to be  an unbiased representation of its population and any inference drawn  from this sample does not hold true of the population.
There are simple ways of addressing incomplete data.  Populating the missing values with its mean value is a common approach in remedying the problem.  This method allows for the other non-missing values to be used in an analysis.  In this single imputation approach, the probability of the predictions about the missing values are not taken into account.
SAS came up with an alternative solution which is more robust but rather complex in populating missing values.  The procedure is called MI and performs multiple imputations of missing data.  Rather than looking at a single value to use as a proxy for imputation, the methodology looks at a set of values and their underlying distribution.  The random sample of probable values of the missing value is assessed using standard statistical methods including confidence intervals about the missing value.
The MI procedure can be summarised in three steps:
1. The missing data are filled in n times to generate n complete data sets.
2. The n complete data sets without missing values are analyzed using standard statistical analyses.
3. The n complete data sets provide the information to perform statistical inferences.
The procedure is far better than single imputation approach because it takes into account the uncertainty associated with the missing value.
While computational time on a computer is longer using this methodology, it is no longer a significant deterrent since modern day computers have enough processing speed to handle such tasks. Roberto

2 Comments
  1. kenoconnordataconsultant

    Hi Roberto,
    Interesting approach, which I assume you recommend for use with for numerical values.
    What if anything do you recommend when dealing with missing data such as phone numbers?
    Rgds Ken

    Reply
  2. Roberto Garrido

    Phone numbers are a totally different class of variables. Replacing missing values for phone numbers would be an exercise of looking for other data that would give you some indication of what phone numbers are in an area for example or a zip code. The MI procedure is not just about filling in values based on probability but the procedure shines bright when the values become part of a predictive model.

    Reply
Submit a Comment

Your email address will not be published. Required fields are marked *