Saturday, November 26, 2011

Handling NULLs and NAs

Real world data always has missing and blatantly incorrect values.

This becomes a painful issue when it comes to coming up with predictive models. While there are multiple ways of imputing data, it is difficult to figure out whether one is doing a good enough job. To make matters worse, the rows missing data might not be random. For example, all incomes above a certain threshold might be deliberately made NA to preserve anonymity. However, the model developer might not be aware of this censoring. Imputing data using any central measure will not only fail to capture this bit of information, but will actually make predictions worse.

Similar encoding might be present when one sees columns with values outside the natural limits. For example, say a column that contains number of questions answered from 5 questions in a test having the value -1 to indicate absentees.

In the worst case, a model developed by completely dropping the offending parameter might perform better than an imputed data-set.

In most cases, we can do better.



Ensemble techniques
Dealing with incomplete data is one of the pros of using ensemble techniques where one can afford to have some noise in the data-set and yet get fairly good predictors which generalize well. Some of these techniques are fairly good at dealing with very high dimensional data as well. Random forest models, for example, take a subspace of dimensions for training different trees. Sub-sampling the data while carefully avoiding selection of those (row, column) pairs which contain missing data springs as a possible solution.

At least in R, the default implementation of these techniques is sensitive to missing values. For example, the random forest library does not allow input data to have any NAs. One way to deal with it is dropping the extra information. It will work, but it means we are not using the entire data-set.

Instead, the next time you encounter such a data-set (with missing values):


UsersIncome
110
2NA
31000
4NA


    Add another column which indicates whether the actual value was missing (or incorrect) and then replace the missing (or incorrect) values with anything (mean, median, zero, any constant, etc.).


    UsersIncomeIncomeNA
    1100
    25051
    310000
    45051

    And then use this modified data-set as input for training the model.

    The intuition
    An intuitive idea of why this might work is that if there is some information associated with the missing data, then the IncomeNA column will become a significant predictor and will overshadow the effect of the (constant) imputed values. For example, if we were training a glm model, then the coefficient of  IncomeNA will not play a role for those rows when data is available. However, the coefficient would adjust with the imputed values and provide the optimal prediction for rows which have missing data. Note that glm itself is not an ensemble method, but it makes it easier to interpret the effect of the added column.


    Conclusion
    This is a fairly general method of dealing with missing data but it does have some shortcomings.

    Data could be missing for more than one reason (e.g. two-sided censoring), in which case, this technique will not do much to improve the performance over other kinds of imputations.

    Also, in case of incorrect values, more than one encoding might be in play for each different value. To extend to this case, we can add one extra column for each incorrect value (if they are discrete) to account for different reasons.

      Hope that helps you in dealing with your next real world data-set.

      5 comments:

      MK said...

      Interesting.

      What if a categorical variable has missing values?

      musically_ut said...

      The simplest thing would be creating a new category which stands for the missing value. Or, if we know more about the reason for omission of those values, then one new category for each reason.

      Whether one would need a new numerical column to indicate missing values (the IncomeNA column) in this case depends on the technique being used to create the model. If it is a hybrid technique which handles numerical values as well as categorical values, the extra column will probably help. However, if the technique deals with only categorical data (e.g. decision trees) then the second column will be useless.

      esha said...

      Very interesting! In fact this is the method that I usually use. However, i have some questions/comments:

      1. In case of continuous variables, would it give better results if we imputed the missing by a user defined value (say, median etc.)?

      2. In R is there a way of generating these "missing flags" in any package; because I have been creating them manually so far.

      3. A comment about the zeros: there can be two types of zeros, either they are missing from the data or the variable itself can take a value zero. In this case can we model the two separately, esp. in case of a binary outcome variable (something along the lines of neative binomial distribution?)

      musically_ut said...

      Hi esha,

      Thanks for the comment.

      To answer your questions 1 and 3: for a continuous variable, any user defined value should do for imputing. The basic idea is making the data-set NA-free while preserving the information that there was NA at certain rows. Note that I am not overwriting the original values with 0 or 1. Instead, I am adding a new column IncomeNA. I include both the original (imputed) data as well as this extra column in my models later.

      Also, the missing values should be explicitly marked in the input data. While reading from an SQL database, they'll probably arrive as NULLs. While reading from a csv in R, one can get NAs recognized by properly setting na.strings in read.csv.

      Hope those answer your questions?

      ~
      Ut.

      esha said...

      Thanks a lot... That really helps!!