Saturday, November 26, 2011

Handling NULLs and NAs

Real world data always has missing and blatantly incorrect values.

This becomes a painful issue when it comes to coming up with predictive models. While there are multiple ways of imputing data, it is difficult to figure out whether one is doing a good enough job. To make matters worse, the rows missing data might not be random. For example, all incomes above a certain threshold might be deliberately made NA to preserve anonymity. However, the model developer might not be aware of this censoring. Imputing data using any central measure will not only fail to capture this bit of information, but will actually make predictions worse.

Similar encoding might be present when one sees columns with values outside the natural limits. For example, say a column that contains number of questions answered from 5 questions in a test having the value -1 to indicate absentees.

In the worst case, a model developed by completely dropping the offending parameter might perform better than an imputed data-set.

In most cases, we can do better.