Real world data always has missing and blatantly incorrect values.
This becomes a painful issue when it comes to coming up with predictive models. While there are multiple ways of imputing data, it is difficult to figure out whether one is doing a good enough job. To make matters worse, the rows missing data might not be random. For example, all incomes above a certain threshold might be deliberately made NA to preserve anonymity. However, the model developer might not be aware of this censoring. Imputing data using any central measure will not only fail to capture this bit of information, but will actually make predictions worse.
Similar encoding might be present when one sees columns with values outside the natural limits. For example, say a column that contains number of questions answered from 5 questions in a test having the value -1 to indicate absentees.
In the worst case, a model developed by completely dropping the offending parameter might perform better than an imputed data-set.
In most cases, we can do better.
Dealing with incomplete data is one of the pros of using ensemble techniques where one can afford to have some noise in the data-set and yet get fairly good predictors which generalize well. Some of these techniques are fairly good at dealing with very high dimensional data as well. Random forest models, for example, take a subspace of dimensions for training different trees. Sub-sampling the data while carefully avoiding selection of those (row, column) pairs which contain missing data springs as a possible solution.
At least in R, the default implementation of these techniques is sensitive to missing values. For example, the random forest library does not allow input data to have any NAs. One way to deal with it is dropping the extra information. It will work, but it means we are not using the entire data-set.
Instead, the next time you encounter such a data-set (with missing values):
Add another column which indicates whether the actual value was missing (or incorrect) and then replace the missing (or incorrect) values with anything (mean, median, zero, any constant, etc.).
And then use this modified data-set as input for training the model.
An intuitive idea of why this might work is that if there is some information associated with the missing data, then the IncomeNA column will become a significant predictor and will overshadow the effect of the (constant) imputed values. For example, if we were training a glm model, then the coefficient of IncomeNA will not play a role for those rows when data is available. However, the coefficient would adjust with the imputed values and provide the optimal prediction for rows which have missing data. Note that glm itself is not an ensemble method, but it makes it easier to interpret the effect of the added column.
This is a fairly general method of dealing with missing data but it does have some shortcomings.
Data could be missing for more than one reason (e.g. two-sided censoring), in which case, this technique will not do much to improve the performance over other kinds of imputations.
Also, in case of incorrect values, more than one encoding might be in play for each different value. To extend to this case, we can add one extra column for each incorrect value (if they are discrete) to account for different reasons.
Hope that helps you in dealing with your next real world data-set.