Abstract:
Since most data-driven systems including classifiers require large amounts of
complete data, the task of handling missing data has garnered much attention. If one of the
variables under study in a dataset has some incomplete values, it is treated as a missing data
problem. Various methods in the literature exist for dealing with missing data including
complete case analysis, listwise deletion, single imputation and multiple imputations. Out of
these, mean imputation remains a favourite for researchers due to its simplicity and ease of use,
despite some glaring flaws. In this paper, we compare Mean imputation with a similar single
imputation method – Group Means imputation – and present our results on nine real-world
datasets with respect to classifier accuracy of the C5.0 classifier on the imputed dataset. We
show that while Group Means imputation fares better on training data, the test set accuracies
fall in favour of Mean Imputation, which deals with novel data in a much better fashion