Clustering-based missing value imputation for data preprocessing
- Publication Type:
- Conference Proceeding
- 2006 IEEE International Conference on Industrial Informatics, INDIN'06, 2007, pp. 1081 - 1086
- Issue Date:
Missing value imputation is an actual yet challenging issue confronted by machine learning and data mining. Existing missing value imputation is a procedure that replaces the missing values in a dataset by some plausible values. The plausible values are generally generated from the dataset using a deterministic, or random method. In this paper we propose a new and efficient missing value imputation based on data clustering, called CRI (Clustering-based Random Imputation). In our approach, we fill up the missing values of an instance with those plausible values that are generated from the data similar to this instance using a kernel-based random method. Specifically, we first divide the dataset (exclude instances with missing values) into clusters. And then each of those instances with missing-values is assigned to a cluster most similar to it. Finally, missing values of an instance A are thus patched up with those plausible values that are generated using a kernel-based method to those instances from A's cluster. Our experiments (some of them are with the decision tree induction system C5.0) have proved the effectiveness of our proposed method in missing value imputation task. © 2006 IEEE.
Please use this identifier to cite or link to this item: