Comparison of Data Imputation Techniques and their Impact

Comparison of Data Imputation Techniques and their Impact
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Missing and incomplete information in surveys or databases can be imputed using different statistical and soft-computing techniques. This paper comprehensively compares auto-associative neural networks (NN), neuro-fuzzy (NF) systems and the hybrid combinations the above methods with hot-deck imputation. The tests are conducted on an eight category antenatal survey and also under principal component analysis (PCA) conditions. The neural network outperforms the neuro-fuzzy system for all tests by an average of 5.8%, while the hybrid method is on average 15.9% more accurate yet 50% less computationally efficient than the NN or NF systems acting alone. The global impact assessment of the imputed data is performed by several statistical tests. It is found that although the imputed accuracy is high, the global effect of the imputed data causes the PCA inter-relationships between the dataset to become altered. The standard deviation of the imputed dataset is on average 36.7% lower than the actual dataset which may cause an incorrect interpretation of the results.


💡 Research Summary

This paper conducts a comprehensive comparative study of three missing‑data imputation approaches: an auto‑associative neural network (AANN), a neuro‑fuzzy (NF) system, and a hybrid method that combines each of the former with hot‑deck (HD) statistical imputation. The authors evaluate the techniques on an eight‑category antenatal survey dataset (12,089 records) and also under principal component analysis (PCA) dimensionality‑reduction conditions.

The AANN is implemented as an auto‑encoder whose input and output layers are identical; a genetic algorithm (GA) optimizes the network by minimizing the reconstruction error, thereby estimating missing values. The NF system uses an Adaptive Neuro‑Fuzzy Inference System (ANFIS) that blends fuzzy‑logic rules with neural‑network learning, allowing both quantitative and qualitative relationships to be captured. The hybrid approach first applies hot‑deck imputation to generate a statistically plausible range for each missing entry (by averaging a set of similar cases) and then uses GA to search within that bounded interval for the value that best reduces the AANN error.

Data preprocessing includes binary encoding of categorical variables, outlier detection, normalization to the


Comments & Academic Discussion

Loading comments...

Leave a Comment