Article
Missing values in deduplication tasks
Search Medline for
Authors
Published: | September 20, 2011 |
---|
Outline
Text
Introduction: Systematic approaches to missing values in the field of record linkage are still lacking. Commonly, ad-hoc solutions such as treating comparison values of two unknown underlying attribute values as unequal or equal are applied. We want to examine whether leaving out attributes with high ratios of missing values leads to better results in contrast to the usual ad-hoc approach when using decision trees for classifications.
Material and Methods: Cancer registry data with 9 attributes are used for empirical comparisons. The missingness scheme assumed is MNAR (missing not at random). The following strategies are considered for the comparison patterns: transforming missing values (NA's) to 0, 0.5 and 1 each; stepwise leaving out attributes with the most numbers of missing values. These strategies are compared by determining the minimal number of the most informative training samples for simple decision trees (CART) necessary for achieving a predetermined level of the so called F-Measure, the harmonic mean of precision and recall. The information of a data item is measured through computing the entropy of the classification results of 50 randomly sampled decision trees concerning this item. As working environment, we use R. Theoretically, other missingness schemes such as MAR (missing at random) and adequate imputation techniques are discussed.
Results: The empirical comparisons show that on our data the strategy of leaving out all attributes with high numbers of missing values leads to the best results. This means that the information in these attributes is outweighed by the distortion they convey into the data.
Conclusions: The ad-hoc approach to missing values is to be critically questioned. More than preserving attributes, it is to be analyzed whether the information in attributes compensate the ‘impurity’ introduced by missing values. In our future work, we want to give measures for the determination whether an attribute should be discarded or preserved when it entails missing values.
References
- 1.
- Enders CK. Applied Missing Data Analysis. New York: Guildford Press; 2010.
- 2.
- Robison-Cox JF. A Record Linkage Approach to Imputation of Missing Data: Analyzing Tag Retention in a Tag-Recapture Experiment. Journal of Agricultural, Biological, and Environmental Statistics. 1998;3(1):48-61.
- 3.
- Rubin DB, Little RJA. Statistical analysis with missing data. 2nd edition. New York: Wiley; 2002.