gms | German Medical Science

MAINZ//2011: 56. GMDS-Jahrestagung und 6. DGEpi-Jahrestagung

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V.
Deutsche Gesellschaft für Epidemiologie e. V.

26. - 29.09.2011 in Mainz

Missing values in deduplication tasks

Meeting Abstract

Search Medline for

  • Murat Sariyar - Universitätsmedizin der Johannes Gutenberg-Universität Mainz, Mainz
  • Andreas Borg - Universitätsmedizin der Johannes Gutenberg-Universität Mainz, Mainz

Mainz//2011. 56. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (gmds), 6. Jahrestagung der Deutschen Gesellschaft für Epidemiologie (DGEpi). Mainz, 26.-29.09.2011. Düsseldorf: German Medical Science GMS Publishing House; 2011. Doc11gmds511

doi: 10.3205/11gmds511, urn:nbn:de:0183-11gmds5110

Published: September 20, 2011

© 2011 Sariyar et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc-nd/3.0/deed.en). You are free: to Share – to copy, distribute and transmit the work, provided the original author and source are credited.


Outline

Text

Introduction: Systematic approaches to missing values in the field of record linkage are still lacking. Commonly, ad-hoc solutions such as treating comparison values of two unknown underlying attribute values as unequal or equal are applied. We want to examine whether leaving out attributes with high ratios of missing values leads to better results in contrast to the usual ad-hoc approach when using decision trees for classifications.

Material and Methods: Cancer registry data with 9 attributes are used for empirical comparisons. The missingness scheme assumed is MNAR (missing not at random). The following strategies are considered for the comparison patterns: transforming missing values (NA's) to 0, 0.5 and 1 each; stepwise leaving out attributes with the most numbers of missing values. These strategies are compared by determining the minimal number of the most informative training samples for simple decision trees (CART) necessary for achieving a predetermined level of the so called F-Measure, the harmonic mean of precision and recall. The information of a data item is measured through computing the entropy of the classification results of 50 randomly sampled decision trees concerning this item. As working environment, we use R. Theoretically, other missingness schemes such as MAR (missing at random) and adequate imputation techniques are discussed.

Results: The empirical comparisons show that on our data the strategy of leaving out all attributes with high numbers of missing values leads to the best results. This means that the information in these attributes is outweighed by the distortion they convey into the data.

Conclusions: The ad-hoc approach to missing values is to be critically questioned. More than preserving attributes, it is to be analyzed whether the information in attributes compensate the ‘impurity’ introduced by missing values. In our future work, we want to give measures for the determination whether an attribute should be discarded or preserved when it entails missing values.


References

1.
Enders CK. Applied Missing Data Analysis. New York: Guildford Press; 2010.
2.
Robison-Cox JF. A Record Linkage Approach to Imputation of Missing Data: Analyzing Tag Retention in a Tag-Recapture Experiment. Journal of Agricultural, Biological, and Environmental Statistics. 1998;3(1):48-61.
3.
Rubin DB, Little RJA. Statistical analysis with missing data. 2nd edition. New York: Wiley; 2002.