gms | German Medical Science

67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

21.08. - 25.08.2022, online

Data Quality Analysis of Input Data – Is a Dataset a Valid Subset of a Database Instance?

Meeting Abstract

Suche in Medline nach

  • Tobias Brix - Institute of Medical Informatics, University of Münster, Münster, Germany
  • Philipp Regier - Institute of Computer Science, University of Münster, Münster, Germany
  • Ludger Becker - Institute of Computer Science, University of Münster, Münster, Germany
  • Julian Varghese - Institute of Medical Informatics, University of Münster, Münster, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). sine loco [digital], 21.-25.08.2022. Düsseldorf: German Medical Science GMS Publishing House; 2022. DocAbstr. 53

doi: 10.3205/22gmds002, urn:nbn:de:0183-22gmds0025

Veröffentlicht: 19. August 2022

© 2022 Brix et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: Retrospective studies are based on analyses of patient data from routine care, which is stored in the hospital’s electronic medical record (EMR). Physicians specify inclusion criteria and define data items of interest. Then, computer scientists programmatically detect suitable patients and export the requested data. Due to unstructured data, inclusion criteria cannot always be validated programmatically. In this case, physicians provide manually collected lists, i.e., CSV-files, containing key data items, e.g., patient and case IDs or important operation and laboratory dates. These files are the input for the requested data export. Frequently, those files contain errors like transposed digits or mismatched dates. The aim of this ongoing work is to develop a tool that checks whether a provided CSV-file contains the exact data from the EMR and that can correct transmission errors.

State of the art: Data quality is an important field of research in computer science and medicine [1], [2]. Usually, during the process of data export (ETL) only the source data and the exported data are validated. To the best of our knowledge, an automated validation and correction of input data based on a database instance as first step of the ETL is not performed. In computer science, the problem is related to the research field of schema matching [3]. Schemata of two databases should be matched, whereby we want to match a degenerated schema, consisting of a single table, i.e., the CSV-file.

Concept: The workflow consists of five steps:
(a) Columns of the CSV-file are matched with columns of database tables.
(b) All matched tables are joined along foreign keys to generate a valid result set.
(c) Rows of the CSV-file are compared to the result set to determine perfect matches and flawed items.
(d) Heuristics detect minor issues, e.g., transposed digits, and provide automated correction suggestions.
(e) The corrected CSV-file and detected errors can be exported.

Implementation: The current prototype is implemented in Java Spring as web application. It supports all five steps of the conceptual workflow. In step (a) a comparison between column names, datatypes and data values provides a suggestion mechanism for the manual column matching. Step (d) calculates Levenshtein distance to suggest best matching alternatives, if flawed items are detected [4].

Lessons learned: The main challenge of our approach lies within workflow step (b). If matched tables are not directly linked via foreign key constraints, additional tables must be determined to establish an unambiguous linkage. Brute forcing by using Cartesian products over all tables is not suitable for EMR systems with hundreds of tables and billions of rows. Therefore, we are using a Steiner trees algorithm [5]. Tables represent nodes and foreign key constraints represent weighted-edges in the graph. Initially matched tables are terminals of the Steiner tree. Determining the Steiner tree is equivalent to finding a valid join order. Multiple trees, incorporating additional tables as terminals, are computed to find alternative join paths. In addition, a manual adjustment of edge weights can be performed to customize the calculation even further.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Kahn MG, Brown JS, Chun AT, Davidson BN, Meeker D, Ryan PB, et al. Transparent reporting of data quality in distributed data networks. EGEMS (Washington, DC). 2015;3(1):1052.
2.
Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. Journal of the American Medical Informatics Association JAMIA. 2013;20(1):144–51.
3.
Rahm E, Bernstein PA. A survey of approaches to automatic schema matching. The VLDB Journal. 2001;10(4):334–50.
4.
Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady. 1966;10(8):707-710.
5.
Dreyfus SE, Wagner RA. The steiner problem in graphs. Networks. 1971;1(3):195–207.