gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Fitness for use of EHR data: lessons learned from a study using histology data

Meeting Abstract

  • Adrian Richter - Institut für Community Medicine, Universitätsmedizin Greifswald, Greifswald, Germany
  • Jean-Francois Chenot - Institut für Community Medicine, Universitätsmedizin Greifswald, Greifswald, Germany
  • Elizabeth Sierocinski - Institut für Community Medicine, Universitätsmedizin Greifswald, Greifswald, Germany
  • Carsten Oliver Schmidt - Universität Greifswald, Greifswald, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 400

doi: 10.3205/20gmds014, urn:nbn:de:0183-20gmds0141

Published: February 26, 2021

© 2021 Richter et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Background: Big data is often conceived as real-world-data such as electronic health records (EHR) (1). A common hope is the use of EHR data for a better care and personalized medicine. Ideally, data should be accessible and analyzed in real-time. Yet, a vast number of proprietary electronic data capturing (EDC) systems is used for the documentation of routine data that were not designed for research. Many information from EHR data is available as unstandardized plain text and the underlying data model is non-transparent. This work assessed the preprocessing of EHR data required for a scientific use and possible pitfalls using histological reports as an example.

Methods: Data of a population based cohort, the Study of Health in Pomerania (SHIP), have been linked to routine data of biopsy reports provided by the University Hospital Department of Pathology at the University Medicine Greifswald. Both data sources did not provide a shared key for linkage. Therefore a distance based comparison of candidate matches was applied. A total of 322,000 biopsy reports were available from EHR data, 8,599 reports were linked to participants of the SHIP study who consented to record linkage (~98%). Biopsy reports were classified regarding the affected organ system and biopsy outcomes.

Results: Histological EHR data were documented in a proprietary EDC system from 2002 onwards. The data base export was provided by a vendor as unstructured plain text with keywords indicating the date of the biopsy, biopsied tissue and the final evaluation. The report length varied considerably between a few dozen up to several hundreds of words.

Machine-based classification of the affected and targeted organ systems based on keywords was accurate in 80% of the reports. It failed in longer and complex reports. Particularly, if an extensive patient anamnesis has been documented and cross references to previous reports were mentioned.

The classification of the outcomes of biopsies in terms of 1st to 5th malignancies or benign conditions was complicated since longitudinal relations between repeated biopsies of the same patient was not captured by the EDC system. Routinely applied follow-ups of known malignancies aggravated the identification of incident events: a similar wording was found in follow-up reports compared to the report of the 1st occurrence of the event. Therefore longitudinal patient trajectories had to be established upon readers' evaluation.

A change in the data base, i.e. the introduction of a new patient-ID, caused a systematic loss of several hundreds of reports in the initial working data base. This flaw has been recognized after month of manual readings using plausibility checks in interim analyses. If this flaw had been unnoticed, severely biased results in terms of effect estimates would have resulted.

Conclusion: The use of EHR histology data for research required substantial preprocessing and the implementation of plausibility checks. Some of these steps may be fully automatized and therefore allow a faster use of EHR data. The identification of correct longitudinal patient trajectories in versioned and unstandardized data bases appear to be a difficult objective.

The authors declare that they have no competing interests.

The authors declare that a positive ethics committee vote has been obtained.


References

1.
Dimitrov DV. Medical internet of things and big data in healthcare. Healthcare informatics research. 2016;22(3):156-63.