Article
Generating GECCO instance data in HL7 FHIR and openEHR using Synthea
Search Medline for
Authors
Published: | August 19, 2022 |
---|
Outline
Text
Introduction: Due to the COVID-19 pandemic and the resulting scientific activities, various initiatives produce large quantities of research data. To standardize the data collection process, the Network of University Medicine (NUM) developed the GECCO dataset for uniformly documenting COVID-19 patients [1]. Several applications and projects [2], [3], [4] already use it to gather and provide relevant data. For continuous and reliable development, an increasing demand for test data in various formats is required. Since access to real patient data is limited, a framework (SyntheaGECCO) for flexible data generation was created [5]. Consequently, this work presents a novel method for creating test data while avoiding privacy issues.
Methods: The patient data generator Synthea serves as a baseline for this project. It employs demographic data, health statistics, and clinical practice guidelines to generate realistic data in a US-American health context. For the transformation of the data created by Synthea according to the GECCO specification, the terminology server Ontoserver [6], as well the Rx-Norm API of the National Institute of Health were used [7]. The core of the GECCO specification contains 80 data points and 280 associated response options covering all relevant Information from admission to discharge of a COVID-19 case. Profilings of GECCO are available in HL7 FHIR R4 and openEHR [1].
Results: The data generated by Synthea in HL7 FHIR R4 is analyzed regarding the requirements of the GECCO specification. Relevant resources instances get extracted from the FHIR bundles using previously created mappings. Those mappings originate from the structure definitions, as well as templates of the respective profilings in HL7 FHIR and openEHR. With the help of the Ontoserver, they were expanded according to the polyhierarchy regarding the SNOMED CT codes contained. To use the medication administrations, initially coded in RxNorm, the RxcNorm API was employed. In this manner, representations conforming to GECCO could be generated using the synthetic data. The obtained GECCO data instances were validated both regarding FHIR R4 and openEHR. This process included the FHIR-Bridge [8] to examine the equivalence of output formats and use of the profiles for resource validation.
Discussion: Using the described process, most of the data elements specified in GECCO could be generated using synthetic data. Exceptions represent medical image and contact data. The validation confirmed that the final data is valid and both formats provide comparable information content. Although Synthea offers the possibility to generate extensive and detailed patient data for GECCO, some data elements couldn't be realized. Additionally, the population-based approach to data generation demands the creation of large data sets to provide adequate variety regarding patient histories. Concerning the standards HL7 FHIR and openEHR, both are equally applicable for this use case.
Conclusions: The employment of real patient data in secondary use cases, like the development of software for the health sector, is accompanied by many hurdles. Synthetic data based on statistics however avoids these disadvantages and thus represents a viable alternative. This project highlights, that existing technologies simulate sufficiently complex processes to generate enough information for specific use cases, like GECCO.
Acknowledgment: The corresponding author was supported by a scholarship of the Friedrich-Wingert-Stiftung.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Sass J, Bartschke A, Lehne M, Essenwanger A, Rinaldi E, Rudolph S, et al. The German Corona Consensus Dataset (GECCO): a standardized dataset for COVID-19 research in university medicine and beyond. BMC Med Inform Decis Mak. 2020;20(1):341. DOI: 10.1186/s12911-020-01374-w
- 2.
- Orchestra Cohort. About Orchestra. Orchestra; [Updated 2022-02-18, Accessed 2022-03-24]. Available from: https://orchestra-cohort.eu/.
- 3.
- Netzwerk Universitätsmedizin. COMPASS Steckbrief. Netzwerk Universitätsmedizin; [Updated 2020-10-04, Accessed 2022-03-24]. Available from: https://num-compass.science/de/compass/steckbrief/
- 4.
- Netzwerk Universitätsmedizin. NAPKON - Nationales Pandemie Kohorten Netz. Netzwerk Universitätsmedizin; [Accessed 2022-04-14]. Available from: https://napkon.de/
- 5.
- IT Center for Clinical Research - Universität zu Lübeck. Synthea-GECCO. IT Center for Clinical Research; [Accessed 2022-05-24]. Available from: https://github.com/itcr-uni-luebeck/Synthea-Gecco
- 6.
- Metke-Jimenez A, Steel J, Hansen D, Lawley M. Ontoserver: a syndicated terminology server. J Biomed Semantics. 2018;9(1):24. DOI: 10.1186/s13326-018-0191-z
- 7.
- National Institute of Health. RxNorm API. National Library of Medicine; [Accessed 2022-03-24]. Available from: https://lhncbc.nlm.nih.gov/RxNav/APIs/RxNormAPIs.html
- 8.
- EHRbase. FHIR Bridge. GitHub; [Accessed 2022-04-14]. Available from: https://github.com/ehrbase/ehrbase