gms | German Medical Science

67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

21.08. - 25.08.2022, online

Providing Publicly Available Medical Data Access under FAIR Principles for Distributed Analysis

Meeting Abstract

  • Macedo Maia - Department of Medical Data Science, University Medical Center Leipzig, Leipzig, Germany
  • Mehrshad Jaberansary - RWTH Aachen University, Aachen, Germany
  • Yeliz Ucer - Department of Data Science and Artificial Intelligence, Fraunhofer FIT, Sankt Augustin, Germany
  • Oya Beyan - Department of Data Science and Artificial Intelligence, Fraunhofer FIT, Sankt Augustin, Germany; Institute for Medical Informatics, University of Cologne, Cologne, Germany
  • Toralf Kirsten - Department of Medical Data Science, University Medical Center Leipzig, Leipzig, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). sine loco [digital], 21.-25.08.2022. Düsseldorf: German Medical Science GMS Publishing House; 2022. DocAbstr. 203

doi: 10.3205/22gmds071, urn:nbn:de:0183-22gmds0712

Published: August 19, 2022

© 2022 Maia et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Introduction: Medical data is often publicly unavailable due to privacy aspects. Rarely medical data is anonymously available but typically lacks the well-known FAIR principles [1]. In particular, interoperability is a significant barrier in reusing medical data for different kinds of analysis. The effort to download such available data and transform it into a format that allows to integrate it with its own data is often enormous and meets health scientists interested in using it. Distributed data processing approaches help process publicly open and private data in different sources to perform data analysis at scale.

State of the art: Medical data often is available in different formats ranging from simple spreadsheets to proprietary formats and, hence, typically not interoperable on the fly. It is commonly true for publicly available data on websites but mainly for data provided by data-sharing platforms, such as Dryad (https://datadryad.org/). Simultaneously, there is a shift in the analysis paradigm from pooling all relevant data locally at a data scientist's institution to keeping data at each data management site and running analyses in a distributed fashion. The Personal Health Train (PHT) [2], [3], [4] is an analysis infrastructure following this paradigm and supports analysis ranging from statistical to machine and deep learning spectrum.

Concept: Our approach aims to provide publicly available medical data under FAIR principles for Distributed Analysis. We started wrapping medical data publicly open on publishers' websites into standard formats, i.e., HL7 FHIR [5] and provided them using a single access point (AP). This data is managed and accessible on the GAIA-X cloud infrastructure (https://www.data-infrastructure.eu/GAIAX/, [6]), aiming at transparency, portability, and interoperability based on European data and cloud sovereignty values. Data can be included in analyses using the PHT as distributed infrastructure, sending analysis algorithms to data access points relevant for the intended analysis.

Implementation: Our approach works in three steps: Data conversion, Data availability, and FAIR distributed data analysis. Data conversion consists of wrapping data from different formats (e.g., CSV and JSON) into the FHIR structure. Data availability provides public access to medical data sources using GAIA-X to keep the data accessing service available. We consider data sources in the GAIA-X cloud and private data as different stations (under private control). We reuse PHT to jointly analyse publicly available data consistently managed in FHIR on the Gaia-X cloud with private or institutional data that needs to be secured. The distributed analysis setting using the PHT delivers data privacy-preserving.

Lessons learned: While the approach is relatively simple, the most critical point is to work on transformation pipelines. We started with a manual format mapping for breast cancer data sets, including Wisconsin Breast Cancer [7] and several diabetes data sets available in Kaggle, e.g. [8]. In future, we will also work on methods to semi-automatically transform tabular data structures into FHIR by determining the data file content and reusing already available format mappings. The PHT allows to include this transformed data into own analyses, saving total effort, time and energy.

Acknowledgement: This work is supported by the BMBF project FAIR Data Spaces (FAIRDS14).

The authors declare that they have no competing interests.

The authors declare that a positive ethics committee vote has been obtained.


References

1.
Wilkinson M, Dumontier M, Aalbersberg I, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. DOI: 10.1038/sdata.2016.18 External link
2.
Beyan O, Choudhury A, van Soest J, Kohlbacher O, Zimmermann L, Stenzhorn H, Karim MdR, Dumontier M, Decker S, Bonino da Silva Santos LO, Dekker A. Distributed Analytics on Sensitive Medical Data: The Personal Health Train. Data Intelligence. 2020;2:96–107. DOI: 10.1162/dint_a_00032 External link
3.
Welten S, Mou Y, Neumann L, Jaberansary M, Yediel Ucer Y, Kirsten T, Decker S, Beyan O. A Privacy-Preserving Distributed Analytics Platform for Health Care Data. Methods Inf Med. 2022 Jan 17. DOI: 10.1055/s-0041-1740564 External link
4.
Mou Y, Welten S, Jaberansary M, Ucer Yediel Y, Kirsten T, Decker S, Beyan O. Distributed Skin Lesion Analysis Across Decentralised Data Sources. Stud Health Technol Inform. 2021 May 27;281:352-356. DOI: 10.3233/SHTI210179 External link
5.
Ayaz M, Pasha MF, Alzahrani MY, Budiarto R, Stiawan D. The Fast Health Interoperability Resources (FHIR) Standard: Systematic Literature Review of Implementations, Applications, Challenges and Opportunities. JMIR Med Inform. 2021 Jul 30;9(7):e21929. DOI: 10.2196/21929 External link
6.
GAIA-X. Policy Rules and Architecture of Standards. Federal Ministry for Economic Affairs and Energy (BMWi); May 2020.
7.
Street WN, Wolberg WH, Mangasarian OL. Nuclear feature extraction for breast tumor diagnosis. In: IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology; 1993 Jan 31 - Feb 5; San Jose, CA. (SPIE conference proceedings; volume 1905). p. 861-870.
8.
Smith JW, Everhart JE, Dickson WC, Knowler WC, Johannes RS. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings of the Symposium on Computer Applications and Medical Care 1988. IEEE Computer Society Press; 1988. p. 261-265.