gms | German Medical Science

64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

08. - 11.09.2019, Dortmund

A clinical test data generator for distributed research networks based on metadata repositories: the bridgehead approach

Meeting Abstract

  • David Juárez - Federated Information Systems, German Cancer Research Center (DKFZ), Heidelberg, Germany; German Cancer Consortium (DKTK), Heidelberg, Germany
  • Torben Brenner - Federated Information Systems, German Cancer Research Center (DKFZ), Heidelberg, Germany; German Cancer Consortium (DKTK), Heidelberg, Germany
  • Jori Kern - Federated Information Systems, German Cancer Research Center (DKFZ), Heidelberg, Germany; German Cancer Consortium (DKTK), Heidelberg, Germany
  • David Croft - Federated Information Systems, German Cancer Research Center (DKFZ), Heidelberg, Germany; German Cancer Consortium (DKTK), Heidelberg, Germany
  • Esther Erika Schmidt - Federated Information Systems, German Cancer Research Center (DKFZ), Heidelberg, Germany; German Cancer Consortium (DKTK), Heidelberg, Germany
  • Martin Lablans - Federated Information Systems, German Cancer Research Center (DKFZ), Heidelberg, Germany; German Cancer Consortium (DKTK), Heidelberg, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Dortmund, 08.-11.09.2019. Düsseldorf: German Medical Science GMS Publishing House; 2019. DocAbstr. 296

doi: 10.3205/19gmds046, urn:nbn:de:0183-19gmds0468

Veröffentlicht: 6. September 2019

© 2019 Juárez et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: The development of IT components for distributed research networks (DRN) [1] depends on realistic data for testing purposes [2]. For reasons of data privacy or data sovereignty, the provision of actual data from clinical routine is not always feasible [3]. Our aim was to create a generator of clinical test data conforming to the DRN’s data definitions stored in a central metadata repository (MDR) [4].

Methods: We chose the German Cancer Consortium (DKTK) [5] as an exemplary DRN. For data integration, the DKTK’s Clinical Communication Platform (CCP-IT) relies on a so-called “bridgehead”, consisting of a data warehouse and a connector [6]. The connector provides pseudonymized data conformant with the common format of our DRN, described in a central Metadata Repository (Samply.MDR) [7], to the platform’s central components. The bridgehead’s data warehouse is built around the commercial software CentraXX [8]. In this ecosystem, we created a Java-based test data generator producing test data according to CentraXX.

Results: The test data generator features a web interface allowing the user to choose between several predefined data profiles [9] (e.g. oncological data) and to adjust the probability of a data element being created. In the case of enumerated or range values, the interface also allows specification of the probability of a concrete value, or of a value within a defined range. The generated test data is then transferred into the bridgehead’s data warehouse. In our current CCP-IT implementation, test data are generated according to the proprietary CentraXX-XSD [10] definition. In a second step, the bridgehead’s existing routines present the data in the format defined by the DRN’s MDR.

Discussion: Although created for a specific DRN, the test data generator can in principle – given the appropriate configuration via its mechanism of data profiles – generate data for any DRN or data warehouse. However, conceiving such profiles to generate plausible data reflecting clinical reality – is not an easy task. For the purpose of simplicity, we reduced test data quality to syntactical indicators [11], [12], leaving aside far more challenging criteria such as semantic correctness. Apart from these limitations, the test data generator covers neither negative testing nor statistical correlations – the latter relevant for analytical components. In our case, creating data profiles was facilitated by relying on CentraXX’ table definitions, which are already established in several institutions. On the technical side, the combination of a bridgehead and an MDR offers advantages for integrating test data. Firstly, the usually laborious task of adapting the test data generator for each data warehouse implementation is facilitated by the bridgehead’s already implemented capabilities of data transformation and data loading. Secondly, integrating the test data generator into a bridgehead in the same way as an ETL-process (extract-transform-load) [13] opens up an instrument for improving the data quality of the ETL. As a matter of fact, a difficulty of an ETL consists of understanding well the source and the target data warehouse. Providing a set of exemplary data in the correct target format, facilitates thus the conception of the ETL. The test data generator is open source and available online [14], [15].

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Qualls LG, Phillips TA, Hammill BG, Topping J, Louzao DM, Brown JS, Curtis LH, Marsolo K. Evaluating Foundational Data Quality in the National Patient-Centered Clinical Research Network (PCORnet®). EGEMS (Wash DC). 2018;6(1):3.
2.
El Emam K, Arbuckle L. Anonymizing health data: case studies and methods to get you started. O'Reilly Media; 2013.
3.
Vucevic D, Yaddow W. Testing the data warehouse practicum: Assuring data content, data structures and quality. Trafford Publishing; 2012.
4.
Kadioglu D. Institutionsübergreifende Nutzung Verteilter Metadata Repositories. Dortmund: Fachhochschule Dortmund; 2013.
5.
German Cancer Research Center (DKFZ). DKTK - German Cancer Consortium. [Accessed 2018 Oct 22]. Available from: https://dktk.dkfz.de/en/home Externer Link
6.
Lablans M, Kadioglu D, Mate S, Leb I, Prokosch HU, Ückert F. Strategies for biobank networks. Classification of different approaches for locating samples and an outlook on the future within the BBMRI-ERIC. Bundesgesundheitsblatt – Gesundheitsforschung – Gesundheitsschutz. 2016;59(3):373–8.
7.
Lablans M, Kadioglu D, Muscholl M, Ückert F. Exploiting Distributed, Heterogeneous and Sensitive Data Stocks while Maintaining the Owner's Data Sovereignty. Methods Inf Med. 2015;54(4):346–52.
8.
Kairos GmbH. Centraxx. [Accessed 16 July 2019]. Available from: https://www.kairos.de/produkte/centraxx/ Externer Link
9.
Naumann F. Data profiling revisited. ACM SIGMOD Record. 2014 Feb 28;42(4):40-9.
10.
W3C. XML Schema. [Accessed 16 July 2019]. Available from: https://www.w3.org/XML/Schema Externer Link
11.
Batini C, Scannapieco M. Data and Information Quality: Dimensions, Principles and Techniques. Cham: Springer; 2016.
12.
Nonnemacher M, Nasseh D, Stausberg J. Datenqualität in der medizinischen Forschung. Berlin: Medizinisch Wissenschaftliche Verlagsgesellschaft; 2014.
13.
Kimball R, Caserta J. The data warehouse ETL toolkit: Practical techniques for extracting, cleaning, conforming, and delivering data. Indianapolis, Ind.: Wiley; 2009.
14.
Bitbucket. Test Data Generator Web User Interface - tdg-web. [Accessed 16 July 2019]. Available from: https://bitbucket.org/medicalinformatics/tdg-web Externer Link
15.
Bitbucket. Test Data Generator Web User Interface – tdg-core. [Accessed 16 July 2019]. Available from: https://bitbucket.org/medicalinformatics/tdg-core Externer Link