gms | German Medical Science

GMDS 2014: 59. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

07. - 10.09.2014, Göttingen

Construction of the Lung Cancer Phenotype Database (LCPD) for the German Center for Lung Research (DZL)

Meeting Abstract

  • D. Firnkorn - Universität Heidelberg, Heidelberg
  • M. Ganzinger - Universität Heidelberg, Heidelberg
  • M. Thomas - Universität Heidelberg, Heidelberg
  • T. Muley - Universität Heidelberg, Heidelberg
  • P. Knaup-Gregori - Universität Heidelberg, Heidelberg

GMDS 2014. 59. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Göttingen, 07.-10.09.2014. Düsseldorf: German Medical Science GMS Publishing House; 2014. DocAbstr. 315

doi: 10.3205/14gmds061, urn:nbn:de:0183-14gmds0617

Veröffentlicht: 4. September 2014

© 2014 Firnkorn et al.
Dieser Artikel ist ein Open Access-Artikel und steht unter den Creative Commons Lizenzbedingungen (http://creativecommons.org/licenses/by-nc-nd/3.0/deed.de). Er darf vervielfältigt, verbreitet und öffentlich zugänglich gemacht werden, vorausgesetzt dass Autor und Quelle genannt werden.


Gliederung

Text

Introduction: Lung cancer is one of the leading cancer related causes of deaths in the world. Global cancer statistics of the American Cancer Society estimated 224,210 new cases and 159,260 deaths in both sexes during 2014 in the US [1]. Networks like the German Center for Lung Research (DZL) focus on the translational research approach to combat this disease [2]. The DZL is a federally supported institution, which has been initiated in November 2011 to examine common lung diseases. Lung Cancer is one of eight disease areas, which are investigated at six locations in Germany [3]. To support translational cross-linked research in Lung Cancer, adequate IT solutions have to be provided to manage clinical data. Currently, each location has its own database and data model to gather, store and maintain patient specific data. These different data models are causing problems regarding site-overspanning analysis and acquisition of significant amounts of cases for studies to answer specific research questions. To overcome this lack of semantic interoperability inside the disease area, a central data repository (data warehouse) is needed, which provides harmonized data definitions for all participating locations. Within a pilot project, a Lung Cancer Phenotype Database (LCPD) has been constructed to integrate phenotype data for Lung Cancer from Heidelberg, Großhansdorf and Munich. The tasks to achieve this objective can be summarized in three steps: (i) Data harmonization for the data warehouse (ii) implementation of reusable data transformation and integration tools and (iii) installation of the central data warehouse platform plus data import.

Materials and Methods:

1.
Harmonization process: The creation of a target dataset definition for the resulting LCPD is the most crucial and time consuming step to enable cross-linked research. Therefore, we used a Microsoft Excel-based spreadsheet to collect all necessary data from each of the three participating sites. Every site is represented with its own table as well as the general definition for the target dataset. The selection of the necessary parameters, which advect into the general definition, has been done by domain experts from involved locations. Afterwards, datamanagers of each site investigated the actual state of the source databases and described the consented parameters with predefined properties to create the target dataset. At this point, we had to define mapping rules to combine, split or simply convert specific parameters from one location to make them evaluable combined with the respective parameters from other locations. This process results in a harmonization table convenient for documentation and semantic interoperability.
2.
Extract-Transform-Load tool development: Based on the harmonization table, which contains the actual and target state, we developed component-based software tools to (i) extract the data from the source systems, (ii) transform them according to our mapping rules and (iii) load the generated target dataset into LCPD. This method is called Extract-Transform-Load (ETL) and has been implemented with the freely available IDE Talend Open Studio [4]. Depending on local conditions, the basis for the data transfer can be Excel files, text files or the database systems themselves.
3.
Data Warehouse installation: The purpose of LCPD, besides general import and export functionality, is managing a central phenotype-based patient cohort from the DZL disease area Lung Cancer. Within LCPD it should be possible to define sub-cohorts via user-defined queries by selecting respective clinical facts. Because of privacy constraints, we are not allowed to store patient identifying data together with medical data in LCPD, but pseudonyms, which are needed to re-identify the patients in the sites and to enable follow-up updates regarding existing patients in LCPD. The free data warehouse platform i2b2 [5] addresses our requirements best and has been used for setting up LCPD. I2b2 offers the possibility to create a tree-structure (i2b2 ontology) for the phenotype facts with an integrated ontology editor.

Results: A standardized clinical phenotype dataset for lung cancer has been established, which derives from the clinical documentation of the three participating locations. Sections like patient information, tumor-documentation, diagnostics, laboratory results, lung function and therapy details have been created to summarize related parameters. It is possible to enhance the dataset by adding additional columns for the clinical data of other sites in the DZL, because of the modular table-based approach.

We developed a filter-based, well documented and reusable ETL process chain. If the initial filters are set, the target structure for the LCPD can be created and the ETL tool is then deployable as a stand-alone program for the clinicians. Our approach is assumed to simplify the effort of filtering the relevant data for LCPD or other data warehouses, as an individual site does not have to do manual filter steps anymore to define a general dataset by itself.

I2b2’s ontology editor has been utilized to create the target dataset in LCPD regarding the general definition in the harmonized dataset. Via a query-tool on a web-based interface, it is possible to enable user-defined selections of medical facts to create sub-cohorts, even for technically not experienced users. This can be achieved by dragging ontology elements and dropping them into specific containers in the web interface. Executing the resulting query, a researcher gets the number of patients who fulfil the selected facts.

Discussion: The resulting three step concept of data harmonization, filter-based ETL components and data integration into LCPD for the DZL disease area Lung Cancer or similar projects is a necessary process to enable cross-linked research. An initial table-based collection of relevant phenotype data over all sites simplifies the communication and the creation of a target definition for the central repository and the subsequent ETL process. However, data harmonization is considered as an iterative work in progress, which needs massive communication efforts over all participants. A close in-house collaboration among an interdisciplinary team of domain experts and datamanagers is considered as the most promising strategy. Further efforts are targeted on one side at generating a dynamic ontology for the phenotype facts straight from the defined target dataset and on the other side at correlating those facts with genotype parameters, e.g. mutations a patient’s genome.


References

1.
Siegel R, Ma J, Zou Z, Jemal A. Cancer statistics 2014. CA A Cancer Journal for Clinicians. 2014; 64(1):9–29.
2.
Gruber K, Loo J. Profile: German centre breathes new life into lung research. The Lancet. 2012; 380(9856):1806.
3.
Seeger W, Welte T, Eickelberg O, Mall M, Rabe K, Keller B et al. Das Deutsche Zentrum für Lungenforschung - Translationale Forschung für Prävention, Diagnose und Therapie von Atemwegserkrankungen. Pneumologie. 2012; 66(08):464–9.
4.
Talend Open Studio, Talend Inc, 2014. Available from: http://www.talend.com Externer Link
5.
Murphy SN, Mendis M, Hackett K, Kuttan R, Pan W, Phillips LC et al. Architecture of the open-source clinical research chart from Informatics for Integrating Biology and the Bedside. AMIA Annu Symp Proc. 2007:548–52.