gms | German Medical Science

GMDS 2015: 60. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

06.09. - 09.09.2015, Krefeld

Creating a Multicenter Phenotype Database for Rare Kidney Diseases

Meeting Abstract

  • Christian Karmen - Universität Heidelberg, Heidelberg, Deutschland
  • Christian Kohl
  • Franz Schaefer - Universitäts-Kinderklinik Heidelberg, Heidelberg, Deutschland
  • Petra Knaup-Gregori - Universität Heidelberg, Heidelberg, Deutschland
  • Matthias Ganzinger - Universität Heidelberg, Heidelberg, Deutschland

GMDS 2015. 60. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Krefeld, 06.-09.09.2015. Düsseldorf: German Medical Science GMS Publishing House; 2015. DocAbstr. 247

doi: 10.3205/15gmds101, urn:nbn:de:0183-15gmds1012

Published: August 27, 2015

© 2015 Karmen et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at



Introduction: Rare diseases affect a limited number of individuals, however, the number of corresponding disorders can be very high. This is especially the case for rare kidney diseases with at least 150 associated disorders [1], because of the kidney’s complex nature and functionality. For a deeper analysis of the etiology of rare kidney diseases a comprehensive collection of all relevant phenotypes from several patient centers is essential.

Currently, we are establishing a phenotype database for multicenter research of rare kidney diseases in the EU-funded project EURenOmics [2]. The consortium consists of 17 European academic institutions, nine industry partners and one academic partner from the United States of America. The research efforts of EURenOmics focus on several groups of kidney diseases, like steroid resistant nephrotic syndrome (SRNS) and congenital abnormalities of the kidney and urinary tract (CAKUT).

The aim of this paper is to describe how a framework for integrating heterogeneous clinical data into a central data warehouse (DWH) [3] is applied for building a phenotype database of a particular disease area and to present the results for the EURenOmics project.

Material and Methods: The framework [3] consists of three elements:

  • Element 1 - Communication platform for domain experts: A spreadsheet based template is used for collecting all relevant parameters for a specific research field. Since there are several patient centers involved, each with a different syntax and semantic of captured data, we distinguish between two types of data: (i) Common Data Elements (CDEs), which describe the consolidated set of relevant clinical parameters including its meta data and (ii) Source Data Elements (SDEs) that contain the corresponding parameters from a data delivering site that match a CDE. This way, a mapping of semantically corresponding fields is documented.
  • Element 2 - Transformation of heterogeneous data: the idea is to re-use the definition and communication spreadsheet from framework element 1 by utilizing it as the input for parametrizing the logic of the mapping itself. This puts the spreadsheet into the center of the data integration process.
  • Element 3 - Data Warehouse integration: The framework bases on i2b2 [4] as an open-source DWH especially for clinical data. It offers a flexible database backend by supporting the three most common database management systems (Microsoft SQL server, Oracle Database and PostgreSQL). Its data model is the common star schema mostly used in DWHs. It allows powerful and extensive queries, optimized for huge data sets, e.g. with Online Analytical Processing (OLAP) tools.

Results: The application of three framework steps led for the EURenOmics phenotype database to the following results:

We used the spreadsheet for two purposes (i) documentation and (ii) input for the semantic mapping, as explained in framework element 2. Up to now, we have integrated data from four centers that deliver phenotype data for SRNS and from further four centers for CAKUT phenotypes. The phenotype definitions consist of 277 CDEs and correspondingly matching 779 SDEs for all eight centers.

During the semantic mapping we experienced the need for a mapping of empty values of a disease concept that actually have a meaning. For example, if a disease concept, like “bladder anomalies” is coded binary as “x” (means yes) and as an empty string (means no).

We implemented an in-memory mapping function that accelerated the mapping procedure about 130x in average, compared to the mapping component provided by the programming environment “Talend OpenStudio” (TOS). The TOS component is not optimized for mapping sized beyond about 100.000 elements and thus runs out of resources and crashes eventually. Using our implementation, a sample set of 1.709 patients with 261.115 disease facts is matched in about 12 seconds, instead of 1500 seconds using the standard method.

The framework has a strong dependency to the i2b2 data schema and Oracle as the backend. This is well suited for the development of a phenotype database. Nevertheless, the future aim of EURenOmics is to correlate phenotype and –omics-data, which is better supported by tranSMART [5], a non-profit open-source platform for enhanced analysis that is based on i2b2 data storage technology. Therefore, we developed an extract, transform and load (ETL) process that uses generic data types and decision paths. Only at the end of the process chain, a DWH specific transformation and load is done. This way an adaption of the load mechanism for the tranSMART specific i2b2 enhancements was enabled.

Discussion: The methods and tools of [3] proved feasible and especially helpful for the EURenOmics project. During application of the framework several improvements were realized and implemented. These new features will enhance the established framework and can be used for other application areas. Nevertheless, there is still improvement in several directions possible.

A useful aspect for a general usability would be the availability of a closed tool-chain. Although we offer our implementation free of charge to anyone interested, a user-friendly graphical interface to support the ETL process would probably further increase the acceptance.

The usage of the spreadsheet template for managing high numbers of CDEs and SDEs can sometimes be confusing due to its large size. Often, the usability could be improved by hiding the cells in the spreadsheet. Using plausibility checks for individual fields helped reducing false inputs. We further plan to store the spreadsheet centrally in a collaboration environment, such as Microsoft SharePoint. This way concurrent editing and discussion for the data managers for each participating center is enabled. We expect that this accelerates the development process, because even incomplete mappings can instantly be used for a partial integration.

As a next step a fully supported data integration into tranSMART is planned. tranSMART provides web applications for a number of analytical tasks that are useful for EURenOmics, for example investigating correlations between genotype and phenotype, or finding patterns of gene expression in healthy and diseased individuals and in human tissue samples.

As a final conclusion we are confident to say that for EURenOmics, the framework proved to be a valid guideline and led efficiently to a phenotype database that enables comprehensive center-spanning analysis.


Devuyst O, Knoers NV, Remuzzi G, Schaefer F. Rare inherited kidney diseases: challenges, opportunities, and perspectives. The Lancet. 2014; 383(9931): 1844-1859.
European Union 7th Framework Programme. EURenOmics. [last accessed: 2015 Mar 22]. Available from: URL: External link
Karmen C, Ganzinger M, Kohl CD, Firnkorn D, Knaup-Gregori P. A framework for integrating heterogeneous clinical data for a disease area into a central data warehouse. Studies in Health Technology and Informatics. 2014; 205: 1060–1064.
Murphy SN, Mendis M, Hackett K, Kuttan R, Pan W, Phillips LC, et al. Architecture of the open-source clinical research chart from Informatics for Integrating Biology and the Bedside. AMIA Annu Symp Proc. 2007:548–52.
Athey BD, Braxenthaler M, Haas M, Guo Y. tranSMART: An Open Source and Community-Driven Informatics and Data Sharing Platform for Clinical and Translational Research. AMIA Summits on Translational Science Proceedings. 2013: 6–8.