gms | German Medical Science

GMDS 2012: 57. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

16. - 20.09.2012, Braunschweig

Pay-as-you-go data integration for large scale healthcare simulations

Meeting Abstract

  • Philipp Baumgärtel - Universität Erlangen-Nürnberg, Erlangen, Deutschland
  • Gregor Endler - Universität Erlangen-Nürnberg, Erlangen, Deutschland
  • Johannes Held - Universität Erlangen-Nürnberg, Erlangen, Deutschland
  • Richard Lenz - Universität Erlangen-Nürnberg, Erlangen, Deutschland

GMDS 2012. 57. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Braunschweig, 16.-20.09.2012. Düsseldorf: German Medical Science GMS Publishing House; 2012. Doc12gmds047

doi: 10.3205/12gmds047, urn:nbn:de:0183-12gmds0475

Published: September 13, 2012

© 2012 Baumgärtel et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( You are free: to Share – to copy, distribute and transmit the work, provided the original author and source are credited.



Introduction: ProHTA (Prospective Health Technology Assessment) [1] is a simulation project that aims at understanding the impact of innovative medical processes and technologies at an early stage. To that end, large scale healthcare simulations are employed to estimate the effects of potential innovations [2].

Besides the problems of simulation modeling, validation and optimization in our simulation project, simulation data management is required. This includes semantic integration, since simulation data in our project stems from several heterogeneous sources. Although many techniques exist for automatic integration of data [3], it is still an extensive process.

Howe et al. [4] identified several key barriers to adopting a data management system for science. They argue that the initial effort of designing a data schema is to complicated in most cases. Therefore, we develop a "data first, structure later" approach to reduce the initial effort.

Material and Methods: Since our pool of data sources is likely to expand rapidly, we investigate an integration strategy based on the dataspaces abstraction [5]. This approach allows the coexistence of heterogeneous data sources, which are initially integrated only as far as automatically possible. We investigated how an RDF (Resource Description Framework) [6] triplestore can be utilized to enable the gradual improvement of the degree of integration in a pay-as-you-go manner. Additionally, we analyzed the requirements of our simulation practitioners and developed a prototypical data provisioner for the simulation.

Results: We developed an ontology for storing statistical data in an RDF triplestore [7]. Data and metadata are stored using several generic upper ontologies. We also devised a domain specific query language that assists simulation modelers to load statistical data into simulation models.

This framework has been extended to store data using the automatically imported schema of the data source. The user can query the data utilizing a key word search or by querying the data using information about the source schema. Also, the user is able to view the data with a web front end. This front end also enables the user to add annotations about the underlying semantic concepts of the data. When the annotations are complete, application specific views allow the usage of our domain specific query language to load data into the simulation.

Conclusion: Because of the changing demands of the simulation and the amounts of heterogeneous data, the initial effort of integrating a data source should be minimized. Therefore, our framework enables automatic integration and schema import for sources with known formats. As the data has to be reusable for different simulation studies, the data management system provides the means of adding semantic information to initially integrated data. Additionally, the simulation modeler is able to find and query data easily using a key word search and a domain specific query language. Hence, the effort of using a data provider for simulation input data management is less than the effort of manual data input.

Acknowledgements: This project is supported by the German Federal Ministry of Education and Research (BMBF), project grant No. 01EX1013B.


Djanatliev A, German R. ProHTA – Prospective Assessment of innovative Health Technology by Simulation. In: Proceedings of the 2011 Winter Simulation Conference; 2011.
Djanatliev A, Kolominsky-Rabas P, Hofmann BM, Aisenbrey A, German R. Hybrid Simulation Approach for Prospective Assessment of Mobile Stroke Units. In: SIMULTECH 2012 – Proceedings of the 2nd International Conference on Simulation and Modeling Methodologies, Technologies and Applications; 2012. [in print]
Rahm E, Bernstein PA. A survey of approaches to automatic schema matching. The VLDB Journal. 2001.
Howe B, Cole G, Souroush E, Koutris P, Key A, Khoussainova N, et al. Database-as-a-Service for Long-Tail Science. In: Bayard Cushing J, French J, Bowers S, editors. Scientific and Statistical Database Management, volume 6809 of Lecture Notes in Computer Science. Berlin/Heidelberg: Springer; 2011. p. 480-9.
Jeffery SR, Franklin MJ, Halevy AY. Pay-as-you-go user feedback for dataspace systems. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD ’08. New York, NY, USA: ACM; 2008. p. 847-60.
Lassila O, Swick RR, Wide W, Consortium W. Resource Description Framework (RDF) Model and Syntax Specification. 1999. Available from: External link
Baumgärtel P, Lenz R. Towards Data and Data Quality Management for Large Scale Healthcare Simulations. In: Conchon E, Correia C, Fred A, Gamboa H, editors. Proceedings of the International Conference on Health Informatics. SciTePress – Science and Technology Publications; 2012. p. 275-80