gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Using parallelized computation in R to enable efficient data quality reporting

Meeting Abstract

Search Medline for

  • Stephan Struckmann - Universitätsmedizin Greifswald, Greifswald, Germany
  • Adrian Richter - Institut für Community Medicine, Universitätsmedizin Greifswald, Greifswald, Germany
  • Carsten Oliver Schmidt - Universität Greifswald, Greifswald, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 407

doi: 10.3205/20gmds237, urn:nbn:de:0183-20gmds2376

Published: February 26, 2021

© 2021 Struckmann et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Background: Data quality (DQ) reporting is a repetitive task, where similar indicators are computed for large numbers of study variables. This can be computationally intensive, particularly for advanced statistical procedures. In the Study of Health in Pomerania [1], DQ reporting comprises – among others – adjusted marginal means, e.g. observer effects adjusted for covariates, and stratified LOESS plots. With respect to the large number of variables and possible error sources, sequential computations may take hours or even days. Computations need therefore to be realized in a way that secures timely reports.

Methods: A generic R software library, dataquieR, has been developed to generate data quality reports. dataquieR combines information from study data with metadata from an extended data dictionary for data quality reporting [2]. Metadata include for example process information such as examiner or device ids. All possible combinations of outcome variables (like blood pressure) and assigned process variables for a selected quality indicator are computed. For this purpose, a calling plan is created, which can be understood as a table, with each row representing one call of the indicator function and each column representing one argument of that function telling it, which variables to rate. This approach was inspired by the pmap function [3].

All computations are performed in parallel. dataquieR supports diverse backends via the R package parallelMap [4] from sequential computations over multicore- up to SLURM driven HPC parallelizations [5] with and without MPI [6]. It also supports using parallel socket (PSOCK) clusters, and so, it supports Windows.

Advantages of a parallelized computing are illustrated using a dummy dataset with up to 300 metric variables and two grouping variables (= 600 computations).

Results: For calculating e.g. LOESS plots for all variables of a study, the following code suffices:

pipeline_vectorized(fct= acc_loess,study_data = study_data, meta_data = meta_data)

Users may also specify variables explicitly or use variable-attributes to focus the calculations.

Running in parallel reduced the computation time for the 600 computations from 15 to 5 minutes using 3 cores. For more than 128 computations, memory demands of forking exceeded the machine's limits, but PSOCK still worked.

Computing reports also includes time for reading and writing data on each node of a compute cluster (except for multicore clusters). This increases the runtimes depending on the size of the data and results. Except for this overhead, runtimes rise proportionally with the number of variables and decrease proportionally with the available CPU-number. Memory problems can be avoided by dividing the data into smaller sets.

dataquieR is available from https://dfg-qa.ship-med.uni-greifswald.de/

Conclusion: With dataquieR advantages of parallelized computing for data quality assessments become routinely available in standard environments like Laptops. Large back-ends like HPC clusters may be used to have indicator results available timely for even large numbers of variables. Because of memory demands, fork-based parallelization is limited to smaller tasks. PSOCK and HPC-Clusters need time to distribute all required data objects to the nodes, but the memory demands are lower. This data-distribution overhead could be addressed using composed functions as a possible extension to dataquieR.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Völzke H, Alte D, Schmidt CO, Radke D, Lorbeer R, Friedrich N, et al. Cohort profile: the study of health in Pomerania. International journal of epidemiology. 2011;40(2):294-307.
2.
Richter A, Schössow J, Werner A, Schauer B, Radke D, Henke J, et al. Data quality monitoring in clinical and observational epidemiologic studies: the role of metadata and process information. GMS Med Inform Biom Epidemiol. 2019;15(1):Doc08. DOI: 10.3205/mibe000202 External link
3.
Henry L, Wickham H. purrr: Functional Programming Tools. R package version 0.3.3. Available from: https://CRAN.R-project.org/package=purrr2019 External link
4.
Bischl B, Lang M. parallelMap: Unified Interface to Parallelization Back-Ends. R package version 1.4. Available from: https://CRAN.R-project.org/package=parallelMap2019 External link
5.
Yoo AB, Jette MA, Grondona M, editors. SLURM: Simple Linux Utility for Resource Management. 2003.
6.
Clarke L, Glendinning I, Hempel R. The MPI Message Passing Interface Standard. In: Decker KM, Rehmann RM, editors. Programming Environments for Massively Parallel Distributed Systems. 1994.