gms | German Medical Science

66. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 12. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

26. - 30.09.2021, online

Synthetic Complex Medication Data: From Software Testing to Privacy-Preserving Analytics

Meeting Abstract

  • Patric Tippmann - Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
  • Kiana Farhadyar - Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
  • Harald Binder - Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
  • Daniela Zöller - Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 66. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 12. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). sine loco [digital], 26.-30.09.2021. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 203

doi: 10.3205/21gmds092, urn:nbn:de:0183-21gmds0929

Veröffentlicht: 24. September 2021

© 2021 Tippmann et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: Manually creating realistic software tests for complex data is time consuming [1]. At a minimum, cases to be covered correspond to the number of execution paths that depend on the input data [2], [3]. A formal specification or source code for automatically generating tests may not be available, as in the case of test-driven development or proprietary software [4]. Further, for cross-institutional collaboration on clinical data and software, patient privacy prohibits sharing real cases [5]. Hence, while the benefit from testing for correctness and development rises with the realism of test data [6], [7], practitioners resort to synthetic data in this domain.

Since medication (the administration of drugs) can greatly change the state of a patient, its record is an essential part of clinical data. Medication data can also be fairly complex; it may include a mixture of drugs, each consisting of several ingredients, varying quantity administered over time, and additional information such as dosage form and application site. This structure combines hierarchical information and corresponds to complex, multivariate probability distributions.

Methods: We capture medication data using deep generative models [8], [9]. This allows the creation of new, synthetic data from the learned distribution, which ideally matches the original data distribution without disclosing any personally identifiable information [10]. Its value further increases with an accurate representation of relevant features. Therefore, we evaluate statistical properties of simulated data and compare them to that of original medications. In this way, and in addition to the use case of software testing, we can also assess the possibility of sharing complex, simulated medication data for pooled analytics that add to single-site studies while preserving patient privacy.

Conclusion: In summary, we generate complex, simulated medication data with software testing as a first use case, and evaluate its suitability for future use in pooled, privacy-preserving analytics.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Briand LC. A Critical Analysis of Empirical Research in Software Testing. In: First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007). 2007. p. 1–8.
2.
Whittaker JA. What is software testing? And why is it so hard? IEEE Softw. 2000 Jan;17(1):70–9.
3.
Di Geronimo L, Ferrucci F, Murolo A, Sarro F. A Parallel Genetic Algorithm Based on Hadoop MapReduce for the Automatic Generation of JUnit Test Suites. In: Verification and Validation 2012. IEEE Fifth International Conference on Software Testing. 2012. p. 785–93.
4.
Shull F, Melnik G, Turhan B, Layman L, Diep M, Erdogmus H. What Do We Know about Test-Driven Development? IEEE Softw. 2010 Nov;27(6):16–9.
5.
Lenz S, Hess M, Binder H. Deep generative models in DataSHIELD. BMC Med Res Methodol. 2021 Dec [cited 2021 May 8];21(1):64. DOI: 10.1186/s12874-021-01237-6 Externer Link
6.
Michael CC, McGraw G, Schatz MA. Generating software test data by evolution. IEEE Trans Softw Eng. 2001 Dec;27(12):1085–110.
7.
Bozkurt M, Harman M. Automatically generating realistic test input from web services. In: Proceedings of 2011 IEEE 6th International Symposium on Service Oriented System (SOSE). 2011. p. 13–24.
8.
Hu Z, Yang Z, Salakhutdinov R, Xing EP. On Unifying Deep Generative Models. In: Conference Track Proceedings. 6th International Conference on Learning Representations (ICLR 2018); 2018 April 30 - May 3; Vancouver, BC, Canada. OpenReview.net; 2018 [cited 2021 May 8]. Available from: https://openreview.net/forum?id=rylSzl-R- Externer Link
9.
Kingma DP, Welling M. Auto-Encoding Variational Bayes. In: Bengio Y, LeCun Y, editors. Conference Track Proceedings. 2nd International Conference on Learning Representations (ICLR 2014); 2014 April 14-16; Banff, AB, Canada. 2014 [cited 2021 May 8]. 2014. Available from: http://arxiv.org/abs/1312.6114 Externer Link
10.
Beaulieu-Jones BK, et al. Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing. Circ Cardiovasc Qual Outcomes. 2019 Jul 1;12(7):e005122. DOI: 10.1161/CIRCOUTCOMES.118.005122 Externer Link