gms | German Medical Science

SMITH Science Day 2022

23.11.2022, Aachen

Generating structured data in the medical domain using generative adversarial networks

Meeting Abstract

  • Sina Sadeghi - Department for Medical Data Science, Leipzig University Medical Center, Leipzig, Germany; Institute for Medical Informatics, Statistics and Epidemiology, Leipzig University, Leipzig, Germany
  • Lars Hempel - Department for Medical Data Science, Leipzig University Medical Center, Leipzig, Germany; Institute for Medical Informatics, Statistics and Epidemiology, Leipzig University, Leipzig, Germany; Faculty Applied Computer and Bio Sciences, Mittweida University of Applied Sciences, Mittweida, Germany
  • Masoud Abedi - Department for Medical Data Science, Leipzig University Medical Center, Leipzig, Germany; Institute for Medical Informatics, Statistics and Epidemiology, Leipzig University, Leipzig, Germany; Faculty Applied Computer and Bio Sciences, Mittweida University of Applied Sciences, Mittweida, Germany
  • Toralf Kirsten - Department for Medical Data Science, Leipzig University Medical Center, Leipzig, Germany; Institute for Medical Informatics, Statistics and Epidemiology, Leipzig University, Leipzig, Germany; Faculty Applied Computer and Bio Sciences, Mittweida University of Applied Sciences, Mittweida, Germany

SMITH Science Day 2022. Aachen, 23.-23.11.2022. Düsseldorf: German Medical Science GMS Publishing House; 2023. DocP29

doi: 10.3205/22smith40, urn:nbn:de:0183-22smith406

Published: January 31, 2023

© 2023 Sadeghi et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Introduction: Synthetic data generation has attracted particular interest in the medical field for two main reasons. First, medical data are not readily accessible to researchers due to patient privacy and data protection regulations. Second, in some cases, such as rare diseases, only few data records are available, making diagnosis or treatment difficult even for experts. Synthetic medical data generation can address these issues by providing artificial medical data that resembles real data while not associated with real patients. Since modern artificial intelligence methods require sufficiently large data to achieve optimal results, providing more available synthetic data can improve the efficiency of data analysis in the medical field where limited data are present.

Among numerous generative models, we considered Generative Adversarial Networks (GAN) that employ deep learning for generating synthetic data [1]. GAN typically comprise two networks: a generator and a discriminator. While the generator produces synthetic data from random patterns, the discriminator distinguishes generated data from real data. Several GAN variations have been developed since the introduction of the original GAN, serving different purposes. GAN have generally shown impressive results in generating images and also textual data in natural language, however, their performance on tabular data still remains a challenge, especially in the medical domain with a small amount of accessible data [2]. In this study, we addressed the generation of structured (tabular) synthetic data using various appropriate GAN models for applications in the medical domain. We investigated how their synthetic data can improve the classification performance compared to the case where only real data are considered. To determine the quality of generated data, we developed an evaluation framework that incorporates an extended dataset consisting of both real and synthetic data for training classifiers [3]. This is in contrast to other studies that either use real or synthetic data for training the classifier and adopt the other data for testing.

Methods: To generate synthetic data, we considered several GAN models, relevant to structured data: Conditional GAN (CGAN), Conditional Tabular GAN (CTGAN), and Wasserstein GAN with Gradient Penalty (WGANGP), and investigated their applicability in the medical domain where data are limited [3]. We used the publicly available data of the Breast Cancer Wisconsin dataset (BCW) [4], which includes 569 patients with (malignant/benign) breast tumors. The primary task would be then to classify tumors.

Figure 1A [Fig. 1] depicts the evaluation framework schematically. As shown, the BCW is divided into Train and Test sets. To investigate the impact of size of training data on classification accuracy, the Train set is further divided into 10 subsets containing a portion of the original Train data (in percent). We considered regular extension sampling in which a small subset from the Train data is first selected randomly and then extended by adding more data to create a larger training dataset of a desired size. This is important for medical applications with a small amount of data to determine how newly available data can improve the model performance.

To benchmark the synthetic data, we employed two classifiers: Support Vector Machine (SVM) and Multi-Layer Perceptron (MLP) as appropriate classifiers for the BCW. We also calculated pairwise correlation difference (PCD) between synthetic and real data to statistically measure how much correlation among features in real data is captured by synthetic data.

Results and discussions: The key concern in generating synthetic data is whether the complex patterns in the real data are represented in synthetic ones. For this, we measured the PCD between GAN variants and real data versus the training data size, shown in Figure 1B [Fig. 1]. We note that WGANGP has a larger PCD for smaller training size, i.e., the correlation among features in real data is less reflected in synthetic data. However, other metrics may lead to different results in favor of different generative models, especially for high-dimensional and complex data. Moreover, the evaluation results may vary depending on the application domain. Since we are mainly concerned with medical applications, we considered the binary classification as a target evaluation metric.

Figures 1C and 1D [Fig. 1] show classification accuracy of MLP and SVM for different training data sizes to determine how the inclusion of synthetic data can improve the accuracy. The results of the GAN variants were compared with a case using only real data, which is indicated as silver standard in the figures. The MLP accuracy for WGANGP is superior to silver standard, while in the case of SVM, accuracy decreases for smaller training data sizes. However, with increasing training sizes, the accuracy improves. This is also reflected in the PCD, where WGANGP shows larger PCD. We note that GAN models generate low quality data when a very small amount of data is provided. It is indeed crucial to estimate the amount of synthetic data that needs to be generated to improve the accuracy based on the available (limited) real data.

Conclusions: We investigated different GAN variants to generate synthetic data for application in the medical domain, using the BCW. We performed generative experiments with different sizes of training data subsets to study the impact of size of the available data on the quality of the corresponding synthetic data and eventually binary classification accuracy. We also developed an evaluation framework that considers an extended dataset with both generated and real data for training classifiers. The results demonstrate that synthetic data from more advanced models such as WGANGP can improve the classification accuracy, even when a small amount of data is available. This is noticed for a larger range of training data sizes for the MLP classifier. The accuracy improves as the size of the training data increases. Here, a ratio of 1:1 is considered for real and synthetic data in the extended dataset. In the future, other combinations should be explored to determine their impact on the classification accuracy.


References

1.
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative Adversarial Networks. 2014 Jun. DOI: 10.48550/arXiv.1406.2661 External link
2.
Borisov V, Leemann T, Seßler K, Haug J, Pawelczyk M, Kasneci G. Deep neural networks and tabular data: A survey. arXiv. 2022. DOI: 10.48550/arXiv.2110.01889 External link
3.
Abedi M, Hempel L, Sadeghi S, Kirsten T. GAN-Based Approaches for Generating Structured Data in the Medical Domain. Appl. Sci. 2022;12(14):7075. DOI: 10.3390/app12147075 External link
4.
Wolberg W, Street W, Mangasarian O. Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository [Internet]. 1995. Available from: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic) External link