gms | German Medical Science

68. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

17.09. - 21.09.23, Heilbronn

Hyperparameter Tuning Matters for Generative Models to Produce Improved Synthetic Data in Small Tabular Datasets

Meeting Abstract

Search Medline for

  • Waldemar Hahn - Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, Technische Universität Dresden, Dresden, Germany; Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig, Dresden, Germany
  • Markus Wolfien - Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, Technische Universität Dresden, Dresden, Germany; Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig, Dresden, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 68. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS). Heilbronn, 17.-21.09.2023. Düsseldorf: German Medical Science GMS Publishing House; 2023. DocAbstr. 290

doi: 10.3205/23gmds018, urn:nbn:de:0183-23gmds0180

Published: September 15, 2023

© 2023 Hahn et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Introduction: The generation of high-quality synthetic tabular data for small clinical datasets bears significant importance in healthcare research. It enables the utilization of sensitive health information for advancing treatment modalities and patient care, while preserving patient privacy. However, the common practice of applying generative models, such as CTGAN [1], TVAE [1], and CTAB-GAN+ [2], with default hyperparameters often leads to suboptimal performance.

Methods: We implement a hyperparameter tuning approach using 5-fold cross-validation on two small clinical datasets: UCI Heart Disease Data (827 data points, 15 variables) and Breast Cancer Wisconsin (Diagnostic) Data Set (699 data points, 10 variables). The tuning procedure is steered by a combined weighted score (inspired by [3]) composed of six metrics, each addressing a different facet of data similarity:

1.
Correlation Overlap Score evaluates the correlation matrices between the original and synthetic datasets, capturing the degree of relational match.
2.
Support Coverage Metric quantifies the overlap in data distributions, ensuring both datasets share a similar support.
3.
Discriminator Measure, assigned double weight due to its typically low values in literature, uses a random forest classifier to assess the distinguishability between real and synthetic data.
4.
Cluster Similarity Score employs the k-means clustering algorithm to examine the extent to which synthetic data points map onto clusters in the original data.
5.
Machine Learning Efficiency measures how well a CatBoost model trained on the synthetic data predicts a binary label on the original data.
6.
Basic Statistical Measure compares the mean, median, and standard deviation between numerical columns of both datasets.

The tuning process employs 5-fold cross-validation and the models are compared on an 80% training and 20% test split, distinct from the cross-validation folds.

Results: The application of tuned hyperparameters led to higher average metric scores for all tested models on both datasets. For the Breast Cancer Wisconsin dataset, the average of all scores for the tuned TVAE model improved from 0.44 to 0.87 (a 99.21% improvement), while CTGAN's average score rose from 0.39 to 0.85 (a 116.51% improvement). On the UCI Heart Disease Data, the average score improved from 0.65 to 0.81 (a 24.68% improvement) for TVAE, and from 0.44 to 0.72 (a 65.73% improvement) for CTGAN. Moreover, CTAB-GAN+ demonstrated a 60.78% improvement on the same dataset.

Discussion: Our findings reveal considerable improvements in the average of the metrics with tuned hyperparameters across all models. Notably, the lowest improvement was seen with TVAE on the UCI Heart Disease Data. However, this was primarily due to TVAE's superior performance with default hyperparameters on this dataset. Furthermore, we were unable to tune CTAB-GAN+ on the Breast Cancer Wisconsin dataset due to a CUDA error. It is also worth noting that we treated each variable as categorical for this dataset and, thus, did not use the Basic Statistical Measure.

This study emphasizes the potential of finely-tuned hyperparameters in generative models to significantly enhance the quality of synthetic data produced, particularly for small clinical datasets. The findings contribute to the broader goal of enabling secure, privacy-preserving data sharing within the clinical field.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K. Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems 32. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). 2019. p. 7335–7345.
2.
Zhao Z, Kunar A, Birke R, Chen LY. Ctab-gan+: Enhancing tabular data synthesis [Preprint]. arXiv. 2022. arXiv:2204.00401.
3.
Chundawat VS, Tarun AK, Mandal M, Lahoti M, Narang P. TabSynDex: A Universal Metric for Robust Evaluation of Synthetic Tabular Data [Preprint]. arXiv. 2022. arXiv:2207.05295.
4.
Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems 31. 33rd Conference on Neural Information Processing Systems (NeurIPS 2018). 2018. p. 6638-6648.