Article
Hyperparameter Tuning Matters for Generative Models to Produce Improved Synthetic Data in Small Tabular Datasets
Search Medline for
Authors
Published: | September 15, 2023 |
---|
Outline
Text
Introduction: The generation of high-quality synthetic tabular data for small clinical datasets bears significant importance in healthcare research. It enables the utilization of sensitive health information for advancing treatment modalities and patient care, while preserving patient privacy. However, the common practice of applying generative models, such as CTGAN [1], TVAE [1], and CTAB-GAN+ [2], with default hyperparameters often leads to suboptimal performance.
Methods: We implement a hyperparameter tuning approach using 5-fold cross-validation on two small clinical datasets: UCI Heart Disease Data (827 data points, 15 variables) and Breast Cancer Wisconsin (Diagnostic) Data Set (699 data points, 10 variables). The tuning procedure is steered by a combined weighted score (inspired by [3]) composed of six metrics, each addressing a different facet of data similarity:
- 1.
- Correlation Overlap Score evaluates the correlation matrices between the original and synthetic datasets, capturing the degree of relational match.
- 2.
- Support Coverage Metric quantifies the overlap in data distributions, ensuring both datasets share a similar support.
- 3.
- Discriminator Measure, assigned double weight due to its typically low values in literature, uses a random forest classifier to assess the distinguishability between real and synthetic data.
- 4.
- Cluster Similarity Score employs the k-means clustering algorithm to examine the extent to which synthetic data points map onto clusters in the original data.
- 5.
- Machine Learning Efficiency measures how well a CatBoost model trained on the synthetic data predicts a binary label on the original data.
- 6.
- Basic Statistical Measure compares the mean, median, and standard deviation between numerical columns of both datasets.
The tuning process employs 5-fold cross-validation and the models are compared on an 80% training and 20% test split, distinct from the cross-validation folds.
Results: The application of tuned hyperparameters led to higher average metric scores for all tested models on both datasets. For the Breast Cancer Wisconsin dataset, the average of all scores for the tuned TVAE model improved from 0.44 to 0.87 (a 99.21% improvement), while CTGAN's average score rose from 0.39 to 0.85 (a 116.51% improvement). On the UCI Heart Disease Data, the average score improved from 0.65 to 0.81 (a 24.68% improvement) for TVAE, and from 0.44 to 0.72 (a 65.73% improvement) for CTGAN. Moreover, CTAB-GAN+ demonstrated a 60.78% improvement on the same dataset.
Discussion: Our findings reveal considerable improvements in the average of the metrics with tuned hyperparameters across all models. Notably, the lowest improvement was seen with TVAE on the UCI Heart Disease Data. However, this was primarily due to TVAE's superior performance with default hyperparameters on this dataset. Furthermore, we were unable to tune CTAB-GAN+ on the Breast Cancer Wisconsin dataset due to a CUDA error. It is also worth noting that we treated each variable as categorical for this dataset and, thus, did not use the Basic Statistical Measure.
This study emphasizes the potential of finely-tuned hyperparameters in generative models to significantly enhance the quality of synthetic data produced, particularly for small clinical datasets. The findings contribute to the broader goal of enabling secure, privacy-preserving data sharing within the clinical field.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K. Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems 32. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). 2019. p. 7335–7345.
- 2.
- Zhao Z, Kunar A, Birke R, Chen LY. Ctab-gan+: Enhancing tabular data synthesis [Preprint]. arXiv. 2022. arXiv:2204.00401.
- 3.
- Chundawat VS, Tarun AK, Mandal M, Lahoti M, Narang P. TabSynDex: A Universal Metric for Robust Evaluation of Synthetic Tabular Data [Preprint]. arXiv. 2022. arXiv:2207.05295.
- 4.
- Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems 31. 33rd Conference on Neural Information Processing Systems (NeurIPS 2018). 2018. p. 6638-6648.