Article
Adapting Variational Autoencoders for Realistic Synthetic Data with Skewed and Bimodal Distributions
Search Medline for
Authors
Published: | February 26, 2021 |
---|
Outline
Text
Background: When data protection restrictions do not allow for directly passing on patient data to other researchers, one option is to create synthetic data based on the original data. Such synthetic data should ideally preserve the statistical relationships between the variables while protecting the privacy as no original individual observations are contained. In recent years, deep generative models have allowed for significant progress in the field of synthetic data generation. In particular, variational autoencoders (VAEs) are a popular class of deep generative models. Plain VAEs are typically built around a latent space with a Gaussian distribution and this is a key challenge for VAEs when they encounter more complex data distributions like bimodal or skewed data. Some approaches try to improve the performance of VAEs by using different priors [1], but these kinds of changes are distribution-specific and cannot be used for clinical data, which may contain variables with various distributions.
Methods: In this work, we propose a novel method for synthetic data generation that handles bimodal and skewed data as well, while keeping the overall VAE framework. We apply two transformations to convert the data into a form that is more compliant with VAEs. First, we use Box-Cox transformations to transform the skewed distribution to something closer to a symmetric distribution. Then, dealing with potential bimodal data, we employ an inverse hyperbolic tangent transformation. With this transformation, we have closer peaks and lighter tails. After these two transformations and applying the VAE, the back transformations on the VAEs should allow for more realistic synthetic data. For the evaluation of our method, we use a simulation design data [2], which is based on a large breast cancer study [3], [4].
Results: We show that by employing our proposed transformation approach, we have a considerable improvement in the utility of synthetic data for skewed and bimodal distributions. We investigate this in comparison with the synthetic data generated from plain VAEs and also the VAEs with an autoregressive implicit quantile network approach (AIQN) [5]. We see these two other methods cannot as well generate bimodality in synthetic data and instead typically generate unimodal distributions with or without skewness according to the original data having asymmetric bimodality or not. For skewed data, these methods decrease the skewness of synthetic data and make the data closer to a symmetric distribution. Furthermore, they generally do not honor the value range of original data for skewed distributions. In comparison, we show that our method can generate bimodality and skewness close to the original data while keeping the true range of data.
Conclusion: In conclusion, we developed a simple method, which adapts VAEs by transformations to handle skewed and bimodal data. Due to its simplicity, it can be used in combination with many extensions of VAEs. Thus, it becomes feasible to generate high-quality synthetic clinical data for research under data protection constraints.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Serban IV, Ororbia II AG, Pineau J, Courville A. Multi-modal variational encoder-decoders. 2016.
- 2.
- Zöller D, Wockner L, Binder H. Modified ART study – Simulation design for an artifical but realistic human study dataset. 2020. DOI: 10.5281/ZENODO.3678736
- 3.
- Schmoor C, Olschewski M, Schumacher M. Randomized and non-randomized patients in clinical trials: experiences with comprehensive cohort studies. Stat Med. 1996;15:263–271.
- 4.
- Sauerbrei W, Royston P. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. J R Stat Soc Ser A. 1999;162:71–94.
- 5.
- Ostrovski G, Dabney W, Munos R. Autoregressive quantile networks for generative modeling [Preprint]. ArXiv. 2018. ArXiv1806.05575.