gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Adapting Variational Autoencoders for Realistic Synthetic Data with Skewed and Bimodal Distributions

Meeting Abstract

Search Medline for

  • Kiana Farhadyar - Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg, Freiburg, Germany
  • Harald Binder - Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg, Freiburg, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 314

doi: 10.3205/20gmds170, urn:nbn:de:0183-20gmds1702

Published: February 26, 2021

© 2021 Farhadyar et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Background: When data protection restrictions do not allow for directly passing on patient data to other researchers, one option is to create synthetic data based on the original data. Such synthetic data should ideally preserve the statistical relationships between the variables while protecting the privacy as no original individual observations are contained. In recent years, deep generative models have allowed for significant progress in the field of synthetic data generation. In particular, variational autoencoders (VAEs) are a popular class of deep generative models. Plain VAEs are typically built around a latent space with a Gaussian distribution and this is a key challenge for VAEs when they encounter more complex data distributions like bimodal or skewed data. Some approaches try to improve the performance of VAEs by using different priors [1], but these kinds of changes are distribution-specific and cannot be used for clinical data, which may contain variables with various distributions.

Methods: In this work, we propose a novel method for synthetic data generation that handles bimodal and skewed data as well, while keeping the overall VAE framework. We apply two transformations to convert the data into a form that is more compliant with VAEs. First, we use Box-Cox transformations to transform the skewed distribution to something closer to a symmetric distribution. Then, dealing with potential bimodal data, we employ an inverse hyperbolic tangent transformation. With this transformation, we have closer peaks and lighter tails. After these two transformations and applying the VAE, the back transformations on the VAEs should allow for more realistic synthetic data. For the evaluation of our method, we use a simulation design data [2], which is based on a large breast cancer study [3], [4].

Results: We show that by employing our proposed transformation approach, we have a considerable improvement in the utility of synthetic data for skewed and bimodal distributions. We investigate this in comparison with the synthetic data generated from plain VAEs and also the VAEs with an autoregressive implicit quantile network approach (AIQN) [5]. We see these two other methods cannot as well generate bimodality in synthetic data and instead typically generate unimodal distributions with or without skewness according to the original data having asymmetric bimodality or not. For skewed data, these methods decrease the skewness of synthetic data and make the data closer to a symmetric distribution. Furthermore, they generally do not honor the value range of original data for skewed distributions. In comparison, we show that our method can generate bimodality and skewness close to the original data while keeping the true range of data.

Conclusion: In conclusion, we developed a simple method, which adapts VAEs by transformations to handle skewed and bimodal data. Due to its simplicity, it can be used in combination with many extensions of VAEs. Thus, it becomes feasible to generate high-quality synthetic clinical data for research under data protection constraints.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Serban IV, Ororbia II AG, Pineau J, Courville A. Multi-modal variational encoder-decoders. 2016.
2.
Zöller D, Wockner L, Binder H. Modified ART study – Simulation design for an artifical but realistic human study dataset. 2020. DOI: 10.5281/ZENODO.3678736 External link
3.
Schmoor C, Olschewski M, Schumacher M. Randomized and non-randomized patients in clinical trials: experiences with comprehensive cohort studies. Stat Med. 1996;15:263–271.
4.
Sauerbrei W, Royston P. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. J R Stat Soc Ser A. 1999;162:71–94.
5.
Ostrovski G, Dabney W, Munos R. Autoregressive quantile networks for generative modeling [Preprint]. ArXiv. 2018. ArXiv1806.05575.