Artikel
Syndat – a platform for evaluation & visualization of synthetic patient level data
Suche in Medline nach
Autoren
Veröffentlicht: | 6. September 2024 |
---|
Gliederung
Text
Introduction: With the increasing amount of digital data being documented daily, the significance of data-driven methodologies for predictive modeling in healthcare is on the rise [1]. These approaches hold immense promise in aiding clinicians in decision-making processes and improve patient care in general. Accessing and sharing sensitive patient data for such analysis however poses significant challenges due to privacy concerns. AI-based generation of synthetic data aims to bridge this gap by enabling the sharing of realistic patient-level data while at the same time reducing data privacy concerns compared to sharing real data [2]. However, synthetic data generation techniques also face limitations:
While the generative process aims to reproduce the underlying statistical characteristics of the original data as faithfully as possible, synthetic data generation will not always yield optimal results, and thus conclusions derived from synthetic data may differ from those found in the real data [3]. Moreover, clinical data are often structurally complex and high dimensional, and thus evaluating and fine-tuning generative models is a non-trivial and often time consuming task.
Methods: Syndat (https://syndat.scai.fraunhofer.de/) aims to facilitate this process by offering a platform to evaluate, visualize and explore synthetic data based on different metrics and visualizations for both data quality and privacy. Synthetic data quality is evaluated based on three different metrics:
- 1.
- The ability to differentiate between the original and the synthetic data.
- 2.
- The similarity of the data in terms of marginal statistical distributions.
- 3.
- The ability to retain complex correlation structures in the synthetic data.
Syndat calculates a quality score ranging from 0 to 100 for all three metrics which can give the user a detailed impression about the overall quality of the generated data.
Additionally to the data quality scores, Syndat will also report risk scores for membership inference, linkability and singling out risks using the framework of Giomi et al. [4].
Results: Syndat provides the user a fast and easy way to get an overview over the quality of generated synthetic data using the above mentioned scoring system as well as a wide range of visualizations for both the general and statistical compositions of the data. Synthetic data points can be inspected using a two-dimensional, interactive visual embedding plot to compare how they cluster in comparison with the real data. Outlier scores are computed and visualized for each synthetic data point, which can be used to single out and inspect synthetic data points that may not be fitted optimally to the original data.
Distribution plots for both categorical and numerical features can be selected to compare against the statistical distribution of the original data to potentially find features that have not been fitted correctly. Users can also compare correlation patterns between different features in heatmap plots for real and synthetic data to ensure that correlation structures were correctly learned in the training process.
Conclusion: In conclusion, Syndat offers a comprehensive platform for synthetic data analysis which can be used to evaluate data and potentially find ways to fine-tune and further improve on generative modeling results.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Yu KH, Beam AL, Kohane IS. Artificial intelligence in healthcare. Nat Biomed Eng. 2018;2:719–731. DOI: 10.1038/s41551-018-0305-z
- 2.
- Hernandez M, et al. Synthetic data generation for tabular health records: A systematic review. Neurocomputing. 2022;493:28-45.
- 3.
- Rankin D, et al. Reliability of supervised machine learning using synthetic data in health care: Model to preserve privacy for data sharing. JMIR medical informatics. 2020;8.7: e18910.
- 4.
- Giomi M, Boenisch F, Wehmeyer C, Tasnádi B. A unified framework for quantifying privacy risk in synthetic data. In: PoPETS 2023. Proceedings of the Privacy Enhancing Technologies Symposium. 2023. p. 312-328. DOI: 10.56553/popets-2023-0055