Artikel
Potential of ChatGPT for generating synthetic German medical corpus: A comparison with real-world corpora
Suche in Medline nach
Autoren
Veröffentlicht: | 15. September 2023 |
---|
Gliederung
Text
Introduction: The accessibility of medical text corpora is crucial for improving research in health care. However, publicly available medical data is a considerable challenge, especially in Germany, because of European data protection regulations [1]. To overcome this, one fundamental approach is to create diverse and representative synthetic text data. For this purpose, language models (LM) offer enormous potential among other tasks in natural language processing. In our work, we aim to explore one of the most well-known LMs, namely “ChatGPT” [2], and its potential for creating such German medical reports. Recent studies mainly focus on English-language solutions. For example, generative adversarial networks, vanilla Transformers, and GPT-2 models are investigated to create English medical data [3], [4]. Here, we present the comparison between synthetic medical reports generated by ChatGPT, based on GPT-3, and other medical and non-medical datasets.
Methods: For the experiments, 100 German discharge reports were obtained from ChatGPT for a wide variety of diseases and symptoms. One example prompt is “Schreibe mir einen Arztbrief eines Patienten, der wegen Hypertonie ins Krankenhaus eingeliefert wurde. We analyzed the characteristics of these reports by comparing them with medical and non-medical corpora by using t-Distributed Stochastic Neighbor Embedding (t-SNE), Uniform Manifo”ld Approximation and Projection for Dimension Reduction (UMAP), and K-Means algorithms respectively. For comparison, clinical guidelines and summaries “GGPONC 2.0”, “GraSCCo”, and discharge summaries “BRONCO150” are medical datasets while the narratives “WikiWarsDE” and newspaper articles “KRAUTS” are evaluated as out-of-domain datasets. We extracted the embeddings based on the frequency of linguistic and semantic features, as well as the Unified Medical Language System semantic groups, which involve information about chemicals, disorders, anatomical structures, etc., similar to the work in [5]. Furthermore, named entity recognition categories (i.e., person, location) are combined with part-of-speech categories (i.e., noun and verb). Finally, 27 features from 480 documents are visually clustered using t-SNE and UMAP after normalization and scaling. Both techniques visualize the high-dimensional data by projecting it into 2 or 3-dimensional space. Another clustering technique K-Means also exerted after applying principal component analysis to reduce the data dimensionality.
Results: Our analysis resulted in distinguishable, separated clusters for each dataset, which indicates different characteristics. We observed that clusters from ChatGPT and GraSCCo are slightly overlapping. However, in the ideal case, it is expected that they intertwine more with medical ones since they should reflect the statistical distribution of real data. Yet, at least, the ChatGPT cluster is located near the medical data clusters.
Discussion and conclusion: Experimental results indicate the potential to leverage ChatGPT to create a more extensive corpus useful for researchers. However, there is still room for improvement as we demonstrated a gap between the distribution of our synthetic texts and real ones. For this purpose, we aim to fine-tune ChatGPT with German medical datasets in the future. Moreover, using more complex features to embed sequence-level semantics could also be investigated as well as extending the current dataset by automating the process of obtaining ChatGPT texts.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Buchner B. Forschungsdaten effektiver nutzen. Datenschutz und Datensicherheit. 2022;46:555-560. DOI: 10.1007/s11623-022-1658-8
- 2.
- OpenAI. ChatGPT. openai.com; November 2022. Available from: https://chat.openai.com
- 3.
- Guan J, Li R, Yu S, Zhang X. Generation of Synthetic Electronic Medical Record Text. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2018 Dec 3-6; Madrid, Spain. p. 374-380. DOI: 10.1109/BIBM.2018.8621223
- 4.
- Amin-Nejad A, Ive J, Velupillai S. Exploring Transformer Text Generation for Medical Dataset Augmentation. In: Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020. p. 4699-4708.
- 5.
- Modersohn L, Schulz S, Lohr C, Hahn U. GRASCCO — The First Publicly Shareable, Multiply-Alienated German Clinical Text Corpus. In: German Medical Data Sciences 2022 – Future Medicine: More Precise, More Integrative, More Sustainable! IOS; 2022. (Studies in Health Technology and Informatics; 296). p. 66-72. DOI: 10.3233/SHTI220805