gms | German Medical Science

64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

08. - 11.09.2019, Dortmund

Generating synthetic data for use in research and teaching

Meeting Abstract

  • Lea Rosa Droese - University of Applied Sciences, Hannover, Germany
  • Svetlana Gerbel - Hannover Medical School, Center for Information Management, Hannover, Germany
  • Sonja Teppner - Hannover Medical School, Center for Information Management, Hannover, Germany
  • Johanna Fiebeck - Hannover Medical School, Center for Information Management, Hannover, Germany
  • Cornelia Frömke - University of Applied Sciences, Hannover, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Dortmund, 08.-11.09.2019. Düsseldorf: German Medical Science GMS Publishing House; 2019. DocAbstr. 294

doi: 10.3205/19gmds124, urn:nbn:de:0183-19gmds1242

Veröffentlicht: 6. September 2019

© 2019 Droese et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Background: As part of a joint project between the Hannover Medical School (MHH) and the Hannover University of Applied Sciences (HsH), an early warning system is developed with the aim of identifying critical study processes that lead to extended study times and even termination [1], [2]. An integral part of this collaboration is the data sharing with HsH for teaching purposes. However, the intended development of machine learning methods for pattern recognition in critical study processes was not suitable due to MHH data protection regulations, as the data set contains sensitive information. An approach based on synthetic data was identified as a possible solution. Since the production of synthetic data is often not sufficiently documented in the literature to enable the reproduction of the results [3], the aim of our work is to describe a possible procedure for the generation of synthetic data.

Methods: After validating the original data set, synthpop was used to create a synthetic data set. The synthetic data was then validated as well. The initial data set consisted of 112 columns and was reduced to 24 characteristics for clarity. Validation was performed by descriptively comparing the graphical and numerical characteristics of the individual datasets. The validation showed, that the synthetic data imitated the original data set very well. The distributions of the individual values are comparable. Only a few characteristics deviate from the original. Certain rules, e.g. passing an exam given the exam was written, could not be implemented correctly in synthpop.

Results: In the first step, a synthetic data set was created with the validated original data set using synthpop and then validated. The initial data set consisted of 112 columns and was reduced to 24 characteristics for clarity. Validation was performed by descriptively comparing the characteristics of the individual datasets. The chosen method showed, that the synthetic data imitated the original data set very well. The distributions of the individual values such as age structure are very close to each other. Only a few characteristics deviate from the original. Certain rules, e.g. the requirement of a written exam for a passed exam, could not be implemented correctly in synthpop.

Discussion: Our work describes a possible procedure for the creation and validation of a synthetic data set for research purposes. The focus is on the challenges that arise at each step of the synthesis and possible solutions. In order to verify consistency between the original and the synthetic data, rules for the evaluation of synthetic data have been described. While equivalence tests are the method of choice for this evaluation, they are dependent on pre-defined equivalence bounds. These bounds must be defined by the data-owner and were not available in this application.

Each compilation of a synthetic data set is based on an individual question and requires an individual solution. The quality of the synthesized data always depends on the quality of the algorithm/procedure used to generate it. The question remains as to how similar the synthesised data should and may be to the original data.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Krohn M, Teppner S, Simon N, Melnik I, Pracht G, Müller J, Gerbel S. Kritische Studienverläufe mit Datawarehouse erkennen. In: Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie, Hrsg. 63. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS). Osnabrück, 02.-06.09.2018. Düsseldorf: German Medical Science GMS Publishing House; 2018.
2.
Krohn M. Management studentischer Sozialisationsrisiken als strategische Studiengangsentwicklung. Hochschulmanagement. 2016 Dec;5:116-120.
3.
MacLachlan S. Realism in synthetic data generation [Thesis]. Palmerston North, New Zealand: Massey University; 2017.