gms | German Medical Science

63. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

02. - 06.09.2018, Osnabrück

(Semi) automatic annotation and anonymization of medical text documents for secondary use

Meeting Abstract

  • Hans Laser - Hannover Medical School, Centre of Information Management (ZIMt), Hannover, Deutschland
  • Felix Struckmann - University of Applied Sciences Hannover, Faculty III – Media, Information and Design, Hannover, Deutschland
  • Yannik Wissner - University of Applied Sciences Hannover, Faculty III – Media, Information and Design, Hannover, Deutschland
  • Norman Schönfeld - Hannover Medical School, Centre of Information Management (ZIMt), Hannover, Deutschland
  • Christian Wartena - University of Applied Sciences Hannover, Faculty III – Media, Information and Design, Hannover, Deutschland
  • Svetlana Gerbel - Hannover Medical School, Centre of Information Management (ZIMt), Hannover, Deutschland

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 63. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Osnabrück, 02.-06.09.2018. Düsseldorf: German Medical Science GMS Publishing House; 2018. DocAbstr. 248

doi: 10.3205/18gmds127, urn:nbn:de:0183-18gmds1272

Veröffentlicht: 27. August 2018

© 2018 Laser et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: Electronical-Health-Records (EHR) are a common and important source for medical research. Some parts of the EHR documentation are usually saved in plain text (e.g. medical reports and findings). Due to the legal requirements (Article 4(5) of the EU GDPR) the free text documents must be anonymized adequately by removing personally identifiable information in order to protect patients' privacy before sharing data with other researchers [1].

The manual anonymization is however a very time-consuming process [2]. The aim of this project was to develop a solution for automated anonymization of unstructured clinical texts.

We used the Named-Entity-Recognizer (NER-classifier) for benchmarking [3]. The out-of-the-box model of the classifier has been trained on newspaper articles (CoNLL-2003 Shared Task corpus) and semantic generalization was applied using a web corpus.

Methods: The EHR data required for evaluation were provided by the Enterprise Clinical Research Data Warehouse of Hannover Medical School. A rule-based approach based on regular expressions was used to recognize entities. This approach was implemented using Python and the Natural Language Toolkit. First texts were split into sentences, then into words via Natural Language Toolkit (tokenizing). Part-of-speech tagging was performed by the TreeTagger [4]. Additionally, multiple custom tags were added using handcrafted rules such as the word being a title (e.g. Ms., Dr.). Context rules using these tags made it possible to identify names.

A semi-automated pipeline was created to use the algorithm. The pipeline loads the documents, preprocesses them using the methods developed and makes the results available to a web-based dashboard to enable manual verification, correction and enhancement of the generated annotation. In addition to that, a full-text search engine based on Apache Solr [5] and PHP was implemented to make the anonymous texts searchable.

Finally, a gold standard was defined to evaluate the results of automatic classification (NER vs. rule-based-system). For this purpose, 100 documents were manually annotated.

Results: For the annotation of the 100 documents (gold standard), one person needed 415 minutes to identify 1148 relevant passages.

The out-of-the-box NER-classifier using conditional random fields achieved a recall of 73.9% on EHR (reported recall on CoNLL. Corpus: 88%) and a precision of 43.3% (reported precision on CoNLL: 96.2%). False positives were usually due to specific vocabulary of medical-terminology or unknown abbreviations.

The rule-based system could be improved by using lookups over generated name lists and classification-systems. It achieved a recall of 79.6% and precision of 71.6% before manual corrections.

Discussion: As there is currently no medical training data available in the German language, the used classifier is trained on an out of domain corpus. The creation of suitable corpora requires a lot effort.

Nevertheless the created semi-automated pipeline allows a significant acceleration of anonymization compared to solely human annotation. It should be emphasized that manual checks are still absolutely necessary.

Anonymization leads to loss of information and readability. For some applications, a pattern-based approach [6], in which only medical information and stop words are retained and the rest is made unrecognizable, could be an alternative.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
EU general data protection regulation 2016/679 (GDPR). [last access 25.05.2018]. Available from: http://www.privacy-regulation.eu/en/article-4-definitions-GDPR.htm Externer Link
2.
Douglass M. Computer-assisted de-identification of free-text nursing notes [Master thesis]. Massachusetts Institute of Technology; 2005.
3.
Faruqui M, Padó S. Training and Evaluating a German Named Entity Recognizer with Semantic Generalization. Proceedings of KONVENS 2010. Saarbrücken, Germany; 2010.
4.
TreeTagger. [last access 09.04.2018]. Available from: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger1.pdf Externer Link
5.
Apache Solr. [last access 25.05.2018]. Available from: http://lucene.apache.org/solr/ Externer Link
6.
Meystre SM, Friedlin FJ, South BR, Shen S, Samore MH. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Medical Research Methodology. 2010;10(1):70. DOI: 10.1186/1471-2288-10-70 Externer Link