gms | German Medical Science

22. Jahrestagung des Deutschen Netzwerks Evidenzbasierte Medizin e. V.

Deutsches Netzwerk Evidenzbasierte Medizin e. V.

24. - 26.02.2021, digital

Quality control of the machine-learning tool DeepL for translation of oncological guidelines from German to English language

Meeting Abstract

Suche in Medline nach

  • Gregor Wenzel - Deutsche Krebsgesellschaft e.V., Leitlinienprogramm Onkologie, Deutschland
  • Thomas Langer - Deutsche Krebsgesellschaft e.V., Leitlinienprogramm Onkologie, Deutschland
  • Markus Follmann - Deutsche Krebsgesellschaft e.V., Leitlinienprogramm Onkologie, Deutschland

Who cares? – EbM und Transformation im Gesundheitswesen. 22. Jahrestagung des Deutschen Netzwerks Evidenzbasierte Medizin. sine loco [digital], 24.-26.02.2021. Düsseldorf: German Medical Science GMS Publishing House; 2021. Doc21ebmPS-2-04

doi: 10.3205/21ebm058, urn:nbn:de:0183-21ebm0584

Veröffentlicht: 23. Februar 2021

© 2021 Wenzel et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe



Background/research question: Guidelines in oncology are documents with a large text corpus; as an example, the guideline for breast cancer contains almost 153,000 words. Demand for translations into English is high, resulting in a significant economic and resource burden for authors and professional societies. As an alternative to translating guidelines by hand, automated computer-based tools can be used, potentially reducing the required time and manpower. The feasibility of this approach for medical texts has been discussed before, revealing significant limitations. Herein, we resume this discussion focussing on the translation tool DeepL, a neural machine translation service using convolutional neural networks, to translate recommendations and quantify differences compared to a gold standard delivered by a professional human translator.

Methods: The guidelines for breast cancer and endometrial cancer were translated from German to English by a human professional translator as well as by DeepL. The text of every recommendation was compared between the human and the machine translation using Levenshtein distance (LEV), bilingual evaluation understudy (BLEU) and WordNet semantic similarity (SES) scores as measures of text equality. Additionally, a formal grammar check was performed using GrammarBot. Descriptive statistics of these measures were calculated and compared quantitatively.

Results: 535 recommendations were extracted from the two guidelines. LEV scores between the machine and human translation ranged from 18.4% to 100.0% with a mean of 94.8% ± 12.9 (standard deviation), and 358/535 = 66.9% of the translations were identical (LEV = 1). BLEU scores ranged from 0.69 %to 100.00% (mean 91.32 ± 18.39). Semantic similarity ranged from 48.2% to 100.0% (mean 98.9% ± 4.5%), and 410/535 = 76.6% of the translations had complete semantic equality (SES = 1). Formal grammar checking resulted in between no (530/535 = 99.1%) and one (5/535 = 0.9%) grammatical error.

Conclusion: Most recommendations were translated identically by DeepL and the human translator. Considering the high mean Levenshtein and BLEU scores between the translations, the quality of the automated translations has improved significantly within the last decade. The results can serve as a high-quality basis for proof-reading as a final professional quality check, significantly reducing required time and cost. Guideline background texts were not covered by this analysis and require further research.