Artikel
Quality control of the machine-learning tool DeepL for translation of oncological guidelines from German to English language
Suche in Medline nach
Autoren
Veröffentlicht: | 23. Februar 2021 |
---|
Gliederung
Text
Background/research question: Guidelines in oncology are documents with a large text corpus; as an example, the guideline for breast cancer contains almost 153,000 words. Demand for translations into English is high, resulting in a significant economic and resource burden for authors and professional societies. As an alternative to translating guidelines by hand, automated computer-based tools can be used, potentially reducing the required time and manpower. The feasibility of this approach for medical texts has been discussed before, revealing significant limitations. Herein, we resume this discussion focussing on the translation tool DeepL, a neural machine translation service using convolutional neural networks, to translate recommendations and quantify differences compared to a gold standard delivered by a professional human translator.
Methods: The guidelines for breast cancer and endometrial cancer were translated from German to English by a human professional translator as well as by DeepL. The text of every recommendation was compared between the human and the machine translation using Levenshtein distance (LEV), bilingual evaluation understudy (BLEU) and WordNet semantic similarity (SES) scores as measures of text equality. Additionally, a formal grammar check was performed using GrammarBot. Descriptive statistics of these measures were calculated and compared quantitatively.
Results: 535 recommendations were extracted from the two guidelines. LEV scores between the machine and human translation ranged from 18.4% to 100.0% with a mean of 94.8% ± 12.9 (standard deviation), and 358/535 = 66.9% of the translations were identical (LEV = 1). BLEU scores ranged from 0.69 %to 100.00% (mean 91.32 ± 18.39). Semantic similarity ranged from 48.2% to 100.0% (mean 98.9% ± 4.5%), and 410/535 = 76.6% of the translations had complete semantic equality (SES = 1). Formal grammar checking resulted in between no (530/535 = 99.1%) and one (5/535 = 0.9%) grammatical error.
Conclusion: Most recommendations were translated identically by DeepL and the human translator. Considering the high mean Levenshtein and BLEU scores between the translations, the quality of the automated translations has improved significantly within the last decade. The results can serve as a high-quality basis for proof-reading as a final professional quality check, significantly reducing required time and cost. Guideline background texts were not covered by this analysis and require further research.