gms | German Medical Science

From bacterial isolates to antibiotic resistograms ‒ towards an automated detection of recently acquired antibiotic resistance genes (ARGs) using machine learning

Meeting Abstract

  • Josefa Welling - Institute for Artificial Intelligence in Medicine, University Hospital Essen, Department of Medicine, University of Duisburg-Essen
  • Sultan Imangaliyev - Institute for Artificial Intelligence in Medicine, University Hospital Essen, Department of Medicine, University of Duisburg-Essen
  • Simon Magin - Institute for Artificial Intelligence in Medicine, University Hospital Essen, Department of Medicine, University of Duisburg-Essen
  • Jan Kehrmann - Institute for Medical Microbiology, University Hospital Essen, Department of Medicine, University of Duisburg-Essen
  • Ivana Kraiselburd - Institute for Artificial Intelligence in Medicine, University Hospital Essen, Department of Medicine, University of Duisburg-Essen
  • Folker Meyer - Institute for Artificial Intelligence in Medicine, University Hospital Essen, Department of Medicine, University of Duisburg-Essen

SMITH Science Day 2022. Aachen, 23.-23.11.2022. Düsseldorf: German Medical Science GMS Publishing House; 2023. DocP16

doi: 10.3205/22smith27, urn:nbn:de:0183-22smith278

Veröffentlicht: 31. Januar 2023

© 2023 Welling et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: The prevalence of antibiotic resistant bacteria is increasingly leading to a growing number of fatalities annually [1]. It is assumed by the WHO that this trend will continue [2]. In combination with only a few newly developed antibiotic agents, this leads to a situation that poses great threats to human health [3]. Rapid, accurate detection of antibiotic resistance is required to determine an appropriate antibiotic therapy. Current methods are solely based on cultivation and therefore only test a limited number of antibiotics. Our aim is to produce a detailed resistogram that will support the physicians to find a suitable treatment and render therapy more effective. Notably this requires the use of a personalized medicine approach to include known and hitherto unknown antibiotic resistance genes (ARGs) that have been horizontally transferred and thus do not appear in the ARG specific databases. We are training a machine learning classifier to reliably identify those recently transferred genes.

Methods: For high-fidelity, rapid determination of antibiotic resistance, we are developing a fully automated Snakemake workflow for genome analysis. Snakemake is a workflow management system to ensure the production of scalable and reproducible data [4]. Our genome analysis is based on a combination of short and long reads generated by Illumina and Oxford Nanopore sequencing of bacterial isolates. The workflow itself includes a comprehensive set of QC tools (Cutadapat [5], Porechop [6], the FASTX-Toolkit [7] and NanoFilt [8]). Reports on the quality of the reads are created by FastQC [9] and NanoQC [8]. Once processed, reads are used for a hybrid genome assembly by Unicycler [10] and finally CheckM [11] is integrated for assembly completeness control. After assembly, genes are predicted and annotated using the well established Prokka [12] approach. In addition, Abricate [13] is applied, which screens the genome assembly for known ARGs using various ARG specific databases such as ARG-ANNOT [14] and CARD [15].

This workflow is embedded in a continuous integration and continuous delivery (CI/CD) environment. In this type of production environment the software development and maintenance process is assisted by automation, testing and monitoring, resulting in reliable and functioning software, producing reproducible and credible data.

In order to additionally identify newly acquired ARGs we are currently working on a machine learning classifier to reliably detect recent transferred genes. We use a parametric approach, with features such as GC content and GC skew, because this is more likely to indicate recent transfers than a phylogenetic approach [16]. The Escherichia coli K12 dataset from the HGT-DB [17] was used to train a Histogram-based Gradient Boosting Classifier [18]. As expected, the class of horizontally transferred genes is significantly smaller, resulting in an unbalanced data set. To compensate for this, higher weights were assigned to the HGT-positive samples. The hyperparameters of the model were optimized by applying GridSearchCV [18].

Results and discussion: We have already assembled and analyzed 12 complete genomes with the workflow described above. The best model so far achieved an average ROC AUC of 0.990 for the cross validation of the training set and a ROC AUC of 0.959 for the test set. These ROC AUC values indicate trustworthy predictions without overfitting the training data. When testing this model on other datasets from HGT-DB, the results vary widely. The model produces credible results for a different Escherichia coli strain (ROC AUC 0.929) and a Bacillus subtilis strain (ROC AUC 0.748). However, the model currently fails to predict the transferred genes of an Haemophilus influenzae strain (ROC AUC 0.419).

Outlook: The next steps include the improvement and generalization of the machine learning classifier. For this purpose we want to broaden the training set with different organisms from the HGT-DB database. The resulting model will then be used to classify all genes of the assemblies generated with the workflow. Taking all the results together we will create a detailed resistogram with known ARGs and recently acquired new ARGs. Furthermore we will extend the workflow to metagenomic sequencing data to work directly with a patient sample and skip the time-consuming cultivation.

Conclusively, this workflow will rapidly provide the physicians detailed information to select an appropriate treatment. This should avoid the unnecessary use of broad-spectrum antibiotics and thus help fight the spread of antibiotic resistance.


References

1.
Martínez JL, Baquero F. Emergence and spread of antibiotic resistance: setting a parameter space. Ups J Med Sci. 2014 May;119(2):68-77. DOI: 10.3109/03009734.2014.901444 Externer Link
2.
World Health Organization, eds. Global action plan on antimicrobial resistance [Internet]. 2015. Available from: https://www.who.int/publications/i/item/9789241509763 Externer Link
3.
European Centre for Disease Prevention and Control. Antimicrobial resistance in the EU/EEA (EARS-Net) - Annual epidemiological report for 2019 [Internet]. 2020. Available from: https://www.ecdc.europa.eu/en/publications-data/surveillance-antimicrobial-resistance-europe-2019 Externer Link
4.
Köster J, Rahmann S. Snakemake – a scalable bioinformatics workflow engine. Bioinformatics. 2012 Oct 1;28(19):2520-2. DOI: 10.1093/bioinformatics/bts480 Externer Link
5.
Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 2011;17(1):10. DOI: 10.14806/ej.17.1.200 Externer Link
6.
Wick RR. Porechop [Internet]. Available from: https://github.com/rrwick/Porechop Externer Link
7.
Hannon GJ. FASTX-Toolkit. Available from: http://hannonlab.cshl.edu/fastx_toolkit Externer Link
8.
De Coster W, D'Hert S, Schultz DT, Cruts M, Van Broeckhoven C. NanoPack: Visualizing and processing long-read sequencing data. Bioinformatics. 2018 Aug 1;34(15):2666-9. DOI: 10.1093/bioinformatics/bty149 Externer Link
9.
Andrews S. FastQC: A quality control tool for high throughput sequence data [Internet]. Available from: http://www.bioinformatics.babraham.ac.uk/projects/fastqc Externer Link
10.
Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 2017 Jun 8;13(6):e1005595. DOI: 10.1371/journal.pcbi.1005595 Externer Link
11.
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55. DOI: 10.1101/gr.186072.114 Externer Link
12.
Seemann T. Prokka: Rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068-9. DOI: 10.1093/bioinformatics/btu153 Externer Link
13.
Seemann T. Abricate [Internet]. Available from: https://github.com/tseemann/abricate Externer Link
14.
Gupta SK, Padmanabhan BR, Diene SM, Lopez-Rojas R, Kempf M, Landraud L, Rolain JM. ARG-ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes. Antimicrob Agents Chemother. 2014;58(1):212-20. DOI: 10.1128/AAC.01310-13 Externer Link
15.
Jia B, Raphenya AR, Alcock B, Waglechner N, Guo P, Tsang KK, Lago BA, Dave BM, Pereira S, Sharma AN, Doshi S, Courtot M, Lo R, Williams LE, Frye JG, Elsayegh T, Sardar D, Westman EL, Pawlowski AC, Johnson TA, Brinkman FS, Wright GD, McArthur AG. CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Res. 2017 Jan 4;45(D1):D566-D573. doi: 10.1093/nar/gkw1004 Externer Link
16.
Garcia-Vallvé S, Romeu A, Palau J. Horizontal gene transfer in bacterial and archaeal complete genomes. Genome Res. 2000 Nov;10(11):1719-25. DOI: 10.1101/gr.130000 Externer Link
17.
Garcia-Vallve S, Guzman E, Montero MA, Romeu A. HGT-DB: A database of putative horizontally transferred genes in prokaryotic complete genomes. Nucleic Acids Res. 2003 Jan 1;31(1):187-9. DOI: 10.1093/nar/gkg004 Externer Link
18.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011;12:2825-30.