gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Using artificial neural networks for taxonomic classification of viral sequences

Meeting Abstract

  • Moritz Kohls - University of Veterinary Medicine Hannover, Hannover, Germany
  • Magdalena Kircher - University of Veterinary Medicine Hannover, Hannover, Germany
  • Jessica Krepel - University of Veterinary Medicine Hannover, Hannover, Germany
  • Pamela Liebig - University of Veterinary Medicine Hannover, Hannover, Germany
  • Klaus Jung - University of Veterinary Medicine Hannover, Hannover, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 141

doi: 10.3205/20gmds369, urn:nbn:de:0183-20gmds3695

Published: February 26, 2021

© 2021 Kohls et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Background: Next-generation sequencing of biological samples from infected hosts is frequently used to identify specific viruses that might be causative for a disease. For that purpose, sequencing reads are usually mapped against a database of known viral reference genomes ([1]; e.g. fasta file available from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/). While this procedure makes inferences on the level of individual viruses, we want instead use NGS read sequences to classify them on higher taxonomy levels by using machine learning models such as artificial neural networks. Taxonomic classification can then provide a basis to study the overall viral composition of a sample in contrast to inferences regarding specific viruses. Furthermore, this new approach enables us to classify recently discovered viruses which are not yet available as reference genome.

Methods: Taxonomy and genome data from NCBI are used to sample viruses and generate artificial reads for taxonomy category order. As a machine learning tool, artificial neural networks implemented in the R-package keras [2] are used to classify the single viral read sequences into different taxa. The model building includes different input features derived from the read sequences as possible predictors. Input features are for example k-mer frequencies (different software benchmarked in [3]), k-mer distances [4] or known sequence motifs and are chosen by a feature selection method. The training, validation and test data consist of the selected input features which are computed from the artificial read sequences. In order to summarise classification results, a generalised confusion matrix is proposed. Two new formula to statistically estimate taxa counts are introduced for studying the overall viral composition.

Results: After building and training, the performance history of the model is evaluated graphically on the validation data by plotting accuracy and loss functions. The prediction accuracy of the derived methods is evaluated on test data and classification results are summarised in a generalised confusion matrix which contains all possible misclassification combination counts. From this confusion matrix, diagnostic measures such as true and false positive rates as well as positive and negative predictive values are calculated. The prediction accuracy of the artificial neural net is considerably higher than for random classification and the posterior estimation of taxa counts is relatively precise.

Conclusion: Simulations and evaluations of the new approach show that deep learning methods are helpful to classify mapped viral read sequences into taxa and thus to support the findings of the mapping approach. As a benefit, our new approach is not limited to already known viruses. In addition, statistical estimations of taxa counts provide an insight into taxa abundance of viral species.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35(Database issue):D61-D65. DOI: 10.1093/nar/gkl842 External link
2.
Allaire JJ, Chollet F. keras: R Interface to 'Keras'. R package version 2.2.5.0. 2019. Available from: https://CRAN.R-project.org/package=keras External link
3.
Sczyrba A, Hofmann P, Belmann P, et al. Critical Assessment of Metagenome Interpretation – a benchmark of metagenomics software. Nat Methods. 2017;14(11):1063-1071. DOI: 10.1038/nmeth.4458 External link
4.
Afreixo V, Bastos CA, Pinho AJ, Garcia SP, Ferreira PJ. Genome analysis with inter-nucleotide distances. Bioinformatics. 2009;25(23):3064-3070. DOI: 10.1093/bioinformatics/btp546 External link