gms | German Medical Science

Towards an automated detection of minority variants and mutations in SARS-CoV-2 patient samples

Meeting Abstract

  • Katharina Block - Institute for Artificial Intelligence in Medicine, University Hospital Essen, University of Duisburg-Essen
  • Alexander Thomas - Institute for Artificial Intelligence in Medicine, University Hospital Essen, University of Duisburg-Essen
  • Vu Thuy Khanh Le-Trilling - Institute for Virology, University Hospital Essen, University of Duisburg-Essen
  • Olympia Anastasiou - Institute for Virology, University Hospital Essen, University of Duisburg-Essen
  • Mirko Trilling - Institute for Virology, University Hospital Essen, University of Duisburg-Essen
  • Ivana Kraiselburd - Institute for Artificial Intelligence in Medicine, University Hospital Essen, University of Duisburg-Essen
  • Ulf Dittmer - Institute for Virology, University Hospital Essen, University of Duisburg-Essen
  • Folker Meyer - Institute for Artificial Intelligence in Medicine, University Hospital Essen, University of Duisburg-Essen

SMITH Science Day 2022. Aachen, 23.-23.11.2022. Düsseldorf: German Medical Science GMS Publishing House; 2023. DocP24

doi: 10.3205/22smith35, urn:nbn:de:0183-22smith357

Veröffentlicht: 31. Januar 2023

© 2023 Block et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic continues to be a health threat. Constantly emerging mutations in the SARS-CoV-2 genome enable effective evasion of available treatments. It is assumed that some infected patients carry multiple versions of the virus, a so-called majority variant (or consensus sequence) and one or more minority variants, representing additional mutations. These minority variants potentially lead to greater immune evasion capacities and higher rates of transmissibility [1].

Even though the effect of minority variants towards treatment methods is known and the Robert Koch institute stated the need to identify these variants automatically and accurately within SARS-CoV-2 sequence data in patient samples, a high-fidelity analysis for these variants does not currently exist [2]. We developed a first algorithm towards a high-fidelity analysis method to detect SARS-CoV-2 minority variants within sequencing data of patient samples. The analysis algorithm is set out to identify all SARS-CoV-2 minority variants and mutations. It provides information about the mutation frequencies and their positions as well as the current classification and pangolin lineage of the respective sample. The algorithm is an extension to the UnCoVar workflow by Thomas and Battenfeld et al. (unpublished) [3], a transparent and robust workflow to analyze SARS-CoV-2 sequencing data.

In contrast to other bioinformatic tools used to identify new SARS-CoV-2 variants, which rely on consensus sequences considering only majority events, the algorithm considers all mutations independent of their frequency. This approach enables the detection of mutations with a higher resolution and provides a research platform to analyze single nucleotide variants (SNVs) and possible mutation patterns within SARS-CoV-2 sequencing sets.

The algorithm will provide a platform for high-fidelity analyses of all detectable mutations and variants, which is crucial for detection, monitoring and description of changes within the SARS-CoV-2 genome. This is essential since treatment thrives when adjusted to patient specific mutations patterns, including cases of infections with one or more variants.

Methods: The analysis conducted by the developed algorithm is based on the UnCoVar workflow by Thomas and Battenfeld et al. (unpublished), using Snakemake [3]. The UnCoVar workflow is embedded in a continuous integration and continuous delivery (CI/CD) environment. This type of production environment applies an ongoing automation, testing and monitoring to improve the development process, producing reliable reproducible data.

Sequencing data (e.g., produced by the Illumina platform) is provided to the UnCoVar workflow in a paired-end FASTQ format, which excludes reads shorter than 30 base pairs and a Phred quality below 20. Pre-processing and quality control is conducted using fastp [4] bwa [5] and BAMclipper [6]. Artificially introduced sequencing adapters are trimmed and non-viral sequencing reads are removed based on an alignment against a combined reference of the human (GRCh38.p13) and the SARS-CoV-2 reference genome (NC_045512.2). Afterwards, the raw reads are assembled to contigs using MEGAHIT [7] and the resulting contigs are scaffolded against the SARS-CoV-2 reference genome using raGOO [8]. Freebayes [9] and vep [10] are used to detect single and multiple nucleotide variants (SNVs, MNVs), small insertions and deletions (indels) within the scaffolds and structural variants are identified using Delly [11]. These variants are evaluated based on a unified statistical model using Varlociraptor [12]. Afterwards, pangolin [13] is used to employ the dynamic nomenclature for SARS-CoV-2 lineages, suggested by Rambaut et al. (2020) [14].

The output of the UnCoVar workflow provides the base for the newly developed algorithm to identify and extract minority mutations in SARS-CoV-2 qPCR-positive patient samples of University Hospital of Essen, Germany, covering a period from February 2021 to October 2022. The python-based algorithm extracts the read depth, frequency and position of the mutations as well as the pangolin lineage and its classification of the reconstructed genomes from the UnCoVar output and converts the data to be accessible for further analysis of minority variants. Afterwards, the data can be structured and filtered for investigation of specific timeframes, samples, mutations or similar use cases.

Results: We developed and implemented an algorithm, which reliably filters different minority mutations and variants in a fast and reliable high throughput manner. Out of 2,188 samples, with 357,847,205,526 base pairs a total of 744,657 mutations containing 50,050 instances of minority variants were detected. First implementations of the algorithm to study low frequent mutation patterns within patient sample data identified a rare mutation (S:D820A), which evades a group of recently identified fusion peptide-binding broadly neutralizing antibodies (FP-bnAb) [15], [16].

Outlook: The algorithm or resulting tool to detect and study SARS-CoV-2 minority events needs to be benchmarked further and integrated into a continuous integration and continuous delivery (CI/CD) environment. This type of production environment applies an ongoing automation, testing and monitoring to improve the development process, resulting in reliable and functioning software, producing reliable reproducible data.

The implemented algorithm will provide a platform to search for newly emerging SARS-CoV-2 variants, which can evade currently applied treatment methods, in an accurate and high-throughput manner. This includes further investigation of recombination events and thereby enables a more thorough monitoring of emerging Sars-CoV-2 variants.


References

1.
Callaway E. What Omicron's BA.4 and BA.5 variants mean for the pandemic. Nature. 2022 Jun;606(7916):848-849. doi: 10.1038/d41586-022-01730-y Externer Link
2.
Robert Koch-Institut, Abteilung für Epidemiologie und Gesundheitsmonitoring. Wöchentlicher Lagebericht des RKI zur Coronavirus-Krankheit-2019 (COVID-19) 25.05.2022 (Wochenbericht 25.05.2022) [Internet]. Verfügbar unter: https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Situationsberichte/Wochenbericht/Wochenbericht_2022-05-25.pdf?__blob=publicationFile Externer Link
3.
Battenfeld T, Thomas A, Anastasiou O, Dittmer U, Eisner C, Kraiselburd I, Le-Trilling VTK, Magin S, Scholtysik R, Trilling M, Yilmaz P, Köster J, Meyer F. UnCoVar: A reproducible and scalable workflow for transparent and robust SARS-CoV-2 variant calling and lineage assignment, (unpublished) [Internet]. Available from: https://github.com/IKIM-Essen/uncovar Externer Link
4.
Chen S, Zhou Y, Chen Y, Gu J. fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884-90. DOI: 10.1093/bioinformatics/bty560 Externer Link
5.
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv. 2013 Mar. DOI: 10.48550/arXiv.1303.3997 Externer Link
6.
Au CH, Ho DN, Kwong A, Chan TL, Ma ESK. BAMClipper: Removing primers from alignments to minimize false-negative mutations in amplicon next-generation sequencing. Sci Rep. 2017 May 8;7(1):1567. DOI: 10.1038/s41598-017-01703-6 Externer Link
7.
Li D, Liu CM, Luo R, Sadakane K, Lam TW. MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015 May 15;31(10):1674-6. DOI: 10.1093/bioinformatics/btv033 Externer Link
8.
Alonge M, Soyk S, Ramakrishnan S, Wang X, Goodwin S, Sedlazeck FJ, Lippman ZB, Schatz MC. RaGOO: Fast and accurate reference-guided scaffolding of draft genomes. Genome Biol. 2019 Oct 28;20(1):224. DOI: 10.1186/s13059-019-1829-6 Externer Link
9.
Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv. 2012 Jul. DOI: 10.48550/arXiv.1207.3907 Externer Link
10.
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The Ensembl Variant Effect Predictor. Genome Biol. 2016 Jun 6;17(1):122. DOI: 10.1186/s13059-016-0974-4 Externer Link
11.
Rausch T, Zichner T, Schlattl A, Stütz AM, Benes V, Korbel JO. DELLY: Structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012 Sep 15;28(18):i333-9. DOI: 10.1093/bioinformatics/bts378 Externer Link
12.
Köster J, Dijkstra LJ, Marschall T, Schönhuth A. Varlociraptor: Enhancing sensitivity and controlling false discovery rate in somatic indel discovery. Genome Biol. 2020 Apr 28;21(1):98. DOI: 10.1186/s13059-020-01993-6 Externer Link
13.
O'Toole Á, Scher E, Underwood A, Jackson B, Hill V, McCrone JT, Colquhoun R, Ruis C, Abu-Dahab K, Taylor B, Yeats C, du Plessis L, Maloney D, Medd N, Attwood SW, Aanensen DM, Holmes EC, Pybus OG, Rambaut A. Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool. Virus Evol. 2021 Jul 30;7(2):veab064. DOI: 10.1093/ve/veab064 Externer Link
14.
Rambaut A, Holmes EC, O'Toole Á, Hill V, McCrone JT, Ruis C, du Plessis L, Pybus OG. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020 Nov;5(11):1403-7. DOI: 10.1038/s41564-020-0770-5 Externer Link
15.
Dacon C, Tucker C, Peng L, Lee CD, Lin TH, Yuan M, Cong Y, Wang L, Purser L, Williams JK, Pyo CW, Kosik I, Hu Z, Zhao M, Mohan D, Cooper AJR, Peterson M, Skinner J, Dixit S, Kollins E, Huzella L, Perry D, Byrum R, Lembirik S, Drawbaugh D, Eaton B, Zhang Y, Yang ES, Chen M, Leung K, Weinberg RS, Pegu A, Geraghty DE, Davidson E, Douagi I, Moir S, Yewdell JW, Schmaljohn C, Crompton PD, Holbrook MR, Nemazee D, Mascola JR, Wilson IA, Tan J. Broadly neutralizing antibodies target the coronavirus fusion peptide. Science. 2022 Aug 12;377(6607):728-35. DOI: 10.1126/science.abq3773 Externer Link
16.
Low JS, Jerak J, Tortorici MA, McCallum M, Pinto D, Cassotta A, Foglierini M, Mele F, Abdelnabi R, Weynand B, Noack J, Montiel-Ruiz M, Bianchi S, Benigni F, Sprugasci N, Joshi A, Bowen JE, Stewart C, Rexhepaj M, Walls AC, Jarrossay D, Morone D, Paparoditis P, Garzoni C, Ferrari P, Ceschi A, Neyts J, Purcell LA, Snell G, Corti D, Lanzavecchia A, Veesler D, Sallusto F. ACE2-binding exposes the SARS-CoV-2 fusion peptide to broadly neutralizing coronavirus antibodies. Science. 2022 Aug 12;377(6607):735-42. DOI: 10.1126/science.abq2679 Externer Link