gms | German Medical Science

68. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

17.09. - 21.09.23, Heilbronn

The Coordinate Converting Service (CCS): A Toolbox for Sequence Data Transformation

Meeting Abstract

  • Kevin Kornrumpf - Dept. of Medical Bioinformatics, University Medical Center Göttingen, Göttingen, Germany
  • Tim Tucholski - Dept. of Medical Bioinformatics, University Medical Center Göttingen, Göttingen, Germany
  • Klara Drofenik - Dept. of Medical Bioinformatics, University Medical Center Göttingen, Göttingen, Germany
  • Tim Beißbarth - Dept. of Medical Bioinformatics, University Medical Center Göttingen, Göttingen, Germany; Comprehensive Cancer Center Lower Saxony (CCC-N), Hannover, Germany; Campus Institute Data Science (CIDAS), Göttingen, Germany
  • Jürgen Dönitz - Dept. of Medical Bioinformatics, University Medical Center Göttingen, Göttingen, Germany; Comprehensive Cancer Center Lower Saxony (CCC-N), Hannover, Germany; Campus Institute Data Science (CIDAS), Göttingen, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 68. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS). Heilbronn, 17.-21.09.2023. Düsseldorf: German Medical Science GMS Publishing House; 2023. DocAbstr. 330

doi: 10.3205/23gmds149, urn:nbn:de:0183-23gmds1497

Published: September 15, 2023

© 2023 Kornrumpf et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Introduction: Sequence coordinates are represented in various formats and referencing different genomes, making it difficult to process this data with tools that expect a specific format. To address this problem, several methods are needed to standardize sequence data and convert them in the needed format. However, manual conversion of sequence data requires time and knowledge of biological background and complex rules. Computer-assisted support is needed. In this context, conversion of sequence coordinates is considered an essential tool for modern genetic research.

Methods: Here we introduce the Coordinate Converting Service (CCS), a tool responsible for providing information in genome, protein, and transcript data. Using the Universal Transcript Archive database, the hgvs package [1], [2] and the pyliftover package [3], we have access to sequence data and basic functions to obtain information about transcripts and genes or to switch between reference genomes. With this, we were able to implement more complex functionalities, including sequence manipulation, sequence conversion, and verification of complex genetic events that help to obtain more information or unify heterogeneous data.

Results: With the CCS you have the possibility to choose between different conversions Overall 9 methods are available, examples are: GenomicToGene to translate single nucleotide polymorphisms (SNPs) into the corresponding amino acid exchange, GeneToGenomic and TranscriptToGenome to translate amino acid exchanges into possible genomic exchanges, GetProteinSequences to manipulate protein sequences with given SNPs, Liftover to map genomic coordinates between different reference genomes and FusionIsInFrame to determine whether the event of a genomic fusion results in a frameshift for the genes involved. The CCS can be downloaded and installed locally or queried via an API. In addition, a front-end application provides the ability to convert your sequence data easily with custom queries.

Discussion: The CCS is a useful Swiss army knife to work with nucleic and protein sequence data. It finds its place in data projects with heterogeneous data and enables their harmonization by mapping them to the same reference genome, genomic or protein positions. With the API it is possible to automate those processes within a pipeline or to start individual manual queries using the web version.

Conclusion: CSS is a collection of methods to convert sequence coordinates data in the required format. It can be easily extended to include other functions and resources, such as information on protein domains or gene regulatory domains. The service is currently used in data projects with human mutation data with cancer background to map genomic coordinates with SNPs to the resulting protein exchange and vice versa to use other tools that require specific sequence annotations. The front-end version is available at https://mtb.bioinf.med.uni-goettingen.de/CCS-web.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Hart RK, et al. A Python package for parsing, validating, mapping and formatting sequence variants using HGVS nomenclature. Bioinformatics. 2015 Jan 15;31(2):268-70. DOI: 10.1093/bioinformatics/btu630 External link
2.
Wang M, et al. hgvs: A Python package for manipulating sequence variants using HGVS nomenclature: 2018 Update. Hum Mutat. 2018 Dec;39(12):1803-1813. DOI: 10.1002/humu.23615 External link
3.
Kent WJ, et al. The Human Genome Browser at UCSC. Genome Res. 2002;12:996-1006. DOI: 10.1101/gr.229102 External link