gms | German Medical Science

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH)

08.09. - 13.09.2024, Dresden

Streamlining TCGA Downloads and File Management

Meeting Abstract

Suche in Medline nach

  • Alexandra Anke Baumann - Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany
  • Markus Wolfien - Institute for Medical Informatics and Biometry, Faculty of Medicine Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany; Center for Scalable Data Analytics and Artificial Intelligence, Dresden, Germany
  • Olaf Wolkenhauer - Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany; Leibniz-Institute for Food Systems Biology at the Technical University of Munich, Freising, Germany; Stellenbosch Institute of Advanced Study, Wallenberg Research Centre, Stellenbosch University, Stellenbosch, South Africa

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH). Dresden, 08.-13.09.2024. Düsseldorf: German Medical Science GMS Publishing House; 2024. DocAbstr. 824

doi: 10.3205/24gmds143, urn:nbn:de:0183-24gmds1431

Veröffentlicht: 6. September 2024

© 2024 Baumann et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: The Cancer Genome Atlas (TCGA) is an invaluable resource for cancer research, providing comprehensive genomic data across various cancer types. However, first-time users often find navigating the Genomic Data Commons (GDC) portal and handling TCGA data challenging due to complex file naming conventions and the necessity of linking disparate data types to individual case IDs. While the Waldron Lab already introduced the tool TCGAutils (https://github.com/waldronlab/TCGAutils), which facilitates TCGA data handling and provides a function to map the case ID to the file name, yet it lacks a straightforward combination of all steps.

Methods: To address these challenges, we developed a streamlined pipeline using the GDC portal's cart system to manage file selection and the GDC Data Transfer Tool for data download. We use the Sample Sheet provided by the GDC portal to replace the opaque 36-character file IDs and filenames with human-readable case IDs. Furthermore, we developed customizable python scripts and Jupyter notebooks for ID mapping and introduced a novel pipeline using the Snakemake workflow management system for automating the data preprocessing tasks (https://github.com/alex-baumann-ur/TCGADownloadHelp).

Results: Our pipeline simplifies the data download process by allowing for the modification of manifest files to focus on specific data subsets, facilitating the handling of multimodal data sets related to single patients. The use of the Snakemake pipeline tool significantly reduced the time and effort required to preprocess data, as demonstrated by use-case scenarios involving data aggregation and analysis from multiple lung cancer patients.

Discussion and conclusion: The implementation of this pipeline enables researchers to efficiently navigate the complexities of TCGA data extraction and preprocessing. By integrating various tools and a clear step-by-step approach, we provide a streamlined methodology that minimizes errors, enhances data usability, and supports the broader utilization of TCGA data in cancer research. This approach is particularly beneficial for researchers new to the field of genomic data analysis, offering them a practical framework for conducting their studies.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.