Artikel
Streamlining TCGA Downloads and File Management
Suche in Medline nach
Autoren
Veröffentlicht: | 6. September 2024 |
---|
Gliederung
Text
Introduction: The Cancer Genome Atlas (TCGA) is an invaluable resource for cancer research, providing comprehensive genomic data across various cancer types. However, first-time users often find navigating the Genomic Data Commons (GDC) portal and handling TCGA data challenging due to complex file naming conventions and the necessity of linking disparate data types to individual case IDs. While the Waldron Lab already introduced the tool TCGAutils (https://github.com/waldronlab/TCGAutils), which facilitates TCGA data handling and provides a function to map the case ID to the file name, yet it lacks a straightforward combination of all steps.
Methods: To address these challenges, we developed a streamlined pipeline using the GDC portal's cart system to manage file selection and the GDC Data Transfer Tool for data download. We use the Sample Sheet provided by the GDC portal to replace the opaque 36-character file IDs and filenames with human-readable case IDs. Furthermore, we developed customizable python scripts and Jupyter notebooks for ID mapping and introduced a novel pipeline using the Snakemake workflow management system for automating the data preprocessing tasks (https://github.com/alex-baumann-ur/TCGADownloadHelp).
Results: Our pipeline simplifies the data download process by allowing for the modification of manifest files to focus on specific data subsets, facilitating the handling of multimodal data sets related to single patients. The use of the Snakemake pipeline tool significantly reduced the time and effort required to preprocess data, as demonstrated by use-case scenarios involving data aggregation and analysis from multiple lung cancer patients.
Discussion and conclusion: The implementation of this pipeline enables researchers to efficiently navigate the complexities of TCGA data extraction and preprocessing. By integrating various tools and a clear step-by-step approach, we provide a streamlined methodology that minimizes errors, enhances data usability, and supports the broader utilization of TCGA data in cancer research. This approach is particularly beneficial for researchers new to the field of genomic data analysis, offering them a practical framework for conducting their studies.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.