This module imports data from TCGA by taking in a GDC manifest file, downloading the files listed on that manifest, renaming them to be human-friendly, and compiling them into a GCT file to be computer-friendly.
Author: Edwin Juarez
Contact:
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!forum/genepattern-help
Algorithm Version:
Summary
This module imports data from TCGA by taking in a GDC manifest file, downloading the files listed on that manifest, renaming them to be human-friendly, and compiling them into a GCT file to be computer-friendly.
Remember that you will need to download a manifest file and a metadata file from the GDC data portal (https://portal.gdc.cancer.gov/). To dowload these two files follow these intructions: https://github.com/genepattern/TCGAImporter/blob/master/how_to_download_a_manifest_and_metadata.pdf
If you'd like a more comprehensive tutorial of the GDC website, you can find it here: https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Getting_Started/
Version comments:
- Version 3.2: Changed module name (from download_from_gdc to TCGAImporter) and updated code to read metadata files dowloaded after February 2018 (following GDC's metadata reformatting), this is backwards compatible.
Functionality yet to be implemented:
- Parse copy number variation
Technical notes:
- This module has been tested to run in the Docker container genepattern/docker-download-from-gdc:0.1 which has build code benqhsshqwge8fsahuu47d5
- To create a conda environment (called GP_dfgdc_env) with the required dependencies download the requirements.txt file from the github repository named genepattern/docker-python36 (here is the url of the file: https://raw.githubusercontent.com/genepattern/docker-python36/master/requirements.txt) and run this three commands in the same folder where requirements.txt is located:
conda create --name GP_dfgdc_env pip
source activate GP_dfgdc_env
pip install -r requirements.txt
Note that you will need to have the GDC download client on the same folder. If you don't know what this means, read more here: https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Getting_Started
Parameters
Name | Description |
---|---|
imanifest * | The relative path of the manifest used to download the data. This file is obtained from the GDC data portal (https://portal.gdc.cancer.gov/). |
metadata * |
The metadata file obtained from obtained from the GDC data portal (https://portal.gdc.cancer.gov/) |
output_file_name * |
The base name to use for output files. E.g., if you type "TCGA_dataset" then the GCT file will be named "TCGA_dataset.gct" |
gct * | whether or not to create a gct file
|
translate_gene_id * | Whether or not to translate ENSEMBL IDs (e.g., ENSG00000012048) to Hugo Gene Symbol (e.g., BRCA1) |
cls * | Whether or not to translate create a cls file separating Normal and Tumor classes based on TCGA Sample ID. |
* - required
Output Files
- GCT file (if gct was set to True)
Contains all the data downloaded from GDC. - TXT files (if gct was set to False)
Contains the data download from GDC scattered in mulitple files. - CLS
Created if cls was set to True. This CLS file contain the classification of the samples into either normal tissue or cancer tissue based on the TCGA ID.
License
TCGAImporter is distributed under a modified BSD license available at https://raw.githubusercontent.com/genepattern/TCGAImporter/master/LICENSE
Platform Dependencies
Task Type:
Download dataset
CPU Type:
any
Operating System:
any
Language:
Python 3.6
Version Comments
Version | Release Date | Description |
---|---|---|
4 | 2018-05-16 | Renaming the module from download_from_gdc to TCGAImporter |
3 | 2018-04-16 | preparing for prebuild |
1 | 2018-04-16 | Initial version |