GenePattern - TCGAImporter (v4)

This module imports data from TCGA by taking in a GDC manifest file, downloading the files listed on that manifest, renaming them to be human-friendly, and compiling them into a GCT file to be computer-friendly.

Author: Edwin Juarez

Contact:

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!forum/genepattern-help

genepattern.org/help

Algorithm Version:

Summary

Remember that you will need to download a manifest file and a metadata file from the GDC data portal (https://portal.gdc.cancer.gov/). To dowload these two files follow these intructions: https://github.com/genepattern/TCGAImporter/blob/master/how_to_download_a_manifest_and_metadata.pdf

If you'd like a more comprehensive tutorial of the GDC website, you can find it here: https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Getting_Started/

Version comments:

Version 3.2: Changed module name (from download_from_gdc to TCGAImporter) and updated code to read metadata files dowloaded after February 2018 (following GDC's metadata reformatting), this is backwards compatible.

Functionality yet to be implemented:

Parse copy number variation

Technical notes:

This module has been tested to run in the Docker container genepattern/docker-download-from-gdc:0.1 which has build code benqhsshqwge8fsahuu47d5
To create a conda environment (called GP_dfgdc_env) with the required dependencies download the requirements.txt file from the github repository named genepattern/docker-python36 (here is the url of the file: https://raw.githubusercontent.com/genepattern/docker-python36/master/requirements.txt) and run this three commands in the same folder where requirements.txt is located:

conda create --name GP_dfgdc_env pip
source activate GP_dfgdc_env
pip install -r requirements.txt

Note that you will need to have the GDC download client on the same folder. If you don't know what this means, read more here: https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Getting_Started

Parameters

Name	Description
imanifest *	The relative path of the manifest used to download the data. This file is obtained from the GDC data portal (https://portal.gdc.cancer.gov/).
metadata *	The metadata file obtained from obtained from the GDC data portal (https://portal.gdc.cancer.gov/)
output_file_name *	The base name to use for output files. E.g., if you type "TCGA_dataset" then the GCT file will be named "TCGA_dataset.gct"
gct *	whether or not to create a gct file
translate_gene_id *	Whether or not to translate ENSEMBL IDs (e.g., ENSG00000012048) to Hugo Gene Symbol (e.g., BRCA1)
cls *	Whether or not to translate create a cls file separating Normal and Tumor classes based on TCGA Sample ID.

* - required

Output Files

GCT file (if gct was set to True)
Contains all the data downloaded from GDC.
TXT files (if gct was set to False)
Contains the data download from GDC scattered in mulitple files.
CLS
Created if cls was set to True. This CLS file contain the classification of the samples into either normal tissue or cancer tissue based on the TCGA ID.

License

TCGAImporter is distributed under a modified BSD license available at https://raw.githubusercontent.com/genepattern/TCGAImporter/master/LICENSE

Platform Dependencies

Task Type:
Download dataset

CPU Type:
any

Operating System:
any

Language:
Python 3.6

Version Comments

Version	Release Date	Description
4	2018-05-16	Renaming the module from download_from_gdc to TCGAImporter
3	2018-04-16	preparing for prebuild
1	2018-04-16	Initial version