Preprocess RNA-Seq count data in a GCT file so that it is suitable for use in GenePattern analyses.
Author: Arthur Liberzon, David Eby, Broad Institute
This module is used to preprocess RNA-Seq data into a form suitable for use downstream in other GenePattern analyses such as GSEA, ComparativeMarkerSelection, NMFConsensus, as well as GENE-E and other visualizers. Many of these tools were originally designed to handle microarray data - particularly from Affymetrix arrays - and so we must be mindful of that origin when preprocessing data for use with them.
The module does this by using a mean-variance modeling technique  to transform the dataset to fit an approximation of a normal distribution, with the goal of thus being able to apply classic normal-based microarray-oriented statistical methods and workflows.
This modeling technique is called 'voom' and is part of the 'limma' package of Bioconductor  . Use of this method requires the user to supply raw read counts as produced by HTSeq or RSEM. These counts should not be normalized and also should not be RPKM or FPKM values. The MergeHTSeqCounts module in GenePattern is capable of producing a suitable GCT from HTSeq output.
The module first performs a filtering pass on the dataset to remove any features (rows) without at least 1 read per million in n of the samples, where n is the size of the smallest group of replicates (recommended in ). Note that this not a simple threshold on the count but rather a filtering using CPM (counts per million) values calculated just for this purpose. The raw values are still used for variance modeling; these CPM values are only used for filtering and then subsequently discarded. The module will automatically determine the smallest group of samples (n) based on their classifications in the user-supplied CLS file.
Next, the module performs normalization of the dataset using Trimmed Mean of M-values (TMM)  on the raw counts of any remaining features that pass the filter. Finally, the module performs the mean-variance transformation to approximate a normal distribution using the 'voom' method of the 'limma' package, returning a new dataset with values in logCPM (log2 counts per million) that can be used with classic normal-based microarray-oriented statistical methods and workflows.
|input file *||A GCT file containing raw RNA-Seq counts, such as is produced by MergeHTSeqCounts|
|cls file *||A categorical CLS file specifying the phenotype classes for the samples in the GCT file.|
|output file *||Output file name|
|expression value filter threshold *||Threshold to use when filtering CPM expression values; rows are kept only if the values (in CPM) for all columns are greater than this threshold|
* - required
The module requires R-3.1.3 with the 'getopt_1.20.0' and 'optparse_1.3.2' packages from CRAN and the 'limma' and 'edgeR' packages from Bioconductor 3.0.
[Update this section with pointers to new R docs as they become available]
Preprocess & Utilities
|0.4||2015-11-24||Prerelease building towards Beta|