Performs several preprocessing steps on a res, gct, or odf input file
Author: Joshua Gould, Broad Institute
Most analyses of gene expression data derived from microarray experiments begins with preprocessing of the expression data. Preprocessing removes platform noise and genes that have little variation so the subsequent analysis can identify interesting variations, such as the differential expression between tumor and normal tissue. GenePattern provides the PreprocessDataset module for this purpose. While the module's default parameter values are tailored to Affymetrix expression arrays, we provide guidelines below for its use with Illumina expression arrays. This module has limited applicability to gene expression data derived from RNA-seq experiments and typically is not employed in RNA-Seq analysis workflows.
This module performs a variety of pre-processing operations including thresholding/ceiling, variation filtering, normalization and log2 transform. It may be applied to datasets in .gct, .res, or .odf formatted files.
The algorithm conducts the following steps in order. Each step is optional, controlled by the module's parameter settings.
If thresholding and filtering are disabled, then rows may be selected for inclusion by random sampling (without replacement). The row sampling rate parameter specifies the fraction of genes that will be selected. If row sampling rate is set to 1, all genes will be selected.
As mentioned in the introduction, this module has limited applicability to expression data derived from RNA-Seq experiments. DNA microarrys have a limited dynamic range for detection due to high background levels arising from cross hybridization and signal saturation. RNA-Seq data, on the other hand, have very low background signal and a higher dynamic range of expression levels. Due to RNA-Seq's larger dynamic range, setting floor and ceiling values is unnecessary, as is sample-count threshold filtering.
Variation filtering is also of questionable value when working with expression data derived from RNA-seq experiments. Because RNA-seq expression data is derived from read counts, researchers view gene or transcript expression measurements as legitimate regardless of their level of variability and may not want to eliminate genes or transcripts from consideration in the downstream analysis. Unlike microarray data, there are no default values for min fold change or min delta that would generally apply to RNA-seq derived expression data; thus, rather than eliminate features due to levels of variability below arbitrary thresholds, the current practice is to skip variation filtering and retain all features in RNA-seq derived expression data.
In order to derive transcript or gene expression levels from RNA-Seq read counts, the counts must be normalized to remove biases arising from differences in transcript length and differences in sequencing depth between samples. For example, longer transcripts will produce more sequencing fragments, and thus more counts, than shorter transcripts. Similarly, differences in sequencing depth will be reflected in read counts. FPKM normalization (fragments per kilobase of transcript per million mapped fragments) divides transcript counts by the transcript length and total read count to eliminate these inherent biases. We assume that GCT-formatted expression data derived from RNA-seq experirments is in units of FPKM (or RPKM for data derived from single-ended sequencing experiments) and has therefore undergone normalization and does not require PreprocessDataset normalization.
For downstream analyses that employ correlation metrics (e.g. clustering, feature selection) it may be useful to log transform the data first. Due to the wide dynamic range of RNA-Seq data, highly expressed outliers could dominate the calculated correlations and log transforming the data would be one approach to working around this issue (see [Adiconis, X.]. However, if the expression data is to be log transformed, it would first be necessary to add a small number (e.g., 1) to each expression value. When calculating correlation, this would give more weight to genes with lower expression values. An alternative approach to the outlier issue not requiring log transformation of the data would be to use a rank correlation metric such as Spearman correlation.
While this module has default values which pertain to Affy expression data, it may also be effectively used with Illumina expression data, after first running that data through IlluminaNormalizer and changing the default values in this module to better suit Illumina data. Suggested values are as follows (with thanks to Yujin Hoshida of the Broad Institute):
*There is currently no module in GenePattern for this last method, probe filtering based on CV, however Yujin has plans to release his own module for this purpose to GPARC (http://gparc.org)
Kuehn, H., Liberzon, A., Reich, M. and Mesirov, J. P. 2008. Using GenePattern for Gene Expression Analysis. Current Protocols in Bioinformatics. 22:7.12.1–7.12.39.
Adiconis, X., Borges-Rivera, D., et al., Comparative analysis of RNA sequencing methods for dergraded or low-input samples. Nature Methods 10, 623-629 (2013).
|input filename *||Input filename - .res, .gct, .odf|
|threshold and filter||Flag controlling whether to apply thresholding and variation filter. The default value is yes.|
|floor||Value for floor threshold. The default value is 20, but this only applies to Affymetrix microarray data; the value is not appropriate for expression data derived from RNA-seq experiments or alternative microarray platforms. For Illumina data this should be set to a value a little above the background signal.|
|ceiling||Value for ceiling threshold. The default value is 20,000, but this only applies to Affymetrix microarray data; the value is not appropriate for expression data derived from RNA-seq experiments or alternative microarray platforms. For Illumina data this should be set to 0|
|min fold change||Minimum fold change for variation filter. The default value is 3, but this only applies to Affymetrix microarray data; the value is not appropriate for expression data derived from RNA-seq experiments or alternative microarray platforms. For Illumina data this should be set between 3 and 5.|
|min delta||Minimum delta for variation filter. The default value is 100, but this only applies to Affymetrix microarray data; the value is not appropriate for expression data derived from RNA-seq experiments or alternative microarray platforms. For Illumina data this should be set to between 300 and 500 (assuming you've run your data through IlluminaNormalizer and used cubic spline normalization.|
|num outliers to exclude||Number of outliers per row to ignore when calculating row min and max for variation filter. If this value is set to n, then then the n smallest and the n largest expression values will ignored.|
|row normalization||Perform row normalization. Row normalization and log2 transform are mutually exclusive.|
|row sampling rate||Sample rows without replacement to obtain this fraction of the total number of rows|
|threshold for removing rows||Threshold for removing rows. Row normalization and log2 transform are mutually exclusive.|
|number of columns above threshold||Remove row if this number of columns not >= given threshold|
|log2 transform||Apply log2 transform after all other preprocessing steps.|
|output file format||Output file format|
|output file *||Output file name|
* - required
Preprocess & Utilities
|5||2013-12-02||Update to new html doc|
|4||2013-11-11||Adds support for Illumina; performs log transform; deprecates max sigma binning|
|3||2005-05-26||Changed default value of ceiling to 20000|
|2||2005-05-26||Added additional filtering options|