Identify differentially expressed genes that can discriminate between distinct classes of samples.
Author: Joshua Gould, Gad Getz, Stefano Monti
When analyzing genome-wide transcription profiles from microarray or RNA-seq data, the first step is often to identify genes that can discriminate between distinct classes of samples (usually defined by a phenotype, such as tumor or normal). This process is commonly referred to as marker (or feature) selection. Marker genes are identified by calculating, for each profiled gene, a test statistic (e.g., t-test) which asseses correlation of the gene's expression profile with a class template. If the value of the test statistic for a specific gene, and thus the degree of differential expression presented by that gene, is significantly greater than what one would expect to see under the null hypothesis (gene is not differentially expressed between classes), that gene is identified as a statistically significant marker gene.
The ComparativeMarkerSelection module takes as input a dataset of expression profiles from samples belonging to two classes and, implementing the statistical tests described above, identifies marker genes which discriminate between the classes.
The ComparativeMarkerSelection module includes several approaches to determine the features that are most closely correlated with a class template and the significance of that correlation. The module computes significance values for features using several metrics, including FDR(BH), Q-Value, maxT, FWER, Feature-Specific P-Value, and Bonferroni. The results from the ComparativeMarkerSelection algorithm can be viewed with the ComparativeMarkerSelectionViewer. ExtractComparativeMarkerResults creates a derived dataset and feature list file from the results of ComparativeMarkerSelection.
By default ComparativeMarkerSelection expects the data in the input file to not be log transformed. Some of the calculations such as the fold change are not accurate when log transformed data is provided and not indicated. To indicate that your data is log transformed, be sure to set the “log transformed data” parameter to “yes”. Also, ComparativeMarkerSelection requires at least three samples per class to run successfully.
The analytic module takes as input a dataset of expression profiles from samples belonging to two phenotypes. If a dataset contains multiple phenotypes, then there is the option to perform all pairwise comparisons or all one-versus-all comparisons. A test statistic (e.g. t-test) is chosen to assess the differential expression between the two classes of samples. Note that technical and biological replicates are handled the same way as independent samples. The significance (nominal P-value) of marker genes is computed using a permutation test, which is a commonly used method for assessing the significance
of marker genes; see (4) for details.
Selecting class markers is a particular instance of the general multiple hypothesis testing problem. Since several thousand hypotheses are usually tested at once (one per gene), the nominal P-values have to be corrected to account for the increased number of potential false positives. For example, if we test 20,000 genes for differential expression, a nominal P-value threshold of 0.01 would only ensure that the expected number of false positives is <200 (0.01 x 20,000). ComparativeMarkerSelection includes several methods of correcting for multiple hypothesis testing, including FDR(BH), Q-Value, maxT, FWER, Feature-Specific P-Value, and Bonferroni; (4) describes their applicability.
|input file *||
Note the following constraints:
|cls file *||
The class file. CLS ?
ComparativeMarkerSelection analyzes two phenotype classes at a time. If the expression data set includes samples from more than two classes, use the phenotype test parameter to analyze each class against all others (one-versus-all) or all class pairs (all pairs).
|confounding variable cls file||
The class file containing the confounding variable. CLS
If you are studying two variables and your data set contains a third variable that might distort the association between the variables of interest, you can use a confounding variable class file to correct for the affect of the third variable. For example, the data set in Lu, Getz, et. al. (2005) contains tumor and normal samples from different tissue types. When studying the association between the tumor and normal samples, the authors use a confounding variable class file to correct for the effect of the different tissue types.
The phenotype class file identifies the tumor and normal samples:
75 2 1 # Normal Tumor 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1
The confounding variable class file identifies the tissue type of each sample:
75 6 1 # colon kidney prostate uterus human-lung breast 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5
Given these two class files, when performing permutations, ComparativeMarkerSelection shuffles the tumor/normal labels only among samples with the same tissue type.
|test direction *||The test to perform. By default, ComparativeMarkerSelection performs a two-sided test; that is, the test statistic score is calculated assuming that the differentially expressed gene can be up-regulated in either phenotype class. Optionally, use the test direction parameter to specify a one-sided test, where the differentially expressed gene must be up-regulated for class 0 or for class 1.|
|test statistic *||
The statistic to use:
The minimum standard deviation if test statistic includes min std option. If σ is less than min std, σ is set to min std.
|number of permutations *||
The number of permutations to perform (use 0 to calculate asymptotic p-values using the standard independent two-sample t-test). ComparativeMarkerSelection uses a permutation test to estimate the significance (p-value) of the test statistic score. The number of permutations you specify depends on the number of hypotheses being tested and the significance level that you want to achieve (3). If the data set includes at least 10 samples per class, use the default value of 10000 permutations to ensure sufficiently accurate p-values.
If the data set includes fewer than 10 samples in any class, permuting the samples cannot give an accurate p-value. Specify a value of 0 permutations to use asymptotic p-values instead. In this case, ComparativeMarkerSelection computes p-values assuming the test statistic scores follow Student's t-distribution (rather than using the test statistic to create an empirical distribution of the scores). Asymptotic p-values are calculated using the p-value obtained from the standard independent two-sample t-test.
|log transformed data *||Whether the input data has been log transformed. By default ComparativeMarkerSelection expects the data in the input file to not be log transformed. Some of the calculations such as the fold are not accurate when log transformed data is provided and not indicated. To indicate that your data is log transformed, set this parameter to “yes”.|
|complete *||Whether to perform all possible permutations. When the complete parameter is set to yes, ComparativeMarkerSelection ignores the number of permutations parameter and computes the p-value based on all possible sample permutations. Use this option only with small data sets, where the number of all possible permutations is less than 1000.|
|balanced *||Whether to perform balanced permutations. When the balanced parameter is set to yes, ComparativeMarkerSelection requires an equal and even number of samples in each class (e.g. 10 samples in each class, not 11 in each class or 10 in one class and 12 in the other).|
|random seed *||The seed of the random number generator used to produce permutations|
|smooth p values *||
Whether to smooth p-values by using the Laplace’s Rule of Succession. By default, smooth p values is set to yes, which means p-values are always less than 1.0 and greater than 0.0.
|phenotype test||Tests to perform when cls file has more than two classes: one-versus-all, all pairs. (Note: The p-values obtained from the one-versus-all comparison are not fully corrected for multiple hypothesis testing.)|
|output filename *||The name of the output file.|
* - required
Gene List Selection
|10||2013-12-04||Updated documentation from pdf to html|
|9||2012-03-26||Changed default number of permutations to 10000|
|8||2011-08-30||added parameter to specify whether data is log transformed|
|7||2010-05-28||Made improvements to error messages|
|6||2009-12-30||Fixed bug with using res file with paired t-test|
|5||2008-10-24||Added Paired T-Test|
|4||2008-02-19||Added Paired T-Test|
|3||2006-03-03||Added additional metrics|
|2||2005-06-08||Added restricted permutations option and maxT p-value|