Nearest neighbor prediction based on a list of marker genes
Author: Yujin Hoshida (Broad Institute)
Contact:
gp-help@broadinstitute.org
Algorithm Version:
Summary
This module performs class prediction using a predefined list of marker genes for multiple (≥2) classes [1,2]. Such marker genes are usually selected based on fold change, t-statistic, correlation coefficient, regression coefficient, etc. These numerical values indicate the relative importance of each gene in the list and often help improve accuracy of the class prediction. The module can take such values as “weight” for 2-class prediction (optional). As measures of significance of the prediction for each sample, the module computes nominal p-value, false discovery rate (FDR) [3], and Bonferroni-corrected p-value based on a random resampling-based test.
Graphic output, including a heatmap of the marker genes, is generated if the users’ graphic device supports “png”. Alternatively, users can run the HeatMapViewer module to generate a heatmap of the marker genes based on the .gct and .cls files output by the NearestTemplatePrediction module.
References
- Hoshida Y, et al. Gene Expression in Fixed Tissues and Outcome in Hepatocellular Carcinoma. N Engl J Med 2008; 359(19):1995-2004.
- Xu L, et al. Gene Expression Changes in an Animal Melanoma Model Correlate with Aggressiveness of Human Melanoma Metastases. Molecular Cancer Research 2008;6:760-769.
- Reiner A, Yekutieli D, Benjamini Y. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 2003;19(3):368-75.
Parameters
Name | Description |
---|---|
input exp filename * | Gene expression data set (.gct) |
input features filename * | List of marker genes (.txt): Probe ID, Gene name, Class (1,2,...), Weight (optional) |
output name * | Name for output files |
distance selection * | Distance metric |
weight genes * | Weight genes? (by statistic, fold change, etc. only for 2 classes) |
num resamplings * | # resampling to generate null distribution for distance metric |
GenePattern output * | Create .gct and .cls files for GenePattern |
random seed * | Random seed |
* - required
Input Files
- <input.exp.filename>
Gene expression dataset in GCT format. - <input.features.filename>
Format for input features file:
Tab-delimited text file. The first row contains column heads. The first column should be based on the same annotation system as the first column of the input gene expression (.gct) file.
Gene ID #1 |
Gene ID #2 |
Class (should be 1, 2, ...) |
Weight value (optional) |
id1 | gene1 | 1 | 4.3 |
id2 | gene2 | 1 | 3.8 |
id3 | gene3 | 2 | -1.2 |
id4 | gene4 | 2 | -3.2 |
... | ... | ... | ... |
Output Files
- <output name>_prediction_result.xls
Prediction result - <output name>_features.xls
List of marker genes - <output name>_heatmap.png
Heatmap of marker genes - <output name>_FDR_sample_bar.png
Predicted sample labels at FDR < 0.05 - <output name>_FDR.png
Plot of FDR - <output name>_heatmap_legend.png
Color map for SD –3~+3 - <output name>_sorted.dataset.gct
- <output name>_predicted_(un)sorted.cls
- <output name>_sample_info.txt
Example Data
Example files from Hoshida, 2008 [1] are available: Train_Liver.gct, Hoshida_Survival_signature.txt
Platform Dependencies
Task Type:
Prediction
CPU Type:
any
Operating System:
any
Language:
R-2.15.3
Version Comments
Version | Release Date | Description |
---|---|---|
4 | 2015-12-02 | Updated to use R-3.1 and added HTML documentation |
3 | 2012-07-23 | Fixed bug in heatmap creation |
2 | 2011-03-30 | Fixed errors with creating the images |
1 | 2009-04-09 |