NearestTemplatePrediction (v4)

Nearest neighbor prediction based on a list of marker genes

Author: Yujin Hoshida (Broad Institute)

Contact:

gp-help@broadinstitute.org

Algorithm Version:

Summary

This module performs class prediction using a predefined list of marker genes for multiple (≥2) classes [1,2]. Such marker genes are usually selected based on fold change, t-statistic, correlation coefficient, regression coefficient, etc. These numerical values indicate the relative importance of each gene in the list and often help improve accuracy of the class prediction. The module can take such values as “weight” for 2-class prediction (optional). As measures of significance of the prediction for each sample, the module computes nominal p-value, false discovery rate (FDR) [3], and Bonferroni-corrected p-value based on a random resampling-based test.

 

Graphic output, including a heatmap of the marker genes, is generated if the users’ graphic device supports “png”. Alternatively, users can run the HeatMapViewer module to generate a heatmap of the marker genes based on the .gct and .cls files output by the NearestTemplatePrediction module.

References

  1. Hoshida Y, et al. Gene Expression in Fixed Tissues and Outcome in Hepatocellular Carcinoma. N Engl J Med 2008; 359(19):1995-2004.
  2. Xu L, et al. Gene Expression Changes in an Animal Melanoma Model Correlate with Aggressiveness of Human Melanoma Metastases. Molecular Cancer Research 2008;6:760-769.
  3. Reiner A, Yekutieli D, Benjamini Y. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 2003;19(3):368-75.

Parameters

Name Description
input exp filename * Gene expression data set (.gct)
input features filename * List of marker genes (.txt): Probe ID, Gene name, Class (1,2,...), Weight (optional)
output name * Name for output files
distance selection * Distance metric
weight genes * Weight genes? (by statistic, fold change, etc. only for 2 classes)
num resamplings * # resampling to generate null distribution for distance metric
GenePattern output * Create .gct and .cls files for GenePattern
random seed * Random seed

* - required

Input Files

  1. <input.exp.filename>
    Gene expression dataset in GCT format.
  2. <input.features.filename>
    Format for input features file:
    Tab-delimited text file. The first row contains column heads. The first column should be based on the same annotation system as the first column of the input gene expression (.gct) file.

Gene ID #1
(e.g. Probe ID)

Gene ID #2
(e.g. Probe ID)

Class
(should be 1, 2, ...)
Weight value
(optional)
id1 gene1 1 4.3
id2 gene2 1 3.8
id3 gene3 2 -1.2
id4 gene4 2 -3.2
... ... ... ...

Output Files

  1. <output name>_prediction_result.xls
    Prediction result
  2. <output name>_features.xls
    List of marker genes
  3. <output name>_heatmap.png
    Heatmap of marker genes
  4. <output name>_FDR_sample_bar.png
    Predicted sample labels at FDR < 0.05
  5. <output name>_FDR.png
    Plot of FDR
  6. <output name>_heatmap_legend.png
    Color map for SD –3~+3
  7. <output name>_sorted.dataset.gct
  8. <output name>_predicted_(un)sorted.cls
  9. <output name>_sample_info.txt

Example Data

Example files from Hoshida, 2008 [1] are available: Train_Liver.gctHoshida_Survival_signature.txt

Platform Dependencies

Task Type:
Prediction

CPU Type:
any

Operating System:
any

Language:
R-2.15.3

Version Comments

Version Release Date Description
4 2015-12-02 Updated to use R-3.1 and added HTML documentation
3 2012-07-23 Fixed bug in heatmap creation
2 2011-03-30 Fixed errors with creating the images
1 2009-04-09