GenePattern - NearestTemplatePrediction (v4)

Nearest neighbor prediction based on a list of marker genes

Author: Yujin Hoshida (Broad Institute)

Contact:

gp-help@broadinstitute.org

Algorithm Version:

Summary

This module performs class prediction using a predefined list of marker genes for multiple (≥2) classes [1,2]. Such marker genes are usually selected based on fold change, t-statistic, correlation coefficient, regression coefficient, etc. These numerical values indicate the relative importance of each gene in the list and often help improve accuracy of the class prediction. The module can take such values as “weight” for 2-class prediction (optional). As measures of significance of the prediction for each sample, the module computes nominal p-value, false discovery rate (FDR) [3], and Bonferroni-corrected p-value based on a random resampling-based test.

Graphic output, including a heatmap of the marker genes, is generated if the users’ graphic device supports “png”. Alternatively, users can run the HeatMapViewer module to generate a heatmap of the marker genes based on the .gct and .cls files output by the NearestTemplatePrediction module.

References

Hoshida Y, et al. Gene Expression in Fixed Tissues and Outcome in Hepatocellular Carcinoma. N Engl J Med 2008; 359(19):1995-2004.
Xu L, et al. Gene Expression Changes in an Animal Melanoma Model Correlate with Aggressiveness of Human Melanoma Metastases. Molecular Cancer Research 2008;6:760-769.
Reiner A, Yekutieli D, Benjamini Y. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 2003;19(3):368-75.

Parameters

Name	Description
input exp filename *	Gene expression data set (.gct)
input features filename *	List of marker genes (.txt): Probe ID, Gene name, Class (1,2,...), Weight (optional)
output name *	Name for output files
distance selection *	Distance metric
weight genes *	Weight genes? (by statistic, fold change, etc. only for 2 classes)
num resamplings *	# resampling to generate null distribution for distance metric
GenePattern output *	Create .gct and .cls files for GenePattern
random seed *	Random seed

* - required

Input Files

<input.exp.filename>
Gene expression dataset in GCT format.
<input.features.filename>
Format for input features file:
Tab-delimited text file. The first row contains column heads. The first column should be based on the same annotation system as the first column of the input gene expression (.gct) file.

Gene ID #1 (e.g. Probe ID)	Gene ID #2 (e.g. Probe ID)	Class (should be 1, 2, ...)	Weight value (optional)
id1	gene1	1	4.3
id2	gene2	1	3.8
id3	gene3	2	-1.2
id4	gene4	2	-3.2
...	...	...	...

Output Files

<output name>_prediction_result.xls
Prediction result
<output name>_features.xls
List of marker genes
<output name>_heatmap.png
Heatmap of marker genes
<output name>_FDR_sample_bar.png
Predicted sample labels at FDR < 0.05
<output name>_FDR.png
Plot of FDR
<output name>_heatmap_legend.png
Color map for SD –3~+3
<output name>_sorted.dataset.gct
<output name>_predicted_(un)sorted.cls
<output name>_sample_info.txt

Example Data

Example files from Hoshida, 2008 [1] are available: Train_Liver.gct, Hoshida_Survival_signature.txt

Platform Dependencies

Task Type:
Prediction

CPU Type:
any

Operating System:
any

Language:
R-2.15.3

Version Comments

Version	Release Date	Description
4	2015-12-02	Updated to use R-3.1 and added HTML documentation
3	2012-07-23	Fixed bug in heatmap creation
2	2011-03-30	Fixed errors with creating the images
1	2009-04-09