Performs single sample GSEA Projection
Project each sample within a data set onto a space of gene set enrichment scores using the ssGSEA projection methodology described in Barbie et al., 2009.
Single-sample GSEA (ssGSEA), an extension of Gene Set Enrichment Analysis (GSEA), calculates separate enrichment scores for each pairing of a sample and gene set. Each ssGSEA enrichment score represents the degree to which the genes in a particular gene set are coordinately up- or down-regulated within a sample.
When analyzing genome-wide transcription profiles from microarray data, a typical goal is to find genes significantly differentially correlated with distinct sample classes defined by a particular phenotype (e.g., tumor vs. normal). These findings can be used to provide insights into the underlying biological mechanisms or to classify (predict the phenotype of) a new sample. Gene Set Enrichment Analysis (GSEA) addressed this problem by evaluating whether a priori defined sets of genes, associated with particular biological processes (such as pathways), chromosomal locations, or experimental results are enriched at either the top or bottom of a list of differentially expressed genes ranked by some measure of differences in a gene’s expression across sample classes. Examples of ranking metrics are fold change for categorical phenotypes (e.g., tumor vs. normal) and Pearson correlation for continuous phenotypes (e.g., age). Enrichment provides evidence for the coordinate up- or down-regulation of a gene set’s members and the activation or repression of some corresponding biological process.
Where GSEA generates a gene set’s enrichment score with respect to phenotypic differences across a collection of samples within a dataset, ssGSEA calculates a separate enrichment score for each pairing of sample and gene set, independent of phenotype labeling. In this manner, ssGSEA transforms a single sample's gene expression profile to a gene set enrichment profile. A gene set's enrichment score represents the activity level of the biological process in which the gene set's members are coordinately up- or down-regulated. This transformation allows researchers to characterize cell state in terms of the activity levels of biological processes and pathways rather than through the expression levels of individual genes.
In working with the transformed data, the goal is to find biological processes that are differentially active across the phenotype of interest and to use these measures of process activity to characterize the phenotype. Thus, the benefit here is that the ssGSEA projection transforms the data to a higher-level (pathways instead of genes) space representing a more biologically interpretable set of features on which analytic methods can be applied.
As a practical matter, ssGSEAProjection essentially reduces the dimensionality of the set. You can look for correlations between the gene set enrichment scores and the phenotype of interest (e.g., tumor vs. normal) using a module like ComparativeMarkerSelection. You could also try clustering the data set; whichever gene sets stand out as strong predictors of the phenotype of interest, specific clusters can then be mapped to biochemical pathways, giving you insight into what is driving the phenotype of interest.
This module implements the single-sample GSEA projection methodology described in Barbie et al, 2009.
|input gct file *||GCT file containing input dataset’s gene expression data.|
|output file prefix||The prefix used for the name of the output GCT file. If unspecified, output prefix will be set to <prefix of input GCT file>.PROJ. The output GCT file will contain the projection of input dataset onto a space of gene set enrichments scores.|
|gene sets database||
Gene sets database from the GSEA website.Note: Gene sets database file and gene sets database list file override this parameter.
|gene sets database file|
|gene sets database list file||
.txt file containing a list of GMT and GMX gene set description files (one gene set description filename per line). This optional parameter should be used if projecting expression data across gene sets spanning multiple gene sets database files. The listed gene sets database files must be uploaded to GenePattern server. This list file is typically generated using the GenePattern ListFiles module.Note: An optional parameter, which when set overrides the gene sets database and gene sets database file parameters.
|gene symbol column *||Input GCT file column containing gene symbol names. In most cases, this will be column 1. (default: Column 1)|
|gene set selection *||Comma-separated list of gene set names on which to project the input expression data. Alternatively, this field may be set to ALL, indicating that the input expression dataset is to be projected to all gene sets defined in the specified gene set database(s). (default: ALL)|
|sample normalization method *||Normalization method applied to expression data. Supported methods are rank, log.rank, and log. (Default: rank)|
|weighting exponent *||Exponential weight employed in calculation of enrichment scores. The default value of 0.75 was selected after extensive testing. The module authors strongly recommend against changing from default. (Default: 0.75)|
|min gene set size *||Exclude from the projection gene sets whose overlap with the genes listed in the input GCT file are less than this value. (Default: 10)|
* - required
|4||2013-06-17||Updated list of gene sets to include v4.0 MSigDB collections|
|3||2013-02-15||Updated list of gene sets databases to include v3.1 MSigDB collection, updated FTP download code, made documentation more biologist-friendly.|
|2||2012-09-19||Added support for reading of gmx-formatted gene set description files|