Performs single sample GSEA Projection
Project each sample within a data set onto a space of gene set enrichment scores using the ssGSEA projection methodology described in Barbie et al., 2009.
Single-sample GSEA (ssGSEA), an extension of Gene Set Enrichment Analysis (GSEA), calculates separate enrichment scores for each pairing of a sample and gene set. Each ssGSEA enrichment score represents the degree to which the genes in a particular gene set are coordinately up- or down-regulated within a sample.
When analyzing genome-wide transcription profiles from microarray data, a typical goal is to find genes significantly differentially correlated with distinct sample classes defined by a particular phenotype (e.g., tumor vs. normal). These findings can be used to provide insights into the underlying biological mechanisms or to classify (predict the phenotype of) a new sample. Gene Set Enrichment Analysis (GSEA) addressed this problem by evaluating whether a priori defined sets of genes, associated with particular biological processes (such as pathways), chromosomal locations, or experimental results are enriched at either the top or bottom of a list of differentially expressed genes ranked by some measure of differences in a gene’s expression across sample classes. Examples of ranking metrics are fold change for categorical phenotypes (e.g., tumor vs. normal) and Pearson correlation for continuous phenotypes (e.g., age). Enrichment provides evidence for the coordinate up- or down-regulation of a gene set’s members and the activation or repression of some corresponding biological process.
Where GSEA generates a gene set’s enrichment score with respect to phenotypic differences across a collection of samples within a dataset, ssGSEA calculates a separate enrichment score for each pairing of sample and gene set, independent of phenotype labeling. In this manner, ssGSEA transforms a single sample's gene expression profile to a gene set enrichment profile. A gene set's enrichment score represents the activity level of the biological process in which the gene set's members are coordinately up- or down-regulated. This transformation allows researchers to characterize cell state in terms of the activity levels of biological processes and pathways rather than through the expression levels of individual genes.
In working with the transformed data, the goal is to find biological processes that are differentially active across the phenotype of interest and to use these measures of process activity to characterize the phenotype. Thus, the benefit here is that the ssGSEA projection transforms the data to a higher-level (pathways instead of genes) space representing a more biologically interpretable set of features on which analytic methods can be applied.
As a practical matter, ssGSEAProjection essentially reduces the dimensionality of the set. You can look for correlations between the gene set enrichment scores and the phenotype of interest (e.g., tumor vs. normal) using the GCT output with a module like ComparativeMarkerSelection. You could also try clustering the data set; whichever gene sets stand out as strong predictors of the phenotype of interest, specific clusters can then be mapped to biochemical pathways, giving you insight into what is driving the phenotype of interest.
While the GCT can be passed along to any module accepting that format, it does not make sense to run it through GSEA.
This module implements the single-sample GSEA projection methodology described in Barbie et al, 2009.
|input gct file *||GCT file containing input dataset’s gene expression data. The dataset must contain gene expression data for a minimum of two samples.|
|output file prefix||The prefix used for the name of the output GCT file. If unspecified, output prefix will be set to <prefix of input GCT file>.PROJ. The output GCT file will contain the projection of input dataset onto a space of gene set enrichments scores.|
|gene sets database||
Gene sets database from the GSEA website.Note: Gene sets database file and gene sets database list file override this parameter.
|gene sets database file|
|gene sets database list file||
.txt file containing a list of GMT and GMX gene set description files (one gene set description filename per line). This optional parameter should be used if projecting expression data across gene sets spanning multiple gene sets database files. The listed gene sets database files must be uploaded to GenePattern server. This list file is typically generated using the GenePattern ListFiles module.Note: An optional parameter, which when set overrides the gene sets database and gene sets database file parameters.
|gene symbol column *||Input GCT file column containing gene symbol names. In most cases, this will be column 1. (default: Column 1)|
|gene set selection *||Comma-separated list of gene set names on which to project the input expression data. Alternatively, this field may be set to ALL, indicating that the input expression dataset is to be projected to all gene sets defined in the specified gene set database(s). (default: ALL)|
|sample normalization method *||Normalization method applied to expression data. Supported methods are rank, log.rank, and log. (Default: rank)|
|weighting exponent *||Exponential weight employed in calculation of enrichment scores. The default value of 0.75 was selected after extensive testing. The module authors strongly recommend against changing from default. (Default: 0.75)|
|min gene set size *||Exclude from the projection gene sets whose overlap with the genes listed in the input GCT file are less than this value. (Default: 10)|
|combine mode *||
Options for combining enrichment scores for paired *_UP and *_DN gene sets. (Default: combine.add)
For gene set collections that do not utilize _UP and _DN suffixes at the ends of set names, the combine mode parameter option is irrelevant as all the modes give the same output.
For Gene set collections that utilize _UP and _DN suffixes, which include MSigDB v5's C2.all, C2.CGP, C6.all, and C7.all, recombine sets in two different ways:
* - required
The GCT file must contain gene expression data for at least two samples.
In the case of experimentally derived gene sets with _UP and _DN suffixes appended to otherwise identical gene set names, combine modes of combine.add and combine.replace will either add to the set or replace the original gene set pair with a combined gene set with the suffix removed from the name thereby creating new gene set names that may impact downstream applications using these files in combination with the original gene set collection file. Check that downstream applications utilize subsets of gene sets within a collection for compatibility with the combine.add mode output.
|7||2016-02-04||Updated to give access to MSigDB v5.1. Updated to R-2.15 and made fixes for portability.|
|6||2015-06-16||Add built-in support for MSigDB v5.0, which includes new hallmark gene sets.|
|5||2014-08-11||Added combine mode parameter|
|4||2013-06-17||Updated list of gene sets to include v4.0 MSigDB collections|
|3||2013-02-15||Updated list of gene sets databases to include v3.1 MSigDB collection, updated FTP download code, made documentation more biologist-friendly.|
|2||2012-09-19||Added support for reading of gmx-formatted gene set description files|