GISTIC_2.0 (v6.3)

Genomic Identification of Significant Targets in Cancer (version 2.0.22)

Author: Steven Schumacher, Jen Dobson, Rameen Beroukhim, Gad Getz

Contact:

Algorithm Version: 2.0.22

Summary

The GISTIC module identifies regions of the genome that are significantly amplified or deleted across a set of samples. Each aberration is assigned a G-score that considers the amplitude of the aberration as well as the frequency of its occurrence across samples. False Discovery Rate q-values are then calculated for the aberrant regions, and regions with q-values below a user-defined threshold are considered significant. 

For each significant region, a “peak region” is identified, which is the part of the aberrant region with greatest amplitude and frequency of alteration. In addition, a “wide peak” is determined using a leave-one-out algorithm to allow for errors in the boundaries in a single sample. The “wide peak” boundaries are more robust for identifying the most likely gene targets in the region. 

Each significantly aberrant region is also tested to determine whether it results primarily from broad events (longer than half a chromosome arm), focal events, or significant levels of both. The GISTIC module reports the genomic locations and calculated q-values for the aberrant regions. It identifies the samples that exhibit each significant amplification or deletion, and it lists genes found in each “wide peak” region. 

Note: The GISTIC module is memory-intensive. 

References

Mermel C, Schumacher S, et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biology. 2011;12:R41. 

Beroukhim R, Mermel C, et al. The landscape of somatic copy-number alteration across human cancers. Nature. 2010;463:899-905. 

Parameters

Name Flag Description
refgene file * -refgene The reference file including cytoband and gene location information.
seg file * -seg The segmentation file contains the segmented data for all the samples identified by GLAD, CBS, or some other segmentation algorithm. (See GLAD file format in the Genepattern file formats documentation.) It is a six column, tab-delimited file with an optional first line identifying the columns. Positions are in base pair units.
markers file * -mk The markers file identifies the marker names and positions of the markers in the original dataset (before segmentation). It is a three column, tab-delimited file with an optional header. If not already, markers are sorted by genomic position.
array list file  -alf The array list file is an optional file identifying the subset of samples to be used in the analysis. It is a one column file with an optional header. The sample identifiers listed in the array list file must match the sample names given in the segmentation file.
cnv file  -cnv There are two options for the cnv file. The first option allows CNVs to be identified by marker name. The second option allows the CNVs to be identified by genomic location.
gene gistic * -genegistic Flag indicating that the gene GISTIC algorithm should be used to calculate the significance of deletions at a gene level instead of a marker level.
amplifications threshold * -ta Threshold for copy number amplifications. Regions with a log2 ratio above this value are considered amplified.
deletions threshold * -td Threshold for copy number deletions. Regions with a log2 ratio below the negative of this value are considered deletions.
join segment size * -js Smallest number of markers to allow in segments from the segmented data. Segments that contain a number of markers less than or equal to this number are joined to the neighboring segment that is closest in copy number.
qv thresh * -qvt Threshholding value for q-values.
remove X * -rx Flag indicating whether to remove data from the X-chromosome before analysis.
cap val * -cap Minimum and maximum cap values on analyzed data. Regions with a log2 ratio greater than the cap are set to the cap value; regions with a log2 ratio less than -cap value are set to -cap.
confidence level * -conf Confidence level used to calculate the region containing a driver.
run broad analysis * -broad Flag indicating whether an additional broad-level analysis should be performed.
broad length cutoff * -brlen Threshold used to distinguish broad form focal events, given in units of fraction of chromosome arm.
max sample segs * -maxseg Maximum number of segments allowed for a sample in the input data. Samples with more segments than this threshold are excluded from the analysis.
arm peel * -armpeel Whether to perform arm level peel off. This helps separate peaks which cleans up noise.
sample center * -scent Method for centering each sample prior to the GISTIC analysis.
gene collapse method * -gcm Method for reducing marker-level copy number data to the gene-level copy number data in the gene tables. Markers contained in the gene are used when available, otherwise the flanking marker or markers are used.
output prefix * -fname The prefix for the output file name

* - required

Input Files

  1. Reference Genome File (-refgene) (REQUIRED) 

    The reference genome file contains information about the location of genes and cytobands on a given build of the genome. Reference genome files are created in MATLAB TM and are not viewable with a text editor. The GISTIC 2.0 release includes the following reference genomes: hg16.mat, hg17.mat, hg18.mat, and hg19.mat).

  2. Segmentation File (-seg) (REQUIRED)

    The segmentation file contains the segmented data for all the samples identified by GLAD, CBS, or some other segmentation algorithm. (See GLAD file format in the GenePattern file formats documentation.) It is a six column, tab-delimited file with an optional first line identifying the columns. Positions are in base pair units. Seg.CN values should be log transformed; if not, GISTIC will automatically log transform the values. The column headers are:

    1. Sample (sample name)
    2. Chromosome (chromosome number)
    3. Start Position (segment start position, in bases)
    4. End Position (segment end position, in bases)
    5. Num markers (number of markers in segment)
    6. Seg.CN (log2() -1 of copy number)]

  3. Markers File (-mk) (REQUIRED)

    The markers file identifies the marker names and positions of the markers in the original dataset (before segmentation). It is a three-column, tab-delimited file with an optional header. The column headers are:

    1. Marker Name (marker name)
    2. Chromosome (chromosome number)
    3. Marker Position (in bases)

  4. Array List File (-alf) (OPTIONAL)
    The array list file is an optional file identifying the subset of samples to be used in the analysis. It is a one column file with an optional header (array). The sample identifiers listed in the array list file must match the sample names given in the segmentation file.
  5. CNV File (-cnv) (OPTIONAL)
    There are two options for the CNV file. The first option allows CNVs to be identified by marker name. The second option allows the CNVs to be identified by genomic location.
    Option #1: A two-column, tab-delimited file with an optional header row. The marker names given in this file must match the marker names given in the markers file. The CNV identifiers are for user use and can be arbitrary. The column headers are:
    1. Marker Name
    2. CNV Identifier
    Option #2: A 6-column, tab-delimited file with an optional header row. The CNV Identifier, Narrow Region Start, and Narrow Region End are for user use and can be arbitrary. The column headers are:
    1. CNV Identifier
    2. Chromosome
    3. Narrow Region Start
    4. Narrow Region End
    5. Narrow Region End
    6. Wide Region Start
    7. Wide Region End

Output Files

  1. All Lesions File (all_lesions.conf_XX.txt, where XX is the confidence level)
    [Description of the content, file format, and how to interpret the results.]

Example Data

[provide example data, including input files and parameter settings]. (we’ll put the input files on the ftp site and link to them from the doc)

Requirements

[any software requirements for running this, e.g., version of R, licensing]

Platform Dependencies

Task Type:
SNP Analysis

CPU Type:
any

Operating System:
linux

Language:
MATLAB

Version Comments

Version Release Date Description
2014-07-30