Extracts absolute copy numbers per cancer cell from a mixed DNA population. Use this module for the per-sample processing step in the workflow (usually after HAPSEG).
Author: Scott Carter, Matthew Meyerson, Gad Getz
Algorithm Version: ABSOLUTE 1.0.6
The ABSOLUTE module takes copy number data segmented by haplotype, such as the output of HAPSEG, and determines possible models for absolute copy numbers per cancer cell from a mixed DNA population. This output should be used as input for the ABSOLUTE.summarize module.
The human genome typically consists of a set of chromosome pairs, with one chromosome in each pair, known as a homolog, derived from each parent, and is typically referred to as diploid (whereas the set of chromosomes from a single parent is the haploid genome). For a given gene on a given chromosome, there is a comparable, if not identical, gene on the other chromosome in the pair, known as an allele.
Cancer cells frequently have large structural alterations in their chromosomes that change the number of copies of affected genes on those chromosomes. Thus, instead of having a homologous pair of alleles for a given gene, there may be deletions or duplications of those genes. At a marker where the two alleles are heterozygous, this can lead to unequal contribution of one allele over the other, altering the copy number of a given allele.
Variations in copy number, as reflected in the ratio of cancer cell copy number to normal cell copy number, or relative copy number, can be informative regarding the structure and history of the cancer, and is relatively easy to determine. When DNA is extracted from an admixed population of cancer and normal cells, the information regarding absolute copy number per cancer cell is lost in the mixing, and these data must be inferred.
Inferring absolute copy number is difficult for three reasons:
Let's briefly discuss the biases introduced by purity and ploidy in cancer samples.
Tumor tissue usually consists of a mixture of many tumor clones — that is, cell lines originating from different sets of chromosomal rearrangements, or subclones — and normal diploid cells. Determining the degree of normal cell contamination is important because as the percentage of normal cells in a tumor sample increases, the ability to extract meaningful data from the sample regarding copy number and gene expression decreases.
A common method of determining the degree of contamination is direct microscopic review of the tumor pathology in tissue blocks taken from a tumor specimen. In many cases, however, pathological review is carried out on tissue blocks from tumor regions physically distant from the tissue block used for DNA extraction. Because of irregularities in the tumor shape and variability of normal cell contamination, this review is less accurate than could be desired. As a result, many researchers prefer to assay the DNA sample directly and estimate its purity mathematically.
The variable ploidy of cancer cells affects the preprocessing used for many copy number algorithms. When DNA is seeded to a microarray, a fixed mass of DNA is used. This means that for non-tumor samples derived from diploid cells, each microarray well represents a similar amount of DNA, and likewise, a similar number of cells. The signals derived from each single nucleotide polymorphism (SNP) on the microarray are directly proportional to the allelic copy number for all samples.
When the amount of DNA delivered to each microarray well is controlled by mass for cancer cells, the result can be variable cell numbers. For example in a tetraploid cancer sample, which has twice the quantity of chromosomes in each cancer cell, a microarray well will represent half the number of cells as for a diploid sample. If the tetraploid sample has a region that has the identical copy number as the same region in a diploid sample, wells designed to hybridize in this region will produce half the signal in the tetraploid sample as they will in the diploid sample. The signal is no longer proportional to copy number.
The purpose of ABSOLUTE is to extract the absolute copy number per cancer cell from the mixed DNA population. It does this in three steps:
The ABSOLUTE module accepts segmented copy number data as input, together with pre-computed models of recurrent cancer karyotypes and, optionally, allelic fraction values for somatic point mutations. The output of ABSOLUTE then estimates the absolute cellular copy number of local DNA segments and, for point mutations, the number of mutated alleles.
The common workflow is to process SNP data with the HAPSEG GenePattern module and pass the results to ABSOLUTE. Alternatively, you can supply a tab-delimited segmentation file (e.g., from array CGH or massively parallel sequencing experiments); this file must contain the columns "Chromosome", "Start", "End", "Num_Probes", and "Segment_Mean". Your file may contain other columns besides these, but at a minimum, these columns must be specified. To run with a file other than those produced by HAPSEG, you must also select "total" for the copy number type parameter. See the Example Data link below for sample HAPSEG output.
Multiple ABSOLUTE results can be summarized using the ABSOLUTE.summarize module and final solutions chosen – after analyst review – with the ABSOLUTE.review module.
Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA, Beroukhim R, Pellman D, Levine DA, Lander ES, Meyerson M, Getz G. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012;30(5):413-21. (abstract and link to PDF)
|seg dat file *||A HAPSEG output file (<plate.name>_<array.name>.segdat.RData) or other segmented copy number data file. If you supply a tab-delimited segmentation file, see the Input Files section for file details.|
|output file name base *||If specified, provides a base filename for all output files. The default value is the sample name parameter.|
|sigma p *||Provisional value of excess sample level variance used for mode search. Default: 0|
|max sigma h *||Maximum value of excess sample level variance. For more details, see equation 6 in the ABSOLUTE paper. Default: 0.015|
|min ploidy *||Specifies the minimum ploidy value for the algorithm to consider, and models implying lower ploidy values will be discarded. Default: 0.95|
|max ploidy *||Specifies maximum ploidy value to consider, and models implying greater ploidy values will be discarded. Default: 10|
|primary disease *||Primary disease of the sample. This is used for display and reporting purposes only.|
The chip type used. Supported chips are:
|sample name *||The name of the sample. This is used for display and reporting purposes only.|
|max as seg count *||Maximum number of allelic segments. Samples with a higher segment count will be flagged as 'failed'. Default: 1500|
|max neg genome *||
Sometimes, due to noise in the data, ABSOLUTE may model the fraction of the genome attributed to tumor subclones to be less than zero. This parameter specifies the maximum allowable fraction of the genome that can be modeled as being less than zero without discarding a given solution. Default: 0.005
|max non clonal *||Maximum genome fraction that may be modeled as non-clonal — that is, as being derived from tumor subclones. Solutions implying greater values will be discarded. Default: 0.05|
|copy number type *||
The copy number type to assess. Options include:
|maf file||If available, a minor allele frequency file in mutation annotation format (MAF) (see Input Files for more details). This specifies the data for somatic point mutations to be used by ABSOLUTE.|
|min mut af||If specified, a minimum mutation allelic fraction; that is, the fraction of alleles at a site that show the mutation. Mutations with lower allelic fractions will be filtered out before analysis. Note that if maf file is specified, min mut af must also be specified.|
* - required
A HAPSEG output file or tab-delimited segmentation file. If you supply a tab-delimited segmentation file (e.g., from array comparative genomic hybridization [CGH] or massively parallel sequencing experiments) not generated by HAPSEG, this file must contain the columns "Chromosome", "Start", "End", "Num_Probes", and "Segment_Mean". Your file may contain other columns besides these, but at a minimum, these columns must be specified..
If available, a minor allele frequency file in mutation annotation format (MAF) that specifies the data for somatic point mutations to be used by ABSOLUTE. Note that the MAF format specification has changed over time, and no particular specification is required, but this file must contain at least the following columns:
Plot showing the purity/ploidy values and the solutions
An R file containing an object ‘seg.dat’ which provides all of the information used to generate the plot.
A set of HAPSEG example data from the CGA group is available at:
This can be run through HAPSEG and the output supplied to ABSOLUTE. Note that there is a README file in the ZIP archive that provides the filenames and parameters you will need to run this example data through HAPSEG, ABSOLUTE, ABSOLUTE.summarize, and ABSOLUTE.review.
ABSOLUTE can only be used on the GenePattern public server, as it requires a specialized installation process that prevents distribution via the repository. Please contact the authors listed above if you have an interest in installing ABSOLUTE locally.
Acceptance of the module license is required for its use. A copy of the license text is available here: http://www.broadinstitute.org/cancer/cga/sites/default/files/images/ABSOLUTE_HAPSEG_license_2013.pdfThe ABSOLUTE module runs only on GenePattern 3.4.2 or above and requires R2.15.2 with the following packages:
Each of these R packages will be automatically downloaded and installed when the module is installed. R2.15.2 must be installed and configured independently.