This module is currently in beta release. The module and/or documentation may be incomplete.
Converts a Cufflinks read_group_tracking file into GCT format
Author: David Eby, Broad Institute
Contact:
gp-help@broadinstitute.org
Algorithm Version:
Introduction
This module will convert files in the Cufflinks read_group_tracking format into GenePattern's GCT format for use in downstream modules.
The Cuffdiff module performs a quantification step as a precursor to running its differential expression calculations. This provides independent FPKM and other quantification values for each replicate of each condition in the dataset, which are thus suitable as input to existing GenePattern modules for analysis and visualization such as ConsensusClustering, NMFConsensus, and so on. Cuffdiff stores these in several read_group_tracking files at the level of genes, isoforms, transcription start sites (tss_group), and coding sequences (cds), each named appropriately for the feature information it holds.
Usage
The module will extract FPKM values by default, but it can also extract any of the other available types of quantification values as well:
- raw_frags: The estimate number of (unscaled) fragments originating from the object in this replicate
- internal_scaled_frags: Estimated number of fragments originating from the object, after transforming to the internal common count scale (for comparison between replicates of this condition.)
- external_scaled_frags: Estimated number of fragments originating from the object, after transforming to the external common count scale (for comparison between conditions)
- FPKM (the default): FPKM of this object in this replicate
Within the read_group_tracking file, Cuffdiff gives a status for the quantification calculation result per tracking ID and replicate:
- OK: deconvolution successful
- LOWDATA: too complex or shallowly sequenced
- HIDATA: too many fragments in locus
- FAIL: when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution
In practical terms, any value associated with a non-OK status represents a quantification error and so will be either screened out of the resulting GCT or else replaced with an "NA". The default is to screen as most downstream modules cannot handle "NA" values. Note that doing so requires that all values for the entire feature be screened - even those with OK status - rather than just those for affected replicates.
See this section of the Cufflinks manual for a full explanation of the read_group_tracking format, which is the source of the above description.
GTF Name Mapping
Depending on your prior workflow, your read_group_tracking files may contain internal Cufflinks tracking_id references rather than gene symbols; this is particularly the case if you used the merged.gtf file from Cuffmerge as an input to Cuffdiff. These tracking_ids refer to entries in the GTF file supplied to Cuffdiff and will take the form of XLOC_00001 for genes, TCONS_00000001 for isoforms, TSS1 for tss references, and P1 for cds references. Each line of the GTF will contain several named attributes that show the tracking_ids associated with that particular feature. While you can open your GTF to look up these references, you may find it more convenient to have them mapped to gene symbols in the GCT so that these are available in any downstream processing.
The module can optionally perform this mapping for you. To do so, you must supply it with the same GTF used in running Cuffdiff and tell it the feature level - gene, isoform, tss, cds - of the read_group_tracking file to be converted. The feature-level selection controls how the module will look up tracking_ids in the GTF and which column to use when writing them to the GCT. By default, it behaves like this:
- At the gene level: the gene_id attribute is used to look up tracking_ids, and the corresponding gene_name will be used in the GCT Name column. The Description column is left blank.
- At the isoform level: the transcript_id attribute is used to look up tracking_ids, and the corresponding gene_name will be used in the GCT Description column. The Name column will hold the tracking_id value.
- At the tss level: the tss_id attribute is used to look up tracking_ids, and the corresponding gene_name will be used in the GCT Description column. The Name column will hold the tracking_id value.
- At the cds level: the p_id attribute is used to look up tracking_ids, and the corresponding gene_name will be used in the GCT Description column. The Name column will hold the tracking_id value.
By default, the GCT columns are chosen this way because of the way tracking_ids generally map to the corresponding features. At the gene level, a single tracking_id will correspond to one gene (with a few possible exceptions discussed below) and so it is generally safe to use the gene symbol in place of the tracking_id in the GCT. At the other feature levels, multiple tracking_ids may correspond to the same gene and so it is not possible to use these as the Name as it must be unique in the GCT; thus the tracking_id is used as the GCT Name and the gene symbol is given as the Description. You can use the choose gct column for mapping parameter to override this choice.
Note that it is possible for the GTF to map a tracking_id to either no gene symbol at all or, alternatively, to multiple gene symbols. In either of these cases, the module will perform no mapping and instead will use the tracking_id in the GCT Name column with a blank Description. The module will also list these in the summary report, either as counts or with details (controlled by the report naming conflict details parameter).
If no matching gene symbol was found it may indicate a novel feature, while if multiple matching gene symbols were found it may indicate overlapping features in the annotation. To investigate these further, we recommend that you look them up in the GTF for more information. For a given gene_id, multiple features may be present, including protein-coding sequence, ncRNAs, etc. These features may be differentiated through the associated gene_name or nearest_ref attributes, and it may be useful to trace these back to the reference annotation supplied to Cuffmerge. Also consider continuing analyses at a more detailed feature level to differentiate expression of these features.
You can use the override attribute name for lookup and override attribute name for retrieval if you have special mapping requirements or non-standard attributes in the GTF. To use these, both must be supplied and should contain the attribute name exactly as it appears in the GTF. For example, you could map transcription start sites to the nearest feature reference by setting these to tss_id and nearest_ref, respectively, assuming that the latter is present in your GTF and you are using a tss-level read_group_tracking file.
Note that all parameters described in this section are ignored if no GTF file is supplied.
References
Cufflinks website and manual, particularly this section.
Parameters
| Name | Description |
|---|---|
| input file * | The read_group_tracking file to be converted into GCT format. |
| output file name * | The name to be given to the GCT output file. |
| expression value column * | The column to use for extracting expression values. |
| screen out errors * | Set to 'yes' to exclude from the GCT file any feature having at least one non-OK quantification status across all of the sequenced samples. |
| output value for errors * | This parameter controls what expression value to write to a GCT file in those cases where a feature's expression estimate carries a non-OK quantification status. The parameter is only relevant to your final GCT when 'screen out errors' is set to 'no'. We strongly recommend treating these as missing values (select 'NA' or 'blank'). |
| gtf file | An optional GTF file. If provided, this should be the same file that you provided to Cuffdiff. You can use this to map tracking IDs in the read_group_tracking file to gene symbols for output in the GCT. Make sure that you set the feature level to match your read_group_tracking file. |
| feature level for symbol lookup * | Select the feature level of the read_group_tracking file for mapping gene symbols from a GTF. This is ignored if no GTF is provided. |
| choose gct column for mapping * | Use this to explicitly set the GCT column to use when writing symbols retrieved from the GCT. This is automatic by default, meaning that retrieved symbols go into the Name column for gene-level files and into the Description column for other feature levels. |
| report naming conflict details * | Include the naming conflict details in the summary report, rather than just giving counts of the issues. This is ignored if no GTF is provided. |
| override attribute name for lookup | Use this to override the name of the attribute to search against to look up IDs when mapping with a GTF (gene_id, for example). Type in the attribute name exactly as it appears in the GTF. The retrieval override must also be provided if this parameter is set. |
| override attribute name for retrieval | Use this to override the name of the attribute to retrieve on match when mapping with a GTF (gene_name, for example). Type in the attribute name exactly as it appears in the GTF. The lookup override must also be provided if this parameter is set. |
* - required
Input Files
- <input.file> (required)
The file in read_group_tracking format to be converted to GCT.
Output Files
- <output.file.name>
The GCT resulting from the extraction of the expression values in the source file. - <output.file.name_basename>.cls
A companion CLS file created using the Cufflinks conditions as classes, with replicates treated as samples within the class. - <output.file.name_basename>.summary.txt
A report giving summary statistics on the values with non-OK quantification status found during processing, both by feature (tracking_id) and by column (sample or condition/replicate).
Example Data
A short example input file is available from our FTP site. This is a truncated example to illustrate the input format and is intended to be used for test purposes only.
Platform Dependencies
Task Type:
RNA-seq, Data Format Conversion, Preprocess & Utilities
CPU Type:
any
Operating System:
any
Language:
Java
Version Comments
| Version | Release Date | Description |
|---|---|---|
| 0.15 | 2014-10-01 | Beta Release |
