This module is currently in beta release. The module and/or documentation may be incomplete.
Creates a GCT file from a set of CEL files from Affymetrix ST arrays.
Author: David Eby, Broad Institute
Contact:
gp-help@broadinstitute.org
Algorithm Version:
Summary
Please Note that version 0.14 is currently only available in beta on GenePattern Team hosted servers. We are working to release updates which will be available for use on all platforms. Feel free to contact us with any questions.
This module creates a gene expression dataset from a set of CEL files for Affymetrix ST arrays. It is similar to ExpressionFileCreator, which operates on CEL files from the older 3' biased IVT-based Affymetrix arrays. The conversion is done using the Robust Multi-array Average (RMA) algorithm as provided by the 'oligo' package in Bioconductor. The result is a matrix containing one intensity value per probe set per sample in the GCT file format.
Note that the RMA algorithm will log-transform the data during processing. This may affect downstream processing by other modules, some of which will produce erroneous results with log-transformed data unless adjustments are made. For example, the ComparativeMarkerSelection module has a parameter that must be set for it to accept and adjust for log-transformed data.
Multiple CEL files can be uploaded directly to the input file parameter for processing. The parameter also accepts CELs packaged as a ZIP or TAR bundle or supplied as a directory input if your GenePattern server is configured to allow it. You can provide multiple ZIPs, TARs, or directory inputs as well, or mix all of these forms. The CEL files can be compressed in GZ format and the TAR bundles can be in GZ, XZ, or BZ2 format. Any directory inputs will be recursively searched for CEL files (uncompressed or in GZ format) to include in the dataset; ZIPs and TARs in these inputs will not be included, however.
You can supply an optional CLM file listing the CEL files to be included in the dataset, their order, their phenotypic categories, and their alternate sample names. Note that if there are any files submitted for a job but not listed in the included CLM file, those files will not be included in the dataset. The column order of the dataset will match the order of the CLM listing. If no CLM file is provided, the CEL file names will be used as sample names and the order will match the module's processing order. This can be somewhat unpredictable, so if order is important then the use of a CLM is recommended.
References
Carvalho BS and Irizarry RA (2010). “A Framework for Oligonucleotide Microarray Preprocessing.” Bioinformatics. ISSN 1367-4803.
Carvalho BS and Irizarry RA (2014). "Package 'oligo'" documentation from Bioconductor 2.14.
Parameters
Name | Description |
---|---|
input file * | One or more Affymetrix ST CEL files either uploaded directly, packaged into a ZIP or TAR bundle, or supplied through a directory input. The CEL files can be in GZ format and the TAR can be in GZ, XZ, or BZ2 format. The parameter will accept multiple inputs in any of these forms. |
normalize * | Whether to normalize data using quantile normalization. |
background correct * | Whether to perform background correction. |
clm file | A tab-delimited text file containing one scan, sample, and class per line. |
annotate probes * | Whether to annotate probes with the gene symbol and description. |
output file base * | The base name of the output file(s). File extensions will be added automatically. |
* - required
Input Files
- input.file
One or more Affymetrix ST CEL files. These can be supplied as individual CEL files, in a ZIP or TAR bundle, or in a directory. The CEL files can be in GZ format and the TAR can be in GZ, XZ, or BZ2 format. Note that the CEL file names must be unique, ignoring any compression format extensions. Also note that all CEL files must be of the same array type. - clm.file
An optional CLM file listing the CEL files to be included in the dataset, their order, their phenotypic categories, and their alternate sample names. Note that if there are any files submitted for a job but not listed in the included CLM file, those files will not be included in the dataset. The column order of the dataset will match the order of the CLM listing. If no CLM file is provided, the CEL file names will be used as sample names and the order will match the module's processing order. This can be somewhat unpredictable, so if order is important then the use of a CLM is recommended.
Output Files
- <output.file.base>.gct
The expression dataset in GCT format. - <output.file.base>.cls
A categorical label CLS file, listing the categories of all the samples in the dataset as determined by the input CLM file. - <output.file.base>.QC.Density_histogram.pdf (or .png or .svg)
A histogram plot of the density estimates for each sample. This may be useful for QC purposes. - <output.file.base>.QC.Boxplot.pdf (or .png or .svg)
A boxplot of the observed intensities for each sample. This may be useful for QC purposes. - <output.file.base>.QC.[sample name]_MAplot.pdf (or .png or .svg)
A plot of Average Intensity vs. log ratio (M vs. A, or MA) for each sample versus a reference array. This Wikipedia entry gives some background on MA plots. - <output.file.base>.QC.[sample name]_Cel_image.pdf (or .png or .svg)
A psuedo-image of the array for each sample, based on the observed intensities. This may be useful for QC purposes.
Example Data
[Yet to be posted]
Requirements
Requires R 3.1.3 and a set of R package dependencies from CRAN and Bioconductor. R 3.1.3 must be installed and configured by the GenePattern administrator before this module can be installed [Instructions yet to be posted. Will link to an updated version of our Admin Guide on the subject]. The package dependencies will be automatically installed when the module is installed.
Platform Dependencies
Task Type:
Preprocess & Utilities
CPU Type:
Operating System:
any
Language:
R
Version Comments
Version | Release Date | Description |
---|---|---|
0.14 | 2015-10-22 | Updated to make use of the R package installer. |