ComBat (v3)

Performs batch correction on a dataset containing multiple batches - Not intended for use with single-cell RNA-seq data

Author: W. Evan Johnson (Boston University), Marc-Danie Nazaire (Broad Institute)

Contact:

Algorithm Version: 2.0

Introduction

ComBat runs the Combatting batch effects when combining batches of microarray data R script and uses an Empirical Bayes method to adjust for potential batch effects. Practical considerations limit the number of samples run at a given time, and replicate samples are generated in ways that introduce non-biological differences, or systematic "batch effects". For example, batch effects occur when adding replicates from different labs, array types, or platforms. In some cases, different lots of amplification reagent or the time of day of the assay have been demonstrated to cause batch effects. ComBat's Empirical Bayesian approach assumes phenomena resulting in batch effects affect many genes in similar ways and adjusts for these systematic batch biases common across genes. 

*Note that this module is not intended for use with single-cell RNA-seq data

Algorithm

  • ComBat differs from previous methods in its ability to adjust data whose batch sizes are small, <10 samples versus >25.
  • Offers two methods of estimation, and one will give a truer adjustment for a given dataset.
    • The parametric method computes a prior probability distribution--prior plots--used in adjustment.
      • In the plots, if the black (kernal density estimate of batch effects) and red (parametric estimate of batch effects) lines do not overlap, such that the plots show bimodality, then the non-parametric method should be used. 
    • The non-parametric method makes no prior assumptions and thus takes significantly longer to run.
  • Additional considerations 
    • Use on data that is already preprocessed and normalized gene-wise such that genes have similar overall mean and variance. Also, include covariates in analysis when possible. Finally, note missing or unbalanced proportion of treatments/controls from a batch risks removal of biological signal.

References

Johnson WE, Rabinovic A, and Li C. Adjusting batch effects in microarray expression data using Empirical Bayes methods. Biostatistics. 2007;8(1):118-127. doi:10.1093/biostatistics/kxj037.

Luo J, Schumacher M, Scherer A, et al. A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J. 2010;10(4):278-91. doi:10.1038/tpj.2010.57.

Chen C, Grennan K, Badner J, et al. Removing batch effects in analysis of expression microarray data: An evaluation of six batch adjustment methods. PLoS One. 2011;6(2):e17238. doi:10.1371/journal.pone.0017238.

Parameters

Name Description
input file *
  • GCT or RES format file. (Additional quick guide to file conversion)
  • ComBat requires unique row identifiers. GenePattern module UniquifyLabels renames duplicate row names.
  • NOTE: The data must be preprocessed first. (See additional considerations above)
sample info file *

TXT plain text file matching batch and covariate information to sample identifier. First three column labels in first row must be exactly "Array", "Sample", and "Batch" without spaces.

  • Row headings (sample identifiers) from data file are inputted under "Array".
  • "Sample" column rows can be blank or use labels. 
  • Indicate batches under “Batch” with a minimum two samples per batch. Batch labels can be alphanumerical, e.g. label four batches (1, 2, 6, 8) or (lab, collaborator-1, collaborator-2, 3).
  • Indicate covariates in columns four and greater, or leave blank.
covariate columns * Subset of covariate columns to use in analysis. This is either set to all, none, or a list specifying one or more covariate columns from the sample info file, i.e. (4, 5, 7).
absent calls filter  Filter to apply to RES file genes with absent calls in 1-(absent calls filter) of the samples. Use values between 0 and 1, or leave blank. For example, (0.8) removes a feature if at least 20% of samples have absent calls.
create prior plots  Whether to generate prior probability distribution plots. Select "yes" for parametric or "no" for non-parametric method.
prior method  Empirical Bayes priors distribution estimation method to use, either parametric or non-parametric.
output files *
  • Batch adjusted output file <output file>.<res, gct> in same format as input.
    • For parametric method, additional prior plot file <output file>.plot.<jpg, pdf> in JPEG format if supported or as PDF.

* - required

Input Files

Sample info file

  • Algorithm is sensitive to extra spaces in certain cases which result in errors. Use Find>Replace function in TextEdit or Excel to remove spaces.
  • Alternatively, follow steps outlined in Creating a Sample Information File to copy exactly sample identifiers from Excel data. Label first three cells of Row 1: “Array”, “Sample”, and “Batch”. Indicate batches and covariates, and save as tab delimited text (.txt).

Example Data

Platform Dependencies

Task Type:
Preprocess & Utilities

CPU Type:
any

Operating System:
any

Language:
R (v. 2.5.0)

Version Comments

Version Release Date Description
3.0 2014-06-03 Updated doc to html
2.0 2014-03-26 Updated to run on any OS
1.0 2008-08-18 Windows only version