GenePattern - Trimmomatic (v2) BETA

This module is currently in beta release. The module and/or documentation may be incomplete.

Provides a variety of options for trimming Illumina FASTQ files of adapter sequences and low-quality reads.

Author: Anthony Bolger et al, Usadel Lab, Rheinsch - Westfalische Technische Hochschule Aachen

Contact:

gp-help@broadinstitute.org

Algorithm Version: 0.32

Introduction

The GenePattern Trimmomatic module conducts quality-based trimming and filtering of FASTQ-formatted short read data produced by Illumina sequencers. The module can also be used to remove adapters and other Illumina technical sequences from the read sequences. The module operates on both paired end or single end data. With paired end data the tool will maintain correspondence of read pairs and also use the additional information contained in paired reads to better find adapter sequences contaminating the read data. The module wraps the Trimmomatic command line tool [Bolger, et al., 2014]. Using the command line tool, a user specifies which trimming/filtering operations to employ and how the selected operations are to be ordered. The GenePattern Trimmomatic module directly exposes through its GUI six of the most frequently used trimming/flitering operations and enforces a particular relative ordering, supporting the most common usage scenarios. Through an extra.steps parameter, GenePattern users may directly specify Trimmomatic command line options and thus gain access to the underlying tool's full range of functionality.

Usage

The goal of FASTQ trimming and filtering is to remove low-quality base calls from reads, and to remove detrimental artifacts introduced into the reads by the sequencing process. The removal of low quality reads and contaminating sequences will improve processing by downstream tools such as aligners. The tool provides operations to detect and remove known adapter fragments (adapter.clip), remove low-quality regions from the start and end of the reads (trim.leading and trim.trailing), drop short reads (min.read.length), as well as operations with different quality-filtering strategies for removing low-quality bases within the reads (max.info and sliding.window).

Trimmomatic works with Illumina FASTQ files using phred33 or phred64 quality scores. The appropriate setting depends on the Illumina pipeline used. The default is phred33, which matches modern Illumina pipelines. Correct specification of the phred encoding is critical to successful trimming. The tool will incorrectly interpret the quality values in the FASTQ if the wrong encoding is specified.

The following operations are available directly from the module parameters. They will be executed in the following order, though all operations are optional:

adapter.clip
trim.leading
trim.trailing
max.info
sliding.window
min.read.length

In order to simplify the workflow in GenePattern, these operations always execute in the above order when specified through the module parameters. This order allows for the most common example workflows and also matches the general recommendations of the Trimmomatic documentation. The underlying trimming engine is much more flexible; if you have the need for that increased flexibility, it can be accessed through the extra.steps parameter.

A typical usage scenario involves operations 1, 2, 3, and 6 along with either the max.info or sliding.window operation (using both of these together is not recommended). The adapter.clip step is done first as the known adapter sequences are more likely to be recognized within the original read than in one that has been modified by another trimming step. The trim.leading and trim.trailing happen next and are often used with a very low phred threshold to quickly remove the special Illumina 'low-quality regions' at the start and end of the reads as a precursor to the subsequent, more sophisticated max.info and sliding.window quality-filtering operations. Finally, the min.read.length step is used to drop any read shorter than a desired length.

FastQC, used for quality assessment of the raw reads, includes an analysis of overrepresented sequences. When conducting this analysis, FastQC also checks to see whether any overrepresented sequences correspond to known Illumina adapter and primer sequences. If the resulting Overrepresented Sequences report flags matches with known adapter or primer sequences, these can be removed by using the adapter.clip step.

Of the two quality-filtering operations, max.info is newer and more sophisticated, and is recommended over the older sliding.window strategy by the Trimmomatic authors. One important feature of max.info is that it can be tuned to be more strict or tolerant based on the expected downstream use, where 'strict' applications favor stronger alignment accuracy (e.g., are more sensitive to base mismatches) and 'tolerant' applications favor longer reads (where downstream tools or analysis can tolerate or correct for larger numbers of mismatches or indels). Reference-based RNA-Seq would tend to be in the former category while assembly or variant finding would be in the latter. sliding.window, however, remains an affective method for quality-based trimming of RNA-Seq short reads; its input parameters are more easily interpreted than max.info's and there are established guidelines for their settings.

The module can also be used to convert into a specific phred encoding through the convert.phred.scores parameter. At least one processing step must be chosen, either from operations 1-6, convert.phred.scores or extra.steps.

For single-ended data, a single input file is specified and the module will create a single output file of trimmed/filtered reads. For paired-end data, two input files, one for each mate of the paired-end reads, are specified and the module will create four output files, two for the ‘paired’ output where both reads survived the processing, and two for the corresponding ‘unpaired’ output containing reads where only one of two paired reads survived trimming/filtering.

Details of each of the available steps are explained below in turn. For reference, the underlying operations are listed as well; these are further described in the Trimmomatic manual.

adapter.clip
This step cuts adapter and other specified sequences; it corresponds to the ILLUMINACLIP operation. The Trimmomatic manual contains a detailed discussion of how this mode works and how to specify your own adapter sequences; we encourage you to read that document for more details. This is the most complex of all the available trimming steps, so please check that documentation if there is any confusion. The illustrations are particularly informative. Bolger(2014) contains an even more detailed explanation.

While adapter contamination could result in the appearance of technical sequences anywhere in a RNA-seq read, the most common cause of Illumina adapter contamination is the sequencing of a cDNA read fragment that is shorter than the read length. In this scenario, the initial bases in the read will contain valid data, but when the end of the fragment is reached, the sequencer continues to read through the adapter, leading to a partial or full adapter sequence at the 3’ end of the read. This is known as “adapter read-through”. Adapter read-through is more likely to occur when employing longer read lengths, such as is possible with the Illumina MiSeq sequencing system.

Trimmomatic combines two approaches to detect technical sequences. The first, referred to as ‘simple mode’, conducts a local alignment of technical sequences (adapter sequences are specified in an adapter sequence file provided to the module - see adapter.clip.sequence.file in the Parameters section) against a read. If the alignment score exceeds a user-defined threshold (adapter.clip.simple.clip.threshold parameter), the portion of the read that aligns to the technical sequence plus the remainder of the read after the alignment (towards the 3’ direction) are trimmed from the read. Simple mode can detect any technical sequence at any location within the read; however, the user-defined threshold must be set sufficiently high to prevent false positives. Thus, ‘simple mode’ cannot detect the short overlaps between a read and technical sequence which often arise in cases of adapter read-through.

Trimmomatic’s second approach to technical sequence detection, referred to as “palindrome mode”, is specifically designed to detect the common adapter read-through scenario. Palindrome mode can only be used with paired-end data. When a read-through occurs, both reads in a pair will contain an equal number of valid bases (i.e., not from adapter sequences) followed by contaminating sequence from opposite adapters. The valid sequence in each of the pair’s reads will be reverse complements. Trimmomatic’s palindrome mode uses these characteristics to identify contaminating technical sequences arising from adapter read-through with high sensitivity and specificity. Operating in the palindrome mode, Trimmomatic prepends the Illumina adapter sequences to their respective reads in the paired-end data. The resulting sequences are then globally aligned against one another. A high scoring alignment (greater than adapter.clip.palindrome.threshold) indicates that the first parts of each read are reverse complements of one another and the remaining parts of the reads match their respective adapters. Read bases matching the adapters are removed.

Trimmomatic uses a “seed and extend” method for alignment detection and scoring in both the simple and palindrome modes. Initial sequence comparisons are done using 16 base fragments from each sequence. If the number of mismatches between seeds from the two sequences are less than or equal to a specified threshold (see seed adapter.clip.seed.mismatches in Parameters section), the full alignment scoring algorithm is run.

Parameters relevant to this operation are: adapter.clip.sequence.file, adapter.clip.seed.mismatches, adapter.clip.palindrome.clip.threshold, adapter.clip.simple.clip.threshold, adapter.clip.min.length and adapter.clip.keep.both.reads. The adapter.clip.sequence.file, adapter.clip.seed.mismatches, adapter.clip.plaindrome.clip.threshold, and adapter.clip.simple.clip.threshold parameters must all be specified in order to enable this step (though note that adapter.clip.plaindrome.clip.threshold will not be used for single-ended data).
trim.leading
This step will remove low quality bases from the beginning of the reads, controlled by the trim.leading.quality.threshold parameter. As long as a base has a quality value below this threshold the based is removed and the next base will be investigated. This step corresponds to the LEADING operation and the threshold parameter represents a phred score. A low value of 3 can be used to remove only special Illumina 'low quality regions' (marked with a score of 2), while a value in the range of 10-15 can be used for a deeper quality-based trimming (15 being more conservative in terms of required quality). The table below translates phred quality scores (ranging from 10 to 15) to base call error probabilities.

Phred Quality score (standard Sanger variant) base call error probability

10 0.1

11 0.08

12 0.06

13 0.05

14 0.04

15 0.03
trim.trailing
This step will remove low quality bases from the end of the reads, controlled by the trim.trailing.quality.threshold parameter. As long as a base has a quality score below this threshold the based is removed and the next base will be investigated (moving in the 3' to 5' direction). This step corresponds to the TRAILING operation and the threshold parameter represents a phred score. As with trim.leading, this approach can also be used to remove the special Illumina 'low quality segment' regions or for a quality-based trimming using the same values described above.
max.info
This step performs an "adaptive quality" trimming, balancing the benefits of retaining longer reads against the costs of retaining bases with errors; it corresponds to the MAXINFO operation. The discussion in Bolger(2014) provides a detailed description of max.info trimming and contrasts it with the sliding.window approach and suggests that max.info quality trimming should outperform (increased number of uniquely aligned reads) sliding.window. However, sliding.window is simpler, with established recommendations for input parameter values. The simultaneous use of sliding.window and max.info is not recommended.

Maximum Information Quality Trimming is an adaptive approach to quality-based trimming where the criterion for retaining the remaining bases in a read becomes increasingly more strict as one progresses through that read. The motivation for the adaptive approach is that, in many scenarios, the incremental value of retaining a red's additional bases is related to the read length. Very short reads are of little value since they are likely to align to multiple locations in a reference sequence; thus, it is beneficial to retain lower-quality reads early in a read so that the trimmed read is long enough to be informative, However, beyond a certain length, retaining additional bases is less beneficial and could even be detrimental if the retention of low-quality reads leads to the read becoming unmappable. Parameters relevant to this operation are max.info.target.length and max.info.strictness; both must be specified to enable max.info quality trimming.
sliding.window
This step performs a "sliding window" trimming, cutting once the average quality within the window fals below a threshold; it corresponds to the SLIDINGWINDOW operation. By considering multiple reads, a single poor quality base will not cause removal of high quality data later in the read. The sliding.window.size parameter controls the size of this window (in bases) while the sliding.window.quality.threshold specifies the required average quality (as a phred value). Both parameters must be specified to enable this step; typical examples use a 4-base wide window, cutting when average quality drops below 15. The use of sliding.window at the same time as max.info is not recommended.
min.read.length
This step removes reads that fall below the minimum length specified by the min.read.length parameter. It should normally be used after all other processing steps, which is why it is presented last in this predefined list. This step corresponds to the MINLEN operation.

Phred Quality score (standard Sanger variant)	base call error probability
10	0.1
11	0.08
12	0.06
13	0.05
14	0.04
15	0.03

Use of the extra.steps parameter

Finally, any trimming operations specified in the extra.steps parameter will be performed after those in the above predefined list. Such operations must be specified using the exact syntax found in the Trimmomatic manual; use spaces to separate multiple operations. This allows you to perform operations in a different order than the list above, or to access other operations not presented here. Even when using extra.steps, it is still recommended that adapter.clip (ILLUMINACLIP) be performed first and that MINLEN be performed last. Note that because ILLUMINACLIP requires a file parameter it is highly inconvenient to use through extra.steps due to the need to specify a server-side file path. For this reason, it is best to use the adapter.clip parameters rather than specifying ILLUMINACLIP through extra.steps.

Trimmomatic supports three other trimming operations not presented in the predefined list above:

CROP: removes bases from the end of the read regardless of quality, leaving the read with (maximally) the specified length after cropping. Later steps might of course further shorten the read. Note that the parameter governs the total length rather than the number of bases to remove; this is in contrast to HEADCROP.
HEADCROP: removes the specified number of bases from the start of the read, regardless of quality. Note that the parameter governs the number of bases to remove rather than the total length; this is in contrast to CROP
AVGQUAL: Drop the read if the average quality is below the specified level.

CROP and HEADCROP were left out of the above predefined list as their use is somewhat at odds with the other quality-based and adaptive approaches. Certain trimming strategies simply want to cut a certain number of bases from the start and/or end of every read and nothing more. Use these operations through extra.steps if that is your goal.

AVGQUAL was left out because its use is not well documented in the Trimmomatic manual, making it unclear where to place it in the overall order and leaving its use harder to explain. Our understanding is that it is similar to the sliding.window approach but always applied at the level of the entire read. As such, it is probably best to use only one of AVGQUAL or sliding.window.

Examples

The parameter setting recommendations are largely based on the Trimmomatic manual and the example included near the end. For paired-end data, the corresponding settings for this example would be:

input.file.1 and input.file.2: your paired-end FASTQs
adapter.clip.sequence.file: TruSeq3-PE.fa (Note: you need to choose this according to your platform)
adapter.clip.seed.mismatches: 2
adapter.clip.plaindrome.clip: 30
adapter.clip.simple.clip: 10
leading.trim.quality.threshold: 3 (for trimming the special Illumina "low quality segment" regions)
trailing.trim.quality.threshold: 3 (as above)
sliding.window.size: 4
sliding.window.quality.threshold: 15
min.read.length: 36

For single-ended data you would (obviously) provide only input.file.1 and leave input.file.2 blank, and use TruSeq3-SE.fa as the adapter.clip.sequence.file (again, adjusted according to your platform. The adapter.clip.palindrome.clip value of 30 should still be specified, though it will be ignored for this usage.

Bolger(2014) provides several examples of the use of a Maximum Information approach. It used the following settings for a 'strict' alignment application:

input.file.1 and input.file.2: your paired-end FASTQs
adapter.clip.sequence.file: TruSeq3-PE.fa (Note: you need to choose this according to your platform)
adapter.clip.seed.mismatches: 2
adapter.clip.plaindrome.clip: 30
adapter.clip.simple.clip: 12
leading.trim.quality.threshold: 3
trailing.trim.quality.threshold: 3
max.info.target.length: 40
max.info.strictness: 0.999
min.read.length: 36

Finally, here is an example using extra.steps to perform a simple trimming of reads past the 45th base, followed by removal of the first 5 bases, and then dropping any reads with length under 36:

input.file.1 and input.file.2: your paired-end FASTQs
extra.steps: CROP:45 HEADCROP:5 MINLEN:36

Note that we are not recommending the last example as a ideal trimming approach. It is simply illustrative of the use of extra.steps and some of the additional Trimmomatic operations.

References

Trimmomatic website
Trimmomatic manual. This documentation was adapted largely based on this documentation.
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.
Lohse M, Bolger AM, Nagel A, Fernie AR, Lunn JE, Stitt M, Usadel B. RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics. Nucleic Acids Res. 2012 Jul;40(Web Server issue):W622-7.

Parameters

Name	Description
input file 1 *	The input FASTQ to be trimmed. For paired-end data, this should be the forward ("*_1" or "left") input file.
input file 2	The reverse ("*_2" or "right") input FASTQ of paired-end data to be trimmed.
output filename base *	A base name to be used for the output files.
adapter clip sequence file	A FASTA file containing the adapter sequences, PCR sequences, etc. to be clipped. This parameter is required to enable adapter clipping. Files are provided for several Illumina pipelines but you can also provide your own; see the manual for details. Be sure to choose a PE file for paired-end data and an SE file for single-end data. See the manual for details on creating your own adapter sequence file.
adapter clip seed mismatches	Specifies the maximum mismatch count which will still allow a full match to be performed. A value of 2 is recommended. This parameter is required to enable adapter clipping.
adapter clip palindrome clip threshold	Specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment. This is the log10 probability against getting a match by random chance; values around 30 or more are recommended. This parameter is required to enable adapter clipping.
adapter clip simple clip threshold	Specifies how accurate the match between any adapter etc. sequence must be against a read as a log10 probability against getting a match by random chance; values between 7-15 are recommended. This parameter is required to enable adapter clipping.
adapter clip min length	In addition to the alignment score, palindrome mode can verify that a minimum length of adapter has been detected. If unspecified, this defaults to 8 bases, for historical reasons. However, since palindrome mode has a very low false positive rate, this can be safely reduced, even down to 1, to allow shorter adapter fragments to be removed.
adapter clip keep both reads *	Controls whether to keep both forward and reverse reads when trimming in palindrome mode. The reverse read is the same as the forward but in reverse complement and so carries no additional information. The default is "yes" (retain the reverse read) which is useful when downstream tools cannot handle a combination of paired and unpaired reads.
trim leading quality threshold	Remove low quality bases from the beginning. As long as a base has a value below this threshold the base is removed and the next base will be investigated. See the Usage section above for recommendations.
trim trailing quality threshold	Remove low quality bases from the end. As long as a base has a value below this threshold the base is removed and the next trailing base will be investigated. See the Usage section above for recommendations.
max info target length	This parameter specifies the read length which is likely to allow the location of the read within the target sequence to be determined. A typical value for target length is 40.
max info strictness	This value, which should be set between 0 and 1, specifies the balance between preserving as much read length as possible vs. removal of incorrect bases. A low value of this parameter (<0.2) favors longer reads, while a high value (>0.8) favors read correctness. Both max.info.target.length and max.info.strictness are required for the Max Info quality trim. Examples presented in [Bolger, 2014] employ a value of 0.4 for "tolerant" applications and values from 0.9 all the way up to 0.999 for "strict" applications.
sliding window size	Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold. By considering multiple bases, a single poor quality base will not cause the removal of high quality data later in the read. This parameter specifies the number of bases to average across. See the Usage section above for recommendations.
sliding window quality threshold	Specifies the average quality required for the sliding window trimming. Both sliding.window.size and sliding.window.quality.threshold are required to enable the sliding window trimming. See the Usage section above for recommendations.
min read length	Remove reads that fall below the specified minimal length.
extra steps	Extra steps to be performed after any other processing. These must be specified in exactly the format described in the Trimmomatic manual; see the documentation for details. This is recommended for advanced users only.
phred encoding *	Allows you to specify the phred quality encoding. The default is phred33, which matches modern Illumina pipelines.
convert phred scores	Convert phred scores into a particular encoding. Leave this blank for no conversion.
create trimlog *	Create a log of the trimming process. This gives details on what operations were performed, etc. but can be quite lengthy.

* - required

Input Files

<input.file.1>
The input FASTQ to be trimmed. For paired-end data, this should be the forward ("*_1") input file. Data compressed with gzip or bzip2 is also accepted and will be automatically detected based on the .bz2 or .gz extension.
<input.file.2>
The reverse ("*_2" or "right") input FASTQ to be trimmed. Data compressed with gzip or bzip2 is also accepted and will be automatically detected based on the .bz2 or .gz extension.
<adapter.clip.sequence.file>
A FASTA file containing the adapter sequences, PCR sequences, etc. to be clipped. This parameter is required to enable adapter clipping. Files are provided for several Illumina pipelines but you can also provide your own; see the manual for details. Be sure to choose a PE file for paired-end data and an SE file for single-ended data.

Output Files

Use the output.filename.base to specify a base to be used in naming for the output files that will be created. By default, this will be the name of input.file.1 with the both the FASTQ (.fq or .fastq) and compression (.gz or .bz2) extensions removed. Also, if this name (minus extensions) ends in "_1", then this will also be removed to avoid producing output files with confusing names. For example, if input.file.1 is "my_reads_1.fastq.bz2" (presumably paired with "my_reads_2.fastq.bz2") then the module will use "my_reads" as the output.filename.base when creating output files. The names in the list below reflect this naming scheme.

Output FASTQ files will normally use the .fq extension, though if the original input.file.1 used the .fastq extension then this will be used instead. The names in the list below will use .fq with no compression extension, for the sake of uniformity.

FASTQ files compressed using either gzip or bzip2 are supported and are automatically identified by use of the .gz or .bz2 file extensions. Note: we have seen severe issues with Trimmomatic hanging indefinitely when asked to bz2-compress output and so this feature has been disabled; there are no issues with .bz2 input. If the input file is compressed (as either .gz or .bz2) then the output will be as well (though always using gzip).

<output.filename.base>_1P.fq and <output.filename.base>_2P.fq
FASTQ files holding the paired forward and reverse reads (respectively) where both reads in the pair survived all specified trimming steps. These files are only produced with paired-ended input.
<output.filename.base>_1U.fq and <output.filename.base>_2U.fq
FASTQ files holding the "unpaired" forward and reverse reads (respectively) where only one read in the pair survived all specified trimming steps but the partner read did not; this surviving read is placed in the corresponding unpaired output FASTQ. These files are only produced with paired-ended input.
<output.filename.base>-trimmed.fq
A FASTQ file holding the single-ended reads which survived all specified trimming steps. This file is only produced with single-ended input.
<output.filename.base>.trimlog.txt
A log of the trimming process giving details on what operations were performed and how they applied to each of the reads, etc. This can be quite lengthy and is not produced by default. To create this file, set the create.trimlog parameter to "yes".
cmdline.log
Shows the equivalent command-line call of Trimmomatic that was performed by the module.

Platform Dependencies

Task Type:
RNA-seq

CPU Type:
any

Operating System:
Linux, Mac

Language:
Java

Version Comments

Version	Release Date	Description
1.1	2016-11-01	configured convert.phred.scores parameter to expose blank option in pipeline designer
1	2014-09-23