文库构建前,核酸经过随机打断,有的本身就长短不一(mRNA),因此接头之间片段长度也长短不一,而二代测序的测序长度一般是固定,肯定会有部分短于测序读长的序列
被测序,因此测序序列中包含了部分或全部接头序列
,需要进行接头序列的检测并过滤掉对应的reads或截掉接头序列。
Paired End:
You often don’t need leading and traling clipping. Also in general keepBothReads
can be useful
when working with paired end
data, you will keep even redunfant information but this likely makes your pipelines more manageable. Note the additional :2 in front of keepBothReads this is the minimum adapter length in palindrome mode, you can even set this to 1. (Default is a very conservative 8)
java -jar trimmomatic-0.39.jar PE input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
for reference only (less sensitive for adapters)
java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
•Remove adapters
(ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
•Remove leading low quality
or N bases (below quality 3) (LEADING:3)
•Remove trailing low quality
or N bases (below quality 3) (TRAILING:3)
•Scan the read
with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)
•Drop reads below
the 36 bases long (MINLEN:36)
Single End:
java -jar trimmomatic-0.35.jar SE -phred33 input.fq.gz output.fq.gz ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Description
•ILLUMINACLIP: Cut adapter
and other illumina-specific sequences from the read. 过滤 reads 中的 Illumina 测序接头和引物序列,并决定是否去除反向互补的 R1/R2 中的 R2。
PE 测序模式下如果文库的插入片段比测序读长短,那么 read1 和 read2 中非接头序列的那部分会完全反向互补,Trimmomatic 有一个 ‘palindrome’ 模式会利用这个特点进行接头序列的去除。
Trimmomatic 去除接头和引物的四种模式:
红色条形:被切除的序列
绿色条形:保留下来的有效读长
深蓝色条形:接头序列
浅蓝色条形:引物序列
A 模式:测序 reads 从起始位置
开始就包含完整的接头序列
,那么根据 Illumina 测序原理,这整条 reads 都不可能包含有用序列了,整条 reads 被丢弃。
B 模式:这种相对常见,由于文库插入片段比测序读长短,会在 reads 末端包含部分接头序列,若是这部分接头序列足够长是可以识别并去除的,但如果接头序列太短,比接头匹配参数设置的最短长度还短,就无法去除
。PE 测序,可以按照 D 模式去除 reads 末端的很短的接头序列。
C 模式:PE 测序可能出现这种情况,正向测序和反向测序有部分完全反向互补,但是空载
的文库,两个接头直接互连,这样的 reads 不包含任何有用序列,正反向测序 reads 都被丢弃。
D 模式:是 Trimmomatic 利用 PE 测序进行短接头序列去除的典范,如果文库插入片段比测序读长短,利用正反向测序 reads 中一段碱基可以完全反向互补的特点,将两个接头序列与 reads 进行比对,同时两条 reads 之间也互相比对,可以将 3’ 末端哪怕只有 1bp 的接头序列都可以被准确去除,相对 B 模式去除接头污染更彻底。
Trimmomatic 使用了一种类似序列比对软件(例如 Isaac aligner,一个超快速的 alignment 软件)的两步策略来搜索潜在的接头序列。首先,使用接头序列中的一段种子序列(seed 长度不超过 16bp)与测序 reads 进行比对,如果种子序列在测序 reads 中有足够好的比对结果(具体由 seedMismatch 参数决定),就启动第二步的接头全长与 reads 比对。第一步的 seed 搜索速度很快,可以过滤掉没有接头污染的 reads ,这种两步搜索的方法使得接头序列的查找效率很高。
在第二步的接头序列和测序 reads 全长比对统计比对分值时,罚分策略考虑了测序碱基的质量值Q,每一个比对上的碱基加分 0.6,每一个错配的碱基减分 Q/10,考虑碱基质量值可以降低低质量碱基(高测序错误率)错配对整个比对得分的影响。在这个规则下,一段 12bp 的接头序列完全比对到 reads 上得分为 7.2, 25bp 的接头序列完全比对到 reads 上得分为 15。因此在 ILLUMINACLIP 参数中 simple clip threshold 的值建议为 7-15 之间(即上图中 A/B 比对模式
比对得分阈值)。
对于 palindromic 模式的比对(上图中 D 模式
),可以比对上的序列长度会更长,为了保证识别接头序列的准确率,比对得分的阈值也更高,例如 reads的 R1 和 R2 中有 50bp 序列可以反向互补匹配,得分为 30。这种模式下,Trimmomatic 可以识别并去除 reads 中非常短的接头序列。
•SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold.
•LEADING: Cut bases off the start of a read, if below a threshold quality
•TRAILING: Cut bases off the end of a read, if below a threshold quality
•CROP: Cut the read to a specified length
•HEADCROP
: Cut the specified number of bases from the start of the read
•MINLEN: Drop the read if it is below a specified length
•TOPHRED33: Convert quality scores to Phred-33
•TOPHRED64: Convert quality scores to Phred-64
Step options:
ILLUMINACLIP
: [fastaWithAdaptersEtc
]:[seed mismatches
]:[palindrome clip threshold
]:[simple clip threshold
]:[minAdapterLength
]:[keepBothReads
]- fastaWithAdaptersEtc: specifies the path to a fasta file containing all the adapters, PCR sequences etc. The naming of the various sequences within this file determines how they are used.
See below
. - seedMismatches: specifies the maximum mismatch count which will still allow a full match to be performed. 指定第一步 seed 搜索时允许的错配碱基个数,例如 2。
- palindromeClipThreshold: specifies how accurate the match between the two ‘adapter ligated’ reads must be for PE palindrome read alignment. 指定针对 PE 的 palindrome clip 模式下,需要 R1 和 R2 之间至少多少比对分值(上图中 D 模式),才会进行接头切除,例如 30。
- simpleClipThreshold: specifies how accurate the match between any adapter etc. sequence must be against a read.指定切除接头序列的最低比对分值(上图 A/B 模式),通常 7-15 之间。
- minAdapterLength:只对 PE 测序的 palindrome clip 模式有效,指定 palindrome 模式下可以切除的接头序列最短长度,由于历史的原因,默认值是 8,但实际上 palindrome 模式可以切除短至 1bp 的接头污染,所以可以设置为 1 。
- keepBothReads:只对 PE 测序的 palindrome clip 模式有效,这个参数很重要,在上图中 D 模式下, R1 和 R2 在去除了接头序列之后剩余的部分是完全反向互补的,默认参数 false,意味着整条去除与 R1 完全反向互补的 R2,当做重复去除掉,但在有些情况下,例如需要用到 paired reads 的 bowtie2 流程,就要将这个参数改为 true,否则会损失一部分 paired reads。
- fastaWithAdaptersEtc: specifies the path to a fasta file containing all the adapters, PCR sequences etc. The naming of the various sequences within this file determines how they are used.
$ java -jar trimmomatic-0.36.jar PE -phred33 F-2-test_R1.fastq.gz F-2-test_R2.fastq.gz -baseout F-2.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:51
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 2500 Both Surviving: 1633 (65.32%) Forward Only Surviving: 828 (33.12%) Reverse Only Surviving: 12 (0.48%) Dropped: 27 (1.08%)
TrimmomaticPE: Completed successfully
# 使用 ILLUMINACLIP 默认的第六个参数 false,只有 65.32% paired reads 保留下来
$ java -jar trimmomatic-0.36.jar PE -phred33 F-2-test_R1.fastq.gz F-2-test_R2.fastq.gz -baseout F-2.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:8:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:51
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 2500 Both Surviving: 2439 (97.56%) Forward Only Surviving: 22 (0.88%) Reverse Only Surviving: 16 (0.64%) Dropped: 23 (0.92%)
TrimmomaticPE: Completed successfully
# 将 ILLUMINACLIP 第六个参数改为 true,其余所有参数均相同,结果有 97.56% paired reads 保留下来
•SLIDINGWINDOW:: •windowSize: specifies the number of bases to average across
•requiredQuality: specifies the average quality required.
•LEADING: •quality: Specifies the minimum quality required to keep a base.
•TRAILING: •quality: Specifies the minimum quality required to keep a base.
•CROP: •length: The number of bases to keep, from the start of the read.从 reads 的起始开始保留设定长度的碱基,其余全部切除。一刀切,把所有 reads 切成相同的长度。
•HEADCROP: •length: The number of bases to remove from the start of the read.
•MINLEN: •length: Specifies the minimum length of reads to be kept.
The Adapter Fasta
Illumina adapter and other technical sequences are copyrighted by Illumina,but we have been granted permission to distribute them with Trimmomatic. Suggested adapter sequences are provided for TruSeq2 (as used in GAII machines) and TruSeq3
(as used by HiSeq
and MiSeq
machines), for both single-end and paired-end mode. These sequences have not been extensively tested, and depending on specific issues which may occur in library preparation, other sequences may work better for a given dataset.
To make a custom version of fasta, you must first understand how it will be used. Trimmomatic uses two strategies for adapter trimming: Palindrome and Simple
With 'simple' trimming
, each adapter sequence is tested against the reads, and if a sufficiently accurate match
is detected, the read is clipped appropriately
.
'Palindrome' trimming
is specifically designed for the case of ‘reading through’ a short fragment into the adapter sequence on the other end. In this approach, the appropriate adapter sequences are ‘in silico ligated’ onto the start of the reads, and the combined adapter+read sequences, forward and reverse are aligned. If they align in a manner which indicates ‘read-through’, the forward read is clipped and the reverse read dropped (since it contains no new data).
Naming of the sequences indicates how they should be used. For 'Palindrome' clipping
, the sequence names should both start with ‘Prefix
’, and end in ‘/1’ for the forward adapter and ‘/2’ for the reverse adapter. All other sequences are checked using 'simple' mode
. Sequences with names ending in ‘/1
’ or ‘/2’ will be checked only against the forward or reverse read. Sequences not ending in ‘/1’ or ‘/2’ will be checked against both the forward and reverse read.
The thresholds used are a simplified log-likelihood approach. Each matching base adds just over 0.6
, while each mismatch reduces the alignment score by Q/10. Therefore, a perfect match of a 12 base
sequence will score just over 7, while 25 bases are needed to score 15. As such we recommend values between 7 - 15 for this parameter
. For palindromic matches
, a longer alignment is possible - therefore this threshold can be higher, in the range of 30. The ‘seed mismatch’ parameter is used to make alignments more efficient, specifying the maximum base mismatch count in the ‘seed’ (16 bases). Typical values here are 1 or 2.