trimmomatic参数理解

参数及其作用:

ILLUMINACLIP:TruSeq3-PE.fa:2:30:10

去接头,TruSeq3-PE.fq是fastqwithAdaptersEtc,不同情况下要用不同的文件;2是seedMismatches,是允许的最大mismatch数;30是palindromeClipThreshold,回文clip阈值,指定对于PE回文reads比对,两个“接头连接”reads之间的匹配必须有多精确;10是simpleClipThreshold,简单clip阈值,指定任何接头等序列之间的匹配相对于reads必须有多精确。

  • Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
  • ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>

    • fastaWithAdaptersEtc: specifies the path to a fasta file containing all the adapters, PCR sequences etc. The naming of the various sequences within this file determines how they are used. See below.
    • seedMismatches: specifies the maximum mismatch count which will still allow a full match to be performed
    • palindromeClipThreshold: specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment.
    • simpleClipThreshold: specifies how accurate the match between any adapter etc. sequence must be against a read.

LEADING:3

去除leading(reads的开头)的低质量(质量值小于3)和N碱基

  • Remove leading low quality or N bases (below quality 3) (LEADING:3)
  • LEADING:<quality>

    • quality: Specifies the minimum quality required to keep a base.

TRAILING:3

去除trailing(reads的结尾)的低质量(质量值小于3)和N碱基

  • Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
  • TRAILING:<quality>

    • quality: Specifies the minimum quality required to keep a base.

SLIDINGWINDOW:4:15

4碱基长的滑窗,如果每个碱基的平均质量值低于15则减掉

  • Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)
  • SLIDINGWINDOW:<windowSize>:<requiredQuality>

    • windowSize: specifies the number of bases to average across
    • requiredQuality: specifies the average quality required.

MINLEN:36

丢弃长度小于36的reads

  • Drop reads below the 36 bases long (MINLEN:36)
  • MINLEN:<length>

    • length: Specifies the minimum length of reads to be kept.

还有其他参数,详情见trimmomatic的github主页。

以下内容来搜集的官方的表述:

Trimmomatic supports sequence quality data in both standard (phred+33) and Illumina ‘legacy’ formats (phred+64), and can also convert between these formats if required. The quality format is determined automatically if not specified by the user.

Trimmomatic 支持标准 (phred+33) 和 Illumina 'legacy' 格式 (phred+64) 的序列质量数据,如果需要,还可以在这些格式之间进行转换。如果用户未指定,则自动确定质量格式。

You often don't need leading and traling clipping. Also in general setting the keepBothReads to True can be useful when working with paired end data, you will keep even redunfant information but this likely makes your pipelines more manageable. Note the additional :2 in front of the True (for keepBothReads) - this is the minimum adapter length in palindrome mode, you can even set this to 1. (Default is a very conservative 8)

不是经常需要对开头和结尾进行剪切。处理paired end数据时设置keepBothReads为True比较有用。这会保留冗余信息,但是让流程更易于管理。

参数True 的含义就是keepBothReads,在True前加2,是回文模式的最小接头长度,也可以设置为1,默认是非常保守的8。

使用哪个adapter文件

Suggested adapter sequences are provided for TruSeq2 (as used in GAII machines) and TruSeq3 (as used by HiSeq and MiSeq machines), for both single-end and paired-end mode. 

接头文件分为单端和双端,GAII测序用TruSeq2,HiSeq 和 MiSeq 测序用TruSeq3。

These sequences have not been extensively tested, and depending on specific issues which may occur in library preparation, other sequences may work better for a given dataset.

这些接头文件没有广泛测试,可以依据特定情况,换成自己的。

换成自己的接头文件前,需要知道现有的接头文件是如何工作的。

接头文件是如何工作的

trimmomatic有两个去接头策略,是回文和简单。

With 'simple' trimming, each adapter sequence is tested against the reads, and if a sufficiently accurate match is detected, the read is clipped appropriately.

通过“简单”修剪,每个接头序列都会针对reads进行测试,如果检测到足够精确的匹配,则会对reads进行适当修剪。

'Palindrome' trimming is specifically designed for the case of 'reading through' a short fragment into the adapter sequence on the other end. In this approach, the appropriate adapter sequences are 'in silico ligated' onto the start of the reads, and the combined adapter+read sequences, forward and reverse are aligned. If they align in a manner which indicates 'read-through', the forward read is clipped and the reverse read dropped (since it contains no new data).

“回文”修剪是专门为在另一端将短片段“测穿”接头序列而设计的。在这种方法中,将适当的接头序列“模拟连接”到reads的开始,并将组合的接头+reads序列,正向和反向比对。如果它们以指示“测穿”的方式比对上,则正向read被剪裁,反向read被丢弃(因为它不包含新数据)。

对于回文剪切,接头序列需要以Prefix开头,以‘/1’ 给正向接头结尾,‘/2’给反向接头结尾。

For 'Palindrome' clipping, the sequence names should both start with 'Prefix', and end in '/1' for the forward adapter and '/2' for the reverse adapter.

其他命名的序列会按简单模式检查,如果序列结尾不带‘/1’和‘/2’,则会正向和反向都检查。

All other sequences are checked using 'simple' mode. Sequences with names ending in '/1' or '/2' will be checked only against the forward or reverse read. Sequences not ending in '/1' or '/2' will be checked against both the forward and reverse read. 

(检查反向补码)

If you want to check for the reverse-complement of a specific sequence, you need to specifically include the reverse-complemented form of the sequence as well, with another name.

解释接头的3个阈值

ILLUMINACLIP:TruSeq3-PE.fa:2:30:10

阈值是用的log-likelihood的方法。

每个比对上的碱基加大于0.6,同时,每个没有比对上的碱基会减少比对分数值的Q/10。

因此,一个完美的12碱基的比对的比对分数值会大于7,25碱基的完美比对需要大于15分。所以建议简单模式阈值的值是7-15。

The thresholds used are a simplified log-likelihood approach. Each matching base adds just over 0.6, while each mismatch reduces the alignment score by Q/10. Therefore, a perfect match of a 12 base sequence will score just over 7, while 25 bases are needed to score 15. As such we recommend values between 7 - 15 for this parameter. 

对于回文比对,可能会产生一个更长的比对。所以这个回文模式的阈值需要高一些,大概在30。

For palindromic matches, a longer alignment is possible - therefore this threshold can be higher, in the range of 30.

'seed mismatch' 参数,可以让比对更高效,指定了种子(16碱基)中最大的mismatch,这个值可以是1或者2。

The 'seed mismatch' parameter is used to make alignments more efficient, specifying the maximum base mismatch count in the 'seed' (16 bases). Typical values here are 1 or 2.

官方出现过的一些参数组合

在github主页的一些参数组合

 ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True LEADING:3 TRAILING:3 MINLEN:36

 ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 (仅供参考,对接头更不敏感)

单端的参数

ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

文献附录中的commands.txt的一些参数组合

只去接头,不做质量检查

ILLUMINACLIP:${ADAPTERS}:2:30:12:1:true MINLEN:36

去接头,加上头和尾的质量检查

ILLUMINACLIP:${ADAPTERS}:2:30:12:1:true LEADING:3 TRAILING:3 MINLEN:36

去接头加上Sliding Window

ILLUMINACLIP:${ADAPTERS}:2:30:12:1:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:${SW} MINLEN:36

去接头加上最大信息

ILLUMINACLIP:${ADAPTERS}:2:30:12:1:true LEADING:3 TRAILING:3 MAXINFO:40:0.${S} MINLEN:36

参考

 https://github.com/usadellab/TrimmomaticTrimmomatic: a flexible trimmer for Illumina sequence data | Bioinformatics | Oxford Academic

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值