Bowtie2详细文档

文章目录

Introduction

How is Bowtie 2 different from Bowtie 1?

Bowtie 2是一种超快速、高效使用内存的工具,用于将测序读段与长参考序列比对。它特别擅长将大约50个字符到100个字符的读段与相对较长的(如哺乳动物)基因组比对。Bowtie 2 使用 FM 索引(基于 Burrows-Wheeler Transform 或 BWT)对基因组进行索引,以保持其内存占用率较小:对于人类基因组,其内存占用率通常约为 3.2 Gb的 RAM。Bowtie 2 支持gapped间隙、local局部和pair-end的比对模式。可以同时使用多个处理器来实现更高的比对速度。

Bowtie 2 以 SAM 格式输出比对结果,因此可以与大量使用 SAM 的其他工具(如 SAMtools、GATK)进行互操作。Bowtie 2 以 GPLv3 许可证发布,可在 Windows、Mac OS X、Linux 和 BSD 下的命令行中运行。
Bowtie 2 通常是比较基因组学分析的第一步,包括变异检测、ChIP-seq、RNA-seq、BS-seq。Bowtie 2和Bowtie(这里也称为 “Bowtie 1”)也紧密地集成到许多其他工具中,这里列出了其中一些工具。
Cite:
Langmead B, Wilks C, Antonescu V, Charles R. Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics. 2018 Jul 18. doi: 10.1093/bioinformatics/bty648.

Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012 Mar 4;9(4):357-9. doi: 10.1038/nmeth.1923.

Bowtie 2 与 Bowtie 1 有何不同?

Bowtie 1 于 2009 年发布,旨在比对当时普遍存在的相对较短的测序读段(最多 25-50 个核苷酸)。从那时起,测序技术已经改善了测序吞吐量(每台测序仪每天产生更多的核苷酸)和读段长度(每个读数更多的核苷酸)。

Bowtie 1和Bowtie 2的主要区别是:

  • 对于长于约50 bp的读数,Bowtie 2一般比Bowtie 1更快、更灵敏、使用的内存更少。对于相对较短的读数(如小于50 bp),Bowtie 1有时更快和/或更灵敏。
  • Bowtie 2 支持带有空位罚分的空位比对。除了通过可配置的计分方案外,gap数量和gap长度不受限制。Bowtie 1 仅能找到无gap的比对方式。
  • Bowtie 2 支持局部比对,它不要求读段端到端对齐。局部比对可以在一个或两个极端进行 “修剪”(“软剪”,soft clipped),以优化比对得分。Bowtie 2还支持端到端比对,这和Bowtie 1一样,要求读段完全对齐。
  • Bowtie 2 中对读段长度没有上限。Bowtie 1的上限是1000 bp左右。
  • Bowtie 2 允许比对方式与参考序列中的重叠的模糊字符(如 Ns)。Bowtie 1 不允许。
  • Bowtie 2 取消了 Bowtie 1 的对齐 "层 "概念,也取消了它对 "类 Maq "和 "端到端 "模式的区分。在 Bowtie 2 中,所有的配对都位于连续的配对分数谱上,其中的评分方案与 Needleman-Wunsch 和 Smith-Waterman 类似。
  • Bowtie 2的 paired-end 比对方式更加灵活。例如,对于不成对的配对,Bowtie 2 试图为每个mate寻找未配对的比对方式。
  • Bowtie 2 会报告一系列的比对质量,而 Bowtie 1 则报告 0 或高。
  • Bowtie 2 不对色彩空间读段进行对齐。
  • Bowtie 2 并非 Bowtie 1 的 "简易 "替代品。Bowtie 2 的命令行参数和基因组索引格式都与 Bowtie 1 不同。

Obtaining Bowtie 2

Bowtie 2的目的是将相对较短的测序读段与大基因组比对。也就是说,它可以处理任意小的参考序列(如扩增子)和非常长的读数(即超过10s或100s的Gb),尽管在这些设置中它的速度较慢。它针对典型的Illumina测序仪产生的读长和错误模式进行了优化。
Bowtie 2 不支持对色域读段进行对齐(Bowtie 1 支持)。Bowtie 1支持)。

conda 安装

 conda install bowtie2.

Building from source

Bowtie 2 can be run on many threads. By default, Bowtie 2 uses the Threading Building Blocks library (TBB) for this.
常见平台TBB安装方法:

Operating SystemSync Package ListSearchInstall
Ubuntu, Mint, Debianapt-get updateapt-cache search tbbapt-get install libtbb-dev
Fedora, CentOSyum check-updateyum search tbbyum install tbb-devel.x86_64
Archpackman -Sypacman -Ss tbbpacman -S extra/intel-tbb
Gentooemerge --syncemerge --search tbbemerge dev-cpp/tbb
macOSbrew updatebrew search tbbbrew install tbb
FreeBSDpkg updatepkg search tbbpkg install tbb-2019.1

Adding to PATH

若从源码安装,需要将安装目录写入环境变量文件中,一般为~/.bashrc文件的PATH

The bowtie2 aligner

bowtie2需要一个Bowtie 2索引和一组测序读段文件,并以SAM格式输出比对结果。
"比对"是我们发现读段序列与参考序列相似的方式和位置的过程。"对齐"就是这个过程的结果,具体来说:比对就是将读段中的部分或全部字符与参考序列中的一些字符排列,以揭示它们的相似性。比如说:

Read:      GACTGGGCGATCTCGACTTCG
           |||||  |||||||||| |||
Reference: GACTG--CGATCTCGACATCG

其中,破折号代表缺口gaps,竖条代表匹配的字符。
我们使用比对来有根据地猜测一个读段相对于参考基因组的起源。并非总是能够确定这一点。例如,如果参考基因组包含几条长的As(AAAAAAAA等),而读段序列是一段很短的As(AAAAAA),我们就不能确定读段究竟起源于As中的什么地方。

End-to-end alignment versus local alignment

默认情况下,Bowtie 2 会执行End-to-end alignment 。也就是说,它搜索涉及所有读段字符的对齐方式。这也称为 "untrimmed "或 "unclipped "对齐。
当指定了 --local 选项时,Bowtie 2 会执行本地读取对齐。在这种模式下,Bowtie 2 可能会从对齐的一端或两端 "修剪 "或 "剪辑 "一些读段字符,如果这样做可以使比对得分最大化。

End-to-end alignment example

以下是 "端到端 "对齐,因为它涉及到读段中的所有字符。这样的对齐方式可以由 Bowtie 2 以端到端模式或本地模式产生。

Read:      GACTGGGCGATCTCGACTTCG
Reference: GACTGCGATCTCGACATCG

Alignment:
  Read:      GACTGGGCGATCTCGACTTCG
             |||||  |||||||||| |||
  Reference: GACTG--CGATCTCGACATCG

end-to-end尽可能使得读段比对得分最大的同时又能最长的利用到原始读段信息,类似于blast中的双序列的全局比对

Local alignment example

下面是一个 "局部 "对齐,因为在读段的两端有一些字符没有参与。在这种情况下,从开头省略 4 个字符(或 "软修剪 "或 “软剪切”),从结尾省略 3 个字符。这种对齐方式只能由 Bowtie 2 在本地模式下产生。

Read:      ACGGTTGCGTTAATCCGCCACG
Reference: TAACTTGCGTTAAATCCGCCTGG

Alignment:
  Read:      ACGGTTGCGTTAA-TCCGCCACG
                 ||||||||| ||||||
  Reference: TAACTTGCGTTAAATCCGCCTGG

local alignment不考虑read比对长度,尽可能使得局部比对得分最高

Scores: higher = more similar

对齐得分量化了读段序列与对齐的参考序列的相似度。分数越高,它们就越相似。分数的计算方法是减去每个差异(错配、空位等)的罚分,在局部对齐模式下,为每个匹配增加得分。
分数可以用–ma (匹配得分), --mp (错配罚分), --np (在读段或参考序列中有N的罚分), --rdg (affine read空位惩罚) 和 --rfg (affine reference 空位惩罚)选项来配置。

End-to-end alignment score example

读段中高质量位置的错配碱基默认-6的惩罚。一个长度为2的读段空位默认-11的惩罚(间隙打开时为-5,第一延伸段为-3,第二延伸段为-3)。因此,在端到端对齐模式下,如果读段长度为50 bp,除了一个高质量位置的错配和一个长度为2的读段空位外,它与参考序列完全匹配,那么总的得分是-(6+11)=-17。
在端到端模式下,可能的最佳比对得分是0,这发生在读段和参考序列之间没有差异的时候。

Local alignment score example

默认情况下,在读取的高质量位置上不匹配的基将收到-6的惩罚。长度为2的读间隙默认接收-11的惩罚(打开间隙为-5,第一个扩展为-3,第二个扩展为-3)。一个匹配的基础收到+2的加值是默认的。因此,在局部比对模式下,如果读长50个基点,它匹配参考到底除了一个不匹配的高质量的位置和一个长度为2阅读差距,然后总分=总奖金,2 * 49岁减去总点球,6 + 11 = 81。本地模式下的最佳得分等于匹配奖励乘以读取的长度。当读取和引用之间没有区别时,就会发生这种情况。

Valid alignments meet or exceed the minimum score threshold

为了使对齐被认为是“有效的”(即。2,它必须有一个对齐分数不低于最低分数阈值。阈值是可配置的,并且表示为读取长度的函数。端到端对齐模式下,默认最小评分阈值为-0.6 + -0.6 * L,其中L为读长度。在本地对齐模式下,默认最小评分阈值为20 + 8.0 * ln(L),其中L为读长度。这可以用——score-min选项配置。有关如何设置与函数对应的选项(如——score-min)的详细信息,请参见setting function options.

Mapping quality: higher = more unique

校准器不可能总是很有把握地将一个读分配给它的原点。例如,一个源自重复元素内的读数可能与该元素在整个基因组中的许多出现点进行同样好的排列,使排列器没有任何依据来选择一个而不是其他。

校准者通过报告一个映射质量来表征他们对原点的信心程度:一个非负的整数Q = -10 log10 p,其中p是对齐不符合读的真正原点的概率的估计。映射质量有时缩写为MAPQ,并记录在SAM MAPQ字段中。

映射质量与 "唯一性 "有关。如果一个配准比其他所有可能的配准有更高的配准得分,我们说它是唯一的。最佳配准的得分与次佳配准的得分之间的差距越大,最佳配准就越独特,其映射质量也应该越高。

准确的映射质量对于变异检测等下游工具很有用。例如,变异检测工具可能会选择忽略来自映射质量小于(比如说)10的配准的证据。映射质量为10或更少,表明该读物至少有十分之一的机会真正来源于其他地方。

Aligning pairs

一个 "配对端 "或 "配对 "读数由一对配偶组成,称为配偶1和配偶2。配对带有一个关于(a)配偶的相对方向,和(b)在原始DNA分子上分离它们的距离的事先期望。对于一个给定的数据集来说,到底有哪些预期,取决于用于生成数据的实验室程序。例如,一个常见的实验室程序产生对是Illumina的配对端测序分析,它产生对与FR(“正向,反向”)的相对方向,这意味着,如果配偶1来自沃森链,配偶2很可能来自克里克链,反之亦然。此外,这个协议产生的对,预期的基因组距离从端到端是约200-500个碱基对。

为了简单起见,本手册使用术语 "配对端 "指的是任何一对读数与一些预期的相对方向和距离。根据协议,这些实际上可能被称为 "成对端 "或 “配对”。此外,我们总是指构成对的单个序列为 “mates”。

Paired inputs

配对通常存储在一对文件中,一个文件中包含配对1s,另一个文件中包含配对2s。配对1的文件中的第一个配偶与配偶2的文件中的第一个配偶形成一对,第二个配偶与第二个配偶形成一对,以此类推。当使用 Bowtie 2 排列配对时,请使用-1 参数指定包含配偶 1s 的文件,使用-2 参数指定包含配偶 2s 的文件。这将使 Bowtie 2 在对齐时考虑到读数的配对性质。

Paired SAM output

当Bowtie 2打印一对配偶的SAM配准时,它会打印两条记录(即两行输出),每条记录代表一个配偶。第一条记录描述了配偶1的排列,第二条记录描述了配偶2的排列。在这两条记录中,SAM记录的一些字段描述了对齐的各种属性;例如,第7和第8字段(分别为RNEXT和PNEXT)表示参考名称和另一个配偶对齐的位置,第9字段表示两个配偶测序的DNA片段的推断长度。关于这些字段的详细情况,请参见SAM规范。

Concordant pairs match pair expectations, discordant pairs don’t

与预期的相对配偶方向和配偶间的预期距离范围相一致的配对称为 "一致 "配对。如果两个配偶都有独特的配准,但配准不符合配对末端的期望(即配偶不在预期的相对方向上,或不在预期的距离范围内,或两者都有),则称该配对为 "不和谐地 "配准。例如,在寻找结构变异时,不和谐的排列可能会引起特别的兴趣。

使用–ff、–fr或–rf选项设置配位的预期相对方向。配对间距离的预期范围(从配对的最远端测量,也称为 “外距离”)用-I和-X选项设置。请注意,将 -I 和 -X 设置得很远会使 Bowtie 2 的速度变慢。请参阅 -I 和 -X 的文档。

要声明一对不和谐对齐,Bowtie 2 要求两个配偶都要唯一对齐。这是一个保守的阈值,但在寻找结构变异时,这通常是可取的。

默认情况下,Bowtie 2 会搜索一致和不一致的配准,不过可以使用 --no-discordant 选项禁止搜索不一致的配准。

Mixed mode: paired where possible, unpaired otherwise

如果 Bowtie 2 无法为一对配对找到成对的末端配准,默认情况下,它将继续寻找组成配准的未配对配准。这就是所谓的 “混合模式”。要禁用混合模式,请设置–no-mixed选项。

在–no-mixed模式下,Bowtie 2的运行速度会更快一些,但只会考虑配对本身的配准状态,而不会考虑单个配对。

Some SAM FLAGS describe paired-end properties

SAM FLAGS字段是SAM记录中的第二个字段,它有多个位来描述读取和对齐的成对性质。如果读取是一对的一部分,那么第一个(最小的)位(十进制中的1,十六进制中的0x1)被设置。第二位(十进制中的2位,十六进制中的0x2位)如果读取的是一对以配对方式对齐的读数的一部分,则会被设置。第四位(十进制中的8,十六进制中的0x8)如果读取是一对的一部分,并且该对中的另一个配偶至少有一个有效的对齐方式,则设置该位。第六位(十进制32位,十六进制0x20)如果读数是一对的一部分,并且一对中的另一个配偶对准了Crick链(或者,等价地,如果另一个配偶的反向补码对准了Watson链),则设置该位。第七位(十进制64位,十六进制0x40位)如果读取的是一对中的配偶1,则设置该位。如果读取的是一对中的配偶2,则设置第八位(十进制128位,十六进制0x80)。有关FLAGS字段的详细说明,请参见SAM规范。

Some SAM optional fields describe more paired-end properties

每个SAM记录的最后几个字段通常包含SAM可选字段,这些字段是简单的标签分隔的字符串,传达有关读数和对齐的附加信息。一个SAM可选字段的格式是这样的。“XP:i:1”,其中 "XP "是TAG,"i "是TYPE(本例中的 “整数”),"1 "是VALUE。关于SAM可选字段的详细内容,请参见SAM规范。

Mates can overlap, contain, or dovetail each other

片段和读数的长度可能是这样的,即一对中的两个配偶的配准是相互重叠的。考虑这个例子。
(在这些例子中,假设我们希望配偶1对准配偶2的左边。)

Mate 1:    GCAGATTATATGAGTCAGCTACGATATTGTT
Mate 2:                               TGTTTGGGGTGACACATTACGCGTCTTTGAC
Reference: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGCGTCTTTGAC

也有可能,虽然不常见,一个配偶组合包含另一个,如这些例子

Mate 1:

GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGC
Mate 2:                               TGTTTGGGGTGACACATTACGC
Reference: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGCGTCTTTGAC

Mate 1:                   CAGCTACGATATTGTTTGGGGTGACACATTACGC
Mate 2:                      CTACGATATTGTTTGGGGTGAC
Reference: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGCGTCTTTGAC

虽然不常见,但配偶之间也有可能“燕尾相接”的情况,比如在这个例子中,配偶之间似乎在“过去”对方

Mate 1:                 GTCAGCTACGATATTGTTTGGGGTGACACATTACGC
Mate 2:            TATGAGTCAGCTACGATATTGTTTGGGGTGACACAT                   
Reference: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGCGTCTTTGAC

在某些情况下,只要不违反其他配对端约束,对齐器最好将所有这些情况都视为 “一致”。Bowtie 2 的默认行为是将重叠和包含视为与一致对齐一致。默认情况下,燕尾被认为不符合协整对齐。

这些默认值可以被重写。设置 --no-overlap 会使 Bowtie 2 将重叠的配准视为非一致。设置 --no-contain 会使 Bowtie 2 将一个配准包含另一个配准的情况视为不一致。设置–dovetail(燕尾)会让Bowtie 2将配偶排列燕尾的情况视为一致。

Reporting

报告模式规定了 Bowtie 2 寻找多少条排列,以及如何报告这些排列。Bowtie 2 有三种不同的报告模式。默认报告模式与包括 BWA 在内的许多其他读取对齐工具的默认报告模式类似。它也类似于 Bowtie 1 的 -M 对齐模式。

一般来说,当我们说一个读有一个对齐方式时,我们意味着它有一个有效的对齐方式。当我们说一个读有多条配准时,我们指的是它有多条有效且彼此不同的配准。

Distinct alignments map a read to different places

如果同一个体读的两个配准将同一读映射到不同的地方,那么这两个配准就是 “不同的”。具体地,我们说,如果在具有相同方向的两个配准中没有特定读偏移与特定参考偏移相对的配准位置,则两个配准是不同的。例如,如果第一条配准是正向的,并且将读偏移量10处的读字符与染色体3的参考字符偏移量3,445,245对齐,而第二条配准也是正向的,并且也将读偏移量10处的读字符与染色体3的参考字符偏移量3,445,245对齐,则它们不是不同的配准。

如果两对配对末端排列中的配偶1s是不同的,或者两对排列中的配偶2s是不同的,或者两者都是不同的,则同一对的两对排列是不同的。

Default mode: search for multiple alignments, report the best one

默认情况下,Bowtie 2 会为每个读搜索不同的、有效的配准。当它找到一个有效的配准时,它通常会继续寻找与它差不多或更好的配准。它最终会停止寻找,原因是它超过了对搜索工作的限制(见-D和-R),或者是因为它已经知道了报告一个配准所需要知道的所有信息。来自最佳配准的信息被用于估计映射质量(MAPQ SAM 字段)和设置 SAM 可选字段,如 AS:i 和 XS:i。Bowtie 2 不保证报告的配准是最佳的配准分数。

另请参见: -D,它对 Bowtie 2 停止搜索之前,可以在一行中 "失败 "的动态编程问题(即种子扩展)的数量设置了上限。增加-D会使Bowtie 2的速度变慢,但增加了它为一个多处对齐的读取报告正确对齐的可能性。

另请参见: -R,它设置了 Bowtie 2 在尝试对齐具有重复种子的读取时的最大 "重选 "次数。增加-R会使Bowtie 2的速度变慢,但增加了它为一个对准多处的读报告正确对齐的可能性。

-k mode: search for one or more alignments, report each

在-k模式下,Bowtie 2最多搜索每个读数的N个不同的有效排列,其中N等于用-k参数指定的整数。也就是说,如果指定了 -k 2,Bowtie 2 将最多搜索 2 个不同的配准。它按照排列得分的降序报告所有找到的排列。成对末端排列的排列得分等于各个配对的排列得分之和。每一个被报告的读或配对排列超越了第一条,在它的FLAGS字段中设置了SAM "二级 "位(等于256)。补充对齐也将被分配一个255的MAPQ。有关详情,请参见 SAM 规范。

Bowtie 2 不会以任何特定的顺序 "寻找 "排列,因此对于具有超过 N 个不同的有效排列的读取,Bowtie 2 不能保证报告的 N 个排列在排列得分方面是最好的。不过,在用户更关心读物是否对齐(或对齐次数)而不是它的具体起源地的情况下,这种模式还是很有效和快速的。

-a mode: search for and report all alignments

-a 模式与 -k 模式类似,只是 Bowtie 2 应报告的排列数没有上限。按排列得分降序报告排列。成对末端对齐的对齐分数等于各个配对的对齐分数之和。每一个报告的读或配对对齐超越第一有SAM "二级 "位(这等于256)设置在其FLAGS字段。补充排列将被分配一个255的MAPQ。详情请参见SAM规范。

有些工具在设计时考虑到了这种报告模式。Bowtie 2 不是这样的 对于非常大的基因组,这种模式非常慢。

Randomness in Bowtie 2

Bowtie 2对给定读物的排列搜索是 "随机化 "的。也就是说,当Bowtie 2遇到一组同样好的选择时,它会使用一个伪随机数来选择。例如,如果Bowtie 2发现了一组3个同样好的排列,并想决定报告哪个,它就会选择一个伪随机整数0、1或2,并报告相应的排列。任意选择可以在对齐过程中的不同点出现。

伪随机数生成器在每次读取时都会重新初始化,用于初始化它的种子是读取名称、核苷酸字符串、质量字符串和用 --seed 指定的值的函数。如果您在两个具有相同名称、核苷酸串和质量串的读本上运行相同版本的 Bowtie 2,并且如果两次运行的 --seed 设置相同,Bowtie 2 将产生相同的输出;即,它将把读本对齐到相同的位置,即使有多个同样好的对齐。这在大多数情况下是直观的,也是可取的。大多数用户希望Bowtie在同一输入上运行两次时,会产生相同的输出。

但是,当用户指定了 --non-deterministic 选项时,Bowtie 2 将使用当前时间来重新初始化伪随机数生成器。当指定了这个选项时,Bowtie 2 可能会报告相同读数的不同排列。这对一些用户来说是反直觉的,但在输入由许多相同读数组成的情况下可能更合适。

Multiseed heuristic

为了迅速缩小必须考虑的可能的对齐数量,Bowtie 2 首先从读取及其反向补码中提取子串(“种子”),并在 FM 索引的帮助下,以未加码的方式对它们进行对齐。这就是 “多种子对齐”,它与Bowtie 1的做法类似,只是Bowtie 1试图以这种方式对齐整个读取。

这个初始步骤使得Bowtie 2的速度比没有这样的过滤器要快得多,但代价是错过了一些有效的对齐方式。例如,一个读有可能有一个有效的整体对齐,但没有有效的种子对齐,因为每个潜在的种子对齐都被太多的错配或间隙所打断。

速度和灵敏度/精度之间的权衡可以通过设置种子长度(-L)、提取种子之间的间隔(-i)和每个种子允许的错配数(-N)来调整。对于更敏感的对齐,将这些参数设置为:(a)使种子更接近,(b)使种子更短,和/或©允许更多的错配。您可以逐一调整这些选项,尽管 Bowtie 2 带有一些有用的选项组合,作为 "预设选项 "预先打包。

-D和-R也是调整速度和灵敏度/准确度之间权衡的选项。

FM Index memory footprint

Bowtie 2 使用 FM 索引为种子寻找未填充的排列。这一步骤占了 Bowtie 2 内存占用的大部分,因为 FM 索引本身通常是使用的最大数据结构。例如,人类基因组的 FM 索引的内存占用约为 3.2 Gb的 RAM。

Ambiguous characters

除A、C、G或T以外的非空格字符被认为是 “模糊字符”。N是参考序列中常见的模糊字符。Bowtie 2 将参考文献中的所有模糊字符(包括 IUPAC 核苷酸代码)视为 Ns。

Bowtie 2 允许在参考文献中重叠含糊字符的排列。根据–np,在读取、参考文献或两者中都包含模糊字符的排列位置会受到惩罚。–n-ceil 设置了一个有效对齐中可能包含模糊参考字符的位置数的上限。可选字段XN:i报告了一个对齐方式重叠的模糊参考字符的数量。

请注意,多种子启发式不能找到重叠模糊参考字符的种子对齐。要找到与模糊参考字符重叠的对齐方式,它必须有一个或多个不与模糊参考字符重叠的种子对齐方式。

Presets: setting many settings at once

Bowtie 2 提供了一些有用的参数组合,并将其打包成较短的 "预设 "参数。例如,使用–very-sensitive选项运行Bowtie 2与使用选项运行是一样的: -D 20 -R 3 -N 0 -L 20 -i S,1,0.50。-D 20 -R 3 -N 0 -L 20 -i S,1,0.50。Bowtie 2 附带的预设选项旨在覆盖速度/灵敏度/准确度之间的广泛区域,以快速结尾的预设通常更快,但灵敏度和准确度较低,以敏感结尾的预设通常较慢,但灵敏度和准确度较高。详情请看预设选项的文档。

从 Bowtie2 v2.4.0 开始,可以通过提供特定的选项来覆盖单个预设值,例如,在上面 [–very-senitive] 预设中配置的种子长度为 20,可以通过在命令行的任何地方指定 -L 25 参数来改变为 25。

Filtering

有些读数会被Bowtie 2跳过或 “过滤掉”。例如,读数可能会被过滤掉,因为它们非常短,或者有很高比例的模糊核苷酸。Bowtie 2 仍将为这样的读打印一条 SAM 记录,但不会报告对齐情况,YF:i SAM 可选字段将被设置为指示该读被过滤的原因。

YF:Z:LN:该读被过滤的原因是它的长度小于或等于用-N选项设置的种子不匹配数。
YF:Z:NS: 读取的长度大于或等于用–n-ceil选项设置的种子错配数,因此被过滤。
YF:Z:SC: 读取被过滤的原因是读取长度和匹配奖金(用–ma设置)使得读取不可能获得大于或等于用–score-min设置的阈值的对齐分数。
YF:Z:QC:该读物被过滤,因为它被标记为未通过质量控制,并且用户指定了–qc-filter选项。只有当输入是Illumina的QSEQ格式(即指定了–qseq时),并且读的QSEQ记录的最后(第11)字段包含1时,才会发生这种情况。
如果一个读可以被过滤的原因不止一个,那么YF:Z标志的值将只反映其中一个原因。

Alignment summary

当Bowtie 2完成运行时,它将打印总结所发生的事情的消息。这些消息被打印到“标准错误”(“stderr”)文件句柄中。对于由未配对读取组成的数据集,汇总可能如下所示

20000 reads; of these:
  20000 (100.00%) were unpaired; of these:
    1247 (6.24%) aligned 0 times
    18739 (93.69%) aligned exactly 1 time
    14 (0.07%) aligned >1 times
93.77% overall alignment rate

对于由数据对组成的数据集,摘要可能是这样的。

10000 reads; of these:
  10000 (100.00%) were paired; of these:
    650 (6.50%) aligned concordantly 0 times
    8823 (88.23%) aligned concordantly exactly 1 time
    527 (5.27%) aligned concordantly >1 times
    ----
    650 pairs aligned concordantly 0 times; of these:
      34 (5.23%) aligned discordantly 1 time
    ----
    616 pairs aligned 0 times concordantly or discordantly; of these:
      1232 mates make up the pairs; of these:
        660 (53.57%) aligned 0 times
        571 (46.35%) aligned exactly 1 time
        1 (0.08%) aligned >1 times
96.70% overall alignment rate

缩进表示小计与总数之间的关系。

Wrapper scripts

bowtie2、bowtie2-build 和 bowtie2-inspect 可执行文件实际上是包装脚本,它们会适当地调用二进制程序。这些包装程序保护了用户不必区分 "小 "和 "大 "的索引格式,这将在下一节简要讨论。此外,bowtie2包装器还提供了一些关键功能,比如处理压缩输入的能力,以及–un、–al和相关选项的功能。

建议你总是运行 bowtie2 包装器,而不是直接运行二进制文件。

Small and large indexes

bowtie2-build可以为任何大小的参考基因组建立索引。对于长度小于约40亿核苷酸的基因组,bowtie2-build使用32位数字在索引的各个部分建立一个 "小 "索引。当基因组较长时,bowtie2-build会使用64位数字建立一个 "大 "索引。小索引存储在.bt2扩展名的文件中,大索引存储在.bt2l扩展名的文件中。用户不需要担心某个索引是小的还是大的,包装脚本会自动建立并使用相应的索引。

Performance tuning

  • 如果你的电脑有多个处理器/核心,使用-p

-p 选项会使 Bowtie 2 启动指定数量的并行搜索线程。每个线程都在不同的处理器/核心上运行,所有线程都会并行地查找排列,从而以线程数量的大约倍数提高排列吞吐量(尽管在实践中,速度的提升比线性差一些)。

  • 如果报告每个读数有很多排列,可以尝试减少 bowtie2-build --offrate

如果您正在使用-k或-a选项,并且Bowtie 2正在报告每个读数的许多排列,使用具有更密集SA样本的索引可以大大加快事情的速度。要做到这一点,请在运行 bowtie2-build 时指定一个小于默认值的 -o/–offrate 值。更密集的 SA 样本会产生一个更大的索引,但当每个读数报告许多排列时,也特别有效地加快了排列速度。

  • 如果 bowtie2 “thrashes”,请尝试增加 bowtie2-build --offrate 的值。

如果 bowtie2 在内存相对较低的计算机上运行得很慢,请尝试在构建索引时将 -o/–offrate 设置为一个较大的值。这样可以减少索引的内存占用。

Command Line

Setting function options

一些 Bowtie 2 选项指定了一个功能,而不是一个单独的数字或设置。在这些情况下,用户需要指定三个参数。可用的函数类型有常数 ©、线性 (L)、平方根 (S) 和自然对数 (G)。参数被指定为F,B,A–也就是说,函数类型、常数项和系数用逗号分隔,没有空格。常数项和系数可以是负数和/或浮点数。
例如,如果函数规范为L,-0.4,-0.6,则定义的函数为

f(x) = -0.4 + -0.6 * x

如果函数规范是G,1,5.4,那么函数定义为

f(x) = 1.0 + 5.4 * ln(x)

SAM output

以下是关于bowtie2输出的SAM格式的简要说明。更多细节,请参见 SAM 格式规范。

默认情况下,bowtie2 会打印一个包含 @HD、@SQ 和 @PG 行的 SAM 头。当指定一个或多个 --rg 参数时,bowtie2 还会打印 @RG 行,其中包括所有用户指定的 --rg 标记,并以制表符分隔。

之后的每一行都描述了对齐方式,如果读取失败,则描述了读取方式。每一行都是由至少12个字段组成的集合,用制表符隔开;从左到右,字段是:

  • 1 比对读段的名称
    注意,SAM规范不允许在读名中出现空格。如果读取的名称包含任何空白字符,Bowtie 2将在第一个空白字符处截断名称。这与其他工具的行为类似。可以使用——SAM -no-qname-trunc来抑制截断第一个空格的标准行为,但要牺牲生成非标准的SAM。
  • 2 所有适用flags的总和。与Bowtie有关的旗帜有:
flag意义
1pair-end中的一个
2The alignment is one end of a proper paired-end alignment
4该读段没有报告比对的情况
8该读物为pair-end之一,没有报告的比对的方式。
16比对的是反向参考链
32成对pair-end比对中的另一个配偶与反向参考链对齐。
64The read is mate 1 in a pair
128The read is mate 2 in a pair

因此,一个与反向参考链对齐的未配对读数将有标志16。一个成对的末端读数如果与反向参考链对齐,并且是配对中的第一配偶,则其标志为83(=64+16+2+1)。

  • 3 发生对齐的参考序列的名称
  • 4 以1为基础的偏移量,进入最左边的对齐字符所在的正向参考链中
  • 5 比对质量
  • 6 CIGAR字符串表示的对齐方式
  • 7 参考序列的名称,在那里配偶的对齐发生。如果配偶的参考序列和这个配准的序列相同,则设为=,如果没有配偶,则设为*。
  • 8 伴侣对齐的最左字符发生在正向参考链上的基于1的偏移量,如果没有伴侣,偏移量为0。如果没有配偶,偏移量为0。
  • 9 推断的片段长度。如果配偶的对齐发生在该对齐的上游,则大小为负。如果配偶没有一致地对齐,则大小为0。然而,大小是非0,如果配偶不一致地排列到同一染色体。
  • 10 读取序列(如果对准反链,则反向补全)
  • 11 ASCII编码的读取质量(如果读取对准反向链,则反向补全)。编码的质量值是在Phred质量等级上的,编码是ASCII偏移33(ASCII字符!),类似于FASTQ文件。
  • 12 可选字段。字段是以制表符分隔的。根据对齐类型,bowtie2为每个对齐输出零个或多个可选字段。
AS:i:<N>。
对齐得分。可以是负数。在--local模式下可以大于0(但在--end-to-end模式下不能)。只有当SAM记录是一个对齐的读数时才会出现。

XS:i:<N>。
除了报告的对齐方式之外,找到的最好的对齐方式的得分。可以是负数。在--local模式下可以大于0(但在--end-to-end模式下不能)。只有当SAM记录是一个对齐的读,并且为该读找到了一个以上的对齐。需要注意的是,当读取是一个一致对齐的对子的一部分时,这个分数可能大于AS:i。

YS:i:<N>。
成对末端对齐中相反配偶的对齐分数。只有当SAM记录是一个配对末端对齐的读的时候才会出现。

XN:i:<N>
在参考文献中涵盖这个对齐的模糊碱基的数量。只有当SAM记录是一个对齐的读取时才会出现。

XM:i:<N>。
排列中的不匹配数。只有当SAM记录是一个对齐的读数时才会出现。

XO:i:<N>。
在对齐中,读和参考的间隙打开的数量。只有当SAM记录是一个对齐的读数时才会出现。

XG:i:<N>
在对齐的读和参考间隙的间隙扩展数。只有当SAM记录是一个对齐的读数时才会出现。

NM:i:<N>。
编辑距离,也就是将读取字符串转化为参考字符串所需的单核苷酸编辑(替换、插入和删除)的最小数量。只有当SAM记录是一个对齐的读取时才会出现。

YF:Z:<S>
字符串表示过滤掉读取的原因。也请参见: 过滤。只出现在被过滤掉的读数上。

YT:Z:<S>
UU的值表示该读物不是一对的一部分。CP的值表示该读数是一对的一部分,并且该对的排列是一致的。DP的值表示读数是一对的一部分,而对位不一致。UP的值表示该读物是一对的一部分,但这对读物未能一致或不一致地对齐。

MD:Z:<S>
在对齐中不匹配的参考基准的字符串表示。详情请看SAM标签格式规范。只有当SAM记录是一个对齐的读数时才会出现。

The bowtie2-build indexer

bowtie2-build 从一组 DNA 序列建立一个 Bowtie 索引。bowtie2-build 输出一组 6 个文件,后缀为 .1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2, 和 .rev.2.bt2。在大索引的情况下,这些后缀会有一个bt2l结尾。这些文件共同构成了索引:它们是将读数对齐到该参考文献所需的全部内容。索引建立后,Bowtie 2 不再使用原始序列 FASTA 文件。

Bowtie 2的.bt2索引格式与Bowtie 1的.ebwt格式不同,它们之间不兼容。

使用 Karkkainen 的 blockwise 算法可以让 bowtie2-build 在运行时间和内存使用量之间进行权衡。bowtie2-build 有三个选项来管理如何进行这种权衡:-p/–packed、–bmax/–bmaxdivn 和 --dcv。默认情况下,bowtie2-build 会自动搜索能产生最佳运行时间的设置,而不会耗尽内存。这个行为可以使用-a/–noauto选项来禁止。

索引器提供了与索引的 "形状 "有关的选项,例如–offrate控制了Burrows-Wheeler行中被 "标记 "的部分(即后缀数组样本的密度;详见原FM索引论文)。所有这些选项都是潜在的有利权衡,取决于应用。根据我们的实验,它们已经被设置为默认值,对于大多数情况下是合理的。详情请参见性能调整。

bowtie2-build 可以生成小型或大型索引。包装器会根据输入基因组的长度来决定哪种。如果参考文献不超过 40 亿个字符,但希望使用大索引,用户可以指定 --large-index 来强制 bowtie2-build 建立一个大索引。

Bowtie 2 索引是基于 Ferragina 和 Manzini 的 FM 索引,而 FM 索引又是基于 Burrows-Wheeler 变换。用于构建指数的算法是基于Karkkainen的块状算法。

Usage:

bowtie2-build [options]* <reference_in> <bt2_base>
Main arguments
<reference_in>
A comma-separated list of FASTA files containing the reference sequences to be aligned to, or, if -c is specified, the sequences themselves. E.g., <reference_in> might be chr1.fa,chr2.fa,chrX.fa,chrY.fa, or, if -c is specified, this might be GGTCATCCT,ACGGGTCGT,CCGTTCTATGCGGCTTA.

<bt2_base>
The basename of the index files to write. By default, bowtie2-build writes files named NAME.1.bt2, NAME.2.bt2, NAME.3.bt2, NAME.4.bt2, NAME.rev.1.bt2, and NAME.rev.2.bt2, where NAME is <bt2_base>.

Options
-f
The reference input files (specified as <reference_in>) are FASTA files (usually having extension .fa, .mfa, .fna or similar).

-c
The reference sequences are given on the command line. I.e. <reference_in> is a comma-separated list of sequences rather than a list of FASTA files.

--large-index
Force bowtie2-build to build a large index, even if the reference is less than ~ 4 billion nucleotides inlong.

-a/--noauto
Disable the default behavior whereby bowtie2-build automatically selects values for the --bmax, --dcv and --packed parameters according to available memory. Instead, user may specify values for those parameters. If memory is exhausted during indexing, an error message will be printed; it is up to the user to try new parameters.

-p/--packed
Use a packed (2-bits-per-nucleotide) representation for DNA strings. This saves memory but makes indexing 2-3 times slower. Default: off. This is configured automatically by default; use -a/--noauto to configure manually.

--bmax <int>
The maximum number of suffixes allowed in a block. Allowing more suffixes per block makes indexing faster, but increases peak memory usage. Setting this option overrides any previous setting for --bmax, or --bmaxdivn. Default (in terms of the --bmaxdivn parameter) is --bmaxdivn 4 * number of threads. This is configured automatically by default; use -a/--noauto to configure manually.

--bmaxdivn <int>
The maximum number of suffixes allowed in a block, expressed as a fraction of the length of the reference. Setting this option overrides any previous setting for --bmax, or --bmaxdivn. Default: --bmaxdivn 4 * number of threads. This is configured automatically by default; use -a/--noauto to configure manually.

--dcv <int>
Use <int> as the period for the difference-cover sample. A larger period yields less memory overhead, but may make suffix sorting slower, especially if repeats are present. Must be a power of 2 no greater than 4096. Default: 1024. This is configured automatically by default; use -a/--noauto to configure manually.

--nodc
Disable use of the difference-cover sample. Suffix sorting becomes quadratic-time in the worst case (where the worst case is an extremely repetitive reference). Default: off.

-r/--noref
Do not build the NAME.3.bt2 and NAME.4.bt2 portions of the index, which contain a bitpacked version of the reference sequences and are used for paired-end alignment.

-3/--justref
Build only the NAME.3.bt2 and NAME.4.bt2 portions of the index, which contain a bitpacked version of the reference sequences and are used for paired-end alignment.

-o/--offrate <int>
To map alignments back to positions on the reference sequences, it's necessary to annotate ("mark") some or all of the Burrows-Wheeler rows with their corresponding location on the genome. -o/--offrate governs how many rows get marked: the indexer will mark every 2^<int> rows. Marking more rows makes reference-position lookups faster, but requires more memory to hold the annotations at runtime. The default is 5 (every 32nd row is marked; for human genome, annotations occupy about 340 megabytes).

-t/--ftabchars <int>
The ftab is the lookup table used to calculate an initial Burrows-Wheeler range with respect to the first <int> characters of the query. A larger <int> yields a larger lookup table but faster query times. The ftab has size 4^(<int>+1) bytes. The default setting is 10 (ftab is 4MB).

--seed <int>
Use <int> as the seed for pseudo-random number generator.

--cutoff <int>
Index only the first <int> bases of the reference sequences (cumulative across sequences) and ignore the rest.

-q/--quiet
bowtie2-build is verbose by default. With this option bowtie2-build will print only error messages.

--threads <int>
By default bowtie2-build is using only one thread. Increasing the number of threads will speed up the index building considerably in most cases.

-h/--help
Print usage information and quit.

--version
Print version information and quit.

The bowtie2-inspect index inspector

bowtie2-inspect 从一个 Bowtie 索引中提取信息,说明它是什么类型的索引,以及用于建立它的参考序列。在没有任何选项的情况下运行时,该工具将输出一个包含原始参考序列的 FASTA 文件(将所有非 A/C/G/T 字符转换为 Ns)。也可以使用-n/–names选项只提取参考文献序列名,或者使用-s/–summary选项提取更详细的摘要。

Command Line
Usage:

bowtie2-inspect [options]* <bt2_base>
Main arguments
<bt2_base>
The basename of the index to be inspected. The basename is name of any of the index files but with the .X.bt2 or .rev.X.bt2 suffix omitted. bowtie2-inspect first looks in the current directory for the index files, then in the directory specified in the BOWTIE2_INDEXES environment variable.

Options
-a/--across <int>
When printing FASTA output, output a newline character every <int> bases (default: 60).

-n/--names
Print reference sequence names, one per line, and quit.

-s/--summary
Print a summary that includes information about index settings, as well as the names and lengths of the input sequences. The summary has this format:

Colorspace  <0 or 1>
SA-Sample   1 in <sample>
FTab-Chars  <chars>
Sequence-1  <name>  <len>
Sequence-2  <name>  <len>
...
Sequence-N  <name>  <len>
Fields are separated by tabs. Colorspace is always set to 0 for Bowtie 2.

-v/--verbose
Print verbose output (for debugging).

--version
Print version information and quit.

-h/--help
Print usage information and quit.

Getting started with Bowtie 2: Lambda phage example

Bowtie 2 附带了一些示例文件,让您可以开始使用。这些示例文件在科学上并不重要;我们使用Lambda噬菌体参考基因组只是因为它很短,而且读数是由计算机程序生成的,而不是测序仪。不过,这些文件可以让你马上开始运行Bowtie 2和下游工具。

首先按照手册说明获得Bowtie 2。将 BT2_HOME 环境变量设置为指向包含 bowtie2、bowtie2-build 和 bowtie2-inspect 二进制文件的新 Bowtie 2 目录。这一点很重要,因为 BT2_HOME 变量在下面的命令中用于引用该目录。

Indexing a reference genome

要为Bowtie 2中包含的Lambda噬菌体参考基因组创建索引,请创建一个新的临时目录(在哪里都没关系),换到该目录下,然后运行。

$BT2_HOME/bowtie2-build $BT2_HOME/example/reference/lambda_virus.fa lambda_virus

该命令应该打印很多行输出,然后退出。当命令完成后,当前目录将包含四个新文件,它们都以 lambda_virus 开头,以 .1.bt2、.2.bt2、.3.bt2、.4.bt2、.rev.1.bt2 和 .rev.2.bt2 结尾。这些文件构成了索引–你完成了!

您可以使用 bowtie2-build 为一组从任何来源获得的 FASTA 文件创建索引,包括 UCSC、NCBI 和 Ensembl 等网站。当为多个FASTA文件建立索引时,请使用逗号指定所有文件,以分隔文件名。有关如何使用 bowtie2-build 创建索引的更多细节,请参见手册中关于建立索引的部分。你也可以通过获取一个预建索引来绕过这个过程。请看下面使用预建索引的例子。

Aligning example reads

保持在上一步中创建的目录中,该目录现在包含lambda病毒索引文件。接下来,运行

$BT2_HOME/bowtie2 -x lambda_virus -U $BT2_HOME/example/reads/reads_1.fq -S eg1.sam

这将运行Bowtie 2对齐器,它使用上一步生成的索引将一组未配对的读数与Lambda噬菌体参考基因组对齐。SAM格式的对齐结果被写入文件eg1.sam,一个简短的对齐摘要被写入控制台。(实际上,摘要被写入 "标准错误 "或 "stderr "文件柄,通常被打印到控制台)。

To see the first few lines of the SAM output, run:

head eg1.sam

You will see something like this:

@HD VN:1.0  SO:unsorted
@SQ SN:gi|9626243|ref|NC_001416.1|  LN:48502
@PG ID:bowtie2  PN:bowtie2  VN:2.0.1
r1  0   gi|9626243|ref|NC_001416.1| 18401   42  122M    *   0   0   TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG  +"@6<:27(F&5)9"B):%B+A-%5A?2$HCB0B+0=D<7E/<.03#!.F77@6B==?C"7>;))%;,3-$.A06+<-1/@@?,26">=?*@'0;$:;??G+:#+(A?9+10!8!?()?7C>  AS:i:-5 XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3  MD:Z:59G13G21G26    YT:Z:UU
r2  0   gi|9626243|ref|NC_001416.1| 8886    42  275M    *   0   0   NTTNTGATGCGGGCTTGTGGAGTTCAGCCGATCTGACTTATGTCATTACCTATGAAATGTGAGGACGCTATGCCTGTACCAAATCCTACAATGCCGGTGAAAGGTGCCGGGATCACCCTGTGGGTTTATAAGGGGATCGGTGACCCCTACGCGAATCCGCTTTCAGACGTTGACTGGTCGCGTCTGGCAAAAGTTAAAGACCTGACGCCCGGCGAACTGACCGCTGAGNCCTATGACGACAGCTATCTCGATGATGAAGATGCAGACTGGACTGC (#!!'+!$""%+(+)'%)%!+!(&++)''"#"#&#"!'!("%'""("+&%$%*%%#$%#%#!)*'(#")(($&$'&%+&#%*)*#*%*')(%+!%%*"$%"#+)$&&+)&)*+!"*)!*!("&&"*#+"&"'(%)*("'!$*!!%$&&&$!!&&"(*"$&"#&!$%'%"#)$#+%*+)!&*)+(""#!)!%*#"*)*')&")($+*%%)!*)!('(%""+%"$##"#+(('!*(($*'!"*('"+)&%#&$+('**$$&+*&!#%)')'(+(!%+ AS:i:-14    XN:i:0  XM:i:8  XO:i:0  XG:i:0  NM:i:8  MD:Z:0A0C0G0A108C23G9T81T46 YT:Z:UU
r3  16  gi|9626243|ref|NC_001416.1| 11599   42  338M    *   0   0   GGGCGCGTTACTGGGATGATCGTGAAAAGGCCCGTCTTGCGCTTGAAGCCGCCCGAAAGAAGGCTGAGCAGCAGACTCAAGAGGAGAAAAATGCGCAGCAGCGGAGCGATACCGAAGCGTCACGGCTGAAATATACCGAAGAGGCGCAGAAGGCTNACGAACGGCTGCAGACGCCGCTGCAGAAATATACCGCCCGTCAGGAAGAACTGANCAAGGCACNGAAAGACGGGAAAATCCTGCAGGCGGATTACAACACGCTGATGGCGGCGGCGAAAAAGGATTATGAAGCGACGCTGTAAAAGCCGAAACAGTCCAGCGTGAAGGTGTCTGCGGGCGAT  7F$%6=$:9B@/F'>=?!D?@0(:A*)7/>9C>6#1<6:C(.CC;#.;>;2'$4D:?&B!>689?(0(G7+0=@37F)GG=>?958.D2E04C<E,*AD%G0.%$+A:'H;?8<72:88?E6((CF)6DF#.)=>B>D-="C'B080E'5BH"77':"@70#4%A5=6.2/1>;9"&-H6)=$/0;5E:<8G!@::1?2DC7C*;@*#.1C0.D>H/20,!"C-#,6@%<+<D(AG-).?&#0.00'@)/F8?B!&"170,)>:?<A7#1(A@0E#&A.*DC.E")AH"+.,5,2>5"2?:G,F"D0B8D-6$65D<D!A/38860.*4;4B<*31?6  AS:i:-22    XN:i:0  XM:i:8  XO:i:0  XG:i:0  NM:i:8  MD:Z:80C4C16A52T23G30A8T76A41   YT:Z:UU
r4  0   gi|9626243|ref|NC_001416.1| 40075   42  184M    *   0   0   GGGCCAATGCGCTTACTGATGCGGAATTACGCCGTAAGGCCGCAGATGAGCTTGTCCATATGACTGCGAGAATTAACNGTGGTGAGGCGATCCCTGAACCAGTAAAACAACTTCCTGTCATGGGCGGTAGACCTCTAAATCGTGCACAGGCTCTGGCGAAGATCGCAGAAATCAAAGCTAAGT(=8B)GD04*G%&4F,1'A>.C&7=F$,+#6!))43C,5/5+)?-/0>/D3=-,2/+.1?@->;)00!'3!7BH$G)HG+ADC'#-9F)7<7"$?&.>0)@5;4,!0-#C!15CF8&HB+B==H>7,/)C5)5*+(F5A%D,EA<(>G9E0>7&/E?4%;#'92)<5+@7:A.(BG@BG86@.G AS:i:-1 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:77C106 YT:Z:UU
r5  0   gi|9626243|ref|NC_001416.1| 48010   42  138M    *   0   0   GTCAGGAAAGTGGTAAAACTGCAACTCAATTACTGCAATGCCCTCGTAATTAAGTGAATTTACAATATCGTCCTGTTCGGAGGGAAGAACGCGGGATGTTCATTCTTCATCACTTTTAATTGATGTATATGCTCTCTT  9''%<D)A03E1-*7=),:F/0!6,D9:H,<9D%:0B(%'E,(8EFG$E89B$27G8F*2+4,-!,0D5()&=(FGG:5;3*@/.0F-G#5#3->('FDFEG?)5.!)"AGADB3?6(@H(:B<>6!>;>6>G,."?%  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:138    YT:Z:UU
r6  16  gi|9626243|ref|NC_001416.1| 41607   42  72M2D119M   *   0   0   TCGATTTGCAAATACCGGAACATCTCGGTAACTGCATATTCTGCATTAAAAAATCAACGCAAAAAATCGGACGCCTGCAAAGATGAGGAGGGATTGCAGCGTGTTTTTAATGAGGTCATCACGGGATNCCATGTGCGTGACGGNCATCGGGAAACGCCAAAGGAGATTATGTACCGAGGAAGAATGTCGCT 1H#G;H"$E*E#&"*)2%66?=9/9'=;4)4/>@%+5#@#$4A*!<D=="8#1*A9BA=:(1+#C&.#(3#H=9E)AC*5,AC#E'536*2?)H14?>9'B=7(3H/B:+A:8%1-+#(E%&$$&14"76D?>7(&20H5%*&CF8!G5B+A4F$7(:"'?0$?G+$)B-?2<0<F=D!38BH,%=8&5@+ AS:i:-13    XN:i:0  XM:i:2  XO:i:1  XG:i:2  NM:i:4  MD:Z:72^TT55C15A47  YT:Z:UU
r7  16  gi|9626243|ref|NC_001416.1| 4692    42  143M    *   0   0   TCAGCCGGACGCGGGCGCTGCAGCCGTACTCGGGGATGACCGGTTACAACGGCATTATCGCCCGTCTGCAACAGGCTGCCAGCGATCCGATGGTGGACAGCATTCTGCTCGATATGGACANGCCCGGCGGGATGGTGGCGGGG -"/@*7A0)>2,AAH@&"%B)*5*23B/,)90.B@%=FE,E063C9?,:26$-0:,.,1849'4.;F>FA;76+5&$<C":$!A*,<B,<)@<'85D%C*:)30@85;?.B$05=@95DCDH<53!8G:F:B7/A.E':434> AS:i:-6 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:98G21C22   YT:Z:UU

前几行(以 @ 开头)是 SAM 头行,其余的行是 SAM 走线,每读或伴侣一行。有关如何解释SAM文件格式的详细信息,请参见Bowtie 2手册中关于SAM输出和SAM规范的部分

Paired-end example

要对齐包含在bowtie2中的配对端读取,请保持在相同的目录中并运行

$BT2_HOME/bowtie2 -x lambda_virus -1 $BT2_HOME/example/reads/reads_1.fq -2 $BT2_HOME/example/reads/reads_2.fq -S eg2.sam

这将一组配对端读数与参考基因组对齐,结果写入文件eg2.sam。

Local alignment example

要使用本地对齐来对齐Bowtie 2中包含的一些较长的读数,请留在同一目录下运行

$BT2_HOME/bowtie2 --local -x lambda_virus -U $BT2_HOME/example/reads/longreads.fq -S eg3.sam

这就利用局部对齐将长读数与参考基因组对齐,结果写入文件eg3.sam。

Using SAMtools/BCFtools downstream

SAMtools是一个工具集,用于操作和分析SAM和BAM对齐文件。BCFtools是一个用于调用变体和操作VCF和BCF文件的工具集合,它通常与SAMtools一起发布。将这些工具一起使用可以让你从SAM格式的配准到VCF格式的变体调用。这个例子假设已经安装了SAMtools和bcftools,并且您的PATH环境变量中包含这些二进制文件的目录。
Run the paired-end example:

$BT2_HOME/bowtie2 -x $BT2_HOME/example/index/lambda_virus -1 $BT2_HOME/example/reads/reads_1.fq -2 $BT2_HOME/example/reads/reads_2.fq -S eg2.sam

使用 samtools 视图将 SAM 文件转换为 BAM 文件。BAM是对应于SAM文本格式的二进制格式。运行。

samtools view -bS eg2.sam > eg2.bam

使用samtools sort将BAM文件转换为排序的BAM文件。

samtools sort eg2.bam -o eg2.sorted.bam

我们现在有了一个排序的BAM文件,叫做eg2.sorted.bam。排序的BAM是一种有用的格式,因为排列是(a)压缩的,这对长期存储很方便,(b)排序的,这对变异发现很方便。要生成VCF格式的变体调用,请运行。

samtools mpileup -uf $BT2_HOME/example/reference/lambda_virus.fa eg2.sorted.bam | bcftools view -Ov - > eg2.raw.bcf

然后要查看变体,运行。

bcftools view eg2.raw.bcf

有关此过程的更多细节和变化,请参见SAMtools官方指南 “使用SAMtools/BCFtools调用SNPs/INDELs”。

时间

单端:

[yutao@amms-sugon8401@reads]$ time bowtie2 -x ../index/lambda_virus -U longreads.fq -S ../mapping_out/long.sam --threads 50
6000 reads; of these:
  6000 (100.00%) were unpaired; of these:
    287 (4.78%) aligned 0 times
    5713 (95.22%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
95.22% overall alignment rate

real    0m4.035s
user    0m8.719s
sys     0m15.860s

双端:

[yutao@headnode@reads]$ time bowtie2 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq  -S ../test.sam --threads 10
10000 reads; of these:
  10000 (100.00%) were paired; of these:
    834 (8.34%) aligned concordantly 0 times
    9166 (91.66%) aligned concordantly exactly 1 time
    0 (0.00%) aligned concordantly >1 times
    ----
    834 pairs aligned concordantly 0 times; of these:
      42 (5.04%) aligned discordantly 1 time
    ----
    792 pairs aligned 0 times concordantly or discordantly; of these:
      1584 mates make up the pairs; of these:
        1005 (63.45%) aligned 0 times
        579 (36.55%) aligned exactly 1 time
        0 (0.00%) aligned >1 times
94.97% overall alignment rate

real    0m2.129s
user    0m4.973s
sys     0m0.387s

Full help

Usage
bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i> | --sra-acc <acc> | b <bam>} -S [<sam>]
Main arguments
-x <bt2-idx>
The basename of the index for the reference genome. The basename is the name of any of the index files up to but not including the final .1.bt2 / .rev.1.bt2 / etc. bowtie2 looks for the specified index first in the current directory, then in the directory specified in the BOWTIE2_INDEXES environment variable.

-1 <m1>
Comma-separated list of files containing mate 1s (filename usually includes _1), e.g. -1 flyA_1.fq,flyB_1.fq. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in <m2>. Reads may be a mix of different lengths. If - is specified, bowtie2 will read the mate 1s from the "standard in" or "stdin" filehandle.

-2 <m2>
Comma-separated list of files containing mate 2s (filename usually includes _2), e.g. -2 flyA_2.fq,flyB_2.fq. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in <m1>. Reads may be a mix of different lengths. If - is specified, bowtie2 will read the mate 2s from the "standard in" or "stdin" filehandle.

-U <r>
Comma-separated list of files containing unpaired reads to be aligned, e.g. lane1.fq,lane2.fq,lane3.fq,lane4.fq. Reads may be a mix of different lengths. If - is specified, bowtie2 gets the reads from the "standard in" or "stdin" filehandle.

--interleaved
Reads interleaved FASTQ files where the first two records (8 lines) represent a mate pair.

--sra-acc
Reads are SRA accessions. If the accession provided cannot be found in local storage it will be fetched from the NCBI database. If you find that SRA alignments are long running please rerun your command with the -p/--threads parameter set to desired number of threads.

NB: this option is only available if bowtie 2 is compiled with the necessary SRA libraries. See Obtaining Bowtie 2 for details.

-b <bam>
Reads are unaligned BAM records sorted by read name. The --align-paired-reads and --preserve-tags options affect the way Bowtie 2 processes records.

-S <sam>
File to write SAM alignments to. By default, alignments are written to the "standard out" or "stdout" filehandle (i.e. the console).

Options
Input options
-q
Reads (specified with <m1>, <m2>, <s>) are FASTQ files. FASTQ files usually have extension .fq or .fastq. FASTQ is the default format. See also: --solexa-quals and --int-quals.

--tab5
Each read or pair is on a single line. An unpaired read line is [name]\t[seq]\t[qual]\n. A paired-end read line is [name]\t[seq1]\t[qual1]\t[seq2]\t[qual2]\n. An input file can be a mix of unpaired and paired-end reads and Bowtie 2 recognizes each according to the number of fields, handling each as it should.

--tab6
Similar to --tab5 except, for paired-end reads, the second end can have a different name from the first: [name1]\t[seq1]\t[qual1]\t[name2]\t[seq2]\t[qual2]\n

--qseq
Reads (specified with <m1>, <m2>, <s>) are QSEQ files. QSEQ files usually end in _qseq.txt. See also: --solexa-quals and --int-quals.

-f
Reads (specified with <m1>, <m2>, <s>) are FASTA files. FASTA files usually have extension .fa, .fasta, .mfa, .fna or similar. FASTA files do not have a way of specifying quality values, so when -f is set, the result is as if --ignore-quals is also set.

-r
Reads (specified with <m1>, <m2>, <s>) are files with one input sequence per line, without any other information (no read names, no qualities). When -r is set, the result is as if --ignore-quals is also set.

-F k:<int>,i:<int>
Reads are substrings (k-mers) extracted from a FASTA file <s>. Specifically, for every reference sequence in FASTA file <s>, Bowtie 2 aligns the k-mers at offsets 1, 1+i, 1+2i, ... until reaching the end of the reference. Each k-mer is aligned as a separate read. Quality values are set to all Is (40 on Phred scale). Each k-mer (read) is given a name like <sequence>_<offset>, where <sequence> is the name of the FASTA sequence it was drawn from and <offset> is its 0-based offset of origin with respect to the sequence. Only single k-mers, i.e. unpaired reads, can be aligned in this way.
-c
The read sequences are given on command line. I.e. <m1>, <m2> and <singles> are comma-separated lists of reads rather than lists of read files. There is no way to specify read names or qualities, so -c also implies --ignore-quals.

-s/--skip <int>
Skip (i.e. do not align) the first <int> reads or pairs in the input.

-u/--qupto <int>
Align the first <int> reads or read pairs from the input (after the -s/--skip reads or pairs have been skipped), then stop. Default: no limit.

-5/--trim5 <int>
Trim <int> bases from 5' (left) end of each read before alignment (default: 0).

-3/--trim3 <int>
Trim <int> bases from 3' (right) end of each read before alignment (default: 0).

--trim-to [3:|5:]<int>
Trim reads exceeding <int> bases. Bases will be trimmed from either the 3' (right) or 5' (left) end of the read. If the read end if not specified, bowtie 2 will default to trimming from the 3' (right) end of the read. --trim-to and -3/-5 are mutually exclusive.

--phred33
Input qualities are ASCII chars equal to the Phred quality plus 33. This is also called the "Phred+33" encoding, which is used by the very latest Illumina pipelines.

--phred64
Input qualities are ASCII chars equal to the Phred quality plus 64. This is also called the "Phred+64" encoding.

--solexa-quals
Convert input qualities from Solexa (which can be negative) to Phred (which can't). This scheme was used in older Illumina GA Pipeline versions (prior to 1.3). Default: off.

--int-quals
Quality values are represented in the read input file as space-separated ASCII integers, e.g., 40 40 30 40..., rather than ASCII characters, e.g., II?I.... Integers are treated as being on the Phred quality scale unless --solexa-quals is also specified. Default: off.

Preset options in --end-to-end mode
--very-fast
Same as: -D 5 -R 1 -N 0 -L 22 -i S,0,2.50

--fast
Same as: -D 10 -R 2 -N 0 -L 22 -i S,0,2.50

--sensitive
Same as: -D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default in --end-to-end mode)

--very-sensitive
Same as: -D 20 -R 3 -N 0 -L 20 -i S,1,0.50

Preset options in --local mode
--very-fast-local
Same as: -D 5 -R 1 -N 0 -L 25 -i S,1,2.00

--fast-local
Same as: -D 10 -R 2 -N 0 -L 22 -i S,1,1.75

--sensitive-local
Same as: -D 15 -R 2 -N 0 -L 20 -i S,1,0.75 (default in --local mode)

--very-sensitive-local
Same as: -D 20 -R 3 -N 0 -L 20 -i S,1,0.50

Alignment options
-N <int>
Sets the number of mismatches to allowed in a seed alignment during multiseed alignment. Can be set to 0 or 1. Setting this higher makes alignment slower (often much slower) but increases sensitivity. Default: 0.

-L <int>
Sets the length of the seed substrings to align during multiseed alignment. Smaller values make alignment slower but more sensitive. Default: the --sensitive preset is used by default, which sets -L to 22 and 20 in --end-to-end mode and in --local mode.

-i <func>
Sets a function governing the interval between seed substrings to use during multiseed alignment. For instance, if the read has 30 characters, and seed length is 10, and the seed interval is 6, the seeds extracted will be:

Read:      TAGCTACGCTCTACGCTATCATGCATAAAC
Seed 1 fw: TAGCTACGCT
Seed 1 rc: AGCGTAGCTA
Seed 2 fw:       CGCTCTACGC
Seed 2 rc:       GCGTAGAGCG
Seed 3 fw:             ACGCTATCAT
Seed 3 rc:             ATGATAGCGT
Seed 4 fw:                   TCATGCATAA
Seed 4 rc:                   TTATGCATGA
Since it's best to use longer intervals for longer reads, this parameter sets the interval as a function of the read length, rather than a single one-size-fits-all number. For instance, specifying -i S,1,2.5 sets the interval function f to f(x) = 1 + 2.5 * sqrt(x), where x is the read length. See also: setting function options. If the function returns a result less than 1, it is rounded up to 1. Default: the --sensitive preset is used by default, which sets -i to S,1,1.15 in --end-to-end mode to -i S,1,0.75 in --local mode.

--n-ceil <func>
Sets a function governing the maximum number of ambiguous characters (usually Ns and/or .s) allowed in a read as a function of read length. For instance, specifying -L,0,0.15 sets the N-ceiling function f to f(x) = 0 + 0.15 * x, where x is the read length. See also: setting function options. Reads exceeding this ceiling are filtered out. Default: L,0,0.15.

--dpad <int>
"Pads" dynamic programming problems by <int> columns on either side to allow gaps. Default: 15.

--gbar <int>
Disallow gaps within <int> positions of the beginning or end of the read. Default: 4.

--ignore-quals
When calculating a mismatch penalty, always consider the quality value at the mismatched position to be the highest possible, regardless of the actual value. I.e. input is treated as though all quality values are high. This is also the default behavior when the input doesn't specify quality values (e.g. in -f, -r, or -c modes).

--nofw/--norc
If --nofw is specified, bowtie2 will not attempt to align unpaired reads to the forward (Watson) reference strand. If --norc is specified, bowtie2 will not attempt to align unpaired reads against the reverse-complement (Crick) reference strand. In paired-end mode, --nofw and --norc pertain to the fragments; i.e. specifying --nofw causes bowtie2 to explore only those paired-end configurations corresponding to fragments from the reverse-complement (Crick) strand. Default: both strands enabled.

--no-1mm-upfront
By default, Bowtie 2 will attempt to find either an exact or a 1-mismatch end-to-end alignment for the read before trying the multiseed heuristic. Such alignments can be found very quickly, and many short read alignments have exact or near-exact end-to-end alignments. However, this can lead to unexpected alignments when the user also sets options governing the multiseed heuristic, like -L and -N. For instance, if the user specifies -N 0 and -L equal to the length of the read, the user will be surprised to find 1-mismatch alignments reported. This option prevents Bowtie 2 from searching for 1-mismatch end-to-end alignments before using the multiseed heuristic, which leads to the expected behavior when combined with options such as -L and -N. This comes at the expense of speed.

--end-to-end
In this mode, Bowtie 2 requires that the entire read align from one end to the other, without any trimming (or "soft clipping") of characters from either end. The match bonus --ma always equals 0 in this mode, so all alignment scores are less than or equal to 0, and the greatest possible alignment score is 0. This is mutually exclusive with --local. --end-to-end is the default mode.

--local
In this mode, Bowtie 2 does not require that the entire read align from one end to the other. Rather, some characters may be omitted ("soft clipped") from the ends in order to achieve the greatest possible alignment score. The match bonus --ma is used in this mode, and the best possible alignment score is equal to the match bonus (--ma) times the length of the read. Specifying --local and one of the presets (e.g. --local --very-fast) is equivalent to specifying the local version of the preset (--very-fast-local). This is mutually exclusive with --end-to-end. --end-to-end is the default mode.

Scoring options
--ma <int>
Sets the match bonus. In --local mode <int> is added to the alignment score for each position where a read character aligns to a reference character and the characters match. Not used in --end-to-end mode. Default: 2.

--mp MX,MN
Sets the maximum (MX) and minimum (MN) mismatch penalties, both integers. A number less than or equal to MX and greater than or equal to MN is subtracted from the alignment score for each position where a read character aligns to a reference character, the characters do not match, and neither is an N. If --ignore-quals is specified, the number subtracted quals MX. Otherwise, the number subtracted is MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) ) where Q is the Phred quality value. Default: MX = 6, MN = 2.

--np <int>
Sets penalty for positions where the read, reference, or both, contain an ambiguous character such as N. Default: 1.

--rdg <int1>,<int2>
Sets the read gap open (<int1>) and extend (<int2>) penalties. A read gap of length N gets a penalty of <int1> + N * <int2>. Default: 5, 3.

--rfg <int1>,<int2>
Sets the reference gap open (<int1>) and extend (<int2>) penalties. A reference gap of length N gets a penalty of <int1> + N * <int2>. Default: 5, 3.

--score-min <func>
Sets a function governing the minimum alignment score needed for an alignment to be considered "valid" (i.e. good enough to report). This is a function of read length. For instance, specifying L,0,-0.6 sets the minimum-score function f to f(x) = 0 + -0.6 * x, where x is the read length. See also: setting function options. The default in --end-to-end mode is L,-0.6,-0.6 and the default in --local mode is G,20,8.

Reporting options
-k <int>
By default, bowtie2 searches for distinct, valid alignments for each read. When it finds a valid alignment, it continues looking for alignments that are nearly as good or better. The best alignment found is reported (randomly selected from among best if tied). Information about the best alignments is used to estimate mapping quality and to set SAM optional fields, such as AS:i and XS:i.

When -k is specified, however, bowtie2 behaves differently. Instead, it searches for at most <int> distinct, valid alignments for each read. The search terminates when it can't find more distinct valid alignments, or when it finds <int>, whichever happens first. All alignments found are reported in descending order by alignment score. The alignment score for a paired-end alignment equals the sum of the alignment scores of the individual mates. Each reported read or pair alignment beyond the first has the SAM 'secondary' bit (which equals 256) set in its FLAGS field. For reads that have more than <int> distinct, valid alignments, bowtie2 does not guarantee that the <int> alignments reported are the best possible in terms of alignment score. -k is mutually exclusive with -a.

Note: Bowtie 2 is not designed with large values for -k in mind, and when aligning reads to long, repetitive genomes large -k can be very, very slow.

-a
Like -k but with no upper limit on number of alignments to search for. -a is mutually exclusive with -k.

Note: Bowtie 2 is not designed with -a mode in mind, and when aligning reads to long, repetitive genomes this mode can be very, very slow.

Effort options
-D <int>
Up to <int> consecutive seed extension attempts can "fail" before Bowtie 2 moves on, using the alignments found so far. A seed extension "fails" if it does not yield a new best or a new second-best alignment. This limit is automatically adjusted up when -k or -a are specified. Default: 15.

-R <int>
<int> is the maximum number of times Bowtie 2 will "re-seed" reads with repetitive seeds. When "re-seeding," Bowtie 2 simply chooses a new set of reads (same length, same number of mismatches allowed) at different offsets and searches for more alignments. A read is considered to have repetitive seeds if the total number of seed hits divided by the number of seeds that aligned at least once is greater than 300. Default: 2.

Paired-end options
-I/--minins <int>
The minimum fragment length for valid paired-end alignments. E.g. if -I 60 is specified and a paired-end alignment consists of two 20-bp alignments in the appropriate orientation with a 20-bp gap between them, that alignment is considered valid (as long as -X is also satisfied). A 19-bp gap would not be valid in that case. If trimming options -3 or -5 are also used, the -I constraint is applied with respect to the untrimmed mates.

The larger the difference between -I and -X, the slower Bowtie 2 will run. This is because larger differences between -I and -X require that Bowtie 2 scan a larger window to determine if a concordant alignment exists. For typical fragment length ranges (200 to 400 nucleotides), Bowtie 2 is very efficient.

Default: 0 (essentially imposing no minimum)

-X/--maxins <int>
The maximum fragment length for valid paired-end alignments. E.g. if -X 100 is specified and a paired-end alignment consists of two 20-bp alignments in the proper orientation with a 60-bp gap between them, that alignment is considered valid (as long as -I is also satisfied). A 61-bp gap would not be valid in that case. If trimming options -3 or -5 are also used, the -X constraint is applied with respect to the untrimmed mates, not the trimmed mates.

The larger the difference between -I and -X, the slower Bowtie 2 will run. This is because larger differences between -I and -X require that Bowtie 2 scan a larger window to determine if a concordant alignment exists. For typical fragment length ranges (200 to 400 nucleotides), Bowtie 2 is very efficient.

Default: 500.

--fr/--rf/--ff
The upstream/downstream mate orientations for a valid paired-end alignment against the forward reference strand. E.g., if --fr is specified and there is a candidate paired-end alignment where mate 1 appears upstream of the reverse complement of mate 2 and the fragment length constraints (-I and -X) are met, that alignment is valid. Also, if mate 2 appears upstream of the reverse complement of mate 1 and all other constraints are met, that too is valid. --rf likewise requires that an upstream mate1 be reverse-complemented and a downstream mate2 be forward-oriented. --ff requires both an upstream mate 1 and a downstream mate 2 to be forward-oriented. Default: --fr (appropriate for Illumina's Paired-end Sequencing Assay).

--no-mixed
By default, when bowtie2 cannot find a concordant or discordant alignment for a pair, it then tries to find alignments for the individual mates. This option disables that behavior.

--no-discordant
By default, bowtie2 looks for discordant alignments if it cannot find any concordant alignments. A discordant alignment is an alignment where both mates align uniquely, but that does not satisfy the paired-end constraints (--fr/--rf/--ff, -I, -X). This option disables that behavior.

--dovetail
If the mates "dovetail", that is if one mate alignment extends past the beginning of the other such that the wrong mate begins upstream, consider that to be concordant. See also: Mates can overlap, contain or dovetail each other. Default: mates cannot dovetail in a concordant alignment.

--no-contain
If one mate alignment contains the other, consider that to be non-concordant. See also: Mates can overlap, contain or dovetail each other. Default: a mate can contain the other in a concordant alignment.

--no-overlap
If one mate alignment overlaps the other at all, consider that to be non-concordant. See also: Mates can overlap, contain or dovetail each other. Default: mates can overlap in a concordant alignment.

BAM options
--align-paired-reads
Bowtie 2 will, by default, attempt to align unpaired BAM reads. Use this option to align paired-end reads instead.

--preserve-tags
Preserve tags from the original BAM record by appending them to the end of the corresponding Bowtie 2 SAM output.

Output options
-t/--time
Print the wall-clock time required to load the index files and align the reads. This is printed to the "standard error" ("stderr") filehandle. Default: off.

--un <path>
--un-gz <path>
--un-bz2 <path>
--un-lz4 <path>
Write unpaired reads that fail to align to file at <path>. These reads correspond to the SAM records with the FLAGS 0x4 bit set and neither the 0x40 nor 0x80 bits set. If --un-gz is specified, output will be gzip compressed. If --un-bz2 or --un-lz4 is specified, output will be bzip2 or lz4 compressed. Reads written in this way will appear exactly as they did in the input file, without any modification (same sequence, same name, same quality string, same quality encoding). Reads will not necessarily appear in the same order as they did in the input.

--al <path>
--al-gz <path>
--al-bz2 <path>
--al-lz4 <path>
Write unpaired reads that align at least once to file at <path>. These reads correspond to the SAM records with the FLAGS 0x4, 0x40, and 0x80 bits unset. If --al-gz is specified, output will be gzip compressed. If --al-bz2 is specified, output will be bzip2 compressed. Similarly if --al-lz4 is specified, output will be lz4 compressed. Reads written in this way will appear exactly as they did in the input file, without any modification (same sequence, same name, same quality string, same quality encoding). Reads will not necessarily appear in the same order as they did in the input.

--un-conc <path>
--un-conc-gz <path>
--un-conc-bz2 <path>
--un-conc-lz4 <path>
Write paired-end reads that fail to align concordantly to file(s) at <path>. These reads correspond to the SAM records with the FLAGS 0x4 bit set and either the 0x40 or 0x80 bit set (depending on whether it's mate #1 or #2). .1 and .2 strings are added to the filename to distinguish which file contains mate #1 and mate #2. If a percent symbol, %, is used in <path>, the percent symbol is replaced with 1 or 2 to make the per-mate filenames. Otherwise, .1 or .2 are added before the final dot in <path> to make the per-mate filenames. Reads written in this way will appear exactly as they did in the input files, without any modification (same sequence, same name, same quality string, same quality encoding). Reads will not necessarily appear in the same order as they did in the inputs.

--al-conc <path>
--al-conc-gz <path>
--al-conc-bz2 <path>
--al-conc-lz4 <path>
Write paired-end reads that align concordantly at least once to file(s) at <path>. These reads correspond to the SAM records with the FLAGS 0x4 bit unset and either the 0x40 or 0x80 bit set (depending on whether it's mate #1 or #2). .1 and .2 strings are added to the filename to distinguish which file contains mate #1 and mate #2. If a percent symbol, %, is used in <path>, the percent symbol is replaced with 1 or 2 to make the per-mate filenames. Otherwise, .1 or .2 are added before the final dot in <path> to make the per-mate filenames. Reads written in this way will appear exactly as they did in the input files, without any modification (same sequence, same name, same quality string, same quality encoding). Reads will not necessarily appear in the same order as they did in the inputs.

--quiet
Print nothing besides alignments and serious errors.

--met-file <path>
Write bowtie2 metrics to file <path>. Having alignment metric can be useful for debugging certain problems, especially performance issues. See also: --met. Default: metrics disabled.

--met-stderr <path>
Write bowtie2 metrics to the "standard error" ("stderr") filehandle. This is not mutually exclusive with --met-file. Having alignment metric can be useful for debugging certain problems, especially performance issues. See also: --met. Default: metrics disabled.

--met <int>
Write a new bowtie2 metrics record every <int> seconds. Only matters if either --met-stderr or --met-file are specified. Default: 1.

SAM options
--no-unal
Suppress SAM records for reads that failed to align.

--no-hd
Suppress SAM header lines (starting with @).

--no-sq
Suppress @SQ SAM header lines.

--rg-id <text>
Set the read group ID to <text>. This causes the SAM @RG header line to be printed, with <text> as the value associated with the ID: tag. It also causes the RG:Z: extra field to be attached to each SAM output record, with value set to <text>.

--rg <text>
Add <text> (usually of the form TAG:VAL, e.g. SM:Pool1) as a field on the @RG header line. Note: in order for the @RG line to appear, --rg-id must also be specified. This is because the ID tag is required by the SAM Spec. Specify --rg multiple times to set multiple fields. See the SAM Spec for details about what fields are legal.

--omit-sec-seq
When printing secondary alignments, Bowtie 2 by default will write out the SEQ and QUAL strings. Specifying this option causes Bowtie 2 to print an asterisk in those fields instead.

--soft-clipped-unmapped-tlen
Consider soft-clipped bases unmapped when calculating TLEN. Only available in --local mode.

--sam-no-qname-trunc
Suppress standard behavior of truncating readname at first whitespace at the expense of generating non-standard SAM

--xeq
Use '='/'X', instead of 'M', to specify matches/mismatches in SAM record

--sam-append-comment
Append FASTA/FASTQ comment to SAM record, where a comment is everything after the first space in the read name.

Performance options
-o/--offrate <int>
Override the offrate of the index with <int>. If <int> is greater than the offrate used to build the index, then some row markings are discarded when the index is read into memory. This reduces the memory footprint of the aligner but requires more time to calculate text offsets. <int> must be greater than the value used to build the index.

-p/--threads NTHREADS
Launch NTHREADS parallel search threads (default: 1). Threads will run on separate processors/cores and synchronize when parsing reads and outputting alignments. Searching for alignments is highly parallel, and speedup is close to linear. Increasing -p increases Bowtie 2's memory footprint. E.g. when aligning to a human genome index, increasing -p from 1 to 8 increases the memory footprint by a few hundred megabytes. This option is only available if bowtie is linked with the pthreads library (i.e. if BOWTIE_PTHREADS=0 is not specified at build time).

--reorder
Guarantees that output SAM records are printed in an order corresponding to the order of the reads in the original input file, even when -p is set greater than 1. Specifying --reorder and setting -p greater than 1 causes Bowtie 2 to run somewhat slower and use somewhat more memory than if --reorder were not specified. Has no effect if -p is set to 1, since output order will naturally correspond to input order in that case.

--mm
Use memory-mapped I/O to load the index, rather than typical file I/O. Memory-mapping allows many concurrent bowtie processes on the same computer to share the same memory image of the index (i.e. you pay the memory overhead just once). This facilitates memory-efficient parallelization of bowtie in situations where using -p is not possible or not preferable.

Other options
--qc-filter
Filter out reads for which the QSEQ filter field is non-zero. Only has an effect when read format is --qseq. Default: off.

--seed <int>
Use <int> as the seed for pseudo-random number generator. Default: 0.

--non-deterministic
Normally, Bowtie 2 re-initializes its pseudo-random generator for each read. It seeds the generator with a number derived from (a) the read name, (b) the nucleotide sequence, (c) the quality sequence, (d) the value of the --seed option. This means that if two reads are identical (same name, same nucleotides, same qualities) Bowtie 2 will find and report the same alignment(s) for both, even if there was ambiguity. When --non-deterministic is specified, Bowtie 2 re-initializes its pseudo-random generator for each read using the current time. This means that Bowtie 2 will not necessarily report the same alignment for two identical reads. This is counter-intuitive for some users, but might be more appropriate in situations where the input consists of many identical reads.

--version
Print version information and quit.

-h/--help
Print usage information and quit.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值