Read-to-graph alignment and error correction

最新推荐文章于 2024-09-16 07:38:24 发布

wangchuang2017

最新推荐文章于 2024-09-16 07:38:24 发布

阅读量339

点赞数

本文链接：https://blog.csdn.net/u010608296/article/details/103420116

版权

生物信息学同时被 3 个专栏收录

642 篇文章 396 订阅

订阅专栏

第三代测序技术

257 篇文章 24 订阅

订阅专栏

生信工具Bioinformatics Tools

77 篇文章 40 订阅

订阅专栏

Read-to-graph alignment and error correction

Mon 18 May 2015

By Jordan Fish

In science.

tags: kh mer wok graph align errcorr

One of the newer features in khmer that we're pretty excited about is the read-to-graph aligner, which gives us a way to align sequences to a De Bruijn graph; our nickname for it is "graphalign."

Briefly, graphalign uses a pair-HMM to align a sequence to a k-mer graph (aka De Bruijn graph) allowing both mismatches and indels, and taking into account coverage using a binary model (trusted and untrusted k-mers). The core code was written by Jordan Fish when he was a graduate student in the lab, based on ideas stemming from Jason Pell's thesis work on error correction. It was then refactored by Michael Crusoe.

Graphalign actually lets us do lots of things, including align both short and long sequences to DBG graphs, error correct, and call variants. We've got a simple Python API built into khmer, and we're working to extend it.

The core graphalign API is based around the concept of a ReadAligner object:

aligner = khmer.ReadAligner(graph, trusted_cov, bits_theta)

where 'graph' is a De Bruijn graph (implemented as a counting table in khmer), 'trusted_cov' defines what the trusted k-mer coverage is, and 'bits_theta' adjusts a scoring parameter used to extend alignments.

The 'aligner' object can be used to align short sequences to the graph:

score, graph_alignment, read_alignment, truncated = \
    aligner.align(read)

Here, 'graph_alignment' and 'read_alignment' are strings; if 'truncated' is false, then they are of the same length, and constitute a full gapped alignment of the DNA sequence in 'read' to the graph.

The approach used by 'align' is to seed an alignment at the first trusted k-mer, and then extend the alignment along the graph in both directions. Thus, it's effectively a local aligner.

Error correction

Our initial motivation for graphalign was to use it to do error correction, with specific application to short-read sequences. There was (and to some extent still is) a dearth of error correction approaches that can be used for metagenome and transcriptome data sets, and since that kind of data is what our lab works on, we needed an error correction approach for those data. We also wanted something a bit more programmable than the existing error correctors, which were primarily command-line tools; we've found a lot of value in building libraries, and wanted to use that approach here, too.

The basic idea is this: we build a graph from our short-read data, and then go back through and align each short read to the graph. A successful alignment is then the corrected read. The basic code looks like this:

graph = build_graph(dataset)

aligner = khmer.ReadAligner(graph, trusted_cov, bits_theta)

for read in dataset:
    score, graph_align, read_align, is_truncated = aligner.align(read)
    corrected_read = graph_align

In conjunction with our work on semi-streaming algorithms, we can directly convert this into a semi-streaming algorithm that works on genomes, metagenomes, and transcriptomes. This is implemented in the correct-reads script.

Some results

If we try this out on a simulated data set (random genome, randomly chosen reads - see target compare-sim.txt in Makefile), it takes the simulated data from an error rate of around 1% to about 0.1%; see compare-sim.txt.

Applying this to a ~7m read subset of mRNAseq that we tackled in the semi-streaming paper (the data itself is from the Trinity paper, Grabherr et al, 2011), we take the data from an error rate of about 1.59% to 0.98% (see target rseq-compare.txt in Makefile). There are several reasons why this misses so many errors - first, error correction depends on high coverage, and much of this RNAseq data set is low coverage; second, this data set has a lot of errors; and third, RNAseq may have a broader k-mer abundance distribution than genomic sequencing.

One important side note: we use exactly the same script for error correcting RNAseq data as we do for genomic data.

How good is the error correction?

tl; dr? It's pretty good but still worse than current methods. When we compare to Quake results on an E. coli data set (target compare-ecoli.txt in the Makefile), we see:

Data set	Error rate
Uncorrected	1.587%
Quake	0.009%
khmer	0.013%

This isn't too bad - two orders of magnitude decrease in error rate! - but we'd like to at least be able to beat Quake :).

(Note that here we do a fair comparison by looking only at errors on sequences that Quake doesn't discard; to get comparable results on your data with khmer, you'd also have to trim your reads. We are also making use of the approach developed in the streaming paper where we digitally normalize the graph in advance, in order to decrease the number of errors and the size of the graph.)

Concluding thoughts

What attracts us to this approach is that it's really simple. The basic error correction is a few lines, although it's surrounded by a bunch of machinery for doing semi-streaming analysis and keeping pairing intact. (The two-pass/offline script for error correction is much cleaner, because it omits all of this machinery.)

It's also nice that this applies to all shotgun sequencing, not just genomic; that's a trivial extension of our semi-streaming paper.

We also suspect that this approach is quite tunable, although we are just beginning to investigate the proper way to build parameters for the pair-HMM, and we haven't nailed down the right coverage/cutoff parameters for error correction either. More work to be done!

In any case, there's also more than error correction to be done with the graphalign approach -- stay tuned!

References and previous work

This is by no means novel - we're building on a lot of ideas from a lot of people. Our interest is in bridging from theory to practice, and providing a decent tunable implementation in an open-source package, so that we can explore these ideas more widely.

Here is short summary of previous work, surely incomplete --

Much of this was proximally inspired by Jordan's work on Xander, software to do HMM-guided gene assembly from metagenomic data. (An accompanying paper has been accepted for publication; will blog about that when it hits.)
More generally, my MSU colleague Yanni Sun has had several PhD students that have worked on HMMs and graph alignment, and she and her students have been great sources of ideas! (She co-advised Jordan.)
BlastGraph, like Xander, built on the idea of graph alignment. It is the earliest reference I know of to graph alignment, but I haven't looked very hard.
Yuzhen Ye and Haixu Tang at Indiana have developed very similar functionality that I became aware of when reviewing their nice paper on graph alignment for metatranscriptomics.
Jared Simpson has been doing nice work on aligning Nanopore reads to a reference sequence. My guess is that the multiple sequence alignment approach described in Jonathan Dursi's blog post is going to prove relevant to us.
The error corrector Coral (Salmela and Schroder, 2011) bears a strong philosophical resemblance to graphalign in its approach to error correction, if you think of a De Bruijn graph as a kind of multiple-sequence alignment.

If you know of more, please add references below, in the comments - much appreciated!

Appendix: Running this code

The computational results in this blog post are Rather Reproducible (TM). Please see https://github.com/dib-lab/2015-khmer-wok1-ec/blob/master/README.rst for instructions on replicating the results on a virtual machine or using a Docker container.

读图对齐和纠错

2015年5月18日星期一

通过约旦鱼

在科学上。

标签：高棉炒锅 graphalign errcorr

高棉语中令我们兴奋的新功能之一是读图对齐器，它为我们提供了一种将序列与De Bruijn图对齐的方法。我们的昵称是“ graphalign”。

简而言之，graphalign使用对-HMM将序列与允许不匹配和插入缺失的k-mer图（aka De Bruijn图）对齐，并考虑到使用二进制模型（可信和不可信的k-mers）的覆盖率。核心代码是乔丹·菲什（Jordan Fish）在实验室毕业时编写的，其依据是杰森·佩尔（Jason Pell）关于纠错的论文工作所产生的思想。然后由迈克尔·克鲁索（Michael Crusoe）对其进行了重构。

Graphalign实际上使我们能够做很多事情，包括将短序列和长序列与DBG图对齐，纠错和调用变量。我们在khmer中内置了一个简单的Python API，我们正在努力对其进行扩展。

核心graphalign API基于ReadAligner对象的概念：

<span style="color:#000305"><span style="color:#ffffff">aligner = khmer.ReadAligner（graph，Trusted_cov，bits_theta）
</span></span>

其中“图”是De Bruijn图（在高棉语中作为计数表实现），“ trusted_cov”定义了可信k-mer覆盖范围，而“ bits_theta”调整用于扩展比对的评分参数。

'aligner'对象可用于将短序列与图对齐：

<span style="color:#000305"><span style="color:#ffffff">得分，graph_alignment，read_alignment，截断= \
    aligner.align（读取）
</span></span>

这里，“ graph_alignment”和“ read_alignment”是字符串；如果“截短”为假，则它们具有相同的长度，并构成“读”到图的DNA序列的完全空位比对。

“对齐”使用的方法是在第一个受信任的k-mer处播种对齐方式，然后在两个方向上沿图扩展对齐方式。因此，它实际上是本地对齐器。

错误更正

我们使用graphalign的最初动机是使用它来进行纠错，特别是针对短读序列。过去（并且在某种程度上仍然如此）缺乏可用于元基因组和转录组数据集的纠错方法，并且由于这种数据是我们实验室处理的数据，因此我们需要针对这些数据的纠错方法。我们还需要比现有的纠错器更具可编程性的东西，后者主要是命令行工具。我们在构建库中发现了很多价值，并且也想在这里使用这种方法。

基本思想是这样的：我们从短读取的数据中构建一个图形，然后回顾并将每个短读取与该图形对齐。然后，成功的比对是校正后的读数。基本代码如下所示：

<span style="color:#000305"><span style="color:#ffffff">图= build_graph（数据集）

aligner = khmer.ReadAligner（graph，Trusted_cov，bits_theta）

用于读取数据集：
    得分，graph_align，read_align，is_truncated = aligner.align（read）
    corrected_read = graph_align
</span></span>

结合我们在半流算法上的工作，我们可以将其直接转换为适用于基因组，元基因组和转录组的半流算法。这是在正确读取脚本中实现的。

一些结果

如果我们尝试了这一点上的模拟数据组（随机基因组中，随机选择的读取-见目标比较-sim.txt在生成文件），它需要从大约1％的错误率至约0.1％的仿真数据; 参见compare-sim.txt。

将其应用于我们在半流论文中处理的约700万个mRNAseq读取子集（数据本身来自Trinity论文，Grabherr等人，2011年），我们得出的数据错误率约为1.59％至0.98 ％（参见目标 RSEQ-compare.txt在生成文件）。遗漏这么多错误的原因有很多-首先，错误校正取决于高覆盖率，而此RNAseq数据集大部分是低覆盖率；第二，这个数据集有很多错误；第三，RNAseq的k-mer丰度分布可能比基因组测序更广泛。

一个重要的旁注：我们使用与基因组数据完全相同的脚本来纠错RNAseq数据。

纠错有多好？

tl; 博士？这是相当不错的，但仍然比目前的方法差。当我们比较大肠杆菌数据集（Makefile 中的 target compare-ecoli.txt）上的Quake结果时，我们看到：

资料集	错误率
未校正	1.587％
雷神之锤	0.009％
高棉	0.013％

这还不错-错误率降低了两个数量级！-但我们希望至少能够击败Quake :)。

（请注意，这里我们仅通过查看Quake不会丢弃的序列中的错误来进行公平的比较；要使用khmer获得数据上的可比结果，您还必须修剪读取。我们还利用了在流式纸中开发的一种方法，其中我们预先对图形进行了数字标准化，以减少错误的数量和图形的大小。）

结论性思想

吸引我们使用此方法的是，它非常简单。基本的纠错是几行，尽管它被一堆机器围绕着，用于进行半流分析并保持配对完整。（用于纠错的两次通过/脱机脚本要干净得多，因为它忽略了所有这些机制。）

同样适用于所有shot弹枪测序，而不仅仅是基因组测序。这是我们的半流媒体论文的重要扩展。

我们还怀疑这种方法是否可调整，尽管我们才刚刚开始研究为HMM配对建立参数的正确方法，而且我们也没有确定正确的覆盖/截止参数以进行纠错。还有更多工作要做！

无论如何，使用graphalign方法还需要进行更多的错误校正-敬请期待！

参考资料和以前的工作

这绝不是新颖的-我们正在借鉴许多人的许多想法。我们的兴趣是从理论到实践的桥梁，并在开源软件包中提供体面的可调实施，以便我们可以更广泛地探索这些思想。

这是以前工作的简短摘要，肯定是不完整的-

这在很大程度上受到约旦在Xander上的工作的启发，Xander是一种从宏基因组学数据进行HMM指导的基因组装的软件。（随附的论文已被接受发表；点击时将在博客上发布。）
更一般而言，我的MSU同事Yanni Sun拥有数位博士生，他们从事过HMM和图对齐的工作，她和她的学生们是很好的想法来源！（她是乔丹的共同顾问。）
像Xander一样，BlastGraph建立在图对齐的概念上。这是我所知道的关于图形对齐的最早参考资料，但是我看起来并不难。
印第安纳州的Yuzhen Ye和 Tang Haixu Tang开发了非常类似的功能，在回顾他们关于元转录组学的图对齐研究的不错论文时我意识到了。
贾里德·辛普森（Jared Simpson）在将Nanopore读数与参考序列进行比对方面一直做得很好。我的猜测是，乔纳森·杜尔西（Jonathan Dursi）的博客文章中描述的多序列比对方法将证明与我们相关。
如果您将De Bruijn图视为一种多序列对齐方式，那么错误校正器Coral （Salmela and Schroder，2011）在图对齐方面具有很强的哲学相似性。