Long Read Error Correction Algorithm Based on the de Bruijn Graph for the Third-generation Sequencin

Abstract:

PacBio single-molecule real-time sequencing platform can generate a large number of long reads, which are important for de novo assembly of genomes. Although these long reads have a high error rate of 15%, it is not wise to abandon them due to their high error rate. Illumina sequencing platform has produced short reads with a length of about 100 bp, which has a low error rate and low cost. However, there are many branches formed by the assembly, which is not conducive to the subsequent analysis of the genome. In this paper, a new hybrid error correction method LecdB which uses accurate short reads to correct long reads with higher error rates is proposed. Firstly, two de Bruijn graphs of fixed length and variable length are constructed from the short reads of the reference sequence. Then, the long read to be corrected is traversed to find the consistent solid K-mer with the fixed length de Bruijn graph. The long reads without solid K-mer are aligned with the nodes in the variable length de Bruijn graph using the maximum exact matching algorithm. The experiment show that, compared with other de Bruijn graph-based long read correction algorithms, better results can be obtained.

文摘:

PacBio单分子实时测序平台可以产生大量的长序列,这对基因组的从头组装至关重要。虽然这些长读取有15%的高错误率,但由于它们的高错误率,放弃它们是不明智的。Illumina测序平台生产的短序列长度约为100 bp,错误率低,成本低。然而,组装形成的分支很多,不利于后续对基因组的分析。本文提出了一种新型的混合纠错方法LecdB,该方法利用精确的短读对错误率较高的长读进行纠错。首先,利用参考序列的短读取构造定长和变长两个de Bruijn图。然后,遍历需要修正的长读数,找到具有固定长度德布鲁因图的一致固体K-mer。使用最大精确匹配算法对无实K-mer的长读取与变长de Bruijn图中的节点进行对齐。实验表明,与其他基于de Bruijn图的长读校正算法相比,可以获得更好的结果。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wangchuang2017

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值