Performance difference of graph-based and alignment-based hybrid error correction methods for error-

Jan 17th, 2020: New Publications.

Performance difference of graph-based and alignment-based hybrid error correction methods for error-prone long reads.
Wang, A.Au, K.F.
Genome Biology. 2020. [Manuscript]

Performance difference of graph-based and alignment-based hybrid error correction methods for error-prone long reads    

针对容易出错的长读,基于图和基于对准的混合纠错方法的性能差异

Abstract

The error-prone third-generation sequencing (TGS) long reads can be corrected by the high-quality second-generation sequencing (SGS) short reads, which is referred to as hybrid error correction. We here investigate the influences of the principal algorithmic factors of two major types of hybrid error correction methods by mathematical modeling and analysis on both simulated and real data. Our study reveals the distribution of accuracy gain with respect to the original long read error rate. We also demonstrate that the original error rate of 19% is the limit for perfect correction, beyond which long reads are too error-prone to be corrected by these methods.

Background

Third-generation sequencing (TGS) technologies [1], including Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), have been demonstrated useful in many biomedical research since the unprecedented read lengths (average for PacBio and ONT can be over 10 kb and 20 kb, and maximum over 60 kb and 800 kb) are very informative for addressing complex problems, such as genome assembly and haplotyping [1,2,3,4,5,6,7,8,9,10]. However, the high error rates of TGS data (average 10–15% for the raw data) [11,12,13,14] reduce the mappability and the resolution of downstream analysis. To address this limitation, the high-quality short reads have been used to correct the long reads, which is termed as hybrid error correction. The existing hybrid error correction methods can be classified into two categories: alignment-based method [15,16,17,18,19,20,21] and de Bruijn graph (DBG)-based method (referred as “graph-based method”) [22,23,24,25,26]. Regardless of the lower algorithmic complexity by the graph-based method than the alignment-based one [27] and the difference of software implementations, several principal factors have significant effects on the error correction performance for both methods: long read error rate, short read error rate, short read coverage, alignment criterion, and solid k-mer size. Although previous studies examined some of these factors separately in the corresponding software development [28,29,30], here we establish mathematical frameworks to perform a comprehensive investigation of all these factors in hybrid error correction. Through studying their influences on short read alignment rate and solid k-mer detection in DBG, we finally interrogate how these factors determinate the accuracy gain in hybrid error correction. This research does not only study the algorithmic frameworks of two major hybrid error correction methods, more importantly it also offers an informative guidance for method selection, parameter design, and future method development for long read error correction.

摘要
易出错的第三代测序(TGS)长reads可以通过高质量的第二代测序(SGS)短reads进行校正,即混合纠错
本文通过对模拟数据和真实数据的数学建模和分析,研究了两种主要类型的混合误差修正方法的主要算法因素的影响。
我们的研究揭示了相对于原始长读错误率的准确度增益的分布。
我们还证明,原始错误率为19%是完美纠正的极限,超过这个限度,长读的错误就太容易被这些方法纠正。

背景
第三代测序技术[1](TGS),包括太平洋生物科学(PacBio)和牛津纳米孔技术(一),已经被证明是有用的在许多生物医学研究以来前所未有的阅读长度(平均PacBio,游客可以在10 kb和20 kb,和最大超过60 kb和800 kb)非常丰富,以解决复杂的问题,如基因组大会和单体型[1,2,3,4,5,6,7,8,9,10)。
然而,TGS数据的高错误率(原始数据平均为10-15%)[11,12,13,14]降低了下游分析的可映射性和分辨率。
为了解决这个限制,我们使用高质量的短读取来纠正长读取,这被称为混合错误校正。
现有的混合误差校正方法可以分为两类:基于对齐的方法[15,16,17,18,19,20,21]和基于de Bruijn graph (DBG)的方法(简称“基于图的方法”)[22,23,24,25,26]。
无论降低算法复杂度的图论方法比alignment-based[27],不同的软件实现,几个主要因素对纠错性能有显著影响两种方法:长阅读错误率,短读错误率,短阅读报道,校准标准,和固体k-mer大小。
虽然之前的研究在相应的软件开发中分别考察了其中的一些因素[28,29,30],但是在这里我们建立了数学框架来对混合纠错中的所有这些因素进行全面的研究。
通过研究它们对DBG中短读比对率和固态k-mer检测的影响,最后探讨了这些因素如何决定混合纠错的精度增益。
本研究不仅研究了两种主要的混合纠错方法的算法框架,更重要的是为长读纠错方法的选择、参数设计和未来方法的发展提供了有益的指导。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wangchuang2017

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值