Performance difference of graph-based and alignment-based hybrid error correction methods for error-

最新推荐文章于 2021-11-08 21:49:00 发布

wangchuang2017

最新推荐文章于 2021-11-08 21:49:00 发布

阅读量154

点赞数

本文链接：https://blog.csdn.net/u010608296/article/details/108295660

版权

生物信息学同时被 3 个专栏收录

642 篇文章 383 订阅

订阅专栏

第三代测序技术

257 篇文章 24 订阅

订阅专栏

生信工具Bioinformatics Tools

77 篇文章 40 订阅

订阅专栏

Jan 17th, 2020: New Publications.

Performance difference of graph-based and alignment-based hybrid error correction methods for error-prone long reads.
Wang, A., Au, K.F.
Genome Biology. 2020. [Manuscript]

Performance difference of graph-based and alignment-based hybrid error correction methods for error-prone long reads

针对容易出错的长读，基于图和基于对准的混合纠错方法的性能差异

Abstract

The error-prone third-generation sequencing (TGS) long reads can be corrected by the high-quality second-generation sequencing (SGS) short reads, which is referred to as hybrid error correction. We here investigate the influences of the principal algorithmic factors of two major types of hybrid error correction methods by mathematical modeling and analysis on both simulated and real data. Our study reveals the distribution of accuracy gain with respect to the original long read error rate. We also demonstrate that the original error rate of 19% is the limit for perfect correction, beyond which long reads are too error-prone to be corrected by these methods.

Background

Third-generation sequencing (TGS) technologies [1], including Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), have been demonstrated useful in many biomedical research since the unprecedented read lengths (average for PacBio and ONT can be over 10 kb and 20 kb, and maximum over 60 kb and 800 kb) are very informative for addressing complex problems, such as genome assembly and haplotyping [1,2,3,4,5,6,7,8,9,10]. However, the high error rates of TGS data (average 10–15% for the raw data) [11,12,13,14] reduce the mappability and the resolution of downstream analysis. To address this limitation, the high-quality short reads have been used to correct the long reads, which is termed as hybrid error correction. The existing hybrid error correction methods can be classified into two categories: alignment-based method [15,16,17,18,19,20,21] and de Bruijn graph (DBG)-based method (referred as “graph-based method”) [22,23,24,25,26]. Regardless of the lower algorithmic complexity by the graph-based method than the alignment-based one [27] and the difference of software implementations, several principal factors have significant effects on the error correction performance for both methods: long read error rate, short read error rate, short read coverage, alignment criterion, and solid k-mer size. Although previous studies examined some of these factors separately in the corresponding software development [28,29,30], here we establish mathematical frameworks to perform a comprehensive investigation of all these factors in hybrid error correction. Through studying their influences on short read alignment rate and solid k-mer detection in DBG, we finally interrogate how these factors determinate the accuracy gain in hybrid error correction. This research does not only study the algorithmic frameworks of two major hybrid error correction methods, more importantly it also offers an informative guidance for method selection, parameter design, and future method development for long read error correction.

摘要
易出错的第三代测序(TGS)长reads可以通过高质量的第二代测序(SGS)短reads进行校正，即混合纠错。
本文通过对模拟数据和真实数据的数学建模和分析，研究了两种主要类型的混合误差修正方法的主要算法因素的影响。
我们的研究揭示了相对于原始长读错误率的准确度增益的分布。
我们还证明，原始错误率为19%是完美纠正的极限，超过这个限度，长读的错误就太容易被这些方法纠正。

背景
第三代测序技术[1](TGS),包括太平洋生物科学(PacBio)和牛津纳米孔技术(一),已经被证明是有用的在许多生物医学研究以来前所未有的阅读长度(平均PacBio,游客可以在10 kb和20 kb,和最大超过60 kb和800 kb)非常丰富,以解决复杂的问题,如基因组大会和单体型[1,2,3,4,5,6,7,8,9,10)。
然而，TGS数据的高错误率(原始数据平均为10-15%)[11,12,13,14]降低了下游分析的可映射性和分辨率。
为了解决这个限制，我们使用高质量的短读取来纠正长读取，这被称为混合错误校正。
现有的混合误差校正方法可以分为两类:基于对齐的方法[15,16,17,18,19,20,21]和基于de Bruijn graph (DBG)的方法(简称“基于图的方法”)[22,23,24,25,26]。
无论降低算法复杂度的图论方法比alignment-based[27],不同的软件实现,几个主要因素对纠错性能有显著影响两种方法:长阅读错误率,短读错误率,短阅读报道,校准标准,和固体k-mer大小。
虽然之前的研究在相应的软件开发中分别考察了其中的一些因素[28,29,30]，但是在这里我们建立了数学框架来对混合纠错中的所有这些因素进行全面的研究。
通过研究它们对DBG中短读比对率和固态k-mer检测的影响，最后探讨了这些因素如何决定混合纠错的精度增益。
本研究不仅研究了两种主要的混合纠错方法的算法框架，更重要的是为长读纠错方法的选择、参数设计和未来方法的发展提供了有益的指导。

wangchuang2017

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
Performance difference of graph-based and alignment-based hybrid error correction methods for error-

Jan 17th, 2020: New Publications.Performance difference of graph-based and alignment-based hybrid error correction methods for error-prone long reads.Wang, A.,Au, K.F.Genome Biology.2020. [Manuscript]Performance difference of graph-based and alignmen..
复制链接

扫一扫