Efficient Hybrid De Novo Error Correction and Assembly for Long Reads_an efficient error correction and accurate assembl-CSDN博客

本文链接：https://blog.csdn.net/u010608296/article/details/102873439

Efficient Hybrid De Novo Error Correction and Assembly for Long Reads 长read的高效的混合从头纠错和装配

Abstract—The new generation of long reads generated by Oxford nanopore sequencing technology has revolutionized the
next generation sequencing environment with the appearance of its new sequencer MinIon. This sequencer produces long
reads with a low production costs and with high throughput.However, long reads generated by the MinIon sequencer have
a high error rate which deteriorates the quality of results obtained by analyzing these long reads. A solution to correct
long reads is to use the high coverage and the high quality of short reads generated by the second generation sequencing
technology. Here, we present MiRCA (MinIon Reads Correction Algorithm) a hybrid approach based on the
sequences alignments that detects and corrects errors for MinIon long reads using preassembled Illumina short reads.
With this new error correction approach, we were able to make an effective and quick de novo assembly. Experiments on
Saccharomyces cerevisia and the Escherichia coli genomes show that MiRCA is much better than the available tools.
MiRCA is tested on Linux platforms and freely available at https://github.com/Mkchouk/MiRCA.

INTRODUCTION
Next Generation Sequencing Technologies (NGS) is characterized by a production data with a higher throughput and a lower cost than traditional sequencing technologies like Sanger sequencing technology [1], have led to the success of many genome sequencing projects such as resequencing, de novo assembly projects, metagenomics and gene expression analysis leading a better understanding and appreciation of the species evolution process and living beings.

Since the first generation sequencing technology i.e. Sanger, the number of sequencing technologies has increased and many sequencing technologies have been developed called NGS, including Roche's 454 Life Sciences technology in 2005 [2], Illumina/Solexa sequencing technology [3][4] in 2006, which produces good quality short reads, Applied Biosystems Solid [7] technology in 2007, Helicos BioSciences [5][6], Ion PGM from Ion Torrent [8] in 2010, Pacific Bioscience [9] and Oxford Nanopore Sequencing Technology in 2014 known by the new sequencing platform MinIon[10][11].

Despite their low cost and the high throughput, NGS technologies have their own drawbacks mainly “Sequencing Errors”.

Sequencing errors defined as the mistakes found in the reads generated by NGS platforms making the treatment and the analysis of NGS data difficult. Understanding the properties of the NGS data is important to improve the accuracy of the sequence and to have high quality reads.
The purpose of error correction is to facilitate data analysis for large projects like de novo assembly project.

There are three different types of errors: Substitutions,Deletions and Insertions.

These types of errors are most generated by NGS platforms and the distribution of error types varies from one sequencing platform to another. In this paper we propose a new hybrid algorithm for error correction of long reads using short reads as a reference to
correct the long reads. Our algorithm can be flexibly adapted to different types of errors.

In the second section we introduce the new sequencing technology MinIon.

Then, we present a state of the art on the correction errors approaches.

In the fourth part, we detail our hybrid approach to error correction.

We present the experiments that we have put in place to evaluate our approach in the fourth section.

Finally, we conclude our paper.

摘要

牛津纳米孔测序技术产生的新一代长读技术，随着其新的测序助手的出现，彻底改变了下一代测序环境。

该测序产生长读与低生产成本和高吞吐量。但是，由MinIon测序器产生的长读具有很高的错误率，从而影响了通过分析这些长读得到的结果的质量。纠正长读的一个解决方案是利用第二代测序技术产生的高覆盖率和高质量的短读。

在这里，我们提出了一种基于序列比对的混合方法(MinIon Reads校正算法)，该算法利用预先装配的Illumina短读来检测和校正MinIon长读的错误。有了这种新的纠错方法，我们就能够快速有效地重新组装。在酿酒酵母和大肠杆菌基因组上的实验表明，MiRCA比现有的工具要好得多。MiRCA在Linux平台上进行了测试，可以在https://github.com/Mkchouk/MiRCA免费获得。

介绍

下一代测序技术(门店)的特点是生产数据和更高的通量和更低的成本比传统的测序技术像桑格测序技术[1],已导致许多基因组测序项目的成功如重测序、新创组装项目,宏基因组和基因表达分析领导更好地理解和欣赏的物种进化过程和生物。

自第一代测序技术例如桑格,测序技术的数量增加,已研制出许多测序技术称为门店,包括罗氏的454生命科学技术在2005年[2],Illumina公司/ Solexa测序技术[3][4]2006年,生产优质短的读取,应用生物系统公司2007年固体[7]技术,Helicos BioScience[5][6],离子PGM从2010年的离子激流[8],2014年，太平洋生物科学[9]和牛津纳米孔测序技术被新的测序平台MinIon[10][11]所熟知。

尽管NGS技术成本低、产量高，但也有其自身的缺点，主要是测序错误。测序错误定义为NGS平台产生的读操作中发现的错误，使得处理和分析NGS数据变得困难。了解NGS数据的性质对于提高序列的准确性和高质量的读取非常重要。错误纠正的目的是为了方便对大型项目进行数据分析，比如de novo assembly项目。

有三种不同类型的错误:替换、删除和插入。这些类型的错误大部分是由NGS平台产生的，错误类型的分布在不同的测序平台上是不同的。
本文提出了一种新的长读纠错混合算法，以短读作为长读纠错的参考。我们的算法可以灵活地适应不同类型的错误。

在第二部分，我们将介绍新的测序技术。在此基础上，提出了一种新的纠错方法。在第四部分，我们详细介绍了我们的混合纠错方法。
我们在第四部分中介绍了我们为评估我们的方法而进行的实验。最后，总结全文。

III. RELATED WORK OF NGS ERROR CORRECTION
A. Error Correction Approaches
De novo error correction is based on the use of coverage of reads in an erroneous place (erroneous bases) in order to make the correction of errors in this bases.
To do that, there are three main approaches in the literature based on [17].
a) K-spectrum based approach:
it uses a decomposition of reads in a set of k-mers.
A k-mer is a sequence of character of size k.
This idea was used for the first time in EULER assembler in 2001[18].
K-mer spectrum approach analyzes the distribution of the set of kmer from the reads.
Given in a dataset of k-mers, a k-mer appears at least M times is called "solid" or "trusted",otherwise it is called "insolid" or "untrusted".
The advantage of this approach is effective for large data such as large genomes.
But, the drawback is the choice of the value of k is very difficult.
Some of the popular algorithms that use this approach are Quake [19], Reptile [20], Racer [21].
All these programs can correct only substitutions error type for short reads [22].
b) Suffix Tree/Array based approach:
This approach is used to represent the data structure for string matching.
Suffix tree and array is a generalization of the k-spectrum approach, based on a representation of several k values by associating a frequency to each k-mer.
They build a suffix array/tree read of all suffixes, and uses the frequency of each k-mer in the tree/array to correct erroneous regions in reads [23].
This approach can manipulate multiple values of k but it’s not effective in memory for NGS data structure[24].
Some of the popular algorithms that use this approach are HITEC[24] and HSHREC[25].
c) Alignment based approach:
This approach uses alignment algorithms, either Multiple Sequence Alignment(MSA) algorithm or pairwise alignment algorithm, for detecting erroneous bases in noisy sequences using good quality reads.
The advantage of the alignment approach is the ability to correct the three types of errors but the drawback is that with larger data sets, some alignments tools are less accurate with the different types of errors and the task of alignment becomes difficult.
Some of the popular algorithms that use this approach are Coral[28] and ECHO[29] used for short reads correction and Nanocorr[27]and Proovread[26] for long reads correction.
B. Hybrid Error Correction Algorithms
The benefits of long reads generated by sequencers like Sanger, PacBio and Nanopore Sequencing remain beneficial for several projects like the assembly project; however, the high error rate generated by the sequencer posed a challenge which required the development of tools for error correction.
In the literature, there are several error correction algorithms for short reads and for long reads.
We focus our study on hybrid correction tools which use short reads to correct long reads. TABLE I. highlights different hybrid error correction tools.

ⅢNGS纠错相关工作

A.纠错方法

De novo纠错是基于在错误的地方(错误的基础)使用读取的覆盖范围来纠正这些基础中的错误。为此，在基于[17]的文献中有三种主要的方法。

a)基于k谱的方法:

它在一组k-mers中使用读的分解。k-mer是大小为k的字符序列。这一思想在2001年[18]的欧拉汇编程序中首次得到应用。用K-mer谱方法分析了读序列中K-mer集的分布。给定一个k-mer数据集，一个k-mer出现至少M次称为“可靠的”或“可信的”，否则称为“不可靠的”或“不可信的”。这种方法的优点是对大数据(如大基因组)有效。但是，缺点是k值的选择是非常困难的。使用这种方法的一些流行算法是Quake[19]、爬行动物[20]、Racer[21]。所有这些程序都只能纠正短读[22]的替换错误类型。

b)基于后缀树/数组的方法:

此方法用于表示字符串匹配的数据结构。后缀树和数组是k-spectrum方法的一个推广，它基于几个k值的表示，将一个频率与每个k-mer相关联。他们构建一个后缀数组/树读取所有后缀，并使用树/数组中每个k-mer的频率来纠正读取[23]中的错误区域。这种方法可以处理多个k值，但是对于NGS数据结构[24]在内存中并不有效。使用这种方法的一些流行算法是HITEC[24]和HSHREC[25]。

c)基于对齐的方法:

该方法利用多序列比对(MSA)算法或成对比对算法，通过高质量的读码来检测噪声序列中的错误碱基。校准方法的优点是能够纠正这三种类型的错误，但缺点是，对于较大的数据集，一些校准工具对于不同类型的错误的准确性较低，校准任务变得很困难。

使用这种方法的一些流行算法是用于短读校正的Coral[28]和ECHO[29]，用于长读校正的Nanocorr[27]和Proovread[26]。

混合纠错算法桑格(Sanger)、PacBio和Nanopore等测序仪产生的长时间读序列的好处仍然对装配项目等几个项目有益;然而,音序器产生的高错误率是一个挑战，需要开发纠错工具。在文献中，有几种短读和长读的纠错算法。我们的研究主要集中在利用短读来校正长读的混合校正工具上。表一显示了不同的混合纠错工具。