Ratatosk - Hybrid error correction of long reads enables accurate variant calling and assembly

Ratatosk - Hybrid error correction of long reads enables accurate variant calling and assembly   长读的混合错误纠正允许精确的变体调用和组装

Date: 15th July 2020 | Source: BioRxiv

Authors: Guillaume Holley, Doruk Beyter, Helga Ingimundardottir, Snædis Kristmundsdottir, Hannes Pétur Eggertsson, Bjarni Halldorsson.

Motivation
Long Read Sequencing (LRS) technologies are becoming essential to complement Short Read Sequencing (SRS) technologies for routine whole genome sequencing. LRS platforms produce DNA fragment reads, thousands to millions bases long, allowing the resolution of numerous uncertainties left by SRS reads for genome reconstruction and analysis. In particular, LRS characterizes long and complex structural variants undetected by SRS due to short read length. Furthermore, assemblies produced with LRS reads are considerably more contiguous than with SRS while spanning previously inaccessible telomeric and centromeric regions. However, a major challenge to LRS reads adoption is their much higher error rate than SRS of up to 15%, introducing obstacles in downstream analysis pipelines. 

Results
We present Ratatosk, a new error correction method for erroneous long reads based on a compacted and colored de Bruijn graph built from accurate short reads. Short and long reads color paths in the graph while vertices are annotated with candidate Single Nucleotide Polymorphisms. Long reads are subsequently anchored to the graph using exact and inexact k-mer matches to find paths corresponding to corrected sequences.

We demonstrate that Ratatosk can reduce the raw error rate of Oxford Nanopore reads 6-fold on average with a median error rate as low as 0.28%. Ratatosk corrected data maintain nearly 99% accurate SNP calls and increase indel call accuracy by up to about 40% compared to the raw data. An assembly of the Ashkenazi individual HG002 created from Ratatosk corrected Oxford Nanopore reads yields a contig N50 of 43.22 Mbp and outperforms high quality LRS assemblies using PacBio HiFi reads.

In particular, the assembly of Ratatosk corrected reads contains about 2.5 times less errors than an assembly created from PacBio HiFi reads. 

Availability: https://github.com/DecodeGenetics/Ratatosk.

Read the full text

动机
长读序列(LRS)技术正成为常规全基因组测序短读序列(SRS)技术的重要补充。
LRS平台产生数千到数百万碱基长的DNA片段读取,允许解决SRS读取留下的大量不确定性,用于基因组重建和分析。
特别地,LRS特征是由于读取长度短而SRS无法检测到的长而复杂的结构变体。
此外,与SRS相比,使用LRS读取产生的程序集在跨越以前无法访问的端粒和着丝粒区域时更具连续性。
然而,采用LRS读取的一个主要挑战是它们的错误率比SRS高得多,高达15%,这给下游分析管道带来了障碍。

结果
我们提出了一种新的纠错方法Ratatosk,它针对长读错误,基于基于精确短读建立的压缩和着色的de Bruijn图。
短和长读取图中的颜色路径,而顶点用候选单核苷酸多态性注释。
随后使用精确和不精确的k-mer匹配将长读取锚定到图上,以找到与修正后的序列对应的路径。

我们证明,Ratatosk可以将牛津纳米孔读取的原始错误率平均降低6倍,中位错误率低至0.28%。
与原始数据相比,经过Ratatosk校正的数据保持了近99%的SNP调用的准确性,并将indel调用的准确性提高了40%左右。
由Ratatosk校正的Oxford Nanopore reads创建的德系犹太人个体HG002的组装产生了43.22 Mbp的contig N50,并优于使用PacBio HiFi reads的高质量LRS组装。

特别是,通过Ratatosk校正后的读取集合包含的错误比通过PacBio HiFi读取创建的集合少2.5倍。

©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页