Ratatosk - Hybrid error correction of long reads enables accurate variant calling and assembly

Ratatosk - Hybrid error correction of long reads enables accurate variant calling and assembly   长读的混合错误纠正允许精确的变体调用和组装

Date: 15th July 2020 | Source: BioRxiv

Authors: Guillaume Holley, Doruk Beyter, Helga Ingimundardottir, Snædis Kristmundsdottir, Hannes Pétur Eggertsson, Bjarni Halldorsson.

Long Read Sequencing (LRS) technologies are becoming essential to complement Short Read Sequencing (SRS) technologies for routine whole genome sequencing. LRS platforms produce DNA fragment reads, thousands to millions bases long, allowing the resolution of numerous uncertainties left by SRS reads for genome reconstruction and analysis. In particular, LRS characterizes long and complex structural variants undetected by SRS due to short read length. Furthermore, assemblies produced with LRS reads are considerably more contiguous than with SRS while spanning previously inaccessible telomeric and centromeric regions. However, a major challenge to LRS reads adoption is their much higher error rate than SRS of up to 15%, introducing obstacles in downstream analysis pipelines. 

We present Ratatosk, a new error correction method for erroneous long reads based on a compacted and colored de Bruijn graph built from accurate short reads. Short and long reads color paths in the graph while vertices are annotated with candidate Single Nucleotide Polymorphisms. Long reads are subsequently anchored to the graph using exact and inexact k-mer matches to find paths corresponding to corrected sequences.

We demonstrate that Ratatosk can reduce the raw error rate of Oxford Nanopore reads 6-fold on average with a median error rate as low as 0.28%. Ratatosk corrected data maintain nearly 99% accurate SNP calls and increase indel call accuracy by up to about 40% compared to the raw data. An assembly of the Ashkenazi individual HG002 created from Ratatosk corrected Oxford Nanopore reads yields a contig N50 of 43.22 Mbp and outperforms high quality LRS assemblies using PacBio HiFi reads.

In particular, the assembly of Ratatosk corrected reads contains about 2.5 times less errors than an assembly created from PacBio HiFi reads. 

Availability: https://github.com/DecodeGenetics/Ratatosk.

Read the full text


我们提出了一种新的纠错方法Ratatosk,它针对长读错误,基于基于精确短读建立的压缩和着色的de Bruijn图。

由Ratatosk校正的Oxford Nanopore reads创建的德系犹太人个体HG002的组装产生了43.22 Mbp的contig N50,并优于使用PacBio HiFi reads的高质量LRS组装。

特别是,通过Ratatosk校正后的读取集合包含的错误比通过PacBio HiFi读取创建的集合少2.5倍。

©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页