Long Read Error Correction Algorithm Based on the de Bruijn Graph for the Third-generation Sequencin

本文链接：https://blog.csdn.net/u010608296/article/details/123301052

Abstract:

PacBio single-molecule real-time sequencing platform can generate a large number of long reads, which are important for de novo assembly of genomes. Although these long reads have a high error rate of 15%, it is not wise to abandon them due to their high error rate. Illumina sequencing platform has produced short reads with a length of about 100 bp, which has a low error rate and low cost. However, there are many branches formed by the assembly, which is not conducive to the subsequent analysis of the genome. In this paper, a new hybrid error correction method LecdB which uses accurate short reads to correct long reads with higher error rates is proposed. Firstly, two de Bruijn graphs of fixed length and variable length are constructed from the short reads of the reference sequence. Then, the long read to be corrected is traversed to find the consistent solid K-mer with the fixed length de Bruijn graph. The long reads without solid K-mer are aligned with the nodes in the variable length de Bruijn graph using the maximum exact matching algorithm. The experiment show that, compared with other de Bruijn graph-based long read correction algorithms, better results can be obtained.

文摘:

PacBio单分子实时测序平台可以产生大量的长序列，这对基因组的从头组装至关重要。虽然这些长读取有15%的高错误率，但由于它们的高错误率，放弃它们是不明智的。Illumina测序平台生产的短序列长度约为100 bp，错误率低，成本低。然而，组装形成的分支很多，不利于后续对基因组的分析。本文提出了一种新型的混合纠错方法LecdB，该方法利用精确的短读对错误率较高的长读进行纠错。首先，利用参考序列的短读取构造定长和变长两个de Bruijn图。然后，遍历需要修正的长读数，找到具有固定长度德布鲁因图的一致固体K-mer。使用最大精确匹配算法对无实K-mer的长读取与变长de Bruijn图中的节点进行对齐。实验表明，与其他基于de Bruijn图的长读校正算法相比，可以获得更好的结果。