ParLECH: Parallel Long-Read Error Correction with Hadoop 使用Hadoop并行的长读错误更正
Abstract:
Long-read sequencing is emerging as a promising sequencing technology because it can tackle the short length limitation of second-generation sequencing, which has dominated the sequencing market in past years. However, it has substantially higher error rates compared to short-read sequencing (e.g., 13% vs. 0.1%), and its sequencing cost per base is typically more expensive than that of short-read sequencing. To address these limitations, we present a distributed hybrid error correction framework, called ParLECH, that is scalable and cost-efficient for PacBio long reads. For correcting the errors in the long reads, ParLECH utilizes the Illumina short reads that have the low error rate with high coverage at low cost. To efficiently analyze the high-throughput Illumina short reads, ParLECH is equipped with Hadoop and a distributed NoSQL system. To further improve the accuracy, ParLECH utilizes the k-mer coverage information of the Illumina short reads. Specifically, we develop a distributed version of the widest path algorithm, which maximizes the minimum k-mer coverage in a path of the de Bruijn graph constructed from the Illumina short reads. We replace an error region in a long read with its corresponding widest path. Our experimental results show that ParLECH can handle large-scale real-world datasets in a scalable and accurate manner. Using ParLECH, we can process a 312 GB human genome PacBio dataset, with a 452 GB Illumina dataset, on 128 nodes in less than 29 hours.
摘要:
长序列测序是一种很有前途的测序技术,因为它可以解决第二代测序的短长度限制,这在过去几年一直主导着测序市场。
然而,与短读测序相比,它有更高的错误率(例如,13% vs. 0.1%),它的每个碱基的测序成本通常比短读测序更昂贵。
为了解决这些限制,我们提出了一个分布式的混合错误纠正框架,称为腔技术,它是可扩展的,并且对于PacBio长读来说是经济有效的。
为纠正长读时的错误,使用Illumina短读,错误率低,覆盖率高,成本低。
为了有效地分析高吞吐量的Illumina short reads,使用了Hadoop和分布式NoSQL系统。
为进一步提高准确性,网站利用了Illumina短篇阅读的k-mer覆盖信息。
具体地说,我们开发了一个最宽路径算法的分布式版本,该算法最大化了从Illumina短读构造的de Bruijn图路径的最小k-mer覆盖率。
我们将长读中的错误区域替换为相应的最宽路径。
我们的实验结果表明,麻将馆能够以一种可扩展和精确的方式处理大规模的真实世界数据集。
使用腔室技术,我们可以在不到29小时的时间内,在128个节点上处理312 GB的人类基因组PacBio数据集和452 GB的Illumina数据集。