ParLECH: Parallel Long-Read Error Correction with Hadoop

ParLECH: Parallel Long-Read Error Correction with Hadoop  使用Hadoop并行的长读错误更正

Abstract:

Long-read sequencing is emerging as a promising sequencing technology because it can tackle the short length limitation of second-generation sequencing, which has dominated the sequencing market in past years. However, it has substantially higher error rates compared to short-read sequencing (e.g., 13% vs. 0.1%), and its sequencing cost per base is typically more expensive than that of short-read sequencing. To address these limitations, we present a distributed hybrid error correction framework, called ParLECH, that is scalable and cost-efficient for PacBio long reads. For correcting the errors in the long reads, ParLECH utilizes the Illumina short reads that have the low error rate with high coverage at low cost. To efficiently analyze the high-throughput Illumina short reads, ParLECH is equipped with Hadoop and a distributed NoSQL system. To further improve the accuracy, ParLECH utilizes the k-mer coverage information of the Illumina short reads. Specifically, we develop a distributed version of the widest path algorithm, which maximizes the minimum k-mer coverage in a path of the de Bruijn graph constructed from the Illumina short reads. We replace an error region in a long read with its corresponding widest path. Our experimental results show that ParLECH can handle large-scale real-world datasets in a scalable and accurate manner. Using ParLECH, we can process a 312 GB human genome PacBio dataset, with a 452 GB Illumina dataset, on 128 nodes in less than 29 hours.

摘要:

长序列测序是一种很有前途的测序技术,因为它可以解决第二代测序的短长度限制,这在过去几年一直主导着测序市场。
然而,与短读测序相比,它有更高的错误率(例如,13% vs. 0.1%),它的每个碱基的测序成本通常比短读测序更昂贵。
为了解决这些限制,我们提出了一个分布式的混合错误纠正框架,称为腔技术,它是可扩展的,并且对于PacBio长读来说是经济有效的。
为纠正长读时的错误,使用Illumina短读,错误率低,覆盖率高,成本低。
为了有效地分析高吞吐量的Illumina short reads,使用了Hadoop和分布式NoSQL系统。
为进一步提高准确性,网站利用了Illumina短篇阅读的k-mer覆盖信息。
具体地说,我们开发了一个最宽路径算法的分布式版本,该算法最大化了从Illumina短读构造的de Bruijn图路径的最小k-mer覆盖率。
我们将长读中的错误区域替换为相应的最宽路径。
我们的实验结果表明,麻将馆能够以一种可扩展和精确的方式处理大规模的真实世界数据集。
使用腔室技术,我们可以在不到29小时的时间内,在128个节点上处理312 GB的人类基因组PacBio数据集和452 GB的Illumina数据集。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
检查错误原因 creating directory /data/primary/gpseg0 ... ok creating subdirectories ... ok selecting default max_connections ... 750 selecting default shared_buffers ... 125MB selecting default timezone ... Asia/Shanghai selecting dynamic shared memory implementation ... posix creating configuration files ... ok creating template1 database in /data/primary/gpseg0/base/1 ... child process was terminated by signal 9: Killed initdb: removing data directory "/data/primary/gpseg0" 2023-06-08 08:53:53.568563 GMT,,,p22007,th-604637056,,,,0,,,seg-10000,,,,,"LOG","00000","skipping missing configuration file ""/data/primary/gpseg0/postgresql.auto.conf""",,,,,,,,"ParseConfigFile","guc-file.l",563, 20230608:16:54:12:021728 gpcreateseg.sh:VM-0-5-centos:gpadmin-[INFO]:-Start Function BACKOUT_COMMAND 20230608:16:54:12:021728 gpcreateseg.sh:VM-0-5-centos:gpadmin-[INFO]:-End Function BACKOUT_COMMAND 20230608:16:54:12:021728 gpcreateseg.sh:VM-0-5-centos:gpadmin-[INFO]:-Start Function BACKOUT_COMMAND 20230608:16:54:12:021728 gpcreateseg.sh:VM-0-5-centos:gpadmin-[INFO]:-End Function BACKOUT_COMMAND 20230608:16:54:12:021728 gpcreateseg.sh:VM-0-5-centos:gpadmin-[FATAL][0]:-Failed to start segment instance database VM-0-5-centos /data/primary/gpseg0 20230608:16:54:12:019435 gpinitsystem:VM-0-5-centos:gpadmin-[INFO]:-End Function PARALLEL_WAIT 20230608:16:54:12:019435 gpinitsystem:VM-0-5-centos:gpadmin-[INFO]:-End Function PARALLEL_COUNT 20230608:16:54:12:019435 gpinitsystem:VM-0-5-centos:gpadmin-[INFO]:-Start Function PARALLEL_SUMMARY_STATUS_REPORT 20230608:16:54:12:019435 gpinitsystem:VM-0-5-centos:gpadmin-[INFO]:------------------------------------------------ 20230608:16:54:12:019435 gpinitsystem:VM-0-5-centos:gpadmin-[INFO]:-Parallel process exit status 20230608:16:54:12:019435 gpinitsystem:VM-0-5-centos:gpadmin-[INFO]:------------------------------------------------ 20230608:16:54:12:019435 gpinitsystem:VM-0-5-centos:gpadmin-[INFO]:-Total processes marked as completed = 0 20230608:16:54:12:019435 gpinitsystem:VM-0-5-centos:gpadmin-[INFO]:-Total processes marked as killed = 0 20230608:16:54:12:019435 gpinitsystem:VM-0-5-centos:gpadmin-[WARN]:-Total processes marked as failed = 1 <<<<< 20230608:16:54:12:019435 gpinitsystem:VM-0-5-centos:gpadmin-[INFO]:------------------------------------------------ 20230608:16:54:12:019435 gpinitsystem:VM-0-5-centos:gpadmin-[INFO]:-End Function PARALLEL_SUMMARY_STATUS_REPORT FAILED:VM-0-5-centos~6000~/data/primary/gpseg0~2~0
最新发布
06-09

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wangchuang2017

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值