Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read d

Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data

Abstract

Long reads obtained from third-generation sequencing platforms can help overcome the long-standing challenge of the de novo assembly of sequences for the genomic analysis of non-model eukaryotic organisms. Numerous long-read-aided de novo assemblies have been published recently, which exhibited superior quality of the assembled genomes in comparison with those achieved using earlier second-generation sequencing technologies. Evaluating assemblies is important in guiding the appropriate choice for specific research needs. In this study, we evaluated 10 long-read assemblers using a variety of metrics on Pacific Biosciences (PacBio) data sets from different taxonomic categories with considerable differences in genome size. The results allowed us to narrow down the list to a few assemblers that can be effectively applied to eukaryotic assembly projects. Moreover, we highlight how best to use limited genomic resources for effectively evaluating the genome assemblies of non-model organisms.

de novo assembly, third-generation sequencing, single-molecule sequencing, PacBio SMRT, assembly evaluation

Issue Section:

 Paper

Introduction

Pacific Biosciences (PacBio) single-molecule real-time (SMRT) and Oxford Nanopore sequencing technologies are the two widely used third-generation, single-molecule sequencing (SMS) technologies, which can generate average read lengths of several thousand base pairs. SMRT sequencing technology suffers from high error rates reaching up to 15% [1]; however, as these errors are random, high-quality error-corrected consensus sequences can be generated with sufficient coverage. Application of SMRT sequencing to eukaryotic genomes [2–18] has already demonstrated the obvious advantages provided by long reads in de novo assembly, such as higher contiguity, lesser gaps and fewer errors. The assembled contigs of recently assembled plant and animal genomes can be routinely seen to achieve an N50 of 1 Mb using SMS data. Hence, a significant rise in the number of genomes sequenced using SMS technologies is imminent, raising the need for evaluation of the available long-read assemblers. Large-scale evaluation studies such as GAGE [19], GAGE-B [20], Assemblathon1 [21] and Assemblathon2 [22] have been attempted with short-read assemblers, providing conclusions that serve as a useful guide for the de novo assembly of a given target organism. Although such evaluations have also been attempted for SMS data, these studies were either focused on bacterial and smaller eukaryotic genomes [23, 24] or were not sufficiently comprehensive to cover all of the available non-hybrid long-read assemblers [25–27], while others are already outdated because of continuous improvements in the technology [28, 29]. Also genome size was found to correlate with contiguity in long-read assemblies [17]; hence, diverse genome sizes can help differentiate the effect of the assemblers on each data set. In this study, we attempted to comprehensively evaluate three important features—contiguity, completeness and correctness [1]—of long-read assemblers (Table 1), using SMRT data of a bacterium (Escherichia coli, ∼5 Mb), protist (Plasmodium falciparum, ∼23 Mb), nematode (Caenorhabditis elegans, ∼105 Mb) and plant (Ipomoea nil, ∼750 Mb). We also designed a pipeline (Figure 1) for assembling the data and evaluating the results of different assemblers, which can be applied to both model organisms as well as to non-model organisms with limited genomic resources.

从第三代测序平台获得的长读序列可以帮助克服长期以来重新组装序列用于非模型真核生物基因组分析的挑战。

最近发表了许多长期阅读辅助的从头组装,与使用早期第二代测序技术获得的基因组相比,它们显示出组装基因组的优良质量。评估程序集对于指导针对特定研究需求的适当选择非常重要。

在这项研究中,我们使用了太平洋生物科学(PacBio)数据集上的各种指标,对10个长读汇编程序进行了评估,这些数据集来自不同的分类类别,基因组大小有相当大的差异。这些结果使我们能够将清单缩小到几个能够有效应用于真核生物装配项目的汇编程序。此外,我们强调如何最好地利用有限的基因组资源来有效地评估非模型生物的基因组组装。

 

 

太平洋生物科学(PacBio)单分子实时(SMRT)和牛津纳米孔测序技术是目前应用最广泛的第三代单分子测序(SMS)技术,可以产生数千个碱基对的平均读长。SMRT测序技术的错误率高达15% [1];

然而,由于这些错误是随机的,因此可以在足够的覆盖率下生成高质量的错误纠正一致序列。SMRT测序在真核生物基因组中的应用[2 18]已经证明,从头组装的长读序列具有明显的优势,如更高的接近性、更小的间隙和更少的错误。最近组装的植物和动物基因组的组装的叠基因组可以用SMS数据常规地看到达到1mb的N50。因此,使用SMS技术进行基因组测序的数量即将显著增加,这就需要对现有的长读汇编程序进行评估。大规模的评估研究,如GAGE [19], GAGE- b [20], assembly athon1[21]和assembly athon2[22],已经尝试使用短读汇编程序,提供了对给定目标生物体从头组装有用的指导的结论。尽管这样的评价也一直在试图为短信数据,这些研究都集中在细菌和真核基因组较小(23、24)或不够全面覆盖所有可用的标价读汇编[25 27],而另一些已经过时的,因为技术的不断改善(28、29)。基因组大小与长读序列[17]的邻近性有关;因此,不同的基因组大小可以帮助区分每个数据集上的汇编器的影响。在这项研究中,我们试图接近全面评估三个重要特征,完整性和正确性[1]读汇编(表1),使用SMRT数据的一种细菌(大肠杆菌、5 Mb),原生生物(恶性疟原虫,23 Mb)、线虫(线虫、105 Mb)和植物(番薯nil, 750 Mb)。我们还设计了一个流水线(图1)用于装配数据和评估不同装配器的结果,该流水线既适用于模型生物,也适用于基因组资源有限的非模型生物。

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wangchuang2017

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值