Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applicati

Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis

综合比较太平洋生物科学和牛津纳米孔技术及其应用转录组分析

Background: 

Given the demonstrated utility of Third Generation Sequencing [Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT)] long reads in many studies, a comprehensive analysis and comparison of their data quality and applications is in high demand. Methods: Based on the transcriptome sequencing data from human embryonic stem cells, we analyzed multiple data features of PacBio and ONT, including error pattern, length, mappability and technical improvements over previous platforms. We also evaluated their application to transcriptome analyses, such as isoform identification and quantification and characterization of transcriptome complexity, by comparing the performance of size-selected PacBio, non-size-selected ONT and their corresponding Hybrid-Seq strategies (PacBio+Illumina and ONT+Illumina). Results: PacBio shows overall better data quality, while ONT provides a higher yield. As with data quality, PacBio performs marginally better than ONT in most aspects for both long reads only and Hybrid-Seq strategies in transcriptome analysis. In addition, Hybrid-Seq shows superior performance over long reads only in most transcriptome analyses. 

Conclusions: 

Both PacBio and ONT sequencing are suitable for full-length single-molecule transcriptome analysis. As this first use of ONT reads in a Hybrid-Seq analysis has shown, both PacBio and ONT can benefit from a combined Illumina strategy. The tools and analytical methods developed here provide a resource for future applications and evaluations of these rapidly-changing technologies.

Keywords

Third Generation Sequencing, PacBio, Oxford Nanopore Technologies, Transcriptome

Introduction

Third Generation Sequencing (TGS) emerged more than 5 years ago when Pacific Biosciences (PacBio) commercialized Single Molecule Real Time (SMRT) sequencing technologies in 20111. Although TGS platforms have significant technical differences, they all generate very long reads (1–100kb)25, which is distinct from Second Generation Sequencing (SGS). Considering the paired-end information, the main SGS platform Illumina provides 50–600bp information from each DNA fragment; no SGS platforms provide >1000bp, including 454 sequencing, which generates the longest SGS reads (~700bp)6,7. Therefore, the short sequencing length limits the applications of SGS to large or complex genomic events, such as gene isoform reconstruction. TGS overcomes these challenging problems via long read lengths.

The most widely used TGS platforms [PacBio and Oxford Nanopore Technologies (ONT)] developed new biochemistry/biophysics methods to directly capture the very long nucleotide sequences from single DNA molecules. Other emerging TGS platforms (Moleculo8 and 10X Genomics9) are based on the assembly of short reads from the same DNA molecules to generate synthetic long reads (SLR). Herein, we focus on data features of PacBio and ONT and their applications to transcriptome analysis.

PacBio adopts a similar sequencing-by-synthesis strategy as Illumina sequencing, except PacBio captures a single DNA molecule and Illumina detects augmented signals from a clonal population of amplified DNA fragments. The error rate of raw PacBio data is 13–15%, as the signal-to-noise ratio from single DNA molecules is not high3. To increase accuracy, the PacBio platform uses a circular DNA template by ligating hairpin adaptors to both ends of target double-stranded DNA. As the polymerase repeatedly traverses and replicates the circular molecule, the DNA template is sequenced multiple times to generate a continuous long read (CLR). The CLR can be split into multiple reads ("subreads") by removing adapter sequences, and multiple subreads generate circular consensus sequence ("CCS") reads with higher accuracy. The average length of a CLR is >10kb and up to 60kb, which depends on the polymerase lifetime3. Thus, the length and accuracy of CCS reads depends on the fragment sizes. PacBio sequencing has been utilized for genome (e.g., de novo assembly, detection of structural variants and haplotyping)10 and transcriptome (e.g., gene isoform reconstruction and novel gene/isoform discovery)1113 studies.

ONT is a nanopore-based single molecule sequencing technology, and the first prototype MinION was released in 201414. As compared to other sequencing technologies utilizing nucleotide incorporation or hybridization, ONT directly sequences a native single-stranded DNA (ssDNA) molecule by measuring characteristic current changes as the bases are threaded through the nanopore by a molecular motor protein. ONT MinION uses a hairpin library structure similar to the PacBio circular DNA template: the DNA template and its complement are bound by a hairpin adaptor. Therefore, the DNA template passes through the nanopore, followed by a hairpin and finally the complement. The raw read can be split into two “1D” reads (“template” and “complement”) by removing the adaptor. The consensus sequence of two “1D” reads is a “2D” read with a higher accuracy2. Due to similar data features with PacBio, many researchers have utilized or are testing ONT in applications where PacBio has been applied.

PacBio and ONT platforms share the advantage of long read lengths, yet they also have the same drawback: higher sequencing error rate and lower throughput compared to SGS3,1416. High sequencing error rates pose challenges for single-nucleotide-resolution analyses, such as accurate sequencing of transcripts, identification of splice sites and SNP calling. Low throughput is an obstacle for quantitative analysis, such as gene/isoform abundance estimation. Although PacBio CCS and ONT 2D consensus strategies can reduce error rates, the corresponding read lengths become shorter and throughput becomes lower. Therefore, hybrid sequencing (“Hybrid-Seq”), which integrates TGS and SGS data, has emerged as an approach to address the limitations associated with analysis of TGS data with assistance of SGS data. For example, error correction of PacBio or ONT long reads by SGS short reads improves the accuracy and mappability of long reads1719. Hybrid-Seq can be applied to genome assembly and transcriptome characterization and improve the overall performance and resolution1113,17.

The long read length of PacBio and ONT is very informative for transcriptome research, especially for gene isoform identification. In addition to human transcriptomes2022, the PacBio transcript sequencing protocol, Iso-Seq, has been widely used to characterize transcriptome complexity in non-model organisms and particular genes/gene families2331. In contrast, ONT has no standard transcript sequencing protocol and only a few pilot studies are publically available. Using MinION, Bolisetty et al. discovered very high isoform diversity of four genes in Drosophila, which illustrates the utility of ONT in investigating complex transcriptional events32. Oikonomopoulos et al. also demonstrated the stability of ONT sequencing in quantifying transcriptome by analyzing an artificial mixture of 92 transcripts with Spike-In RNA33. Compared to these studies using PacBio or ONT alone, Hybrid-Seq can reduce the requirement of data size and improve the output, especially for transcriptome-wide studies. For example, a series of Hybrid-Seq methods (IDP, IDP-fusion, IDP-ASE) have been developed to improve the transcriptome studies to isoform levels (e.g., gene isoform reconstruction, fusion genes and allele phasing) with higher sensitivity and accuracy, and achieve a more accurate abundance estimation, which has been demonstrated in human embryonic stem cells (hESCs) and breast cancer1113.

Herein, we generated PacBio and ONT data from cDNA of hESCs. Using our tool AlignQC (http://www.healthcare.uiowa.edu/labs/au/AlignQC/), we performed a comprehensive analysis and comparison of PacBio and ONT data, including the raw data (subreads and 1D “template” reads) and their consensus (CCS and 2D reads). PacBio sequencing was performed on size-selected libraries, as size selection is the manufacturer recommendation. ONT libraries were not size selected, because size selection was not standard practice at the time of sequencing and was not performed for ONT. Since these technologies follow different library preparation protocols, it is important to consider these steps as potential sources of variability just as the sequencing platforms themselves can introduce variability. Comparisons analyzed included error rate and error pattern, read length, mappability and abnormal alignments, as well as technology improvements between the latest sequencing models (PacBio P6-C4 and ONT R9) and previous versions (C2 and R7). We also validated and compared the capability of PacBio and ONT alone to study a gold standard set of spike-in transcripts. Then, we applied long read only and the corresponding Hybrid-Seq approaches to human transcriptome analyses, including isoform identification, quantification and discovery of complex transcriptome events. In addition to a comprehensive evaluation of the characteristics of the two main TGS data platforms, this work serves as a guide for applications of PacBio and ONT and the corresponding Hybrid-Seq for transcriptome analysis.

背景:

鉴于第三代测序[太平洋生物科学公司(PacBio)和牛津纳米孔技术公司(ONT)]在许多研究中被证明具有实用价值,因此对它们的数据质量和应用进行全面分析和比较的需求很大。
方法:

基于人类胚胎干细胞的转录组测序数据,我们分析了PacBio和ONT的多个数据特征,包括错误模式、长度、可映射性和相对于之前平台的技术改进。
我们还评估了它们在转录组分析中的应用,比如通过比较选择尺寸的PacBio、非选择尺寸的ONT和它们相应的杂交seq策略(PacBio+Illumina和ONT+Illumina)的性能,对转录组复杂性进行异构体识别、定量和表征。
结果:PacBio总体数据质量较好,ONT产量较高。
在数据质量方面,PacBio在大部分方面的表现都比ONT好,无论是长读还是混合序列分析策略在转录组分析方面。
此外,Hybrid-Seq仅在大多数转录组分析中显示了较长的阅读性能。

结论:

PacBio和ONT测序都适用于全长单分子转录组分析。
这是在Hybrid-Seq分析中首次使用ONT reads, PacBio和ONT都可以从联合Illumina策略中获益。
这里开发的工具和分析方法为这些快速变化的技术的未来应用和评价提供了资源。

关键字
第三代测序,PacBio,牛津纳米孔技术,转录组

介绍
太平洋生物科技公司(PacBio)于20111年将单分子实时测序(SMRT)技术商业化后,第三代测序(TGS)应运而生。
虽然TGS平台在技术上存在显著差异,但它们都能产生很长的读取(1-100kb) 2-5,这与第二代测序(SGS)不同。
基于对端信息,SGS主平台Illumina提供每个DNA片段50-600bp的信息;
没有SGS平台提供1000bp,包括454测序,产生最长SGS reads (~700bp)6,7。
因此,短的测序长度限制了SGS应用于大型或复杂的基因组事件,如基因亚型重建。
TGS通过长读取长度克服了这些具有挑战性的问题。

应用最广泛的TGS平台[PacBio和Oxford Nanopore Technologies (ONT)]开发了新的生物化学/生物物理学方法,可直接从单个DNA分子中捕获非常长的核苷酸序列。
其他新兴的TGS平台(Moleculo8和10X Genomics9)是基于从相同的DNA分子中组装短片段来生成合成长片段(SLR)。
在此,我们重点关注PacBio和ONT的数据特征及其在转录组分析中的应用。

PacBio采用与Illumina测序相似的合成测序策略,只是PacBio捕获单个DNA分子,Illumina检测扩增DNA片段的克隆群体扩增信号。
由于单个DNA分子的信噪比不高,所以原始PacBio数据的错误率为13-15%。
为了提高精确度,PacBio平台采用了环状DNA模板,将发夹接头连接到目标双链DNA的两端。
当聚合酶重复遍历和复制环状分子时,DNA模板被多次测序以产生连续的长读(CLR)。
通过删除适配器序列,可以将CLR拆分为多个读取(“子读取”),多个子读取生成具有更高精度的循环一致序列(“CCS”)读取。
CLR的平均长度为10kb到60kb,这取决于聚合酶的寿命。
因此,CCS读取的长度和精度取决于片段的大小。
PacBio测序已用于基因组(如从头组装、检测结构变异和单倍型)10和转录组(如基因亚型重建和新基因/亚型发现)11-13的研究。

ONT是一种基于纳米矿的单分子测序技术,第一个原型MinION于201414年发布。
与其他利用核苷酸结合或杂交的测序技术相比,ONT通过测量碱基通过分子马达蛋白穿过纳米孔时特有的电流变化,直接对单链DNA (ssDNA)分子进行测序。
ONT MinION采用了类似PacBio环状DNA模板的发夹库结构:DNA模板及其补体通过发夹适配器结合。
因此,DNA模板通过纳米孔,然后是发夹,最后是补体。
通过删除适配器,原始读取可以分为两个“1D”读取(“模板”和“补充”)。
两次“1D”读取的一致性序列为精度更高的“2D”读取2。
由于与PacBio相似的数据特性,许多研究人员已经在应用了PacBio的应用中使用或测试了ONT。

PacBio和ONT平台具有长读取长度的优势,但它们也有同样的缺点:与SGS3相比,更高的测序错误率和更低的吞吐量,14-16。
高测序错误率给单核苷酸解析分析带来了挑战,例如转录本的准确测序、剪接位点的识别和SNP调用。
低通量是定量分析的一个障碍,如基因/异构体丰度估计。
虽然PacBio CCS和ONT 2D consensus策略可以降低错误率,但相应的读取长度变短,吞吐量变低。
因此,将TGS和SGS数据集成在一起的杂交测序(hybrid - seq)已经成为一种解决利用SGS数据分析TGS数据的局限性的方法。
例如,SGS短读对PacBio或ONT长读的纠错提高了长读的准确性和可映射性17 - 19。
mix - seq可以应用于基因组组装和转录组鉴定,提高整体性能和分辨率11 - 13,17。

PacBio和ONT的长阅读长度对于转录组的研究,特别是基因亚型的鉴定是非常有用的。
除了人类转录组20 - 22之外,PacBio转录组测序协议Iso-Seq已被广泛用于描述非模式生物和特定基因/基因家族的转录组复杂性23 - 31。
相比之下,ONT没有标准的转录子测序方案,只有少数试点研究是公开的。
Bolisetty等人利用MinION发现了果蝇中四个基因的高度异型多样性,这说明了ONT在研究复杂的转录事件中的作用32。
Oikonomopoulos等人通过分析带有spikein RNA33的92个转录本的人工混合物,也证明了ONT测序在定量转录组方面的稳定性。
与这些单独使用PacBio或ONT的研究相比,Hybrid-Seq可以减少对数据量的要求,提高输出,特别是对于全转录组的研究。
例如,一系列Hybrid-Seq方法(国内流离失所者、IDP-fusion IDP-ASE)开发改善转录组研究同种型水平(例如,基因同种型重建、融合基因和等位基因逐步)更高的灵敏度和准确性,并实现更准确的丰度估计,已经证明在人类胚胎干细胞(为)和乳房cancer11-13。

在此,我们从hESCs的cDNA中生成PacBio和ONT数据。
使用AlignQC工具(http://www.healthcare.uiowa.edu/labs/au/AlignQC/),我们对PacBio和ONT数据进行了全面的分析和比较,包括原始数据(subreads和1D“template”reads)和一致意见(CCS和2D reads)。
PacBio测序是在大小选择的文库上进行的,因为大小选择是制造商的建议。
没有选择大小的ONT文库,因为在测序时大小选择不是标准做法,没有对ONT进行选择。
由于这些技术遵循不同的库准备协议,因此必须将这些步骤视为潜在的可变性来源,就像测序平台本身可能引入可变性一样。
分析最新测序模型(PacBio P6-C4和ONT R9)与之前版本(C2和R7)的技术改进情况,包括错误率和错误率、读长、可绘制性和异常对齐。
我们也验证并比较了PacBio和ONT单独研究插穗转录的金标准组的能力。
然后,我们将long - read only和相应的Hybrid-Seq方法应用于人类转录组分析,包括亚型鉴定、定量和复杂转录组事件的发现。
除了对两大主要TGS数据平台的特征进行综合评价外,本工作还为PacBio和ONT以及相应的Hybrid-Seq在转录组分析中的应用提供了指导。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wangchuang2017

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值