Informatics for PacBio Long Reads
- April 2019
- Advances in Experimental Medicine and Biology
- DOI:
- 10.1007/978-981-13-6037-4_8
- In book: Single Molecule and Single Cell Sequencing
In this article, we review the development of a wide variety of bioinformatics software implementing state-of-the-art algorithms since the introduction of SMRT sequencing technology into the field. We focus on the three major categories of development: read mapping (aligning to reference genomes), de novo assembly, and detection of structural variants. The long SMRT reads benefit all the applications, but they are achievable only through considering the nature of the long reads technology properly.
在这篇文章中,我们回顾了自SMRT测序技术引入该领域以来,实现最先进算法的各种生物信息学软件的发展。
我们主要关注三大类开发:读映射(与参考基因组对齐)、从头组装和结构变体检测。
长读SMRT有益于所有的应用程序,但是它们只有通过适当地考虑长读技术的性质才能实现。
Informatics for PacBio Long Reads
Yuta Suzuki
Abstract In this article, we review the development of a wide variety of
bioinformatics software implementing state-of-the-art algorithms since the
introduction of SMRT sequencing technology into the field. We focus on the three
major categories of development: read mapping (aligning to reference genomes), de
novo assembly, and detection of structural variants. The long SMRT reads benefit all
the applications, but they are achievable only through considering the nature of the
long reads technology properly.
在这篇文章中,我们回顾了自SMRT测序技术引入该领域以来,实现最先进算法的各种生物信息学软件的发展。我们关注三个主要的发展类别:读取映射(与参考基因组对齐),从头组装,和检测结构变异。长SMRT读取对所有应用程序都有好处,但只有正确考虑长读取技术的性质,才能实现长SMRT读取。
Advances in SMRT Biology and Challenges in Long Read
Informatics
In 2011, advent of the PacBio RS sequencer and its SMRT (single molecule real
time) sequencing technology revolutionized the concept of DNA sequencing.
Longer reads are promised to generate de novo assembly of much higher contiguity,
and the claim was proved by several assembly projects (Steinberg et al.
2014; Pendleton et al. 2015; Seo et al. 2016). The lack of sequencing bias was
proved to be able to read regions which are extremely difficult for NGS (Next
Generation Sequencers) (Loomis et al. 2013).
2011年,PacBio RS测序仪及其SMRT(单分子实时)测序技术的问世彻底改变了DNA测序的概念。更长的读保证生成更大连续性的从头组装,这一说法已被多个组装项目证明(Steinberg et al. 2014;Pendleton et al. 2015;Seo et al. 2016)。缺乏测序偏差被证明能够读取NGS (Next Generation Sequencers)极其困难的区域(Loomis et al. 2013)。
None of these achievement, however, was just straightforward application of
conventional informatics strategy developed for short read sequencers; the virtue of
the long reads was not free at all. As many careful skeptics claimed in the early his
tory of PacBio sequencing, the long reads seemed too noisy. Base accuracy was
around ~85% for single raw read, that is, ~15% of bases were wrong calls, and indels consisted most of the errors. The higher error rate made it inappropriate to apply informatics tools designed for much accurate short read technologies.
Even the higher error rate is properly handled by sophisticated algorithms, the
length of the reads itself can pose another problem. Computational burden of many
algorithms depends on the read length L. When only the short reads are assumed, it
may be considered as constant, e.g., L = 76, 150, etc. The emergence of long read
sequencer changed the situation drastically by improving the read length by orders
of magnitudes, to thousands of bases, and to tens of thousands of bases by now.
Besides the ongoing innovations for longer reads, there is a large variation in length
of sequencing reads even in the same sequencing run. Therefore, the assumption
that the read length is constant is not valid anymore, and one must have a strategy to handle (variably) long reads in reduced time (CPU hours) and space (memory foot print) requirement.
Availability of long read opened a door to the set of problems which were
biologically existing in real but implicitly ignored by studies using short read
sequencing. For example, we had to realize that a non-negligible fraction of reads
could cover SVs (structural variants), requiring a new robust mapping strategy other
than simply masking the known repetitive regions.
然而,这些成就都不是简单地应用了为短读测序仪开发的传统信息学策略;长时间阅读的好处根本不是免费的。正如许多谨慎的怀疑论者在他早期关于PacBio测序的报道中所说的那样,长时间的阅读看起来太嘈杂了。单次原始读取的碱基正确率约为~85%,即~15%的碱基是错误的调用,大部分错误由indele构成。较高的误差率使它不适合应用为更精确的短读技术而设计的信息学工具。即使较高的错误率是由复杂的算法正确处理的,读取的长度本身也会带来另一个问题。许多算法的计算量取决于读长度L。当只考虑短读时,可以将其视为常数,如L = 76、150等。长读序列仪的出现极大地改变了这种情况,将读序列的长度提高了几个数量级,到现在已经达到了数千个碱基,甚至数万个碱基。除了正在进行的长读取创新之外,即使在相同的测序运行中,测序读取的长度也有很大的差异。因此,读长度为常量的假设不再有效,必须有一个策略来处理(可变)长读,以减少时间(CPU小时)和空间(内存占用)需求。长读的可用性为一系列问题打开了一扇门,这些问题在现实中存在,但被使用短读测序的研究含蓄地忽略了。例如,我们必须认识到,不可忽略的一部分读取可以覆盖svv(结构变体),这就需要一种新的健壮的映射策略,而不是简单地掩盖已知的重复区域。
Consequently, many sophisticated algorithms had to be developed to resolve
these issues; how to mitigate higher error rate, and how it can be done efficiently for
long reads. The rest of this article covers some important innovations achieved and
ongoing efforts in informatics area to make the most of long reads data.
因此,必须开发许多复杂的算法来解决这些问题;如何降低更高的错误率,以及如何高效地完成长时间读取。这篇文章的其余部分涵盖了一些重要的创新和正在进行的努力,在信息学领域最大限度地利用长读数据。
Aligning Noisy Long Reads with Reference Genome
When one aligns long reads against reference sequence, one must be aware that the
variations between reads and reference stems from two conceptually separate
causes. On one hand, there are sequencing errors in its simple sense, which is dis
crepancy between a read observed and actual sequence being sequenced. On the
other hand, we expect a sample sequenced would have slightly different sequence
than a reference sequence (otherwise there is no point in doing sequencing), and
those difference are usually called variants. Though sequencing errors and sequence
variants are conceptually different, however, they both appears just as “errors” to us
unless they have some criteria to distinguish them. The next two examples are for
understanding why the distinction between two classes of “error” is relevant here.
Let’s consider we have some noisy reads. Clearly, we cannot call sequence
variants specific to the sample unless the frequency of sequencing errors is controlled
to be sufficiently low compared to the frequency of variants. This is the reason why
it is difficult for noisy reads to detect small nucleotide variants such as point
mutations and indels.
Next, assume we have long reads. Then, there are more chances that the reads
span the large variations such as structural variations (SVs) between a reference
genome and the sample sequenced. This situation is problematic for aligners who 121
considered any possible variation between reads and reference to be sequencing
errors, for such aligners would fail to detect correct alignment as they need to intro
duce too much errors for aligning these sequences. Some aligners try to combat the
situation by employing techniques such as chaining and split alignment. Some
aligners (NGMLR, Minimap2) explicitly introduce an SV-aware scoring scheme
such as a two-parts concave gap penalty, which reflects the two classes of variations
between read and reference.
Sequence alignment is so fundamental in sequence analysis that it finds its
application everywhere. For example, mapping sequencing reads to reference
genome is the very first step of resequencing studies. Accuracy of mapping can
directly be translated into the overall reliability of results. Also, mapping is often
one of the most computationally intensive steps. Therefore, accurate and faster
mapping software would benefit the whole area of resequencing studies. In the
context of de novo assembly pipeline, it is used for detecting overlap among long
reads. Of noted, desired balance of sensitivity and specificity of overlap detection is
controlled differently from mapping to reference, and it could often be very subtle.
Though it is more or less subjective to make distinction between standalone
aligners and aligners designed as a module of assembly pipeline or SV detection
pipeline, we decided to cover some aligners in other sections. MHAP will be intro
duced in relation with Canu in the section devoted to assembly tools. Similarly,
NGMLR will be detailed together with Sniffle in the section for SV detection.
将嘈杂长读与参考基因组进行对齐
当一个人将长读取与引用序列对齐时,他必须意识到读取和引用之间的差异源于两个概念上独立的原因。一方面,有简单意义上的测序错误,即观察到的序列和实际测序的序列不一致。另一方面,我们期望经过测序的样本的序列与参考序列会有轻微的不同(否则测序就没有意义了),这些差异通常被称为变异。虽然测序错误和序列变异在概念上是不同的,但是,除非有一些标准来区分它们,否则它们在我们看来都是错误。接下来的两个例子是为了理解为什么两类错误之间的区别在这里是相关的。让我们考虑我们有一些嘈杂的读取。显然,我们不能称特定于样本的序列变异,除非测序错误的频率与变异的频率相比控制得足够低。这就是为什么噪声读取很难检测到小的核苷酸变异,如点突变和插入。接下来,假设我们有长时间的阅读。然后,读取更有可能跨越大的变异,如参考基因组和测序样本之间的结构变异(SVs)。
这种情况对于121认为读取和引用之间任何可能的变化都是测序错误的比对者来说是有问题的,因为这样的比对者将无法检测到正确的比对,因为他们需要引入太多的错误来对这些序列进行比对。一些对齐者试图通过使用诸如链和分裂对齐等技术来应对这种情况。一些对准器(NGMLR, Minimap2)明确地引入了sv感知的评分方案,如两部分凹间隙惩罚,它反映了读取和引用之间的两类变化。序列比对是序列分析的基础,在序列分析中随处可见。例如,将测序读数绘制到参考基因组是重测序研究的第一步。测绘的准确性可以直接转化为结果的整体可靠性。
此外,映射通常是计算最密集的步骤之一。因此,准确、快速的测绘软件将有利于整个重测序研究领域。在从头组装流水线的环境下,它被用来检测长读之间的重叠。值得注意的是,重叠检测的灵敏度和特异性在映射和参考之间的控制是不同的,而且常常是非常微妙的。虽然区分独立对准器和作为组装管道模块或SV检测管道设计的对准器或多或少有些主观,但我们决定在其他部分涵盖一些对准器。在专门介绍装配工具的部分,将介绍MHAP和Canu。同样,NGMLR和Sniffle也将在SV检测部分详细介绍。
BWA-SW and BWA-MEM
Adopting the seed-and-extend approach, BWA-SW (Li & Durbin 2010) builds
FM-indices for both query and reference sequence. Then, DP (dynamic program
ming) is applied to these FM-indices to find all local matches, i.e., seeds, allowing
mismatches and gaps between query and reference. Detected seeds are extended by
Smith-Waterman algorithm. Some heuristics are explicitly introduced to speed up
alignment of large-scale sequencing data and to mitigate the effect of repetitive
sequences. BWA-MEM (Li 2013) inherits similar features implemented in
BWA-SW such as split alignment, but is found on a different seeding strategy using
SMEM (supermaximal exact matches) and reseeding technique to reduce mismap
ping caused by missing seed hits.
BWA-SW和BWA-MEM
采用种子和扩展的方法,BWA-SW (Li & Durbin 2010)为查询和参考序列建立了fm -索引。然后,对这些fm索引应用DP(动态规划)来查找所有的局部匹配,即种子,允许查询和引用之间的不匹配和空白。利用Smith-Waterman算法对检测到的种子进行扩展。明确地引入了一些启发式算法,以加快大规模测序数据的对齐速度,并减轻重复序列的影响。BWA-MEM (Li 2013)继承了BWA-SW中实现的类似特征,如分裂对齐,但发现采用了不同的播种策略,使用了SMEM(超最大精确匹配)和重播种技术,以减少丢失种子命中造成的错误映射。
BLASR
BLASR (Chaisson & Tesler 2012) (Basic Local Alignment with Successive
Refinement) is also one of the earliest mapping tools specifically developed for
SMRT reads. Like BWA-MEM, it is probably the most widely used one to date.
Bundled with official SMRT Analysis, it has been the default choice for the
Informatics for PacBio Long Reads mapping (overlapping) step in all protocols such as resequencing, de novo assem
bly, transcriptome analysis, and methylation analysis. In the BLASR’s paper, the
authors explicitly stated it was designed to combine algorithmic devices developed
in two separate lines of studies, namely, a coarse alignment method for whole
genome alignment and a sophisticated data structure for fast short read mapping.
Proven to be effective for handling noisy long read, the approach of successive
refinement, or seed-chain-align paradigm, has become a standard principle.
BLASR first finds short exact matches (anchors) using either suffix array or FM
index (Ferragina & Manzini 2000). Then, the regions with clustered anchors aligned
colinearly are identified as candidate mapping locations, by global chaining algo
rithm (Abouelhoda & Ohlebusch 2003). The anchors are further chained by sparse
dynamic programming (SDP) within each candidate region (Eppstein et al. 1992).
Finally, it gives detailed alignment using banded DP (dynamic programming)
guided by the result of SDP. BLASR achieved tenfold faster mapping of reads to
human genome than BWA-SW algorithm at comparable mapping accuracy and
memory footprint.
BLASR
BLASR (Chaisson & Tesler 2012)(基本局部对齐与逐次细化)也是最早专门为SMRT读取开发的映射工具之一。和BWA-MEM一样,它可能是迄今为止使用最广泛的一个。与官方SMRT分析捆绑在一起,它一直是PacBio Long Reads mapping(重叠)信息学步骤的默认选择,在所有协议中,如重测序、从头组装、转录组分析和甲基化分析。在BLASR的论文中,作者明确表示,它的设计是为了结合两种独立的研究线开发的算法设备,即用于全基因组比对的粗比对方法和用于快速短读图谱的复杂数据结构。经过证明,连续细化或种子链对齐范式在处理长时间读取时是有效的,已成为一种标准原则。BLASR首先使用后缀阵列或FM索引找到短精确匹配(锚)(Ferragina & Manzini 2000)。然后,通过全局链接算法(Abouelhoda & Ohlebusch 2003)将聚类锚节点共线性排列的区域识别为候选映射位置。在每个候选区域内,锚点通过稀疏动态规划(SDP)进一步链接(Eppstein等人,1992)。最后,在SDP结果的指导下,采用带状动态规划方法进行了详细的对准。与BWA-SW算法相比,BLASR在绘制精度和内存占用方面实现了10倍快的人类基因组图谱。
DALIGNER
DALIGNER (Myers 2014) is specifically designed for finding overlaps between
noisy long reads, though its concept can also be adopted for a generic long read
aligner, as implemented in DAMAPPER (https://github.com/thegenemyers/
DAMAPPER). Like in BLASR, DALIGNER also performs filter based on short
exact matches. Instead of using BWT (FM index), it explicitly processes k-mers
within reads by thread-able and cache coherent implementation of radix sort.
Detected k-mers are then compared via block-wise merge sort, which reduces mem
ory footprint to a constant depending only on the block size. To generate local align
ment, it applies O(ND) diff algorithm between two candidate reads (Myers 1986).
DALIGNER achieved 22 ~ 39-fold speedup over BLASR at higher sensitivity in
detecting correct overlaps (Myers 2014). DALIGNER is supposed to be a compo
nent for read overlap (with DAMASKER for repeat masking, DASCRUBBER for
cleaning up low quality regions, and a core module for assembly) of DAZZLER de
novo assembler for long noisy reads, which will be released in future.
DALIGNER
DALIGNER (Myers 2014)是专门为寻找嘈杂长读之间的重叠而设计的,尽管它的概念也可以用于通用的长读对齐器,如DAMAPPER (https://github.com/thegenemyers/ DAMAPPER)中实现的。和BLASR一样,DALIGNER也基于短精确匹配进行滤波。它没有使用BWT (FM索引),而是显式地处理基数排序的可线程和缓存一致实现中的读取中的k-mers。检测到的k-mers然后通过块的归并排序进行比较,这将内存占用减少到一个常数,只取决于块的大小。为了生成局部对齐,它在两个候选读取之间应用O(ND)差异算法(Myers 1986)。DALIGNER在检测正确重叠时,比BLASR提高了22 ~ 39倍的灵敏度(Myers 2014)。DALIGNER应该是用于读取重叠的组件(DAMASKER用于重复屏蔽,DASCRUBBER用于清除低质量区域,核心模块用于组装),以及用于长噪声读取的DAZZLER de novo汇编器,该组件将在未来发布。
Minimap2
Minimap2 (Li 2017) is one of the latest and state-of-the-art alignment program.
Minimap2 is general-purpose aligner in that it can align short reads, noisy long
reads, and reads from transcripts (cDNA) back to a reference genome. Minimap2
combines several algorithmic ideas developed in the field, such as locality-sensitive
hashing as in Minimap and MHAP. For accounting possible SVs between reads and
genome, it employs concave gap cost as in NGMLR, and it is efficiently computed
using formulation proposed by Suzuki & Kasahara (2017). In addition to these fea
tures, the authors further optimized the algorithm, by transforming the DP matrix
from row-column coordinate to diagonal-antidiagonal coordinate for better concur
rency in modern processors. According to the author of Minimap2, it is supposed to
replace BWA-MEM, which is in turn a widely used extension of BWA-SW.
De novo Assembly
As Lander-Waterman theory (Lander & Waterman 1988) would assert, the longer
input reads are quite essential in achieving a high-quality genome assembly for
repetitive genomes. Therefore, developing a de novo assembler for long read is nat
urally the most active area in the field of long read informatics.
To our knowledge, almost all assemblers published for long read take an overlap
layout-consensus (OLC) approach, where the overall task of assembly can be
divided into the three steps. (1. Overlap) The overlaps between reads are identified
as candidate pairs representing the same genomic regions, and the overlap graph is
constructed to express these relations. (2. Layout) The graph is transformed to gen
erate linear contigs. The step often starts by constructing the string graph (Myers
2005), a string-labeled graph which encodes all the information in reads observed,
and eliminates edges containing redundant information. (3. Consensus) The final
assembly is polished. To eliminate errors in contigs, consensus is taken among reads
making up the contigs.
Though we do not cover tools for the consensus step here, there are many of
them released to date including official Quiver and Arrow bundled in SMRT
Analysis (https://github.com/PacificBiosciences/GenomicConsensus), another offi
cial tool pbdagcon (https://github.com/PacificBiosciences/pbdagcon), Racon
(Vaser et al. 2017), and MECAT (Xiao et al. 2017). Of note, quality of a polished
assembly can be much better than a short-read-based assembly due to the random
ness of sequencing errors in long reads (Chin et al. 2013; Myers 2014).
FALCON
FALCON (Chin et al. 2016) is designed as a diploid-aware de novo assembler for
long read. It starts by carefully taking consensus among the reads to eliminate
sequencing errors while retaining heterozygous variants which can distinguish two
homologous chromosomes (FALCON-sense). For constructing a string graph,
FALCON runs DALIGNER. The resulted graph contains “haplotype-fused” contigs
and “bubbles” reflecting variations between two homologous chromosomes. Finally,
FALCON-unzip tries to resolve such regions by phasing the associated long reads
Informatics for PacBio Long Reads and local re-assembly. The contigs obtained are called “haplotigs”, which are sup
posed to be faithful representation of individual alleles in the diploid genome.
Canu (& MHAP)
MHAP (Berlin et al. 2015) (Min-Hash Alignment Process) utilized MinHash for
efficient dimensionality reduction of the read space. In MinHash, H hash functions
are randomly selected, each of them maps k-mer into an integer. For a given read of
length L, only the minimum values over the read are recorded for each of H hash
functions. The k-mers at which the minimum is attained are called min-mers, and
resulted representation is called a sketch. The sketch serves as a locality sensitive
hashing of each read, for the similar sequences are expected share similar sketches.
Because the sketch retains the data only on H min-mers, its size is fixed to H, inde
pendent of read length L.
Built on top of MHAP, Canu (Koren et al. 2017) extends best overlap graph
(BOG) algorithm (Miller et al. 2008) for generating contigs. A new “bogart” algo
rithm estimates an optimal overlap error rate instead of using predetermined one as
in original BOG algorithm. This requires multiple rounds of read and overlap error
correction, but eventually enables to separate repeats diverged only by 3%. Though
BOG algorithm is greedy, the effect is mitigated in Canu by inspecting non-best
overlaps as well to avoid potential misassemblies.
HINGE
While there is no doubt that obtaining more contiguous (i.e., higher contig N50)
assembly is a major goal in genome assembly, the quest just for longer N50 may
cause misassemblies if the strategy gets too greedy. Being aware that danger,
HINGE (Kamath et al. 2017) aims to perform the optimal resolution of repeats in
assembly, in the sense that the repeats should be resolved if and only if it is sup
ported by long read data available. To implement such a strategy is rather straight
forward for de Bruijn graphs. In de Bruijn graph, its k-mers representing nodes are
connected by edges when they co-occur next to each other in reads. In ideal situa
tion, the genome assembly is realized as an Eulerian path, i.e., trail which visits
every edge exactly once, in the de Bruijn graph. However, de Bruijn graphs are not
robust for noisy long read, so overlap graphs are usually preferred for long read.
One of the key motivations of HINGE is to give such a desirable property of de
Bruijn graphs, to overlap graphs which is more error-resilient. To do so, HINGE
enriches string graph with additional information called “hinges” based on the
result of the read overlap step. Then, assembly graph with optimal repeat resolution
can be constructed via a hinge-aided greedy algorithm.
Miniasm (& Minimap)
Minimap (Li 2016) adopts a similar idea as MHAP, it uses minimizers to represents
the reads compactly. For example, Minimap uses a concept of (w,k)-minimizer,
which is the smallest (in the hashed value) k-mer in w consecutive k-mers. To per
form mapping, Minimap searches for colinear sets of minimizers shared between
sequences. Miniasm (Li 2016), an associated assembly module, generates assembly
graph without error-correction. It firstly filters low-quality reads (chimeric or with
untrimmed adapters), constructs graph greedily, and then cleans up the graph by
several heuristics, such as popping small bubbles and removing shorter overlaps.
Detection of Structural Variants (SVs)
Sequence variants are called structural when they are explained by the mechanisms
involving double-strand breaks, and are often defined to be variants larger than cer
tain size (e.g., 50 bp) for the sake of convenience. They are categorized into several
classes such as insertions/deletions (including presence/absence of transposons),
inversion, (segmental) duplication, tandem repeat expansion/contraction, etc. While
some classes of SVs are notoriously difficult to detect via short reads (especially
long inversions and insertions), long reads have promise to detect more of them by
capturing entire structural events within sequencing reads.
PBHoney
PBHoney (English, Salerno & Reid 2014) implements combination of two methods
for detecting SVs via read alignment to reference sequence. Firstly, PBHoney
exploits the fact that the alignment of reads by BLASR should be interrupted (giv
ing soft-clipped tails) at the breakpoints of SV events. PBHoney detects such inter
rupted alignments (piece-alignments) and clusters them to identify individual SV
events. Secondly, PBHoney locates SVs by examining the genomic regions with
anomalously high error rate. Such a large discordance can signal the presence of
SVs because sequencing errors within PacBio reads are supposed to distribute rather
randomly.
Sniffles (& NGMLR)
NGMLR (Sedlazeck et al. 2017) is a long-read aligner designed for SV detection,
which uses two distinct gap extension penalties for different size range of gaps (i.e.,
concave gap penalty) to align entire reads over the regions with SVs. Intuitively, the concave gap penalty is designed so that it can allow longer gaps in alignment while
shorter gaps are penalized just as sequencing errors. Adopting such a complicated
scoring scheme makes the alignment process computationally intensive (Miller et al.
1988), but NGMLR introduces heuristics to perform faster alignment. Then, an
associated tool to detect SVs, Sniffles scans the read alignment to report putative
SVs which are then clustered to identify individual events and evaluated by various
criteria. Optionally, Sniffle can infer genotypes (homozygous or heterozygous) of
detected variants, and can associate “nested SVs” which are supported by the same
group of long reads.
SMRT-SV
SMRT-SV (Huddleston et al. 2017) is a SV detection tool based on local assembly.
It firstly maps long reads to reference genome, against which SVs are called. Then
it searches signatures of SVs within alignment results, and 60 kbp regions around
the detected signatures are extracted. The regions are to be assembled locally from
those reads using Canu, then SVs are called by examining the alignment between
assembled contigs and reference. Local assembly is performed for other regions
(without SV signatures) as well to detect smaller variants.
Beyond DNA – Transcriptome Analysis and Methylation
Analysis
SMRT sequencing has been found its applications outside DNA analysis as well.
When it is applied to cDNA sequencing, long read would be expected to capture the
entire structures of transcripts to elucidate expressing isoforms comprehensively.
IDP (Isoform Detection and Prediction) (Au et al. 2013) and IDP-ASE
(Deonovic et al. 2017) are tools dedicated to analyze long read transcriptome data.
To detect expressing isoforms from long read transcriptome data, IDP formulates it
in the framework of integer programming. To estimate allele-specific expression
both in gene-level and isoform-level, IDP-ASE then solves probabilistic model of
observing each allele in short read RNA-seq. Both IDP and IDP-ASE effectively
combines long read data for detection of overall structure of transcripts, and short
read data for accurate base-pair level information.
In methylation analysis, official kineticsTools in SMRT Analysis has been widely
used to detect base modification sites and to estimate sequence motives for DNA
modification (see (Flusberg et al. 2010) for the principle of detection). Detecting
5-methyl-cytosines (5mC), which is by far the dominant type of DNA modification
in plants and animals, is challenging due to their subtle signal. Designed for detect
ing 5mC modifications in large genomes within practical sequencing depth, AgIn
(Suzuki et al. 2016) exploits the observation that CpG methylation events in verte
brate genomes are correlated over neighboring CpG sites, and tries to assign the
binary methylation states to CpG sites based on the kinetic signals under the con
straint that a certain number of neighboring CpG sites should be in the same state.
Making the most of high mappability of long read, AgIn has been applied to observe
diversified CpG methylation statuses of centromeric repeat regions in fish genome
(Ichikawa et al. 2017), and to observe allele-specific methylation events in human
genomes.
Concluding Remarks
We have briefly described some innovative ideas in bioinformatics for an effective
use of long read data. As concluding remarks, let me mention a few prospects for the
future development in the field. By now, it is evident the quest for complete genome
assembly is almost done, but the remaining is the most difficult part such as
extremely huge repeats, centromeres, telomeres. While many state-of-the-art assem
blers take the presence of such difficult regions into account and can carefully gen
erate high quality assembly for the rest of genomes, it is remained open how to
tackle these difficult part of the genome, how to resolve its sequence, not escaping
from them.
Base modification analysis using PacBio sequencers may also have huge potential
to distinguish several types of base modifications and to detect them simultaneously
in the same sample (Clark et al. 2011), but only the limited number of modification
types (6 mA, 4mC, and 5mC) are considered for now. This is mainly due to the
technical challenge to alleviate noise in kinetics data to distinguish each type of
modifications and unmodified bases from each other.
That said, it will be no doubt that the field would be more attractive than ever, as
the use of long read sequencer becomes a daily routine in every area of biological
research, or maybe even in clinical practice.