Informatics for PacBio Long Reads_long insertions or inversions-CSDN博客

本文链接：https://blog.csdn.net/u010608296/article/details/102765090

Informatics for PacBio Long Reads

April 2019
Advances in Experimental Medicine and Biology

DOI:
10.1007/978-981-13-6037-4_8

In book: Single Molecule and Single Cell Sequencing

Yuta Suzuki

In this article, we review the development of a wide variety of bioinformatics software implementing state-of-the-art algorithms since the introduction of SMRT sequencing technology into the field. We focus on the three major categories of development: read mapping (aligning to reference genomes), de novo assembly, and detection of structural variants. The long SMRT reads benefit all the applications, but they are achievable only through considering the nature of the long reads technology properly.

在这篇文章中，我们回顾了自SMRT测序技术引入该领域以来，实现最先进算法的各种生物信息学软件的发展。

我们主要关注三大类开发:读映射(与参考基因组对齐)、从头组装和结构变体检测。

长读SMRT有益于所有的应用程序，但是它们只有通过适当地考虑长读技术的性质才能实现。

Informatics for PacBio Long Reads

Yuta Suzuki

Abstract In this article, we review the development of a wide variety of

bioinformatics software implementing state-of-the-art algorithms since the

introduction of SMRT sequencing technology into the field. We focus on the three

major categories of development: read mapping (aligning to reference genomes), de

novo assembly, and detection of structural variants. The long SMRT reads benefit all

the applications, but they are achievable only through considering the nature of the

long reads technology properly.

在这篇文章中，我们回顾了自SMRT测序技术引入该领域以来，实现最先进算法的各种生物信息学软件的发展。我们关注三个主要的发展类别:读取映射(与参考基因组对齐)，从头组装，和检测结构变异。长SMRT读取对所有应用程序都有好处，但只有正确考虑长读取技术的性质，才能实现长SMRT读取。

Advances in SMRT Biology and Challenges in Long Read

Informatics

In 2011, advent of the PacBio RS sequencer and its SMRT (single molecule real

time) sequencing technology revolutionized the concept of DNA sequencing.

Longer reads are promised to generate de novo assembly of much higher contiguity,

and the claim was proved by several assembly projects (Steinberg et al.

2014; Pendleton et al. 2015; Seo et al. 2016). The lack of sequencing bias was

proved to be able to read regions which are extremely difficult for NGS (Next

Generation Sequencers) (Loomis et al. 2013).

2011年，PacBio RS测序仪及其SMRT(单分子实时)测序技术的问世彻底改变了DNA测序的概念。更长的读保证生成更大连续性的从头组装，这一说法已被多个组装项目证明(Steinberg et al. 2014;Pendleton et al. 2015;Seo et al. 2016)。缺乏测序偏差被证明能够读取NGS (Next Generation Sequencers)极其困难的区域(Loomis et al. 2013)。

None of these achievement, however, was just straightforward application of

conventional informatics strategy developed for short read sequencers; the virtue of

the long reads was not free at all. As many careful skeptics claimed in the early his

tory of PacBio sequencing, the long reads seemed too noisy. Base accuracy was

around ~85% for single raw read, that is, ~15% of bases were wrong calls, and indels consisted most of the errors. The higher error rate made it inappropriate to apply informatics tools designed for much accurate short read technologies.

Even the higher error rate is properly handled by sophisticated algorithms, the

length of the reads itself can pose another problem. Computational burden of many

algorithms depends on the read length L. When only the short reads are assumed, it

may be considered as constant, e.g., L = 76, 150, etc. The emergence of long read

sequencer changed the situation drastically by improving the read length by orders

of magnitudes, to thousands of bases, and to tens of thousands of bases by now.

Besides the ongoing innovations for longer reads, there is a large variation in length

of sequencing reads even in the same sequencing run. Therefore, the assumption

that the read length is constant is not valid anymore, and one must have a strategy to handle (variably) long reads in reduced time (CPU hours) and space (memory foot print) requirement.

Availability of long read opened a door to the set of problems which were

biologically existing in real but implicitly ignored by studies using short read

sequencing. For example, we had to realize that a non-negligible fraction of reads

could cover SVs (structural variants), requiring a new robust mapping strategy other

than simply masking the known repetitive regions.

然而，这些成就都不是简单地应用了为短读测序仪开发的传统信息学策略;长时间阅读的好处根本不是免费的。正如许多谨慎的怀疑论者在他早期关于PacBio测序的报道中所说的那样，长时间的阅读看起来太嘈杂了。单次原始读取的碱基正确率约为~85%，即~15%的碱基是错误的调用，大部分错误由indele构成。较高的误差率使它不适合应用为更精确的短读技术而设计的信息学工具。即使较高的错误率是由复杂的算法正确处理的，读取的长度本身也会带来另一个问题。许多算法的计算量取决于读长度L。当只考虑短读时，可以将其视为常数，如L = 76、150等。长读序列仪的出现极大地改变了这种情况，将读序列的长度提高了几个数量级，到现在已经达到了数千个碱基，甚至数万个碱基。除了正在进行的长读取创新之外，即使在相同的测序运行中，测序读取的长度也有很大的差异。因此，读长度为常量的假设不再有效，必须有一个策略来处理(可变)长读，以减少时间(CPU小时)和空间(内存占用)需求。长读的可用性为一系列问题打开了一扇门，这些问题在现实中存在，但被使用短读测序的研究含蓄地忽略了。例如，我们必须认识到，不可忽略的一部分读取可以覆盖svv(结构变体)，这就需要一种新的健壮的映射策略，而不是简单地掩盖已知的重复区域。

Consequently, many sophisticated algorithms had to be developed to resolve

these issues; how to mitigate higher error rate, and how it can be done efficiently for

long reads. The rest of this article covers some important innovations achieved and

ongoing efforts in informatics area to make the most of long reads data.

因此，必须开发许多复杂的算法来解决这些问题;如何降低更高的错误率，以及如何高效地完成长时间读取。这篇文章的其余部分涵盖了一些重要的创新和正在进行的努力，在信息学领域最大限度地利用长读数据。

Aligning Noisy Long Reads with Reference Genome

When one aligns long reads against reference sequence, one must be aware that the

variations between reads and reference stems from two conceptually separate

causes. On one hand, there are sequencing errors in its simple sense, which is dis

crepancy between a read observed and actual sequence being sequenced. On the

other hand, we expect a sample sequenced would have slightly different sequence

than a reference sequence (otherwise there is no point in doing sequencing), and

those difference are usually called variants. Though sequencing errors and sequence

variants are conceptually different, however, they both appears just as “errors” to us

unless they have some criteria to distinguish them. The next two examples are for

understanding why the distinction between two classes of “error” is relevant here.

Let’s consider we have some noisy reads. Clearly, we cannot call sequence

variants specific to the sample unless the frequency of sequencing errors is controlled

to be sufficiently low compared to the frequency of variants. This is the reason why

it is difficult for noisy reads to detect small nucleotide variants such as point

mutations and indels.

Next, assume we have long reads. Then, there are more chances that the reads

span the large variations such as structural variations (SVs) between a reference

genome and the sample sequenced. This situation is problematic for aligners who 121

considered any possible variation between reads and reference to be sequencing

errors, for such aligners would fail to detect correct alignment as they need to intro

duce too much errors for aligning these sequences. Some aligners try to combat the

situation by employing techniques such as chaining and split alignment. Some

aligners (NGMLR, Minimap2) explicitly introduce an SV-aware scoring scheme

such as a two-parts concave gap penalty, which reflects the two classes of variations

between read and reference.

Sequence alignment is so fundamental in sequence analysis that it finds its

application everywhere. For example, mapping sequencing reads to reference

genome is the very first step of resequencing studies. Accuracy of mapping can

directly be translated into the overall reliability of results. Also, mapping is often

one of the most computationally intensive steps. Therefore, accurate and faster

mapping software would benefit the whole area of resequencing studies. In the

context of de novo assembly pipeline, it is used for detecting overlap among long

reads. Of noted, desired balance of sensitivity and specificity of overlap detection is

controlled differently from mapping to reference, and it could often be very subtle.

Though it is more or less subjective to make distinction between standalone

aligners and aligners designed as a module of assembly pipeline or SV detection

pipeline, we decided to cover some aligners in other sections. MHAP will be intro

duced in relation with Canu in the section devoted to assembly tools. Similarly,

NGMLR will be detailed together with Sniffle in the section for SV detection.

将嘈杂长读与参考基因组进行对齐
当一个人将长读取与引用序列对齐时，他必须意识到读取和引用之间的差异源于两个概念上独立的原因。一方面，有简单意义上的测序错误，即观察到的序列和实际测序的序列不一致。另一方面，我们期望经过测序的样本的序列与参考序列会有轻微的不同(否则测序就没有意义了)，这些差异通常被称为变异。虽然测序错误和序列变异在概念上是不同的，但是，除非有一些标准来区分它们，否则它们在我们看来都是错误。接下来的两个例子是为了理解为什么两类错误之间的区别在这里是相关的。让我们考虑我们有一些嘈杂的读取。显然，我们不能称特定于样本的序列变异，除非测序错误的频率与变异的频率相比控制得足够低。这就是为什么噪声读取很难检测到小的核苷酸变异，如点突变和插入。接下来，假设我们有长时间的阅读。然后，读取更有可能跨越大的变异，如参考基因组和测序样本之间的结构变异(SVs)。

这种情况对于121认为读取和引用之间任何可能的变化都是测序错误的比对者来说是有问题的，因为这样的比对者将无法检测到正确的比对，因为他们需要引入太多的错误来对这些序列进行比对。一些对齐者试图通过使用诸如链和分裂对齐等技术来应对这种情况。一些对准器(NGMLR, Minimap2)明确地引入了sv感知的评分方案，如两部分凹间隙惩罚，它反映了读取和引用之间的两类变化。序列比对是序列分析的基础，在序列分析中随处可见。例如，将测序读数绘制到参考基因组是重测序研究的第一步。测绘的准确性可以直接转化为结果的整体可靠性。

此外，映射通常是计算最密集的步骤之一。因此，准确、快速的测绘软件将有利于整个重测序研究领域。在从头组装流水线的环境下，它被用来检测长读之间的重叠。值得注意的是，重叠检测的灵敏度和特异性在映射和参考之间的控制是不同的，而且常常是非常微妙的。虽然区分独立对准器和作为组装管道模块或SV检测管道设计的对准器或多或少有些主观，但我们决定在其他部分涵盖一些对准器。在专门介绍装配工具的部分，将介绍MHAP和Canu。同样，NGMLR和Sniffle也将在SV检测部分详细介绍。

BWA-SW and BWA-MEM

Adopting the seed-and-extend approach, BWA-SW (Li & Durbin 2010) builds

FM-indices for both query and reference sequence. Then, DP (dynamic program

ming) is applied to these FM-indices to find all local matches, i.e., seeds, allowing

mismatches and gaps between query and reference. Detected seeds are extended by

Smith-Waterman algorithm. Some heuristics are explicitly introduced to speed up

alignment of large-scale sequencing data and to mitigate the effect of repetitive

sequences. BWA-MEM (Li 2013) inherits similar features implemented in

BWA-SW such as split alignment, but is found on a different seeding strategy using

SMEM (supermaximal exact matches) and reseeding technique to reduce mismap

ping caused by missing seed hits.

BWA-SW和BWA-MEM
采用种子和扩展的方法，BWA-SW (Li & Durbin 2010)为查询和参考序列建立了fm -索引。然后，对这些fm索引应用DP(动态规划)来查找所有的局部匹配，即种子，允许查询和引用之间的不匹配和空白。利用Smith-Waterman算法对检测到的种子进行扩展。明确地引入了一些启发式算法，以加快大规模测序数据的对齐速度，并减轻重复序列的影响。BWA-MEM (Li 2013)继承了BWA-SW中实现的类似特征，如分裂对齐，但发现采用了不同的播种策略，使用了SMEM(超最大精确匹配)和重播种技术，以减少丢失种子命中造成的错误映射。

BLASR

BLASR (Chaisson & Tesler 2012) (Basic Local Alignment with Successive

Refinement) is also one of the earliest mapping tools specifically developed for

SMRT reads. Like BWA-MEM, it is probably the most widely used one to date.

Bundled with official SMRT Analysis, it has been the default choice for the

Informatics for PacBio Long Reads mapping (overlapping) step in all protocols such as resequencing, de novo assem

bly, transcriptome analysis, and methylation analysis. In the BLASR’s paper, the

authors explicitly stated it was designed to combine algorithmic devices developed

in two separate lines of studies, namely, a coarse alignment method for whole

genome alignment and a sophisticated data structure for fast short read mapping.

Proven to be effective for handling noisy long read, the approach of successive

refinement, or seed-chain-align paradigm, has become a standard principle.

BLASR first finds short exact matches (anchors) using either suffix array or FM

index (Ferragina & Manzini 2000). Then, the regions with clustered anchors aligned

colinearly are identified as candidate mapping locations, by global chaining algo

rithm (Abouelhoda & Ohlebusch 2003). The anchors are further chained by sparse

dynamic programming (SDP) within each candidate region (Eppstein et al. 1992).

Finally, it gives detailed alignment using banded DP (dynamic programming)

guided by the result of SDP. BLASR achieved tenfold faster mapping of reads to

human genome than BWA-SW algorithm at comparable mapping accuracy and

memory footprint.

BLASR
BLASR (Chaisson & Tesler 2012)(基本局部对齐与逐次细化)也是最早专门为SMRT读取开发的映射工具之一。和BWA-MEM一样，它可能是迄今为止使用最广泛的一个。与官方SMRT分析捆绑在一起，它一直是PacBio Long Reads mapping(重叠)信息学步骤的默认选择，在所有协议中，如重测序、从头组装、转录组分析和甲基化分析。在BLASR的论文中，作者明确表示，它的设计是为了结合两种独立的研究线开发的算法设备，即用于全基因组比对的粗比对方法和用于快速短读图谱的复杂数据结构。经过证明，连续细化或种子链对齐范式在处理长时间读取时是有效的，已成为一种标准原则。BLASR首先使用后缀阵列或FM索引找到短精确匹配(锚)(Ferragina & Manzini 2000)。然后，通过全局链接算法(Abouelhoda & Ohlebusch 2003)将聚类锚节点共线性排列的区域识别为候选映射位置。在每个候选区域内，锚点通过稀疏动态规划(SDP)进一步链接(Eppstein等人，1992)。最后，在SDP结果的指导下，采用带状动态规划方法进行了详细的对准。与BWA-SW算法相比，BLASR在绘制精度和内存占用方面实现了10倍快的人类基因组图谱。

DALIGNER

DALIGNER (Myers 2014) is specifically designed for finding overlaps between

noisy long reads, though its concept can also be adopted for a generic long read

aligner, as implemented in DAMAPPER (https://github.com/thegenemyers/

DAMAPPER). Like in BLASR, DALIGNER also performs filter based on short

exact matches. Instead of using BWT (FM index), it explicitly processes k-mers

within reads by thread-able and cache coherent implementation of radix sort.

Detected k-mers are then compared via block-wise merge sort, which reduces mem

ory footprint to a constant depending only on the block size. To generate local align

ment, it applies O(ND) diff algorithm between two candidate reads (Myers 1986).

DALIGNER achieved 22 ~ 39-fold speedup over BLASR at higher sensitivity in

detecting correct overlaps (Myers 2014). DALIGNER is supposed to be a compo

nent for read overlap (with DAMASKER for repeat masking, DASCRUBBER for

cleaning up low quality regions, and a core module for assembly) of DAZZLER de

novo assembler for long noisy reads, which will be released in future.

DALIGNER
DALIGNER (Myers 2014)是专门为寻找嘈杂长读之间的重叠而设计的，尽管它的概念也可以用于通用的长读对齐器，如DAMAPPER (https://github.com/thegenemyers/ DAMAPPER)中实现的。和BLASR一样，DALIGNER也基于短精确匹配进行滤波。它没有使用BWT (FM索引)，而是显式地处理基数排序的可线程和缓存一致实现中的读取中的k-mers。检测到的k-mers然后通过块的归并排序进行比较，这将内存占用减少到一个常数，只取决于块的大小。为了生成局部对齐，它在两个候选读取之间应用O(ND)差异算法(Myers 1986)。DALIGNER在检测正确重叠时，比BLASR提高了22 ~ 39倍的灵敏度(Myers 2014)。DALIGNER应该是用于读取重叠的组件(DAMASKER用于重复屏蔽，DASCRUBBER用于清除低质量区域，核心模块用于组装)，以及用于长噪声读取的DAZZLER de novo汇编器，该组件将在未来发布。

Minimap2

Minimap2 (Li 2017) is one of the latest and state-of-the-art alignment program.

Minimap2 is general-purpose aligner in that it can align short reads, noisy long

reads, and reads from transcripts (cDNA) back to a reference genome. Minimap2

combines several algorithmic ideas developed in the field, such as locality-sensitive

hashing as in Minimap and MHAP. For accounting possible SVs between reads and

genome, it employs concave gap cost as in NGMLR, and it is efficiently computed

using formulation proposed by Suzuki & Kasahara (2017). In addition to these fea

tures, the authors further optimized the algorithm, by transforming the DP matrix

from row-column coordinate to diagonal-antidiagonal coordinate for better concur

rency in modern processors. According to the author of Minimap2, it is supposed to

replace BWA-MEM, which is in turn a widely used extension of BWA-SW.

De novo Assembly

As Lander-Waterman theory (Lander & Waterman 1988) would assert, the longer

input reads are quite essential in achieving a high-quality genome assembly for

repetitive genomes. Therefore, developing a de novo assembler for long read is nat

urally the most active area in the field of long read informatics.

To our knowledge, almost all assemblers published for long read take an overlap

layout-consensus (OLC) approach, where the overall task of assembly can be

divided into the three steps. (1. Overlap) The overlaps between reads are identified

as candidate pairs representing the same genomic regions, and the overlap graph is

constructed to express these relations. (2. Layout) The graph is transformed to gen

erate linear contigs. The step often starts by constructing the string graph (Myers

2005), a string-labeled graph which encodes all the information in reads observed,

and eliminates edges containing redundant information. (3. Consensus) The final

assembly is polished. To eliminate errors in contigs, consensus is taken among reads

making up the contigs.

Though we do not cover tools for the consensus step here, there are many of

them released to date including official Quiver and Arrow bundled in SMRT

Analysis (https://github.com/PacificBiosciences/GenomicConsensus), another offi

cial tool pbdagcon (https://github.com/PacificBiosciences/pbdagcon), Racon

(Vaser et al. 2017), and MECAT (Xiao et al. 2017). Of note, quality of a polished

assembly can be much better than a short-read-based assembly due to the random

ness of sequencing errors in long reads (Chin et al. 2013; Myers 2014).

FALCON

FALCON (Chin et al. 2016) is designed as a diploid-aware de novo assembler for

long read. It starts by carefully taking consensus among the reads to eliminate

sequencing errors while retaining heterozygous variants which can distinguish two

homologous chromosomes (FALCON-sense). For constructing a string graph,

FALCON runs DALIGNER. The resulted graph contains “haplotype-fused” contigs

and “bubbles” reflecting variations between two homologous chromosomes. Finally,

FALCON-unzip tries to resolve such regions by phasing the associated long reads

Informatics for PacBio Long Reads and local re-assembly. The contigs obtained are called “haplotigs”, which are sup

posed to be faithful representation of individual alleles in the diploid genome.

Canu (& MHAP)

MHAP (Berlin et al. 2015) (Min-Hash Alignment Process) utilized MinHash for

efficient dimensionality reduction of the read space. In MinHash, H hash functions

are randomly selected, each of them maps k-mer into an integer. For a given read of

length L, only the minimum values over the read are recorded for each of H hash

functions. The k-mers at which the minimum is attained are called min-mers, and

resulted representation is called a sketch. The sketch serves as a locality sensitive

hashing of each read, for the similar sequences are expected share similar sketches.

Because the sketch retains the data only on H min-mers, its size is fixed to H, inde

pendent of read length L.

Built on top of MHAP, Canu (Koren et al. 2017) extends best overlap graph

(BOG) algorithm (Miller et al. 2008) for generating contigs. A new “bogart” algo

rithm estimates an optimal overlap error rate instead of using predetermined one as

in original BOG algorithm. This requires multiple rounds of read and overlap error

correction, but eventually enables to separate repeats diverged only by 3%. Though

BOG algorithm is greedy, the effect is mitigated in Canu by inspecting non-best

overlaps as well to avoid potential misassemblies.

HINGE

While there is no doubt that obtaining more contiguous (i.e., higher contig N50)

assembly is a major goal in genome assembly, the quest just for longer N50 may

cause misassemblies if the strategy gets too greedy. Being aware that danger,

HINGE (Kamath et al. 2017) aims to perform the optimal resolution of repeats in

assembly, in the sense that the repeats should be resolved if and only if it is sup

ported by long read data available. To implement such a strategy is rather straight

forward for de Bruijn graphs. In de Bruijn graph, its k-mers representing nodes are

connected by edges when they co-occur next to each other in reads. In ideal situa

tion, the genome assembly is realized as an Eulerian path, i.e., trail which visits

every edge exactly once, in the de Bruijn graph. However, de Bruijn graphs are not

robust for noisy long read, so overlap graphs are usually preferred for long read.

One of the key motivations of HINGE is to give such a desirable property of de

Bruijn graphs, to overlap graphs which is more error-resilient. To do so, HINGE

enriches string graph with additional information called “hinges” based on the

result of the read overlap step. Then, assembly graph with optimal repeat resolution

can be constructed via a hinge-aided greedy algorithm.

Miniasm (& Minimap)

Minimap (Li 2016) adopts a similar idea as MHAP, it uses minimizers to represents

the reads compactly. For example, Minimap uses a concept of (w,k)-minimizer,

which is the smallest (in the hashed value) k-mer in w consecutive k-mers. To per

form mapping, Minimap searches for colinear sets of minimizers shared between

sequences. Miniasm (Li 2016), an associated assembly module, generates assembly

graph without error-correction. It firstly filters low-quality reads (chimeric or with

untrimmed adapters), constructs graph greedily, and then cleans up the graph by

several heuristics, such as popping small bubbles and removing shorter overlaps.

Detection of Structural Variants (SVs)

Sequence variants are called structural when they are explained by the mechanisms

involving double-strand breaks, and are often defined to be variants larger than cer

tain size (e.g., 50 bp) for the sake of convenience. They are categorized into several

classes such as insertions/deletions (including presence/absence of transposons),

inversion, (segmental) duplication, tandem repeat expansion/contraction, etc. While

some classes of SVs are notoriously difficult to detect via short reads (especially

long inversions and insertions), long reads have promise to detect more of them by

capturing entire structural events within sequencing reads.

PBHoney

PBHoney (English, Salerno & Reid 2014) implements combination of two methods

for detecting SVs via read alignment to reference sequence. Firstly, PBHoney

exploits the fact that the alignment of reads by BLASR should be interrupted (giv

ing soft-clipped tails) at the breakpoints of SV events. PBHoney detects such inter

rupted alignments (piece-alignments) and clusters them to identify individual SV

events. Secondly, PBHoney locates SVs by examining the genomic regions with

anomalously high error rate. Such a large discordance can signal the presence of

SVs because sequencing errors within PacBio reads are supposed to distribute rather

randomly.

Sniffles (& NGMLR)

NGMLR (Sedlazeck et al. 2017) is a long-read aligner designed for SV detection,

which uses two distinct gap extension penalties for different size range of gaps (i.e.,

concave gap penalty) to align entire reads over the regions with SVs. Intuitively, the concave gap penalty is designed so that it can allow longer gaps in alignment while

shorter gaps are penalized just as sequencing errors. Adopting such a complicated

scoring scheme makes the alignment process computationally intensive (Miller et al.

1988), but NGMLR introduces heuristics to perform faster alignment. Then, an

associated tool to detect SVs, Sniffles scans the read alignment to report putative

SVs which are then clustered to identify individual events and evaluated by various

criteria. Optionally, Sniffle can infer genotypes (homozygous or heterozygous) of

detected variants, and can associate “nested SVs” which are supported by the same

group of long reads.

SMRT-SV

SMRT-SV (Huddleston et al. 2017) is a SV detection tool based on local assembly.

It firstly maps long reads to reference genome, against which SVs are called. Then

it searches signatures of SVs within alignment results, and 60 kbp regions around

the detected signatures are extracted. The regions are to be assembled locally from

those reads using Canu, then SVs are called by examining the alignment between

assembled contigs and reference. Local assembly is performed for other regions

(without SV signatures) as well to detect smaller variants.

Beyond DNA – Transcriptome Analysis and Methylation

Analysis

SMRT sequencing has been found its applications outside DNA analysis as well.

When it is applied to cDNA sequencing, long read would be expected to capture the

entire structures of transcripts to elucidate expressing isoforms comprehensively.

IDP (Isoform Detection and Prediction) (Au et al. 2013) and IDP-ASE

(Deonovic et al. 2017) are tools dedicated to analyze long read transcriptome data.

To detect expressing isoforms from long read transcriptome data, IDP formulates it

in the framework of integer programming. To estimate allele-specific expression

both in gene-level and isoform-level, IDP-ASE then solves probabilistic model of

observing each allele in short read RNA-seq. Both IDP and IDP-ASE effectively

combines long read data for detection of overall structure of transcripts, and short

read data for accurate base-pair level information.

In methylation analysis, official kineticsTools in SMRT Analysis has been widely

used to detect base modification sites and to estimate sequence motives for DNA

modification (see (Flusberg et al. 2010) for the principle of detection). Detecting

5-methyl-cytosines (5mC), which is by far the dominant type of DNA modification

in plants and animals, is challenging due to their subtle signal. Designed for detect

ing 5mC modifications in large genomes within practical sequencing depth, AgIn

(Suzuki et al. 2016) exploits the observation that CpG methylation events in verte

brate genomes are correlated over neighboring CpG sites, and tries to assign the

binary methylation states to CpG sites based on the kinetic signals under the con

straint that a certain number of neighboring CpG sites should be in the same state.

Making the most of high mappability of long read, AgIn has been applied to observe

diversified CpG methylation statuses of centromeric repeat regions in fish genome

(Ichikawa et al. 2017), and to observe allele-specific methylation events in human

genomes.

Concluding Remarks

We have briefly described some innovative ideas in bioinformatics for an effective

use of long read data. As concluding remarks, let me mention a few prospects for the

future development in the field. By now, it is evident the quest for complete genome

assembly is almost done, but the remaining is the most difficult part such as

extremely huge repeats, centromeres, telomeres. While many state-of-the-art assem

blers take the presence of such difficult regions into account and can carefully gen

erate high quality assembly for the rest of genomes, it is remained open how to

tackle these difficult part of the genome, how to resolve its sequence, not escaping

from them.

Base modification analysis using PacBio sequencers may also have huge potential

to distinguish several types of base modifications and to detect them simultaneously

in the same sample (Clark et al. 2011), but only the limited number of modification

types (6 mA, 4mC, and 5mC) are considered for now. This is mainly due to the

technical challenge to alleviate noise in kinetics data to distinguish each type of

modifications and unmodified bases from each other.

That said, it will be no doubt that the field would be more attractive than ever, as

the use of long read sequencer becomes a daily routine in every area of biological

research, or maybe even in clinical practice.

Acknowledgements I’d like to thank Yoshihiko Suzuki, Yuichi Motai and Dr./Prof. Shinichi

Morishita for insightful comments on the draft.