基于 PacBio 测序数据的纠错算法评测与剪切位点识别研究

基于 PacBio 测序数据的纠错算法评测与剪切位点识别研究
摘 要
高通量测序技术的产生和发展催生了许多大规模基因测序项目, 如国际千人基
因组计划、 英国 UK10K 计划以及中国的百万人群基因组测序计划等等。 这些项目
已经或计划对成百上千万个个体进行基因测序, 使得测序序列数据量呈指数增长。
基因测序为研究人类的遗传信息, 解释基因功能、 各种疾病的关联以及分析人类疾
病的发病机理提供了详细的基础数据。 在此背景下, 本文以目前最新型的测序平台
PacBio 为对象, 针对第三代基因测序序列高错误率的特点重点研究纠错算法, 并且
根据 DNA 测序数据设计和实现了高性能的剪切位点预测方法。
读段纠错和剪切位点识别是 PacBio 测序数据分析中的两个关键模块, 本文首先
针对 PacBio 测序碱基判读错误率高这一固有缺陷, 解析了所有纠错工具原理和底层
算法架构, 并就目前所有的自纠错和混合纠错工具进行了系统地比较和评估; 然后,
针对新的真核生物基因剪切位点的识别问题, 本文在集成多种特征生成的基础上,
采用机器学习方法对基因剪切的宏微观规律进行探索和研究, 以求达到准确预测剪
切位点的目标。 本文主要研究内容与贡献具体描述如下:
一, 针对 PacBio 纠错工具, 在分析其纠错原理的基础上, 本文提出了一套系统
的纠错工具评测方法, 设计大量实验将现有的自纠错和混合纠错工具统一地应用于
不同测序深度大肠杆菌和酵母的 PacBio 公共数据集上。 实验结果表明, 几乎没有一
种工具在性能、 效率以及对后续分析影响相关的所有指标上均表现很好, 每种工具
都具有特有的优缺点和适用测序深度。 最后, 分别针对每种纠错工具给出了对应的
最优选择策略, 且给出了不同测序深度数据集下的工具选择方案。 本文的指标可以
作为用户选择合适纠错工具的依据, 且为未来新工具的开发指明方向。
二, 针对基因组剪切位点检测的问题, 本文就常规剪切位点附近的序列模式信
息, 提出一种基于多特征提取的机器学习识别方法。 该方法首先利用多种基因序列
特征生成方法, 分别获取常规剪切供体和受体位点附近的序列模式信息。 然后通过
应用随机森林和支持向量机等机器学习方法对剪切模式进行建模, 进一步辨别出序
列真伪剪切位点。 实验结果表明, 基于多特征的机器学习在识别剪切位点上准确率
较高, 在供体、 受体位点的识别上 AUC 值最高可以达到 0.904。 该方法能够高效地
帮助研究人员准确检测基因组上的真正剪切位点以及其他相关功能位点, 并能促进
新基因的注释和清晰认识基因的编码区域和结构。
关键词: PacBio; 自纠错; 混合纠错; 剪切位点; 机器学习

The evaluation of error-correction algorithm and identification
of splicing sites based on PacBio sequencing data
Abstract
The generation and development of high-throughput sequencing technology has led
to many large-scale gene sequencing projects, such as the International Thousand Human
Genome Project, the UK UK10K Program, and China's Million Population Genome
Sequencing Project. These projects already have sequenced or plan to sequence hundreds
of millions of individuals, leading to an exponential increase in the amount of sequencing
data. Gene sequencing provides detailed basic data for the study of human genetic
information, interpretation of gene function, association of various diseases, and analysis
of the pathogenesis of human diseases. In this context, the current novel sequencing
platform PacBio is used as the object, focusing on the error correction algorithm for the
high error rate of the third generation gene sequencing, and based on the DNA
sequencing data, a high-performance method of predicting splicing site is designed and
implemented.
Error correction and splicing site recognition are two key modules in PacBio
sequencing data analysis. This article first direct at the problem on intrinsic high error
rate of PacBio sequencing, analyzing the principle and the underlying algorithm
architecture of error correction tools, and construct a systematic comparison and
evaluation of all self-correction and hybrid error-correction tools currently. Then, in order
to identify the splicing sites of new eukaryotic genome, this paper uses machine learning
methods based on the integration of multiple feature generations to explore and study the
macro and microscopic laws of gene splicing, in order to achieve the goal of predicting
the splicing sites accurately. The main research contents and contributions of this article
are described as follows:
First, based on analyzing the principle of PacBio error correction tool, this paper
proposes a set of systematic evaluation methods for error correction tool, and designs a
large number of experiments to apply the existing self-correction and hybrid error
correction tools at different sequencing depths of PacBio sequencing public dataset of E.
coli and S. cere uniformly. The experimental results show that almost none of the tools
perform well on all the indicators, including performance, efficiency, and subsequent
analysis. Each tool has unique advantages and disadvantages, including applicable
sequencing depth. Finally, the corresponding optimal selection strategies are given for

each error correction tool, and selection schemes for different sequencing depth data sets
are given. The indicators in this article can serve as the basis for the user to select the
appropriate error correction tools, and indicate the direction for future development of
new tools.
Secondly, aiming at the problem of genomic splice sites detection, this paper
proposes a machine learning recognition method based on multi-feature extraction for the
sequence pattern information near the regular splice site. The method first uses a variety
of gene sequence feature generation methods to obtain information about the sequence
patterns and physicochemical properties in the vicinity of the conventional splicing donor
and acceptor sites. Then contrusting the sequence mode by using random forest and
support vector machine, we can further identify the true and false splicing sites. The
experimental results show that the multi-feature based machine learning has higher
accuracy in the recognition of splice sites, and the highest AUC value can reach 0.904 in
recognition of donor and acceptor sites. This method can effectively help researchers
accurately detect the true splice site and other related functional sites on the genome, and
can promote the annotation of new genes and clearly understand the coding regions and
structures of genes.
Keywords: PacBio; Self-correction; Hybrid error-correction; Splicing site; Machine
learning

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wangchuang2017

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值