Computational methods for analysis of single molecule sequencing data 分析单分子测序数据的计算方法
- Computing Science - Theses, Dissertations, and other Required Graduate Degree Essays
- Theses, Dissertations, and other Required Graduate Degree Essays
Author:
Date created: 2020-03-26
Identifier: etd20811
Keywords:
Computational biology Single-molecule sequencing
PacBio Oxford Nanopore
Long read mapping Hybrid error correction Hybrid assembly
Abstract:
Next-generation sequencing (NGS) technologies paved the way to a significant increase in the number of sequenced genomes, both prokaryotic and eukaryotic. This increase provided an opportunity for considerable advancement in genomics and precision medicine. Although NGS technologies have proven their power in many applications such as de novo genome assembly and variation discovery, computational analysis of the data they generate is still far from being perfect. The main limitation of NGS technologies is their short read length relative to the lengths of (common) genomic repeats. Today, newer sequencing technologies (known as single-molecule sequencing or SMS) such as Pacific Biosciences and Oxford Nanopore are producing significantly longer reads, making it theoretically possible to overcome the difficulties imposed by repeat regions. For instance, for the first time, a complete human chromosome was fully assembled using ultra-long reads generated by Oxford Nanopore. Unfortunately, long reads generated by SMS technologies are characterized by a high error rate, which prevents their direct utilization in many of the standard downstream analysis pipelines and poses new computational challenges. This motivates the development of new computational tools specifically designed for SMS long reads. In this thesis, we present three computational methods that are tailored for SMS long reads. First, we present lordFAST, a fast and sensitive tool for mapping noisy long reads to a reference genome. Mapping sequenced reads to their potential genomic origin is the first fundamental step for many computational biology tasks. As an example, in this thesis, we show the success of lordFAST to be employed in structural variation discovery. Next, we present the second tool, CoLoRMap, which tackles the high level of base-level errors in SMS long reads by providing a means to correct them using a complementary set of NGS short reads. This integrative use of SMS and NGS data is known as hybrid technique. Finally, we introduce HASLR, an ultra-fast hybrid assembler that uses reads generated by both technologies to efficiently generate accurate genome assemblies. We demonstrate that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on all the samples compared to other tested assemblers. Furthermore, the generated assemblies in terms of contiguity and accuracy are on par with the other tools on most of the samples.
下一代测序(NGS)技术为原核和真核基因组测序数量的显著增加铺平了道路。
这种增长为基因组学和精准医疗的巨大进步提供了机会。
虽然NGS技术已经在许多应用中证明了它们的威力,如基因组重新组装和变异发现,但它们产生的数据的计算分析仍远未达到完美。
NGS技术的主要限制是相对于(常见)基因组重复序列的长度而言,其读取长度较短。
今天,较新的测序技术(称为单分子测序或SMS),如太平洋生物科学和牛津纳米孔,正在产生更长的读取,使其在理论上有可能克服重复区域带来的困难。
例如,利用牛津纳米孔产生的超长读取片段,人类第一次完整地组装了染色体。
不幸的是,SMS技术产生的长读取具有高错误率的特点,这阻碍了它们在许多标准下游分析管道中的直接利用,并带来了新的计算挑战。
这推动了专门为长时间阅读短信而设计的新型计算工具的发展。
在这篇论文中,我们提出了三种针对长时间短信读取的计算方法。
首先,我们介绍了lordFAST,这是一种快速而灵敏的工具,可以将长时间的嘈杂数据映射到参考基因组。
对许多计算生物学任务来说,将测序读序列映射到其潜在的基因组来源是第一个基本步骤。
作为一个例子,在本论文中,我们展示了lordFAST在结构变异发现中的成功应用。
接下来,我们介绍第二个工具CoLoRMap,它通过使用一组互补的NGS短读取来纠正SMS长读取中的高级基本错误。
这种对SMS和NGS数据的综合使用被称为混合技术。
最后,我们介绍了HASLR,这是一种超高速混合装配程序,它使用两种技术生成的读码来高效地生成准确的基因组组装。
我们证明了HASLR不仅是速度最快的汇编程序,而且在所有样本中,与其他测试的汇编程序相比,它的错误汇编次数也最少。
此外,在接近和精度方面,所生成的装配体在大多数样本上与其他工具是同等的。
Document type:
Thesis
Rights:
This thesis may be printed or downloaded for non-commercial research and scholarly purposes. Copyright remains with the author.
File(s):
Supervisor(s):
Binay Bhattacharya
S. Cenk Sahinalp; Cedric Chauve; Faraz Hach
Department:
Applied Sciences: School of Computing Science
Thesis type:
(Thesis) Ph.D.
Statistics:
文摘:
下一代测序(NGS)技术为原核和真核基因组测序数量的显著增加铺平了道路。
这种增长为基因组学和精准医疗的巨大进步提供了机会。
虽然NGS技术已经在许多应用中证明了它们的威力,如基因组重新组装和变异发现,但它们产生的数据的计算分析仍远未达到完美。
NGS技术的主要限制是相对于(常见)基因组重复序列的长度而言,其读取长度较短。
今天,较新的测序技术(称为单分子测序或SMS),如太平洋生物科学和牛津纳米孔,正在产生更长的读取,使其在理论上有可能克服重复区域带来的困难。
例如,利用牛津纳米孔产生的超长读取片段,人类第一次完整地组装了染色体。
不幸的是,SMS技术产生的长读取具有高错误率的特点,这阻碍了它们在许多标准下游分析管道中的直接利用,并带来了新的计算挑战。
这推动了专门为长时间阅读短信而设计的新型计算工具的发展。
在这篇论文中,我们提出了三种针对长时间短信读取的计算方法。
首先,我们介绍了lordFAST,这是一种快速而灵敏的工具,可以将长时间的嘈杂数据映射到参考基因组。
对许多计算生物学任务来说,将测序读序列映射到其潜在的基因组来源是第一个基本步骤。
作为一个例子,在本论文中,我们展示了lordFAST在结构变异发现中的成功应用。
接下来,我们介绍第二个工具CoLoRMap,它通过使用一组互补的NGS短读取来纠正SMS长读取中的高级基本错误。
这种对SMS和NGS数据的综合使用被称为混合技术。
最后,我们介绍了HASLR,这是一种超高速混合装配程序,它使用两种技术生成的读码来高效地生成准确的基因组组装。
我们证明了HASLR不仅是速度最快的汇编程序,而且在所有样本中,与其他测试的汇编程序相比,它的错误汇编次数也最少。
此外,在接近和精度方面,所生成的装配体在大多数样本上与其他工具是同等的。