基因拼接

最新推荐文章于 2021-11-02 09:00:00 发布

wangchuang2017

最新推荐文章于 2021-11-02 09:00:00 发布

阅读量2.7k

点赞数 1

分类专栏：转录组测序

本文链接：https://blog.csdn.net/u010608296/article/details/113879829

版权

生物信息学同时被 3 个专栏收录

642 篇文章 399 订阅

订阅专栏

转录组测序

22 篇文章 12 订阅

订阅专栏

de novo assembly新的基因组装配

6 篇文章 3 订阅

订阅专栏

基因组拼接方法

de novo定义：from the beginning（从头拼接）, no reference genome guided（无参考基因组）

三类de novo基因拼接的计算方法：

1. Greedy algorithm：对于含重复区的序列拼接效果不好

Shortest common string (SCS)：最短的、包含原序列S中所有的k-mer的序列

但是Greedy algorithm为追求最短的序列或者最多的重叠，出现了“吃掉”重复区间的问题。

2. Overlap Layout Consensus：耗时长，用于Sanger测序

3. de Bruijn：速度快、准确度高，目前NGS多采用此方法

将每条序列拆成长度为k个碱基的序列（k-mer），每个k-mer之间重叠部分overlap=k-1

要点：线性结构，欧拉路径，特殊结构的处理

序列拼接工具如Velvet，SPAdes、IDBA-UD等，均采用de Bruijn算法

部分结构可简化和处理掉，例如去头和气泡（remove tips and bubbles）

E.coli基因组的de Bruijn图和测序错误率的关系

测序深度和覆盖度：

测序深度(depth)：测序得到的总碱基数与待测基因组大小的比值。例如E.coli基因组大小为4Mbp，测序得到40Mbp的reads，则测序深度为10X。

Coverage>80%可形成“基因草图(draft genome)”

Draft genome需要30X的测序深度

举例：

1. 测序得到的Reads数：Abundant>moderately>rare

2. 测序深度或覆盖率（read depth or coverage）：Abundant>moderately>rare

3. 根据所需测序深度决定测序通量：如果要得到C的基因草图（需要depth>=30X）,

则测序通量（总碱基数）=rare的基因组碱基数*30/rare%

scaffold=contigs+gaps（缺口）

Scaffold组装主要靠

与已知物种的基因组进行序列比对

paired read测序结果也提供了大量gap filling的信息,依然有大量缺口(gaps)

生物信息常用工具集锦

过滤

SOAPnuke：华大自主开发的一款针对fastq文件的过滤软件。

HTSeq-count：一款用于reads计数的轻便软件，作者介绍说可以用于多种mapping软件的输出结果，而我则用于tophat2的输出文件做计数。不过貌似所有能转换为sam格式文件的输出都可以用htseq-count计数。

RSeQC: An RNA-seq Quality Control Package

比对

BWA：应用最为广泛的比对软件，可以比二代，也可以比三代

Soap：华大开发的比对软件，全称SOAPaligner/soap2

bowtie2：常用于RNA-seq的比对

BLASR：专门用于比对三代reads

pynast：多重序列比对软件，主要用于处理16S序列

FastTree：超快的建树软件，同时处理1M级的序列，主要用于16S的建树

数据处理

SAMtools：专门用于处理SAM、BAM格式，SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.

Picard：a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.

VCFtools：a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project.

bcftools： utilities for variant calling and manipulating VCFs and BCFs.

bedtools：a powerful toolset for genome arithmetic, allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.

MAKER：an easy-to-use genome annotation pipeline designed for small research groups with little bioinformatics experience.

重测序

Reseqtools：A Toolkit for analyzing next-generation DNA Re-Sequencing data. 华大内部自己整理的工具。

组装

BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs

SOAPdenovo

Platanus

DBG2OLC

CANU

Falcon

HGAP

变异检测

GATK：the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. 最常用的call snp&indel 工具

BreakDancer：genome-wide detection of structural variants from next generation paired-end sequencing reads. 结构变异sv检测工具

CREST：(Clipping Reveals Structure), a new algorithm for detecting genomic structural variations at base-pair resolution using next-generation sequencing data.

CNVnator：a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads. 人重检测CNV

PennCNV：a free software tool for Copy Number Variation (CNV) detection from SNP genotyping arrays.

MIDAS：(Metagenomic Intra-species Diversity Analysis System), An integrated pipeline for estimating strain-level genomic variation from metagenomic data（可以对宏基因组 call variation）

GWAS

PLINK：whole genome association analysis toolset

SSR分析

MISA - MIcroSAtellite identification tool

SSRHunter - Simple Sequence Repeat Search tool

统计方法

DMM：(Dirichlet multinomial mixtures), probabilistic modelling of microbial metagenomics data.（宏基因组的概率建模）输入： frequency_matrix.csv，每行就是一个taxa，每一列都是其在每一个样本中的频率。输出：群体分析结果。The mixture components cluster communities into distinct ‘metacommunities’, and, hence, determine envirotypes or enterotypes, groups of communities with a similar composition. 该方法就是群体的PCA分析，将类似的群体归于一类。

RNA

cd-hit：a very widely used program for clustering and comparing protein or nucleotide sequences. 去冗余

CPAT：using logistic regression model based on 4 pure sequence-based, linguistic features. 预测RNA的编码情况

GMAP: A Genomic Mapping and Alignment Program for mRNA and EST Sequences. RNA比对专用