基因拼接

基因组拼接方法

de novo定义:from the beginning(从头拼接), no reference genome guided(无参考基因组)

三类de novo基因拼接的计算方法:

1. Greedy algorithm:对于含重复区的序列拼接效果不好

Shortest common string (SCS):最短的、包含原序列S中所有的k-mer的序列

但是Greedy algorithm为追求最短的序列或者最多的重叠,出现了“吃掉”重复区间的问题。

2. Overlap Layout Consensus:耗时长,用于Sanger测序

 

3. de Bruijn速度快、准确度高,目前NGS多采用此方法

将每条序列拆成长度为k个碱基的序列(k-mer),每个k-mer之间重叠部分overlap=k-1

要点:线性结构,欧拉路径,特殊结构的处理

序列拼接工具如Velvet,SPAdes、IDBA-UD等,均采用de Bruijn算法

部分结构可简化和处理掉,例如去头和气泡(remove tips and bubbles)

 

E.coli基因组的de Bruijn图和测序错误率的关系

 

测序深度和覆盖度:

测序深度(depth):测序得到的总碱基数与待测基因组大小的比值。例如E.coli基因组大小为4Mbp,测序得到40Mbp的reads,则测序深度为10X。

Coverage>80%可形成“基因草图(draft genome)”

Draft genome需要30X的测序深度

 

举例:

1. 测序得到的Reads数:Abundant>moderately>rare

2. 测序深度或覆盖率(read depth or coverage):Abundant>moderately>rare

3. 根据所需测序深度决定测序通量:如果要得到C的基因草图(需要depth>=30X),

则测序通量(总碱基数)=rare的基因组碱基数*30/rare%

 

scaffold=contigs+gaps(缺口)

Scaffold组装主要靠

与已知物种的基因组进行序列比对

paired read测序结果也提供了大量gap filling的信息,依然有大量缺口(gaps)

 

生物信息常用工具集锦

过滤

SOAPnuke  华大自主开发的一款针对fastq文件的过滤软件。

HTSeq-count:一款用于reads计数的轻便软件,作者介绍说可以用于多种mapping软件的输出结果,而我则用于tophat2的输出文件做计数。不过貌似所有能转换为sam格式文件的输出都可以用htseq-count计数。

RSeQC: An RNA-seq Quality Control Package

比对

BWA:应用最为广泛的比对软件,可以比二代,也可以比三代

Soap:华大开发的比对软件,全称SOAPaligner/soap2

bowtie2:常用于RNA-seq的比对

BLASR:专门用于比对三代reads

pynast:多重序列比对软件,主要用于处理16S序列

FastTree:超快的建树软件,同时处理1M级的序列,主要用于16S的建树

数据处理

SAMtools:专门用于处理SAMBAM格式,SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.

Picarda set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. 

VCFtoolsa program package designed for working with VCF files, such as those generated by the 1000 Genomes Project.

bcftools utilities for variant calling and manipulating VCFs and BCFs.

bedtoolsa powerful toolset for genome arithmetic, allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.

MAKERan easy-to-use genome annotation pipeline designed for small research groups with little bioinformatics experience.

重测序

ReseqtoolsA Toolkit for analyzing next-generation DNA Re-Sequencing data. 华大内部自己整理的工具。

组装

BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs

SOAPdenovo

Platanus

DBG2OLC

CANU

Falcon

HGAP

变异检测 

GATKthe toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. 最常用的call snp&indel 工具

BreakDancergenome-wide detection of structural variants from next generation paired-end sequencing reads. 结构变异sv检测工具

CREST(Clipping Reveals Structure), a new algorithm for detecting genomic structural variations at base-pair resolution using next-generation sequencing data.

CNVnatora tool for CNV discovery and genotyping from depth-of-coverage by mapped reads. 人重检测CNV

PennCNVa free software tool for Copy Number Variation (CNV) detection from SNP genotyping arrays.

MIDAS(Metagenomic Intra-species Diversity Analysis System), An integrated pipeline for estimating strain-level genomic variation from metagenomic data(可以对宏基因组 call variation

GWAS

PLINKwhole genome association analysis toolset

SSR分析

MISA - MIcroSAtellite identification tool

SSRHunter - Simple Sequence Repeat Search tool

统计方法

DMM(Dirichlet multinomial mixtures), probabilistic modelling of microbial metagenomics data.(宏基因组的概率建模)输入: frequency_matrix.csv,每行就是一个taxa,每一列都是其在每一个样本中的频率。输出:群体分析结果。The mixture components cluster communities into distinct ‘metacommunities’, and, hence, determine envirotypes or enterotypes, groups of communities with a similar composition. 该方法就是群体的PCA分析,将类似的群体归于一类。

RNA

cd-hita very widely used program for clustering and comparing protein or nucleotide sequences. 去冗余

CPATusing logistic regression model based on 4 pure sequence-based, linguistic features. 预测RNA的编码情况

GMAP: A Genomic Mapping and Alignment Program for mRNA and EST Sequences. RNA比对专用

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wangchuang2017

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值