基因组拼接方法
de novo定义:from the beginning(从头拼接), no reference genome guided(无参考基因组)
三类de novo基因拼接的计算方法:
1. Greedy algorithm:对于含重复区的序列拼接效果不好
Shortest common string (SCS):最短的、包含原序列S中所有的k-mer的序列
但是Greedy algorithm为追求最短的序列或者最多的重叠,出现了“吃掉”重复区间的问题。
2. Overlap Layout Consensus:耗时长,用于Sanger测序
3. de Bruijn:速度快、准确度高,目前NGS多采用此方法
将每条序列拆成长度为k个碱基的序列(k-mer),每个k-mer之间重叠部分overlap=k-1
要点:线性结构,欧拉路径,特殊结构的处理
序列拼接工具如Velvet,SPAdes、IDBA-UD等,均采用de Bruijn算法
部分结构可简化和处理掉,例如去头和气泡(remove tips and bubbles)
E.coli基因组的de Bruijn图和测序错误率的关系
测序深度和覆盖度:
测序深度(depth):测序得到的总碱基数与待测基因组大小的比值。例如E.coli基因组大小为4Mbp,测序得到40Mbp的reads,则测序深度为10X。
Coverage>80%可形成“基因草图(draft genome)”
Draft genome需要30X的测序深度
举例:
1. 测序得到的Reads数:Abundant>moderately>rare
2. 测序深度或覆盖率(read depth or coverage):Abundant>moderately>rare
3. 根据所需测序深度决定测序通量:如果要得到C的基因草图(需要depth>=30X),
则测序通量(总碱基数)=rare的基因组碱基数*30/rare%
scaffold=contigs+gaps(缺口)
Scaffold组装主要靠
与已知物种的基因组进行序列比对
paired read测序结果也提供了大量gap filling的信息,依然有大量缺口(gaps)
过滤
SOAPnuke: 华大自主开发的一款针对fastq文件的过滤软件。
HTSeq-count:一款用于reads计数的轻便软件,作者介绍说可以用于多种mapping软件的输出结果,而我则用于tophat2的输出文件做计数。不过貌似所有能转换为sam格式文件的输出都可以用htseq-count计数。
RSeQC: An RNA-seq Quality Control Package
比对
BWA:应用最为广泛的比对软件,可以比二代,也可以比三代
Soap:华大开发的比对软件,全称SOAPaligner/soap2
bowtie2:常用于RNA-seq的比对
BLASR:专门用于比对三代reads
pynast:多重序列比对软件,主要用于处理16S序列
FastTree:超快的建树软件,同时处理1M级的序列,主要用于16S的建树
数据处理
SAMtools:专门用于处理SAM、BAM格式,SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.
Picard:a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
VCFtools:a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project.
bcftools: utilities for variant calling and manipulating VCFs and BCFs.
bedtools:a powerful toolset for genome arithmetic, allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.
MAKER:an easy-to-use genome annotation pipeline designed for small research groups with little bioinformatics experience.
重测序
Reseqtools:A Toolkit for analyzing next-generation DNA Re-Sequencing data. 华大内部自己整理的工具。
组装
BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs
SOAPdenovo
Platanus
DBG2OLC
CANU
Falcon
HGAP
变异检测
GATK:the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. 最常用的call snp&indel 工具
BreakDancer:genome-wide detection of structural variants from next generation paired-end sequencing reads. 结构变异sv检测工具
CREST:(Clipping Reveals Structure), a new algorithm for detecting genomic structural variations at base-pair resolution using next-generation sequencing data.
CNVnator:a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads. 人重检测CNV
PennCNV:a free software tool for Copy Number Variation (CNV) detection from SNP genotyping arrays.
MIDAS:(Metagenomic Intra-species Diversity Analysis System), An integrated pipeline for estimating strain-level genomic variation from metagenomic data(可以对宏基因组 call variation)
GWAS
PLINK:whole genome association analysis toolset
SSR分析
MISA - MIcroSAtellite identification tool
SSRHunter - Simple Sequence Repeat Search tool
统计方法
DMM:(Dirichlet multinomial mixtures), probabilistic modelling of microbial metagenomics data.(宏基因组的概率建模)输入: frequency_matrix.csv,每行就是一个taxa,每一列都是其在每一个样本中的频率。输出:群体分析结果。The mixture components cluster communities into distinct ‘metacommunities’, and, hence, determine envirotypes or enterotypes, groups of communities with a similar composition. 该方法就是群体的PCA分析,将类似的群体归于一类。
RNA
cd-hit:a very widely used program for clustering and comparing protein or nucleotide sequences. 去冗余
CPAT:using logistic regression model based on 4 pure sequence-based, linguistic features. 预测RNA的编码情况
GMAP: A Genomic Mapping and Alignment Program for mRNA and EST Sequences. RNA比对专用