真核基因组注释导读

原创 2016年08月29日 09:43:48

前言


   二代测序以及最近三代单分子测序的火热,让我们获得高质量基因组越来越来容易,然而基因组注释仍然面临许多挑战。其中一个挑战就是寻找基因(gene finding),训练基因model,选择基因预测软件和注释软件,另一个挑战就是更新合并不同途径注释的基因,目前还没有完美的解决方法,但流行的RNA-seq 数据能够极大程度的帮助我们校准基因。基因组注释不是简单的点击几下鼠标就能够完成的,然而现在有很多工具帮助我们更好的注释基因组。


基因组组装 (Genome assemblies)

  进行基因组注释之前,需要对组装的基因组进行质量评估,查看是否可以用来进行基因组注释,继而获得可信的注释结果。有3个指标可以衡量组装的质量。
* Scaffold and contig N50s
* Percent gaps
* Percent coverage
   CEGMA提供了另外一种评估方案。CEGMA筛选收集了一些很保守的单拷贝基因(这些基因可以看做在每个真核物种里都存在),这样我们可以通过计算存在于目前的组装版本的基因数目来衡量组装基因组的完整性。


基因组注释

插播一下,基因注释与基因预测的关系

  • gene predictors find the single most likely coding sequence (CDS) of a gene and do not report untranslated regions (UTRs) or alternatively spliced variants. Gene prediction is therefore a somewhat misleading term. A more accurate description might be ‘canonical CDS prediction’.

  • Gene annotations, conversely, generally include UTRs, alternative splice isoforms and have attributes such as evidence trails.

The figure shows a genome annotation and its associated evidence. Terms in parentheses are the names of commonly used software tools for assembling particular types of evidence. Note that the gene annotation (shown in blue) captures both alternatively spliced forms and the 5′ and 3′UTRs suggested by the evidence. By contrast, the gene prediction that is generated by SNAP (shown in green) is incorrect as regards the gene’s 5′ exons and start-of-translation site and, like most gene-predictors, it predicts only a single transcript with no UTR.
继续说基因组注释,注释的第一步是重复的鉴定和掩盖

Repeat identification

  真核基因组包含大量的重复序列,小麦的重复序列高达85%, 重复实际上包括两个方面,一是低复杂度序列(low-complexity),二是转座元件( transposable elements),如 viruses, long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs)。很多情况是一个重复片段套另外一个重复片段,大量的重复给我们的注释工作带来不少困难,重复会导致seed millions of spurious BLAST alignments, producing false evidence for gene annotations。many transposon open reading frames (ORFs) look like true host genes to gene predictors, causing portions of transposon ORFs to be added as additional exons to gene predictions, completely corrupting the final gene annotations.所以第一步鉴定重复非常重要。 重复之间非常不保守,所以为了精确检测往往需要建立一个对应物种的重复数据库。鉴定重复的工具,根据原理可分为两类: 根据相似性(homology-based tools) 和从头预测(de novo tools),根据相似性,即鉴定已知的重复元件,而从头预测则可以鉴定新的重复元件。与重复数据库比对,最常用的软件是RepeatMasker。一般将重复的序列标记为N,或者小写的acgt。

Evidence alignment

  重复鉴定之后,接下来就是将蛋白、ESTs 和 RNA-seq 数据比对到基因组,当然这一步的工具就多了,这里不再单独一一列出,见下表。根据情况选择合适的工具

Software Description Refs
BLAST Suite of rapid database search tools that uses Karlin–Altschul statistics 31,32,33
BLAT Faster than BLAST but has fewer features 42
Splign Splice-aware tool designed to align cDNA to genomic sequence 44
Spidey mRNA-to-DNA alignment tool that is designed to account for possible paralogous alignments 45
Prosplign Global alignment tool that uses BLAST hits to align in a splice-site- and paralogy-aware manner 140
sim4 Splice-aware cDNA-to-DNA alignment tool 46
Exonerate Splice-site-aware alignment algorithm that can align both protein and EST sequences to a genome 43
Cufflinks Extension to TopHat. Uses TopHat outputs to create transcript models 54
Trinity High-quality de novo transcriptome assembler 50
MapSplice Spliced aligner that does not use a model of canonical splice junction 141
TopHat Transcriptome aligner that aligns RNA sequencing (RNA-seq) reads to a reference genome using Bowtie to identify splice sites 51
GSNAP A fast short-read assembler 52

参考文献,请参见原文

Ab initio gene prediction 和 Evidence-driven gene prediction

这一步常用的软件有

Software Description Refs
Augustus Accepts expressed sequence tag (EST)-based and protein-based evidence hints. Highly accurate 66,67
mGene Support vector machine (SVM)-based discriminative gene predictor. Directly predicts 5′ and 3′ untranslated regions (UTRs) and poly(A) sites 133
SNAP Accepts EST and protein-based evidence hints. Easily trained 62
FGENESH Training files are constructed by SoftBerry and supplied to users 72
Geneid First published in 1992 and revised in 2000. Accepts external hints from EST and protein-based evidence 134
Genemark A self-training gene finder 69,70
Twinscan Extension of the popular Genscan algorithm that can use homology between two genomes to guide gene prediction 71
GAZE Highly configurable gene predictor 74
GenomeScan Extension of the popular Genscan algorithm that can use BLASTX searches to guide gene prediction 135
Conrad Discriminative gene predictor that uses conditional random fields (CRFs) 136
Contrast Discriminative gene predictor that uses both SVMs and CRFs 137
CRAIG Discriminative gene predictor that uses CRFs 138
Gnomon Hidden Markov model (HMM) tool based on Genscan that uses EST and protein alignments to guide gene prediction 73
GeneSeqer A tool for identifying potential exon–intron structure in precursor mRNAs (pre-mRNAs) by splice site prediction and spliced alignment 139

使用不同的软件预测之后,需要进一步整合到一起,去除冗余,发现可变剪切体等

The annotation phase

  这一步一般都需要手动去鉴定和校正,当然也可以利用一些软件来校正,常用的有三个 JIGSAW,EVidenceModeler (EVM) 和 GLEAN(and its successor, Evigan) 。 In a recent gene prediction competition, the combiners nearly always improved on the underlying gene prediction models,and JIGSAW, EVM or Evigan performed similarly.
  当然另外的软件是在预测的同时根据evidence进行校正,This is the process used by PASA, Gnomon and MAKER。
未完待续

版权声明:本文为博主原创文章,未经博主允许不得转载。转载联系xxxxxxx

相关文章推荐

从基因组注释说起

N年前测序还是问题,基因组的解读排在后边,现如今,测序已然不是问题, 成百上千的基因组被测序,这么多的基因组需要解读还真不是件容易的事。以前高大上的工作,注定要飞入寻常百姓家。开发出易用且准确度高的注...

腾讯qq十周年挖宝活動是真的吗_____↙

腾讯公司 总 部 电 话《95013+2195+0586》抽奖电话《95013+2195+0586》活动热线《95013+2195+0586》非常6+1 电 话《95013+2195+0586》幸 运...

annovar对人类基因组和非人类基因组variants注释流程

部分翻译:Hui Y, Kai W. Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR[J]. Natur...

Index Similar DNA Sequence 多基因组索引技术 笔记

Indexing Similar DNA Sequences 研究背景:  研究物种的基因变异经常需要挖掘多条非常相似的基因组序列之间的信息。例如当我们研究由于基因组中某几个碱基突变引发的疾病时,...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:深度学习:神经网络中的前向传播和反向传播算法推导
举报原因:
原因补充:

(最多只允许输入30个字)