第六章 变异的功能预测
6.1 问题概述
- Where did your genetic variations come from?
- inherited from parents
- de novo mutations(70~100个新发突变)
- somatic mutations(体细胞突变,如癌症)
- 有很多的先天的小儿疾病,就是这个孩子有一个De novo mutation,刚好落在了一个重要的基因上,它有可能有这种严重的疾病。
- 肿瘤细胞一般都有某种体细胞突变,会导致细胞不受控制地增值。
- Types of genetic variations in a human genome
- Chromosomal aneuploidy(非整倍性的):最严重的的类型,染色体倍数发生错误,如唐氏综合征,这些孩子21号染色体有三条。
- Structural Variations(SVs):insertion; deletion; Inversion; Translocation
- Copy Number Variations(CNVs)
- Short insertion/deletions(indels): 有可能发生在基因区间或者内含子区,多数时候对表型的影响不是很大,但不是绝对;也可能发生编码区,分为两种,一种导致frameshift;一种是不会导致frameshift。
- Single Nucletide Variations(SNVs):一般来讲,每一个人的基因组都会有300万个单核苷酸的编译,大概相当于1/1000的概率。可能出现在基因区间,或者内含子区,也有可能是出现在启动子区。
- Nomenclature: Mutation(突变) vs. polymorphism(多态性) vs. variation(变异) vs. variant
- 在整个地球上所有的人群中,小于1%的这种变异,大家把它通称为突变,叫mutation;超过了1%的这种变异,就叫做polymorphism。有时候也用5%作为cutoff。
- variation或者variant一般是mutation和polymorphism的统称。
- SVNs within coding regions
- stop gain(nonsense):最严重,引入一个stop codon(终止密码子),这个蛋白会提前终止,或者这个蛋白无法翻译出来,或者会翻译出来一个截断的一个一个版本。
- stop loss:有时候还会造成一个终止密码子的缺失,那就是它最后这个蛋白会比原来的蛋白更长,但同时也有可能它因此就无法被翻译。
- Non-synonymous(missense):影响氨基酸的变异。
- Synonymous(silent):发生在外显子区,又是编码蛋白的外显子区,
- Affect splicing:发生在内含子和外显子交界的地方
- Missense (Non-synonymous) SNVs是目前研究的最多的,原因是:1.Missense SNVs change the amino acid; 2. Missense SNVs account for ~2% of the genome but >50% of all mutations known to be involved in human inherited disease. 但是这个数据有可能是因为研究非同义突变的人太多了。。。
- What features differentiate disease-causing variants from neutral ones?
- How can we predict whether a variation is disease-causing?
6.2 记录变异的数据库
- 1976年发现的Thalassemia的一个非同义突变
- 1993年,Huntington's disease的致病原因被发现,它是由一个nucleotide repeat不同的数目导致的。
- 1995年发现的Williams syndrome致病原因为代表的,是一个染色体上片段的删除导致的这样一个疾病。
- 影响蛋白的遗传变异的数据是放在1986年开发的Swiss-prot数据库中。
- 1998年在NCBI建立了dbSNP这个数据,测量了很多正常人出现的单核苷酸变异等。
- 2010年NCBI建立了dbVar数据库,主要存储一些比较大一点的结构变异。
- 2012年发表了1000 genomes。
- 1987年,OMIM(Online Mendelian Inheritance In Man),储存疾病相关的遗传变异。
- 1996年,Human Gene Mutation Database被建立,存储遗传变异相关的数据。
- 2007年,Locus Specifical Database建立,它是专门针对每一个不同的Locus,把相关的遗传变异汇总起来。
- 2007年和2012年,dbVar和ClinVar,这也是存储的GWAS和新一代测序实验结果里发现的和实验相关的一些遗传变异。
- 2004年建立的COSMIC的数据库,它主要存储的是癌症里面的体细胞突变。
- dbSNP是NCBI的一个很重要的数据库,建立于1998年,主要目的是存储所有的被鉴定出来的遗传变异,包括正常人的和病人的。
- LSDBs
- Collect all known variants of each disease related gene in a specific database.
- Annotate with complete and accurate information on genetic mutations
- Most LSDBs are build based on LOVD(Leiden Open Variation Database) which is a database framework of storing variants information.
6.3 基于保守性和规则的预测方法:SIFT和PolyPhen
- Phenotypical/functional "effects" of human genetic variations
- Disease vs. normal
- Deleterious vs. neutral : 演化上的一个概念,就是它会不会影响这个人的适应性
- Personal trait differences (e.g. height): effect不是说疾病和正常的这样极端的表型,而是说一些特征。
- 除了对个体最终的表型的评估之外,如果想要进行深入的研究,建立真正的基因型和表型之间的因果关系,你就要做很多在动物模型和细胞水平上的工作(Animal model phenotypic changes and Cellular phenotypic changes)。
- functional effect其实是指这个变异是不是会造成蛋白的结构和功能上的改变,Protein function changes and Protein structure changes
- 在最底层说,就是会不会造成一个蛋白序列的改变, Protein sequence changes
- Statistical and stochastic, not deterministic
- Observations, not "truth"
- Nonsense mutations are usually considered deleterious.
- Known deleterious mutations are enriched in nonsynonymous mutations.
- 非同义突变,占50%,现在已知的单基因疾病的突变都是这些非同义突变。
- ~50 known mutations of Mendelian disorders are nonsynonymous mutations(ascertainment bias?)
- synonymous mutations, intronic mutations, and intergenic mutations are understudied (According to GWAS studies, 88% of trait-associated variants of weak effect are non-coding)
- Most research so far had focused on nonsynonymous mutations.
- More successful methods
- Conservation-based(e.g., SIFT)
- Rule-based(e.g., PolyPhen)
- Classifier-based(e.g., PolyPhen2, SAPRED)
- Sort Intolerant From Tolerant substitutions (SIFT)
- Published in 2001 by Pauline C.Ng and Steven Henikoff
- The first tool of predicting deleterious Amino Acid Substitutions
- Website: http://sift.jcvi.org
- SIFT bets on evolution: Important positions (such as active sites) tend to be conserved in the protein family across species. Mutations at well-conserved positions tend to be deleterious.
- SIFT bets on evolution: Some positions have a high degree of diversity across species. Mutations at these positions tend to be neutral.
SIFT is a multistep procedure.
Given a protein sequence:
Step1. Search for similar sequences
- Sequence search databse: SWISS-PROT
- PSI-blast is run for four iterations to collect a pool of sequences similar to the query.
Step2. Choose closely related sequences that are likely to share similar function
- The psi-blast results are grouped together if they are >90% identical in the regions aligned
Step3. Obtain the multiple alignment of these chosen sequences
Step4. Calculate normalized probabilities for all possible substitutions at each position at the alignment
- 第四步就根据每一个位点,所看到的氨基酸的分布可以算一个概率,基于这个概率,得到最后的一个值,一个数值的预测值,如果这个SCORE分数是小于0.05的,就预测是deleterious; 如果大于0.05,就是中性的,不会造成功能和表型的改变。
如何定义准确度?
Polymorphism Phenotyping (PolyPhen): a rule-based method
- Amino acid variants may impact folding, interaction sites, solubility or stability of the protein.
- Changes in protein structure may affect protein function, which may lead to phenotype change.
- PolyPhen predicts impact of amino acid allelic variants based on multi-sequence alignment AND protein 3D structure features.
PolyPhen
1. Multi-sequence alignment of homologous sequences
2. Get the protein 3D structure or using homolog modeling to predict its structure
3. Structure-based characterization of the substitution site
- DISULFIDE, THIOLEST or THIOEATH bond, BINDING site, ACTIVE site etc.
- Whether the variant is located in transmembrane regions
- Whether the variant is located in coiled coil regions
- Whether the variant is located in signal peptide regions
4. Calculate the 3D structure features of the substitution site
- Secondary structure
- Solvent accessible surface area
- Φ-Ψ dihedral angles
- Normalized β-factor for the residue
- Loss of hydrogen bond
- Contacts with critical sites, ligands or other polypeptide chains
Pros:
- improved prediction accuracy when protein 3D structure is avaliable
Cons:
- If 3D structure is not avaliable, it can only depend on MSA.
- The rules are empirical.