《生物信息学：导论与方法》----变异的功能预测----听课笔记（十一）

最新推荐文章于 2024-01-17 14:11:08 发布

盲人骑瞎马5555

最新推荐文章于 2024-01-17 14:11:08 发布

阅读量3.9k

点赞数 2

分类专栏：生物信息学文章标签：遗传变异 SIFT PolyPhen

本文链接：https://blog.csdn.net/wxw060709/article/details/100936619

版权

生物信息学专栏收录该内容

50 篇文章 53 订阅

订阅专栏

第六章变异的功能预测

6.1 问题概述

Where did your genetic variations come from?

inherited from parents
de novo mutations（70~100个新发突变）
somatic mutations（体细胞突变，如癌症）

有很多的先天的小儿疾病，就是这个孩子有一个De novo mutation，刚好落在了一个重要的基因上，它有可能有这种严重的疾病。
肿瘤细胞一般都有某种体细胞突变，会导致细胞不受控制地增值。
Types of genetic variations in a human genome

Chromosomal aneuploidy(非整倍性的)：最严重的的类型，染色体倍数发生错误，如唐氏综合征，这些孩子21号染色体有三条。
Structural Variations(SVs)：insertion; deletion; Inversion; Translocation
Copy Number Variations(CNVs)
Short insertion/deletions(indels): 有可能发生在基因区间或者内含子区，多数时候对表型的影响不是很大，但不是绝对；也可能发生编码区，分为两种，一种导致frameshift；一种是不会导致frameshift。
Single Nucletide Variations(SNVs)：一般来讲，每一个人的基因组都会有300万个单核苷酸的编译，大概相当于1/1000的概率。可能出现在基因区间，或者内含子区，也有可能是出现在启动子区。

Nomenclature: Mutation(突变) vs. polymorphism(多态性) vs. variation(变异) vs. variant
在整个地球上所有的人群中，小于1%的这种变异，大家把它通称为突变，叫mutation；超过了1%的这种变异，就叫做polymorphism。有时候也用5%作为cutoff。
variation或者variant一般是mutation和polymorphism的统称。
SVNs within coding regions

stop gain(nonsense):最严重，引入一个stop codon(终止密码子)，这个蛋白会提前终止，或者这个蛋白无法翻译出来，或者会翻译出来一个截断的一个一个版本。
stop loss：有时候还会造成一个终止密码子的缺失，那就是它最后这个蛋白会比原来的蛋白更长，但同时也有可能它因此就无法被翻译。
Non-synonymous(missense)：影响氨基酸的变异。
Synonymous(silent)：发生在外显子区，又是编码蛋白的外显子区，
Affect splicing：发生在内含子和外显子交界的地方

Missense (Non-synonymous) SNVs是目前研究的最多的，原因是：1.Missense SNVs change the amino acid; 2. Missense SNVs account for ~2% of the genome but >50% of all mutations known to be involved in human inherited disease. 但是这个数据有可能是因为研究非同义突变的人太多了。。。
What features differentiate disease-causing variants from neutral ones?
How can we predict whether a variation is disease-causing?

6.2 记录变异的数据库

1976年发现的Thalassemia的一个非同义突变
1993年，Huntington's disease的致病原因被发现，它是由一个nucleotide repeat不同的数目导致的。
1995年发现的Williams syndrome致病原因为代表的，是一个染色体上片段的删除导致的这样一个疾病。
影响蛋白的遗传变异的数据是放在1986年开发的Swiss-prot数据库中。
1998年在NCBI建立了dbSNP这个数据，测量了很多正常人出现的单核苷酸变异等。
2010年NCBI建立了dbVar数据库，主要存储一些比较大一点的结构变异。
2012年发表了1000 genomes。
1987年，OMIM（Online Mendelian Inheritance In Man），储存疾病相关的遗传变异。
1996年，Human Gene Mutation Database被建立，存储遗传变异相关的数据。
2007年，Locus Specifical Database建立，它是专门针对每一个不同的Locus，把相关的遗传变异汇总起来。
2007年和2012年，dbVar和ClinVar，这也是存储的GWAS和新一代测序实验结果里发现的和实验相关的一些遗传变异。
2004年建立的COSMIC的数据库，它主要存储的是癌症里面的体细胞突变。
dbSNP是NCBI的一个很重要的数据库，建立于1998年，主要目的是存储所有的被鉴定出来的遗传变异，包括正常人的和病人的。
LSDBs

Collect all known variants of each disease related gene in a specific database.
Annotate with complete and accurate information on genetic mutations
Most LSDBs are build based on LOVD(Leiden Open Variation Database) which is a database framework of storing variants information.

6.3 基于保守性和规则的预测方法：SIFT和PolyPhen

Phenotypical/functional "effects" of human genetic variations

Disease vs. normal
Deleterious vs. neutral : 演化上的一个概念，就是它会不会影响这个人的适应性
Personal trait differences (e.g. height): effect不是说疾病和正常的这样极端的表型，而是说一些特征。

除了对个体最终的表型的评估之外，如果想要进行深入的研究，建立真正的基因型和表型之间的因果关系，你就要做很多在动物模型和细胞水平上的工作（Animal model phenotypic changes and Cellular phenotypic changes）。
functional effect其实是指这个变异是不是会造成蛋白的结构和功能上的改变，Protein function changes and Protein structure changes
在最底层说，就是会不会造成一个蛋白序列的改变， Protein sequence changes
Statistical and stochastic, not deterministic
Observations, not "truth"
Nonsense mutations are usually considered deleterious.
Known deleterious mutations are enriched in nonsynonymous mutations.
非同义突变，占50%，现在已知的单基因疾病的突变都是这些非同义突变。
~50 known mutations of Mendelian disorders are nonsynonymous mutations(ascertainment bias?)
synonymous mutations, intronic mutations, and intergenic mutations are understudied (According to GWAS studies, 88% of trait-associated variants of weak effect are non-coding)
Most research so far had focused on nonsynonymous mutations.
More successful methods

Conservation-based(e.g., SIFT)
Rule-based(e.g., PolyPhen)
Classifier-based(e.g., PolyPhen2, SAPRED)

Sort Intolerant From Tolerant substitutions (SIFT)

Published in 2001 by Pauline C.Ng and Steven Henikoff
The first tool of predicting deleterious Amino Acid Substitutions
Website: http://sift.jcvi.org

SIFT bets on evolution: Important positions (such as active sites) tend to be conserved in the protein family across species. Mutations at well-conserved positions tend to be deleterious.
SIFT bets on evolution: Some positions have a high degree of diversity across species. Mutations at these positions tend to be neutral.

SIFT is a multistep procedure.

Given a protein sequence:

Step1. Search for similar sequences

Sequence search databse: SWISS-PROT
PSI-blast is run for four iterations to collect a pool of sequences similar to the query.

Step2. Choose closely related sequences that are likely to share similar function

The psi-blast results are grouped together if they are >90% identical in the regions aligned

Step3. Obtain the multiple alignment of these chosen sequences

Step4. Calculate normalized probabilities for all possible substitutions at each position at the alignment

第四步就根据每一个位点，所看到的氨基酸的分布可以算一个概率，基于这个概率，得到最后的一个值，一个数值的预测值，如果这个SCORE分数是小于0.05的，就预测是deleterious; 如果大于0.05，就是中性的，不会造成功能和表型的改变。

如何定义准确度？

Polymorphism Phenotyping (PolyPhen): a rule-based method

Amino acid variants may impact folding, interaction sites, solubility or stability of the protein.
Changes in protein structure may affect protein function, which may lead to phenotype change.
PolyPhen predicts impact of amino acid allelic variants based on multi-sequence alignment AND protein 3D structure features.

PolyPhen

1. Multi-sequence alignment of homologous sequences

2. Get the protein 3D structure or using homolog modeling to predict its structure

3. Structure-based characterization of the substitution site

DISULFIDE, THIOLEST or THIOEATH bond, BINDING site, ACTIVE site etc.
Whether the variant is located in transmembrane regions
Whether the variant is located in coiled coil regions
Whether the variant is located in signal peptide regions

4. Calculate the 3D structure features of the substitution site

Secondary structure
Solvent accessible surface area
Φ-Ψ dihedral angles
Normalized β-factor for the residue
Loss of hydrogen bond
Contacts with critical sites, ligands or other polypeptide chains

Pros:

improved prediction accuracy when protein 3D structure is avaliable

Cons:

If 3D structure is not avaliable, it can only depend on MSA.
The rules are empirical.

盲人骑瞎马5555

关注

2
点赞
踩
9

收藏

觉得还不错? 一键收藏
1
评论
《生物信息学：导论与方法》----变异的功能预测----听课笔记（十一）

第六章变异的功能预测6.1 问题概述Where did your genetic variations come from?inherited from parents de novo mutations（70~100个新发突变） somatic mutations（体细胞突变，如癌症）有很多的先天的小儿疾病，就是这个孩子有一个De novo mutation，刚好落在了一个重...
复制链接

扫一扫

专栏目录