序列比对中的基本概念
- B iology
– What is the biological question or problem? - D ata
– What is the input data?
– What other supportive data can be used? - M odel
– How is the problem formulated computationally?
– Or, what’s the data model? - A lgorithm
– What is the computational algorithm?
– How about its performance/limitation?
Sequence Alignment:
The purpose of a sequence alignment is to line up all residues in the inputted sequence(s) for maximal level of similarity, in the sense of their functional or evolutionary relationship.
序列比对的目的是将输入序列中国的所有残基按其功能或进化关系排列,以获得最大程度的相似性。
序列比对:https://www.ebi.ac.uk/Tools/psa/
罚分:
- opening a gap receives a penalty of d;(第一个开始的空位)
- extending a gap receives a penalty of e;(第一个之后的空位,图中绿色第二个空位有一个extending,第四个空位有4个extending)
- So the total Penalty for a gap with length n would be:
Penalty = d + (n-1) e*
图片示例中:
opening:10分
extending:0.5分
第二个空位:10+0.5=10.5
第四个空位:10+0.5*4=12
利用动态规划进行全局比对
-
Sequence Alignment: Enumerate?(枚举)
The best alignment that ends at a given pair of symbols is the best alignment of the sequences up to that point, plus the best alignment for the two additional symbols.
在给定的一对符号处结束的最佳对齐是到该点的序列的最佳对齐,加上两个附加符号的最佳对齐。 -
动态规划:Dynamic Programming solves problems by combining the solutions to sub‐problems.
通过组合子问题的解决方案来解决问题。 -
局部最优解的组合就是全局最优解
-
Break the problem into smaller sub‐problems.
把问题分解成更小的子问题 -
Solve these sub‐problems optimally recursively.
递归优化地解决这些子问题 -
Use these optimal solutions to construct an optimal solution for the original problem.
利用这些最优解构造原问题的最优解
-
Align two sequences: x and y
– F (i,j) is the score of the best alignment between x 1…i and y 1…j
最佳比对得分
– s(A,B) is the score for substituting A with B; d is the (linear) gap penalty
用B代替A的分数;d是(线性的)空位罚分
如何回溯呢???看方向 左上就是相同的,比如GG AA;上 :-对A;左:A对-。
从全局比对到局部比对
考虑仿射空位罚分的序列比对,以及如何计算Needleman-Wunsch算法的时间复杂度
- M (i,j) is the score of the best alignment between x 1…i and y 1…j ,given x i aligned to y j
- X (i,j) is the score of the best alignment between x 1…i and y 1…j ,given x i aligned to a gap
- Y (i,j) is the score of the best alignment between x 1…i and y 1…j ,given y j aligned to a gap
算法时间复杂度:
关于同源、相似性、相似矩阵和点阵图的补充材料
Homology同源性
- derived from a common ancestor
同源性:两个或多个东西具有共同的祖先
分为直系同源和旁系同源 - ortholog: derived from speciation
直系同源:在不同物种中的两个序列来自历史上的共同祖先的同一个序列。分裂 - paralog: derived from duplication
旁系同源:在同一个物种中的两个序列来自历史上同一个序列。由于序列复制产生的多个拷贝
相似性和同一性
- 两序列有相似性,来自共同祖先
相似性矩阵 - 同一性就是完全一样的
How to let computer do this job?
• How to measure similarity?
– Similarity matrix
• How to find out alignment?
– Dot matrix
– Dynamic programming
– BLAST