Sequence alignment

最新推荐文章于 2024-08-17 21:30:16 发布

涂涂

最新推荐文章于 2024-08-17 21:30:16 发布

阅读量2.4k

点赞数

分类专栏： algorithm 文章标签： algorithm

本文链接：https://blog.csdn.net/u012562273/article/details/56015805

版权

algorithm 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

concept

In computer realm，Sequence is not continuous，on the contrary，String is continuous；in the biology，sequence is called gapped sequence，string is called sequence.
Sequence similarity problem occurs in search engine，command line，genome
Paradigm one:Similar sequence $\Rightarrow$ similar organism:Microorganism how to be classified?
Paradigm two:Similar sequence $\Rightarrow$ similar structure $\Rightarrow$ similar function：protein，DNA,RNA and so on.
The basic method of sequence alignment is Dynamic Programming.

example

Q:Given U,V,how to measure the similarity?
Definition: the alignment of U and V is to insert ” ” into sequences to make them the same length n.(“” means space)
Note:alignment of ” ” and ” “is forbidden.
Example:
N:cat. V:act

(c " " a a " " c t t)

$\begin{pmatrix} c & a & "" & t \\ ""& a & c & t \\ \end{pmatrix}$
2 matches, 2 inserts or deltions.

(c " " a a t c " " t)

$\begin{pmatrix} c & a & t & ""\\ "" & a & c & t \\ \end{pmatrix}$
1 match, 2 mismatchs.
Q:how many alignments between U and V?
A: a lot!
Q:which alignment is better?
A: It depends on the model(scoring function)!

Alignment score

Example one:
Given scoring function

w (x, y) = ⎧ ⎩ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 3, - 1, - 3, if x = y if x \neq y if x = " " o r y = " "

$w(x,y) = \begin{cases} 3, & \text{if $x = y $} \\[2ex] -1, & \text{if $x \not= y$} \\[2ex] -3,& \text{if $x=" " or y=" "$} \end{cases}$

s(caactt)=3+3−3−3=0 $s \begin{pmatrix} c & a & & t \\ & a & c & t \\ \end{pmatrix} = 3+3-3-3=0$

$s \begin{pmatrix} c & a & t \\ a & c & t \\ \end{pmatrix} = 3-1-1=1$ better!
Example two:

w (x, y) = ⎧ ⎩ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 3, - 1, - 2, if x = y if x \neq y if x = " " o r y = " "

$w(x,y) = \begin{cases} 3, & \text{if $x = y $} \\[2ex] -1, & \text{if $x \not= y$} \\[2ex] -2,& \text{if $x=" " or y=" "$} \end{cases}$

s(caactt)=3+3−2−2=2 $s \begin{pmatrix} c & a & & t \\ & a & c & t \\ \end{pmatrix} = 3+3-2-2=2$ better!

$s \begin{pmatrix} c & a & t \\ a & c & t \\ \end{pmatrix} = 3-1-1=1$

Optimal Global Alignment

Definition:Given U,V,w(), asks to find the optimal global alignment that has the maximum score.
S(U,V):score of the optimal alignment number.
s(alignment):the score of the alignment.
S(U,V) = s(T) $\Rightarrow$ T is the optimal alignment.
Key observation: the structure of the optimal solutions.
(1)T: optimal alignment for act and cat.
s(T) = S(act,cat)
what do we know about last column of T?

(a " " a t t t " " t t " ")

$\begin{pmatrix} a & a & t & "" & t\\ "" & t & t & t & "" \\ \end{pmatrix}$
obviously, the first and second column is impossible!
if the third column is true

⇒T=T1(tt) $\Rightarrow T=T_1\begin{pmatrix} t\\ t\end{pmatrix}$ ,

T1 $T_1$ is an alignment of ca and ac.

s(T1)=S(ca,ac)? $s(T_1) = S(ca,ac)?$ YES!
Prove: cut & paste!
if the fourth column is true

⇒T=T2(t"") $\Rightarrow T=T_2\begin{pmatrix} t \\ ""\end{pmatrix}$ ,

T2 $T_2$ is an alignment of ca and act.

s(T2)=S(ca,act)? $s(T_2) = S(ca,act)?$ YES!
Prove: cut & paste!
…
In Summary:

S (c a t, a c t) = m a x ⎧ ⎩ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ S (c a, a c) + w (t, t) S (c a, a c t) + w (t, " ") S (c a t, a c), + w (" ", t)

$S(cat,act)=max \begin{cases} S(ca,ac)+w(t,t)\\[2ex] S(ca,act)+w(t,"") \\[2ex] S(cat,ac),+w("",t) \end{cases}$

n " " c c a c a t “” s (" ", " ") s (c, " ") s (c a, " ") s (c a t, " ") a s (" ", a) s (c, a) s (c a, a) s (c a t, a) ac s (" ", a c) s (c, a c) s (c a, a c) s (c a t, a c) act s (" ", a c t) s (c, a c t) s (c a, a c t) s (c a t, a c t)

$\begin{array}{c|cccc} n & \text{“”} & \text{a} & \text{ac} &\text{act} \\ \hline "" & s("","") & s("",a) & s("",ac) &s("",act) \\ c & s(c,"") & s(c,a) & s(c,ac) &s(c,act) \\ ca & s(ca,"") & s(ca,a) & s(ca,ac) &s(ca,act)\\ cat & s(cat,"") &s(cat,a)& s(cat,ac) &s(cat,act) \end{array}$
Note:计算时不需要决策树，只需要这个表，按照逻辑：每一个值取决于斜上对角线和左侧，上侧，表的数值一行行产生。