concept
In computer realm,Sequence is not continuous,on the contrary,String is continuous;in the biology,sequence is called gapped sequence,string is called sequence.
Sequence similarity problem occurs in search engine,command line,genome
Paradigm one:Similar sequence
⇒
similar organism:Microorganism how to be classified?
Paradigm two:Similar sequence
⇒
similar structure
⇒
similar function:protein,DNA,RNA and so on.
The basic method of sequence alignment is Dynamic Programming.
example
Q:Given U,V,how to measure the similarity?
Definition: the alignment of U and V is to insert ” ” into sequences to make them the same length n.(“” means space)
Note:alignment of ” ” and ” “is forbidden.
Example:
N:cat. V:act
2 matches, 2 inserts or deltions.
1 match, 2 mismatchs.
Q:how many alignments between U and V?
A: a lot!
Q:which alignment is better?
A: It depends on the model(scoring function)!
Alignment score
Example one:
Given scoring function
s(caactt)=3+3−3−3=0
s(caactt)=3−1−1=1
better!
Example two:
s(caactt)=3+3−2−2=2 better!
s(caactt)=3−1−1=1
Optimal Global Alignment
Definition:Given U,V,w(), asks to find the optimal global alignment that has the maximum score.
S(U,V):score of the optimal alignment number.
s(alignment):the score of the alignment.
S(U,V) = s(T)
⇒
T is the optimal alignment.
Key observation: the structure of the optimal solutions.
(1)T: optimal alignment for act and cat.
s(T) = S(act,cat)
what do we know about last column of T?
obviously, the first and second column is impossible!
if the third column is true ⇒T=T1(tt) , T1 is an alignment of ca and ac. s(T1)=S(ca,ac)? YES!
Prove: cut & paste!
if the fourth column is true ⇒T=T2(t"") , T2 is an alignment of ca and act. s(T2)=S(ca,act)? YES!
Prove: cut & paste!
…
In Summary:
Note:计算时不需要决策树,只需要这个表,按照逻辑:每一个值取决于斜上对角线和左侧,上侧,表的数值一行行产生。
Algorithm
- Def scoring function s() (60% workload)
- Recursive function
- Boundouries
- Dynamic Programming
- Time & Space complexity