The Problem
高中生物知识:double-stranded DNA is “zipped” together by complementary base-pairing. Each strand of DNA can be viewed as a string of bases, where each base is drawn from the set
{
A
,
C
,
G
,
T
}
\{A,C,G,T\}
{A,C,G,T}. The bases
A
−
T
A-T
A−T pair with each other and the bases
C
−
G
C-G
C−G pair with each other.
A single strand RNA pairs with itself, resulting in
The set of pairs (and resulting shape) formed by the RNA molecule through this process is called the secondary structure.
Single-Strand RNA Setting
Bases:
{
A
,
C
,
G
,
U
}
\{A,C,G,U\}
{A,C,G,U}
Structure: a sequence of
n
n
n bases
B
=
b
1
b
2
,
…
,
b
n
B=b_1b_2,\dots,b_n
B=b1b2,…,bn
Rules:
- A A A pair only with U U U, G G G pair only with C C C.
- Bases at location i , j i,j i,j cannot form a band if j < i + 4 j<i+4 j<i+4 for j > i j>i j>i
- A base cannot form multiple bands
- No crossing pair is allowed. If ( i , j ) (i,j) (i,j) and ( k , l ) (k,l) (k,l) are two pairs, then we cannot have i < k < j < l i<k<j<l i<k<j<l
Designing and Analyzing the Algorithm
思路
尝试Greedy。从sequence里面第一个base开始寻找可以的pair。
发现此方法会形成crossing pair。
尝试Dynamics Programming 一维数组的表达形式:假设
W
[
i
]
=
W[i]=
W[i]=max base pairs using location
1
,
…
,
i
1,\dots, i
1,…,i, and let
O
\mathcal{O}
O be an optimal pairing。那么, let
(
i
,
n
)
(i,n)
(i,n) be a pair in
O
\mathcal{O}
O: if
n
n
n is not in the optimal pairing set
O
\mathcal{O}
O,
W
[
n
]
=
W
[
n
−
1
]
W[n]=W[n-1]
W[n]=W[n−1] . If
n
n
n is in the optimal pairing,
W
[
n
]
=
1
+
W
[
i
−
1
]
+
W[n]=1+W[i-1]+
W[n]=1+W[i−1]+optimal pairing for strand
i
+
1
,
…
,
,
n
−
1
i+1,\dots,,n-1
i+1,…,,n−1.
算法
Let O P T ( i , j ) OPT(i,j) OPT(i,j) denote the maximum number of base pairs in a secondary structure on b i , … , b j b_i,\dots,b_j bi,…,bj.
两种情况:
- j j j is not involved in a pair
- j j j pairs with t t t for some t < j − 4 t<j-4 t<j−4
Base Case: initialize
O
P
T
(
i
,
j
)
=
0
OPT(i,j)=0
OPT(i,j)=0 for
i
≥
j
−
1
i\geq j-1
i≥j−1.
Inductive Case:
O
P
T
(
i
,
j
)
=
max
(
O
P
T
(
i
,
j
−
1
)
,
m
a
x
(
1
+
O
P
T
(
i
,
t
−
1
)
+
O
P
T
(
t
+
1
,
j
−
1
)
)
)
OPT(i,j)=\max\left(OPT(i,j-1), max(1+OPT(i,t-1)+OPT(t+1,j-1))\right)
OPT(i,j)=max(OPT(i,j−1),max(1+OPT(i,t−1)+OPT(t+1,j−1)))
where
t
<
j
−
4
t<j-4
t<j−4 and
b
t
b_t
bt and
b
j
b_j
bj are allowable pair (A-U OR C-G)
分析
There are O ( n 2 ) O(n^2) O(n2) subproblems to solve, and evaluating the recurrence takes time O ( n ) O(n) O(n) for each. Thus the running time is O ( n 3 ) O(n^3) O(n3).
Reference
Algorithm Design (Jon Kleinberg)