最长公共子序列(Longest common subsequence,LCS),不要跟最长公共子串(Longest common substring)搞混淆了。在很多情况下,我们想知道两个串有多相似,例如:两个短句,又或者两个DNA序列(DNA Sequence),也有一个富有代表性的工具diff。
这个相似度,我们可以看作一个最长公共子序列问题,在动态规划(Dynamic Programming)里,就是求问题的最优解,很多情况下,问题的最优解不只一个,LCS只是取出其中一个。好了,LCS跟动态规划拉上关系了,因为动态规划的两个必要性质LCS都具备了。
接下看看一些定理:
最优子结构 :问题的最优解所包含的子问题的解也是最优解。
重叠子问题 :问题的最优解可以复用子问题的解。
设:
X = {x[1],x[2],...,x[m]}
Y = {y[1],y[2],...,y[n]}
然后有 X 和 Y 的一个LCS:
Z = {z[1],z[2],...,z[k]}
LCS的最优子结构:
(1) 如果 x[m] = y[n],则 z[k] = x[m] = y[n],且 Z[k - 1] 是 X[m - 1] 和 Y[n - 1] 的一个LCS。
(2) 如果 x[m] ≠ y[n],则 z[k] ≠ x[m] 蕴含 Z 是 X[m - 1] 和 Y 的一个LCS。
(3) 如果 x[m] ≠ y[n],则 z[k] ≠ y[n] 蕴含 Z 是 X 和 Y[n - 1] 的一个LCS。
由LCS的最优子结构得出一个递归式,这个递归式可以说明LCS具有重叠子问题性质:
设:
c[i,j] 为 X[i] 和 Y[j] 的一个LCS的长度
|
| (1) 0 如果 i = 0 或 j = 0
c[i,j] = < (2) c[i - 1, j - 1] + 1 如果 i,j > 0 且 x[i] = y[j]
| (3) max( c[i, j - 1], c[i - 1, j]) 如果 i,j > 0 且 x[i] ≠ y[j]
|
再看看两段简单代码:
/*
* 计算LCS的长度。
* O(mn)
*/
LCS-LENGTH(x, y)
1 m = LEN(x)
2 n = LEN(y)
3 c = [m + 1][n + 1]
4 for i = 0 to m
5 c[i,0] = 0
6 for j = 0 to n
7 c[0,j] = 0
8 for (i = 1 to m)
9 for (j = 1 to n)
10 if (x[i - 1] = y[j - 1])
11 c[i, j] = c[i - 1, j - 1] + 1
12 else
13 c[i, j] = max(c[i, j - 1], c[i - 1, j])
14
15 return c[m,n]
/*
* 计算LCS
* O(m + n)
*/
LCS(x, y, c[][])
1 m = LEN(x)
2 n = LEN(y)
3 i = m
4 j = n
5 r = [c[m,n]]
6 k = LEN(r) - 1
7 while (i > 0 && j > 0)
8 if (x[i - 1] = y[j - 1]) {
9 r[k] = x[i - 1]
10 i--; j--; k--
11 }
12 else if (c[i - 1][j] >= c[i][j - 1])
13 i--
14 else
15 j--
16
17 return r;
例子:
X = {substring}
{ s, u, b, s, t, r, i, n, g }
{x[0], x[1], x[2], x[3], x[4], x[5], x[6], x[7], x[8]}
Y = {subsequence}
{ s, u, b, s, e, q, u, e, n, c, e, }
{y[0], y[1], y[2], y[3], y[4], y[5], y[6], y[7], y[8], y[9], y[10]}
Z = {subsn}
{ s, u, b, s, n }
{z[0], z[1], z[2], z[3], z[4]}
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| | Y | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |10 |11 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| X | | | s | u | b | s | e | q | u | e | n | c | e |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 0 | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 1 | s | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 2 | u | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 3 | b | 0 | 1 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 4 | s | 0 | 1 | 2 | 3 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 5 | t | 0 | 1 | 2 | 3 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 6 | r | 0 | 1 | 2 | 3 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 7 | i | 0 | 1 | 2 | 3 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 8 | n | 0 | 1 | 2 | 3 | 4 | 4 | 4 | 4 | 4 | 5 | 5 | 5 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 9 | g | 0 | 1 | 2 | 3 | 4 | 4 | 4 | 4 | 4 | 5 | 5 | 5 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
有发现在X和Y的头部都多了一位0么??那就是为了递归式里的(1)。
再来:
(↖ = \ = 左上角) 表示 x[m] = y[n]
(↑ = ^ = 上) 表示 c[i - 1][j] ≥ c[i][j - 1]
(← = < = 左) 表示 c[i - 1][j] < c[i][j - 1]
然后跟着$(美元符号)自底向上回溯,图像就出来啦
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| | Y | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |10 |11 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| X | | | s | u | b | s | e | q | u | e | n | c | e |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 0 | | | | | | | | | | | | | |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 1 | s | | \$| < | < | \ | < | < | < | < | < | < | < |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 2 | u | | ^ | \$| < | < | < | < | \ | < | < | < | < |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 3 | b | | ^ | ^ | \$| < | < | < | < | < | < | < | < |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 4 | s | | \ | ^ | ^ | \$| < | < | < | < | < | < | < |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 5 | t | | ^ | ^ | ^ | ^$| < | < | < | < | < | < | < |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 6 | r | | ^ | ^ | ^ | ^$| < | < | < | < | < | < | < |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 7 | i | | ^ | ^ | ^ | ^$| <$| <$| <$| <$| < | < | < |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 8 | n | | ^ | ^ | ^ | ^ | < | < | < | < | \$| < | < |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 9 | g | | ^ | ^ | ^ | ^ | < | < | < | < | ^$| <$| <$|
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
再来一个综合的:
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| | Y | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |10 |11 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| X | | | s | u | b | s | e | q | u | e | n | c | e |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 0 | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 1 | s | 0 |\1$|<1 |<1 |\1 |<1 |<1 |<1 |<1 |<1 |<1 |<1 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 2 | u | 0 |^1 |\2$|<2 |<2 |<2 |<2 |\2 |<2 |<2 |<2 |<2 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 3 | b | 0 |^1 |^2 |\3$|<3 |<3 |<3 |<3 |<3 |<3 |<3 |<3 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 4 | s | 0 |\1 |^2 |^3 |\4$|<4 |<4 |<4 |<4 |<4 |<4 |<4 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 5 | t | 0 |^1 |^2 |^3 |^4$|<4 |<4 |<4 |<4 |<4 |<4 |<4 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 6 | r | 0 |^1 |^2 |^3 |^4$|<4 |<4 |<4 |<4 |<4 |<4 |<4 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 7 | i | 0 |^1 |^2 |^3 |^4$|<4$|<4$|<4$|<4$|<4 |<4 |<4 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 8 | n | 0 |^1 |^2 |^3 |^4 |<4 |<4 |<4 |<4 |\5$|<5 |<5 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 9 | g | 0 |^1 |^2 |^3 |^4 |<4 |<4 |<4 |<4 |^5$|<5$|<5$|
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+