Diff算法研究

最新推荐文章于 2024-07-18 21:22:31 发布

airekans

最新推荐文章于 2024-07-18 21:22:31 发布

阅读量9k

点赞数

分类专栏： C++ ACM 文章标签：算法 string table stdstring function 任务

本文链接：https://blog.csdn.net/airekans/article/details/6022178

版权

C++ 同时被 2 个专栏收录

40 篇文章 0 订阅

订阅专栏

ACM

2 篇文章 0 订阅

订阅专栏

在Unix/Linux的世界里面，如果我们需要比较两个文件，就会用一个比较的命令——diff。而这个diff的原理是什么呢？

在diff里面，我们比较的两个文件叫做old和new，而一般是按行来比较。这里我们可以抽象成一个字符串的比较，比如：

old: abcdefger

new: abdefereger

那么其中的每一个字符都可以表示文件里面的一行。那么diff里面用到的比较思想是从old和new里面找出最长的subsequence。

subsequence的定义是：如果原串是(a1, a2, ..., an)的话，其中ai表示串中的第i个字符，那么(a[m1], a[m2], ..., a[mi])称为一个subsequence，如果m1, m2, ..., mi是[1, n]的元素，m1 < m2 < ...< mi且a[mi] 属于ai里面的字符。

那么diff的任务就是找出old和new里面的最长公共subsequence(Longest common subsequence/LCS)。

For example:

old: fabc

new: ebca

那么LCS就是bc。

那么就可以设计一个函数lsp，他接受两个字符串作为参数，返回他们的LSP。

string lsp(const string &s1, const string &s2);

同时，在我们对比字符串的时候，我们都是从前往后的对比的，那么我们可以得到下面的性质。

lsp(s1, s2) = 1). s1[0] + lsp(tail(s1), tail(s2)) if s1[0] == s2[0]

2). max{lsp(s1, tail(s2)), lsp(tail(s1), s2)} if s1[0] != s2[0]

其中tail(str)表示str的除了第一个字符之外的剩下的子串。

可以看出来上面的公式是一个递归的suboptimal公式，这正是Dynamic Programming(DP)里面的一个思想。所以我们就可以用DP来解决上面的问题。

在这里我用的是Bottom-up[3]的DP思路，也就是从最基本的case开始构造元素，使得在后面的计算里面可以重复的利用之前的计算。

假设我们开辟一个2维数据，他的列标志(Column index)表示old里面的字符，比如说old是fabc，0是空字符'/0'，1是'c'，2是'b'，3是'a'，4是'f'。字母顺序是反过来的。

而行标志(Row Index)表示new里面的字符，比如说new是ebca，0是空字符'/0'，1是'a'，2是'c'，3是'b'，4是'e'。

我们定义oldChars[old.size() + 1]为old里面的字符, newChars[new.size() + 1]为new里面的字符，里面的内容就和上面说的一样。然后定义table[old.size() + 1][new.size() + 1]是存放LCS的一个2维数组。那么通过上面的lsp公式，我们可以知道两个字串的LSP是: 1) 如果首个字母是相同的话，也就是oldChars[i] == newChars[j]的时候，那么LSP就是oldChars[i] + table[i - 1][j - 1]; 2) 如果两个首字母是不一样的话，也就是newChars[i] != oldChars[j]的话，那么LSP就是max(table[i - 1][j - 1], table[i][j - 1])。其中比较的标准是字串较长的为较大的字串，如果长度相等，那么就是按字典序排序的较大者为较大者。

从而我们可以这样计算：

Function lsp(old, new)

oldSize = old.size()

newSize = new.size()

for i -> 0 to oldSize

for j -> 0 to newSize

if (oldChars[i] == newChars[j])

then table[i][j] := Append(oldChars[i], table[i - 1][j - 1])

else

table[i][j] = Max(table[i - 1][j], table[i][j - 1])

return table[old.size()][new.size()]

注意，table[i][0]和table[0][j]都被预先赋为空串""了。

通过这样的计算，最后的table[old.size()][new.size()]就是所求的LCS。

在得到了LSP之后，只要比较old和LSP，不同的就是被删除了的；比较new和LSP，不同的就是新增加的。

这个DP算法的时间复杂度是O(nm)，空间复杂度是O(nm)，不过空间复杂度可以简化到O(n)。

下面是完整的一段代码：

#include <iostream> #include <string> using namespace std; string lcs(const string &s1, const string &s2); void showDiff(const string &s1, const string &s2, const string ⊂); int main(int argc, char** argv) { string s1, s2; cin >> s1 >> s2; string strRes = lcs(s1, s2); showDiff(s1, s2, strRes); return 0; } string lcs(const string &s1, const string &s2) { const int rowSize = s1.size() + 1; const int colSize = s2.size() + 1; string table[rowSize][colSize]; char rowChar[rowSize]; char colChar[colSize]; int cnt = 0; rowChar[0] = colChar[0] = '/0'; for (int i = rowSize - 2, cnt = 1; i >= 0; i--, cnt++) { rowChar[cnt] = s1[i]; } for (int i = colSize - 2, cnt = 1; i >= 0; i--, cnt++) { colChar[cnt] = s2[i]; } char ch1, ch2; string str1, str2; for (int i = 1; i < rowSize; i++) { for (int j = 1; j < colSize; j++) { ch1 = rowChar[i]; ch2 = colChar[j]; if (ch1 == ch2) { table[i][j] = ch1 + table[i - 1][j - 1]; } else { str1 = table[i - 1][j]; str2 = table[i][j - 1]; if (str1.size() == str2.size()) { table[i][j] = str1 < str2? str2 : str1; } else { table[i][j] = str1.size() < str2.size()? str2 : str1; } } } } return table[rowSize - 1][colSize - 1]; } void showDiff(const string &s1, const string &s2, const string ⊂) { cout << "LSP: " + sub << endl; cout << endl; int tmp = s1.size(); for (int i = 0, j = 0; i < tmp; i++) { if (s1[i] != sub[j]) { cout << '-'; } else { cout << ' '; j++; } } cout << endl << s1 << endl; tmp = s2.size(); for (int i = 0, j = 0; i < tmp; i++) { if (s2[i] != sub[j]) { cout << '+'; } else { cout << ' '; j++; } } cout << endl << s2 << endl; }