最近因为工作需要,学习了NLP的相关知识,简单动手实现了一下计算Edit Distance的算法,就是计算一个字符串要变成另一个字符串需要的代价,这其中采用Levenshtein方式,即规定一个插入和一个删除的代价是1,一次替换的代价是2.
简单的逻辑:
对于长度为M的字符串X,长度为N的字符串Y,
Initialization:
D(i,0)=i
D(0,j)=j
Recurrence Relation:
for each i=1...M
for each j=1...N
D(i,j)=Min(D(i-1,j)+1,D(i,j-1)+1,X(i)==Y(j)?D(i-1,j-1):D(i-1,j-1)+2)
Termination:
D(M,N) is distance
public static int EditDistance(string str1, string str2) { int len1 = str1.Length; int len2 = str2.Length; int[,] table = new int[len1+1, len2+1]; for (int i = 0; i < len1; i++) { for (int j = 0; j < len2; j++) { table[i, j] = 10000; } } table[0, 0] = 0; for (int i = 0; i <= len1; i++) { for (int j = 0; j <= len2; j++) { if (i == 0 && j != 0) { table[i, j] = table[i, j - 1] + 1; } if (j == 0 && i != 0) { table[i, j] = table[i - 1, j] + 1; } if (i > 0 && j > 0) { int temp = (str1[i-1] == str2[j-1]) ? table[i - 1, j - 1] : table[i - 1, j - 1] + 2; table[i, j] = Min(table[i, j - 1] + 1, table[i - 1, j] + 1, temp); } } } return table[len1, len2]; } public static int Min(int val1, int val2, int val3) { return (val1 < val2 ? val1 : val2) < val3 ? (val1 < val2 ? val1 : val2) : val3; }
递归:
public static int EditDistanceD(string str1, string str2, int len1, int len2) { if (len1 == 0 || len2 == 0) { return Max(len1, len2); } return str1[len1-1]==str2[len2-1]?Min(EditDistanceD(str1.Substring(0,len1-1), str2.Substring(0, len2-1), len1-1, len2-1), EditDistanceD(str1.Substring(0,len1-1), str2, len1-1, len2)+1, EditDistanceD(str1, str2.Substring(0, len2-1), len1, len2-1)+1):Min(EditDistanceD(str1.Substring(0,len1-1), str2.Substring(0, len2-1), len1-1, len2-1)+2, EditDistanceD(str1.Substring(0,len1-1), str2, len1-1, len2)+1, EditDistanceD(str1, str2.Substring(0, len2-1), len1, len2-1)+1); } public static int Max(int val1, int val2) { return val1 > val2 ? val1 : val2; }
具体讲解参考:
http://blog.csdn.net/huaweidong2011/article/details/7727482