这篇我们看看最长公共子序列的另一个版本,求字符串相似度(求最短编辑距离),这是一个非常实用的算法,在DNA对比,网页聚类等方面都有用武之地。
一:概念
对于两个字符串A和B,通过基本的增删改将字符串A改成B,或者将B改成A,在改变的过程中我们使用的最少步骤称之为“编辑距离”。
比如如下的字符串:
dcga
edcb
我们通过种种操作,痉挛之后编辑距离为3,不知道你看出来了没有?
二:解析
设A和B是2个字符串。要用最少的字符操作将字符串A转换为字符串B。这里所说的字符操作包括:
- (1)删除一个字符;
- (2)插入一个字符;
- (3)将一个字符改为另一个字符。
算法:
首先给定第一行和第一列,然后,每个值d[i,j]这样计算:
d[i][j] = min(d[i-1][j]+1, d[i][j-1]+1, d[i-1][j-1]+(s1[i]==s2[j]?0:1));
最后一行,最后一列的那个值就是最小编辑距离
分析
dp[i][j]表示a的前i个和b的前j个相同后的最短距离,dp[i][j]来自于三种状态:
1、删除,dp[i-1][j]+1;
2、插入,dp[i][j-1]+1;
3、替换,if(a[i]==b[j]) dp[i][j]=dp[i-1][j-1],else dp[i][j]=dp[i-1][j-1]+1;
另一种写法:
①: 当 Xi = Yi 时,则C[i,j]=C[i-1,j-1];
②:当 Xi != Yi 时, 则C[i,j]=Min{(C[i-1,j-1],C[i-1,j],C[i,j-1])+1};
三:编码
/**
* Created by xuming on 2016/6/16.
*/
public class minEditDistance {
public static void main(String[] args) {
String str1 = "dcga";
String str2 = "edcb";
int dis = getEditDistance(str1, str2);
System.out.println("str1:" + str1 + ";str2:" + str2 + "; the distance is :" + dis);
}
public static int getEditDistance(String str1, String str2) {
int[][] martix =new int[str1.length()+1][str2.length()+1];
//init boundary = 0
for (int i = 0; i <= str1.length(); i++) {
martix[i][0] = i;
}
for (int j = 0; j <= str2.length(); j++) {
martix[0][j] = j;
}
//martix x
for (int i = 1; i <= str1.length(); i++) {
//martix y
for (int j = 1; j <= str2.length(); j++) {
//equal
if (str1.charAt(i - 1) == str2.charAt(j - 1)) {
martix[i][j] = martix[i - 1][j - 1];
} else {
//get min value:leftfront vs below
int temp = Math.min(martix[i - 1][j], martix[i][j - 1]);
//get min martix
martix[i][j] = Math.min(temp, martix[i - 1][j - 1]) + 1;
}
}
}
int result = martix[str1.length()][str2.length()];
return result;
}
}
四:结果
五:附件(类似题目)
Time Limit: 1000MS | Memory Limit: 65536K | |
Total Submissions: 12332 | Accepted: 4623 |
Description
Let x and y be two strings over some finite alphabet A. We would like to transform x into y allowing only operations given below:
- Deletion: a letter in x is missing in y at a corresponding position.
- Insertion: a letter in y is missing in x at a corresponding position.
- Change: letters at corresponding positions are distinct
Certainly, we would like to minimize the number of all possible operations.
IllustrationA G T A A G T * A G G CDeletion: * in the bottom line
| | | | | | |
A G T * C * T G A C G C
Insertion: * in the top line
Change: when the letters at the top and bottom are distinct
This tells us that to transform x = AGTCTGACGC into y = AGTAAGTAGGC we would be required to perform 5 operations (2 changes, 2 deletions and 1 insertion). If we want to minimize the number operations, we should do it like
A G T A A G T A G G C
| | | | | | |
A G T C T G * A C G C
and 4 moves would be required (3 changes and 1 deletion).
In this problem we would always consider strings x and y to be fixed, such that the number of letters in x is m and the number of letters in y is n where n ≥ m.
Assign 1 as the cost of an operation performed. Otherwise, assign 0 if there is no operation performed.
Write a program that would minimize the number of possible operations to transform any string x into a string y.
Input
The input consists of the strings x and y prefixed by their respective lengths, which are within 1000.
Output
An integer representing the minimum number of possible operations to transform any string x into a string y.
Sample Input
10 AGTCTGACGC
11 AGTAAGTAGGC
Sample Output
4
Source