在最近进行内容治理方面的服务时,用到了Levenshtein编辑距离进行文章相似性判定,于是实现了一下这个算法,如下:
首先定义二维数组,长度分别为两个字符串长度加1,并进行0行0列初始化。
判断矩阵中一个点的值是通过左上角三个点的值的最小值,其中左上点需要根据字符是否相同进行加1或者0 的操作。
代码如下:
public class Levenshtein {
/**
* [0,1,2,3,4,5],
* [1,*,*,*,*,*],
* [2,*,*,*,*,*],
* [3,*,*,*,*,*]
* Init the dp matrix like this,
* and calculate the distance by move
* forward x-axis or y-axis or both of them.
*/
public int distance(String a, String b){
int[][] dp= new int[a.length()+1][b.length()+1];
for(int i = 0;i<=a.length();i++){
dp[i][0] = i;
}
for(int i = 0;i<=b.length();i++){
dp[0][i] = i;
}
for(int i = 0;i<a.length();i++){
for(int j = 0;j<b.length();j++){
int temp = a.charAt(i) == b.charAt(j)?1:0;
dp[i+1][j+1] = Math.min(Math.min(dp[i][j]+temp, dp[i][j+1]), dp[i+1][j]);
}
}
return dp[a.length()][b.length()];
}
public float similaryty(int dist, int len){
return 1 - (float)dist/len;
}
public static void main(String[] args) {
String a = "This is test case of Levenshtein";
String b = "This is test a case of Levenshtein";
System.out.println(new Levenshtein().distance(a, b));
}
}