字符串编辑距离(levenshtein distace莱文史特距离)是一种字符串之间相似度算法。对于中文来说,很多时候都是将词作为一个基本单位,而不是字符。
算 法描述:(算法是由俄国科学家Levenshtein提出)
Step | Description |
---|---|
1 | Set n to be the length of s. Set m to be the length of t. If n = 0, return m and exit. If m = 0, return n and exit. Construct a matrix containing 0..m rows and 0..n columns. |
2 | Initialize the first row to 0..n. Initialize the first column to 0..m. |
3 | Examine each character of s (i from 1 to n). |
4 | Examine each character of t (j from 1 to m). |
5 | If s[i] equals t[j], the cost is 0. If s[i] doesn't equal t[j], the cost is 1. |
6 | Set cell d[i,j] of the matrix equal to the minimum of: a. The cell immediately above plus 1: d[i-1,j] + 1. b. The cell immediately to the left plus 1: d[i,j-1] + 1. c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost. |
7 | After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m]. |
java实现:
- Java
- public class Distance {
- //****************************
- // Get minimum of three values
- //****************************
- private int Minimum ( int a, int b, int c) {
- int mi;
- mi = a;
- if (b < mi) {
- mi = b;
- }
- if (c < mi) {
- mi = c;
- }
- return mi;
- }
- //*****************************
- // Compute Levenshtein distance
- //*****************************
- public int LD (String s, String t) {
- int d[][]; // matrix
- int n; // length of s
- int m; // length of t
- int i; // iterates through s
- int j; // iterates through t
- char s_i; // ith character of s
- char t_j; // jth character of t
- int cost; // cost
- // Step 1
- n = s.length ();
- m = t.length ();
- if (n == 0 ) {
- return m;
- }
- if (m == 0 ) {
- return n;
- }
- d = new int [n+ 1 ][m+ 1 ];
- // Step 2
- for (i = 0 ; i <= n; i++) {
- d[i][ 0 ] = i;
- }
- for (j = 0 ; j <= m; j++) {
- d[ 0 ][j] = j;
- }
- // Step 3
- for (i = 1 ; i <= n; i++) {
- s_i = s.charAt (i - 1 );
- // Step 4
- for (j = 1 ; j <= m; j++) {
- t_j = t.charAt (j - 1 );
- // Step 5
- if (s_i == t_j) {
- cost = 0 ;
- }
- else {
- cost = 1 ;
- }
- // Step 6
- d[i][j] = Minimum (d[i- 1 ][j]+ 1 , d[i][j- 1 ]+ 1 , d[i- 1 ][j- 1 ] + cost);
- }
- }
- // Step 7
- return d[n][m];
- }
- }