Levenshtein Distance 算法

编辑距离就是用来计算从原串(s)转换到目标串(t)所需要的最少的插入,删除和替换的数目,在NLP中应用比较广泛,如一些评测方法中就用到了(wer,mWer等),同时也常用来计算你对原文本所作的改动数。
编辑距离的算法是首先由俄国科学家Levenshtein提出的,故又叫Levenshtein Distance。
Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example,

  • If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
  • If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.

The greater the Levenshtein distance, the more different the strings are.

 

Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance.

The Levenshtein distance algorithm has been used in:

  • Spell checking
  • Speech recognition
  • DNA analysis
  • Plagiarism detection

The Algorithm

Steps

StepDescription
1Set n to be the length of s.
Set m to be the length of t.
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0..m rows and 0..n columns.
2Initialize the first row to 0..n.
Initialize the first column to 0..m.
3Examine each character of s (i from 1 to n).
4Examine each character of t (j from 1 to m).
5If s[i] equals t[j], the cost is 0.
If s[i] doesn't equal t[j], the cost is 1.
6Set cell d[i,j] of the matrix equal to the minimum of:
a. The cell immediately above plus 1: d[i-1,j] + 1.
b. The cell immediately to the left plus 1: d[i,j-1] + 1.
c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
7After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].

Example

This section shows how the Levenshtein distance is computed when the source string is "GUMBO" and the target string is "GAMBOL".

Steps 1 and 2

  GUMBO
 012345
G1     
A2     
M3     
B4     
O5     
L6     

Steps 3 to 6 When i = 1

  GUMBO
 012345
G10    
A21    
M32    
B43    
O54    
L65    

Steps 3 to 6 When i = 2

  GUMBO
 012345
G101   
A211   
M322   
B433   
O544   
L655   

Steps 3 to 6 When i = 3

  GUMBO
 012345
G1012  
A2112  
M3221  
B4332  
O5443  
L6554  

Steps 3 to 6 When i = 4

  GUMBO
 012345
G10123 
A21123 
M32212 
B43321 
O54432 
L65543 

Steps 3 to 6 When i = 5

  GUMBO
 012345
G101234
A211234
M322123
B433212
O544321
L655432

 

算法示例1:

private int ComputeDistance (string s, string t)
{
    int n=s.Length;
    int m=t.Length;
    int[,] distance=new int[n + 1, m + 1]; // matrix
    int cost=0;
    if(n == 0) return m;
    if(m == 0) return n;
    //init1
    for(int i=0; i <= n; distance[i, 0]=i++);
    for(int j=0; j <= m; distance[0, j]=j++);
    //find min distance
    for(int i=1; i <= n; i++)
    {
        for(int j=1; j <= m;j++)
        {
            cost=(t.Substring(j - 1, 1) == 
                s.Substring(i - 1, 1) ? 0 : 1);
            distance[i,j]=Min3(distance[i - 1, j] + 1,
            distance[i, j - 1] + 1,
            distance[i - 1, j - 1] + cost);
        }
    }
    return distance[n, m];
}
算法示例2:
        private int Levenshtein(string str1, string str2)
        {
          int n = str1.Length;
          int m = str2.Length;
          int i;    //遍历str1的
            int j;    //遍历str2的
            char ch1;    //str1的
            char ch2;    //str2的
            int temp;    //记录相同字符,在某个矩阵位置值的增量,不是0就是1
           
            if(n == 0)
            {
                return m;
            }
            if(m == 0)
            {
                return n;
            }
            int[,] d = new int[n+1,m+1];
            for(i=0; i<=n; i++) 
            {    //初始化第一列
                d[i,0] = i;
            }
            for(j=0; j<=m; j++)
            {    //初始化第一行
                d[0,j] = j;
            }
            for(i=1; i<=n; i++) 
            {    //遍历str1
                ch1 = str1[i-1];
                //去匹配str2
                for(j=1; j<=m; j++)
                {
                    ch2 = str2[j-1];
                    if(ch1 == ch2)
                    {
                        temp = 0;
                    } else
                    {
                        temp = 1;
                    }
                    //左边+1,上边+1, 左上角+temp取最小
                    d[i, j] = Min(d[i - 1, j] + 1, d[i, j - 1] + 1, d[i - 1, j - 1] + temp);
                }
            }
            return d[n,m];
        }
        private int Min(int one, int two, int three)
        {
            int min = one;
            if (two < min)
            {
                min = two;
            }
            if (three < min)
            {
                min = three;
            }
            return min;
        }
        private double Sim(String str1, String str2)
        {
            int ld = Levenshtein(str1, str2);
            return 1 - (double)ld / Math.Max(str1.Length, str2.Length);
        }

算法示例3:空間復雜度從O(n*m)降到O(2m)

///*****************************
        /// Compute Levenshtein distance
        /// Memory efficient version
        ///*****************************
        public int iLD(String sRow, String sCol)
        {
            int RowLen = sRow.Length;  // length of sRow
            int ColLen = sCol.Length;  // length of sCol
            int RowIdx;                // iterates through sRow
            int ColIdx;                // iterates through sCol
            char Row_i;                // ith character of sRow
            char Col_j;                // jth character of sCol
            int cost;                   // cost
            /// Test string length
            if (Math.Max(sRow.Length, sCol.Length) > Math.Pow(2, 31))
                throw (new Exception("/nMaximum string length in Levenshtein.iLD is " + Math.Pow(2, 31) + "./nYours is " + Math.Max(sRow.Length, sCol.Length) + "."));
            // Step 1
            if (RowLen == 0)
            {
                return ColLen;
            }
            if (ColLen == 0)
            {
                return RowLen;
            }
            /// Create the two vectors
            int[] v0 = new int[RowLen + 1];
            int[] v1 = new int[RowLen + 1];
            int[] vTmp;

           
            /// Step 2
            /// Initialize the first vector
            for (RowIdx = 1; RowIdx <= RowLen; RowIdx++)
            {
                v0[RowIdx] = RowIdx;
            }
            // Step 3
            /// Fore each column
            for (ColIdx = 1; ColIdx <= ColLen; ColIdx++)
            {
                /// Set the 0'th element to the column number
                v1[0] = ColIdx;
                Col_j = sCol[ColIdx - 1];

                // Step 4
                /// Fore each row
                for (RowIdx = 1; RowIdx <= RowLen; RowIdx++)
                {
                    Row_i = sRow[RowIdx - 1];

                    // Step 5
                    if (Row_i == Col_j)
                    {
                        cost = 0;
                    }
                    else
                    {
                        cost = 1;
                    }
                    // Step 6
                    /// Find minimum
                    int m_min = v0[RowIdx] + 1;
                    int b = v1[RowIdx - 1] + 1;
                    int c = v0[RowIdx - 1] + cost;
                    if (b < m_min)
                    {
                        m_min = b;
                    }
                    if (c < m_min)
                    {
                        m_min = c;
                    }
                    v1[RowIdx] = m_min;
                }
                /// Swap the vectors
                vTmp = v0;
                v0 = v1;
                v1 = vTmp;
            }
               
            // Step 7
            /// Value between 0 - 100
            /// 0==perfect match 100==totaly different
            ///
            /// The vectors where swaped one last time at the end of the last loop,
            /// that is why the result is now in v0 rather than in v1
            System.Console.WriteLine("iDist=" + v0[RowLen]);
            int max = System.Math.Max(RowLen, ColLen);
            return ((100 * v0[RowLen]) / max);
        }
From:http://hi.baidu.com/xining52113339/blog/item/8a23f1388ddfc523b9998f47.html
         http://hi.baidu.com/pecefull0513/blog/item/a746ca1a292b9c118618bfbd.html
        http://www.codeproject.com/KB/recipes/improvestringsimilarity.aspx

   http://en.wikipedia.org/wiki/Levenshtein_distance

  http://www.codeproject.com/KB/recipes/Levenshtein.aspx

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值