中文字符串模糊匹配算法|C# Levenshtein Distance

原创 2011年01月25日 13:58:00

中文字符串模糊匹配算法|C# Levenshtein Distance

2010-01-06 09:08:09  

C# Levenshtein Distance
by Sam Allen - Updated November 27, 2009
You want to match approximate strings with fuzzy logic, using the Levenshtein distance algorithm. Many projects need this logic, including programs that manage prescription drugs, spell-checkers, suggestion searches and plagiarism detectors. Here we see a simple but complete implementation of this algorithm using the C# programming language.

Words:                ant, aunt
Levenshtein distance: 1
Note:                 Only 1 edit is needed.
                      The 'u' must be added at index 2.

Words:                Samantha, Sam
Levenshtein distance: 5
Note:                 The final 5 letters must be removed.

Words:                Flomax, Volmax
Levenshtein distance: 3
Note:                 The first 3 letters must be changed
                      Drug names are commonly confused.Levenshtein algorithm
First, credit goes to Vladimir Levenshtein, a Russian scientist. Here we see the C# code I adapted and optimized. It uses a two-dimensional array instead of a jagged array because the space required will only have one width and one height.

=== Program that implements the algorithm (C#) ===

using System;

/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
    /// <summary>
    /// Compute the distance between two strings.
    /// </summary>
    public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }
}

class Program
{
    static void Main()
    {
        Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
        Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
        Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
    }
}

=== Output from the program ===

1
5
3Description. The Levenshtein method is static. This Compute method doesn't need to store state or instance data, which means you can declare it as static. This can also improve performance, avoiding callvirt instructions. You can easily verify that the above implementation is the standard version of Levenshtein by looking at one of the textbooks you were supposed to read.

Performance notes. The code I show above was adapted by me from another source, and optimized so that it is three times faster. However, there are faster variants of Levenshtein algorithms for some scenarios. [Levenshtein distance - wikipedia.org]

Static classes. This algorithm is stateless, which means it doesn't store instance data and therefore can be put in a static class. Static classes are easier to add to new projects than separate methods.

Usage
Here we see how you can call the method in your C# programs. You will often want to compare multiple strings with the Levenshtein algorithm. The example here shows how you can compare strings in a loop. We use a List of string[] arrays.

=== Program that calls Levenshtein in loop (C#) ===

static void Main()
{
    List<string[]> l = new List<string[]>
    {
        new string[]{"ant", "aunt"},
        new string[]{"Sam", "Samantha"},
        new string[]{"clozapine", "olanzapine"},
        new string[]{"flomax", "volmax"},
        new string[]{"toradol", "tramadol"},
        new string[]{"kitten", "sitting"}
    };

    foreach (string[] a in l)
    {
        int cost = Compute(a[0], a[1]);
        Console.WriteLine("{0} -> {1} = {2}",
            a[0],
            a[1],
            cost);
    }
}

=== Output of the program ===

ant -> aunt = 1
Sam -> Samantha = 5
clozapine -> olanzapine = 3
flomax -> volmax = 3
toradol -> tramadol = 3
kitten -> sitting = 3More resources
Michael Gilleland has an excellent page about the Levenshtein distance and many implementations of it, and that resource is important if you need more detailed reference. [Levenshtein Distance - merriampark.com]

Performance mistake
I found the C# version linked from merriampark.com, but I adapted that code for some big performance improvements. I changed the first statement into the second statement. The before version makes a new string copy for each single character. The after version examines characters directly, with no copy strings made, taking 75% less time to run.

=== Slow version that uses Substring ===

// It makes new strings.
cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);

=== Fast version that uses chars ===

// Doesn't make new strings with Substring.
cost = (t[j - 1] == s[i - 1]) ? 0 : 1;Summary
Here we saw the famous Levenshtein Distance algorithm, adapted and optimized for the C# programming language. The author places the code here in the public domain, and encourages you to test it and improve it. This means you are free to use it anywhere you want. Use this code to implement approximate string matching. The brilliance of the algorithm is from Dr. Levenshtein, not the author of this article. [Page protected by Copyscape; do not copy.]

c#字符串模糊匹配

1、正则表达式简介  正则表达式提供了功能强大、灵活而又高效的方法来处理文本。正则表达式的全面模式匹配表示法可以快速地分析大量的文本以找到特定的字符模式;提取、编辑、替换或删除文本子字符串;或将提取的...
  • dingdang2
  • dingdang2
  • 2011年05月04日 11:15
  • 13193

使用cstring实现中文字符串模糊匹配

CString::Find名称CString::Find编辑本段作用在一个较大的字符串中查找字符或子字符串int Find( TCHAR ch ) const;int Find( LPCTSTR lp...
  • xuexiiphone
  • xuexiiphone
  • 2016年04月26日 18:13
  • 1736

C#实现精确查询和模糊查询

方法一: using System; using System.Collections.Generic; using System.ComponentModel; using System.D...
  • maoyanlong88
  • maoyanlong88
  • 2010年04月22日 16:02
  • 1944

C#参数化模糊查询

private void button1_Click(object sender, EventArgs e) { SqlParameter pName = ne...
  • QingHeShiJiYuan
  • QingHeShiJiYuan
  • 2015年12月24日 11:06
  • 1516

C#中string字符串的模糊查找

有一个很大的string字符串,比如string strText="我爱CSDN我爱CSDN我爱CSDN我爱和 谐CSDN我爱CSDN我爱CSDN",要查找其中的和谐关键字,其中关键中可能存在空格或者...
  • h57020877
  • h57020877
  • 2010年11月11日 21:38
  • 737

字符串相似度算法(编辑距离Levenshtein Distance)

什么是Levenshtein 编辑距离(Edit Distance),最先是由俄国科学家Vladimir Levenshtein在1965年发明,用他的名字命名,又称Levenshtein距离。是...
  • chndata
  • chndata
  • 2015年01月09日 11:38
  • 2959

C#:字符串相似度算法( Levenshtein Distance算法)

编辑距离,又称Levenshtein距离(也叫做Edit Distance),是指两个字符串之间,由一个转成另一个所需的最少编辑操作次数。许可的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删...
  • jhqin
  • jhqin
  • 2011年06月16日 17:34
  • 3390

Levenshtein distance最小编辑距离算法实现

Levenshein distance,中文名为最小编辑距离,其目的是找出两个字符串之间需要改动多少个字符后变成一致。该算法使用了动态规划的算法策略,该问题具备最优子结构,最小编辑距离包含子最小编辑距...
  • xanxus46
  • xanxus46
  • 2014年08月19日 21:11
  • 12949

最短编辑距离问题 : Levenshtein Distance

个人觉得只要你能明白edit数组的含义就可以理解状态转移方程了。/* 可以用来表示字符串的相似度? */ #include using namespace std; int edit[100][10...
  • qq_21063873
  • qq_21063873
  • 2016年10月29日 21:02
  • 247

计算字符串相似度算法——Levenshtein

计算字符串相似度算法——Levenshtein 博客分类:  我喜欢的算法 levenshtein相似度编辑距离算法实现  0.这个算法实现起来很简单 1.百度百科介绍: ...
  • xiaocao9903
  • xiaocao9903
  • 2016年02月05日 17:33
  • 415
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:中文字符串模糊匹配算法|C# Levenshtein Distance
举报原因:
原因补充:

(最多只允许输入30个字)