# 中文字符串模糊匹配算法|C# Levenshtein Distance

2010-01-06 09:08:09

C# Levenshtein Distance
by Sam Allen - Updated November 27, 2009
You want to match approximate strings with fuzzy logic, using the Levenshtein distance algorithm. Many projects need this logic, including programs that manage prescription drugs, spell-checkers, suggestion searches and plagiarism detectors. Here we see a simple but complete implementation of this algorithm using the C# programming language.

Words:                ant, aunt
Levenshtein distance: 1
Note:                 Only 1 edit is needed.
The 'u' must be added at index 2.

Words:                Samantha, Sam
Levenshtein distance: 5
Note:                 The final 5 letters must be removed.

Words:                Flomax, Volmax
Levenshtein distance: 3
Note:                 The first 3 letters must be changed
Drug names are commonly confused.Levenshtein algorithm
First, credit goes to Vladimir Levenshtein, a Russian scientist. Here we see the C# code I adapted and optimized. It uses a two-dimensional array instead of a jagged array because the space required will only have one width and one height.

=== Program that implements the algorithm (C#) ===

using System;

/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];

// Step 1
if (n == 0)
{
return m;
}

if (m == 0)
{
return n;
}

// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}

for (int j = 0; j <= m; d[0, j] = j++)
{
}

// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}

class Program
{
static void Main()
{
Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
}
}

=== Output from the program ===

1
5
3Description. The Levenshtein method is static. This Compute method doesn't need to store state or instance data, which means you can declare it as static. This can also improve performance, avoiding callvirt instructions. You can easily verify that the above implementation is the standard version of Levenshtein by looking at one of the textbooks you were supposed to read.

Performance notes. The code I show above was adapted by me from another source, and optimized so that it is three times faster. However, there are faster variants of Levenshtein algorithms for some scenarios. [Levenshtein distance - wikipedia.org]

Static classes. This algorithm is stateless, which means it doesn't store instance data and therefore can be put in a static class. Static classes are easier to add to new projects than separate methods.

Usage
Here we see how you can call the method in your C# programs. You will often want to compare multiple strings with the Levenshtein algorithm. The example here shows how you can compare strings in a loop. We use a List of string[] arrays.

=== Program that calls Levenshtein in loop (C#) ===

static void Main()
{
List<string[]> l = new List<string[]>
{
new string[]{"ant", "aunt"},
new string[]{"Sam", "Samantha"},
new string[]{"clozapine", "olanzapine"},
new string[]{"flomax", "volmax"},
new string[]{"kitten", "sitting"}
};

foreach (string[] a in l)
{
int cost = Compute(a[0], a[1]);
Console.WriteLine("{0} -> {1} = {2}",
a[0],
a[1],
cost);
}
}

=== Output of the program ===

ant -> aunt = 1
Sam -> Samantha = 5
clozapine -> olanzapine = 3
flomax -> volmax = 3
kitten -> sitting = 3More resources
Michael Gilleland has an excellent page about the Levenshtein distance and many implementations of it, and that resource is important if you need more detailed reference. [Levenshtein Distance - merriampark.com]

Performance mistake
I found the C# version linked from merriampark.com, but I adapted that code for some big performance improvements. I changed the first statement into the second statement. The before version makes a new string copy for each single character. The after version examines characters directly, with no copy strings made, taking 75% less time to run.

=== Slow version that uses Substring ===

// It makes new strings.
cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);

=== Fast version that uses chars ===

// Doesn't make new strings with Substring.
cost = (t[j - 1] == s[i - 1]) ? 0 : 1;Summary
Here we saw the famous Levenshtein Distance algorithm, adapted and optimized for the C# programming language. The author places the code here in the public domain, and encourages you to test it and improve it. This means you are free to use it anywhere you want. Use this code to implement approximate string matching. The brilliance of the algorithm is from Dr. Levenshtein, not the author of this article. [Page protected by Copyscape; do not copy.]

• 本文已收录于以下专栏：

## c#字符串模糊匹配

1、正则表达式简介　　正则表达式提供了功能强大、灵活而又高效的方法来处理文本。正则表达式的全面模式匹配表示法可以快速地分析大量的文本以找到特定的字符模式；提取、编辑、替换或删除文本子字符串；或将提取的...
• dingdang2
• 2011年05月04日 11:15
• 13193

## 使用cstring实现中文字符串模糊匹配

CString::Find名称CString::Find编辑本段作用在一个较大的字符串中查找字符或子字符串int Find( TCHAR ch ) const;int Find( LPCTSTR lp...
• xuexiiphone
• 2016年04月26日 18:13
• 1736

## C#实现精确查询和模糊查询

• maoyanlong88
• 2010年04月22日 16:02
• 1944

## C#参数化模糊查询

private void button1_Click(object sender, EventArgs e) { SqlParameter pName = ne...
• QingHeShiJiYuan
• 2015年12月24日 11:06
• 1516

## C#中string字符串的模糊查找

• h57020877
• 2010年11月11日 21:38
• 737

## 字符串相似度算法(编辑距离Levenshtein Distance)

• chndata
• 2015年01月09日 11:38
• 2959

## C#：字符串相似度算法（ Levenshtein Distance算法）

• jhqin
• 2011年06月16日 17:34
• 3390

## Levenshtein distance最小编辑距离算法实现

Levenshein distance，中文名为最小编辑距离，其目的是找出两个字符串之间需要改动多少个字符后变成一致。该算法使用了动态规划的算法策略，该问题具备最优子结构，最小编辑距离包含子最小编辑距...
• xanxus46
• 2014年08月19日 21:11
• 12949

## 最短编辑距离问题 ： Levenshtein Distance

• qq_21063873
• 2016年10月29日 21:02
• 247

## 计算字符串相似度算法——Levenshtein

• xiaocao9903
• 2016年02月05日 17:33
• 415

举报原因： 您举报文章：中文字符串模糊匹配算法|C# Levenshtein Distance 色情 政治 抄袭 广告 招聘 骂人 其他 (最多只允许输入30个字)