字符串的近似匹配算法_近似字符串匹配指南

cunchi8090

于 2020-07-18 16:14:23 发布

阅读量4.4k

点赞数 3

文章标签：字符串算法 python java 机器学习

原文链接：https://www.experts-exchange.com/articles/2661/A-Guide-to-Approximate-String-Matching.html

版权

字符串的近似匹配算法

Okay. So what exactly is the problem here?

好的。 那到底是什么问题呢？

How often have we come across situations where we need to know if two strings are 'similar' but not necessarily the same? I have, plenty of times. Until recently, I thought any functionality like that would be impossible to get my head around. Fortunately, there was already a vast field of well developed research to come to my rescue!

我们需要多久遇到一次需要知道两个字符串是否“相似”但不一定相同的情况？我有很多次直到最近，我还认为像这样的任何功能都是不可能的。幸运的是，已经有广泛的成熟研究领域可以帮助我！

Whoa, hang on! I'm a beginner here. What kind of field is this?

哇，等等！ 我是这里的初学者。 这是什么领域？

This measure of 'similarity' between strings is known as 'Approximate String Matching'. The basic steps followed in comparing any two strings are as followed:

字符串之间的“相似性”度量称为“近似字符串匹配”。比较任何两个字符串所遵循的基本步骤如下：

a) Analyze the two strings

a）分析两个字符串

b) Compute a 'metric' for these two strings

b）为这两个字符串计算一个“度量”

c) Check if the value of the metric is above or below a certain threshold.

c）检查指标值是否高于或低于某个阈值。

d) Decide if the strings 'match'.

d）确定字符串是否“匹配”。

Metrics? Approximate String matching? Steps? What??

指标？ 近似字符串匹配？ 脚步？ 什么？？

Relax. The above 4 steps will become a little clearer the moment I elaborate on one of these metrics.

放松。当我详细介绍这些指标之一时，以上4个步骤将变得更加清晰。

But wait. What metrics are these?

可是等等。 这些是什么指标？

A metric is a quantity of measurement to put it simply. Here we need to choose an appropriate metric to represent how similar or different two strings are.

度量是简单地说的度量数量。在这里，我们需要选择一个适当的指标来表示两个字符串的相似度或不同度。

Oh okay. So if we're measuring the same thing, why do we need different metrics?

哦好的。 因此，如果我们要测量同一件事，为什么我们需要不同的指标？

A person's weight can be measured in kilograms or pounds. Your height can be in feet or inches. Similarly, there are different approaches to measure how similar two strings are.

一个人的体重可以千克或磅为单位。您的身高可以以英尺或英寸为单位。类似地，有不同的方法来衡量两个字符串的相似程度。

So there are more metrics available?

因此，还有更多可用的指标吗？

There are plenty of metrics to choose from. Some are:

有很多指标可供选择。一些是：

Levenshtein edit distance Levenshtein编辑距离 Hamming distance 汉明距离 Longest common substring 最长的普通子串 Longest common subsequence 最长的公共子序列

Can you tell me a little more about each?

您能告诉我更多有关这些的信息吗？

Of course! We'll be covering the Levenshtein Edit distance in more detail later, but here's a little bit of information about the other three:

当然！稍后我们将更详细地介绍Levenshtein Edit的距离，但以下是有关其他三个的一些信息：

Hamming Distance: The hamming distance, to put it simply, measures how different two strings are based on the position of their characters. For example, the distance between "John" and "Joan" is 1. However, this metric takes the position of characters as a major aspect of the comparison between the strings. This means "John" and "Jonh" have a hamming distance of 2. While we know, at first glance, that the second string is probably just a typographical error of the first, this metric concludes that they are very different. 汉明距离：简而言之，汉明距离根据两个字符的位置来测量它们之间的差异。例如，“ John”和“ Joan”之间的距离为1。但是，此度量标准将字符的位置作为字符串之间比较的主要方面。这意味着“ John”和“ Jonh”的汉明距离为2。虽然我们乍看之下知道第二个字符串可能只是第一个字符串的印刷错误，但该度量标准得出的结论是它们非常不同。 Longest Common Substring:As the name suggests, this method finds the length of the longest substring present in the strings being compared. For example, in the strings "BON","RIBBON" and "BONAFIDE", the longest common substring is "BON". Clearly, the longer the common substring is, the more similar are the strings being compared. Again, as is the case with the hamming distance, position of each character is essential to determine the degree of similarity between strings. 最长公共子串：顾名思义，此方法查找要比较的字符串中存在的最长子串的长度。例如，在字符串“ BON”，“ RIBBON”和“ BONAFIDE”中，最长的公共子字符串是“ BON”。显然，公共子字符串越长，所比较的字符串就越相似。同样，与汉明距离一样，每个字符的位置对于确定字符串之间的相似度至关重要。 Longest Common Subsequence:It's important to note that while it is easy to confuse this method with the previous, they are in fact very different. A substring is a sequence of characters as they appear in the original string. That is to say, a substring has to form a directly mappable part of the original string and contain characters in exactly the same order as the parent string. A subsequence however, has no such restriction. Characters in a subsequence only need to have the same relative position as the parent string. 最长公共子序列：需要注意的重要一点是，虽然很容易将这种方法与以前的方法混淆，但实际上它们却大不相同。子字符串是在原始字符串中出现的一系列字符。也就是说，子字符串必须形成原始字符串的直接可映射部分，并包含与父字符串完全相同的顺序的字符。但是，子序列没有这种限制。子序列中的字符仅需要与父字符串具有相同的相对位置。

Let's take an example here. We have the string "TURTLE". Some valid substrings of this string are - "URTLE","TUR","RTL", etc. Some valid subsequences of this string are - "TE","TUE","RLE". Notice, that in the subsequence, the absolute position of the characters is irrelevant, but their relative position must be the same as the parent string. "RLE" is valid because even though they don't appear consecutively in the string "TURTLE", the character 'R' still appears before 'L' which appears before 'E'.

让我们在这里举个例子。我们有字符串“ TURTLE”。此字符串的某些有效子字符串为-“ URTLE”，“ TUR”，“ RTL”等。此字符串的某些有效子序列为-“ TE”，“ TUE”，“ RLE”。注意，在子序列中，字符的绝对位置无关紧要，但是它们的相对位置必须与父字符串相同。 “ RLE”是有效的，因为即使它们未连续出现在字符串“ TURTLE”中，字符“ R”仍然出现在“ L”之前，而“ L”之前出现在“ E”之前。

We can now understand what the longest common subsequence searches for. If we compare "PENNY","ENTITY" and "EPIPHANY", we will observe that the longest common subsequence is "ENY". While position in this method is not as critical as in the other two methods mentioned above, relative position still reduces this method's effectiveness when confronted with typos.

现在我们可以了解最长的公共子序列要搜索的内容。如果我们比较“ PENNY”，“ ENTITY”和“ EPIPHANY”，我们会发现最长的公共子序列是“ ENY”。尽管此方法中的位置不像上述其他两种方法中那样重要，但是当遇到错别字时，相对位置仍然会降低该方法的有效性。

So which one can I use?

那么我可以使用哪一个呢？

My preference and, indeed, the one I always use is the Levenshtein distance. This is mainly because the Levenshtein distance also takes into account typos and does not, necessarily, take into account the actual positions of the string as strictly as some of the other metrics. You can get more information on each of these, and more in their individual Wikipedia pages.

我的偏好乃至我经常使用的是Levenshtein距离。这主要是因为Levenshtein距离还考虑了拼写错误，并不一定像其他一些度量标准一样严格考虑字符串的实际位置。您可以在每个页面上获得更多信息，并在其各自的Wikipedia页面上获得更多信息。

Okay great. So now I know that I'm going to calculate the Levenshtein distance for my strings. Can you tell me something more about this?

好，太棒了。 所以现在我知道我要计算我的琴弦的Levenshtein距离。 您能告诉我更多有关此的信息吗？

The basic idea of the Levenshtein distance is to find the number of changes you need to make from one string , to get the other. For example, if I have "John" and "Johnny", I need to either add two characters to the first to get the second, or subtract two characters from the second, to get the first. That gives me a distance of 2. Similarly, if I have "John" and "Jones" I need to remove a character 'h' from "John" and add two characters 'e' and 's' to make it "Jones". That gives me a distance of 3.

Levenshtein距离的基本思想是找到需要从一个字符串进行更改的数目，以获取另一个。例如，如果我有“ John”和“ Johnny”，则需要在第一个字符上添加两个字符以获取第二个字符，或者从第二个字符中减去两个字符以获取第一个字符。这使我的距离为2。类似地，如果我有“ John”和“ Jones”，则需要从“ John”中删除字符“ h”，并添加两个字符“ e”和“ s”以使其成为“ Jones” 。这使我的距离为3。

Ah okay, that makes sense.

好的，那是有道理的。

See, I told you!

看，我告诉过你！

Writing code for that is a whole other ball game.

为此编写代码是另一回事。

Yes, I know. Fortunately, to compute Levenshtein distance there is code available on the internet for unrestricted personal and commercial use. Below is one such function in C#:

是的我知道。幸运的是，为了计算Levenshtein距离，互联网上提供了可用于个人和商业用途的代码。以下是C＃中的此类功能之一：

http://dotnetperls.com/levenshtein http://dotnetperls.com/levenshtein

public static int GetLevenshteinDistance(string s, string t)
{
    int n = s.Length;
    int m = t.Length;
    int[,] d = new int[n + 1, m + 1];

    if (n == 0)
    {
        return m;
    }

    if (m == 0)
    {
        return n;
    }

    for (int i = 0; i <= n; d[i, 0] = i++)
    {
    }

    for (int j = 0; j <= m; d[0, j] = j++)
    {
    }

    for (int i = 1; i <= n; i++)
    {
        for (int j = 1; j <= m; j++)
        {
            int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
            d[i, j] = Math.Min(
            Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
            d[i - 1, j - 1] + cost);
        }
    }

    return d[n, m];
}

So what do I do with this now?

那我现在该怎么办？

This code is going to return a number similar to the ones we arrived at earlier. It would return 2 for "John" and "Johnny" and 3 for "John" and "Jones". Now we can try and derive (in percentage) how much the string has to be altered to be made equal with the other.

此代码将返回与我们之前到达的数字相似的数字。它将为“ John”和“ Johnny”返回2，为“ John”和“ Jones”返回3。现在，我们可以尝试得出（以百分比为单位）必须更改多少字符串才能与其他字符串相等。

Consider A = "John" and B = "Johnny".

考虑A =“ John”和B =“ Johnny”。

Since we're checking two strings, we can either change A to B or B to A. Either one will need the same number of operations.

因为我们要检查两个字符串，所以我们可以将A更改为B或将B更改为A。任何一个都需要相同数量的操作。

An ideal thing to do here is find out, which string would be altered the least to obtain the other string?

找出此处的理想方法，找出哪一个字符串最少更改以获得另一个字符串？

I didn't understand that.

我不明白。

Well, here we need to compare A and B. We know now that the distance between these is 2. How effective is the change on each of these strings? Changing A would involve a net change of 2 out of 4 character. Changing B would involve a net change of 2 out of 6 characters.

好了，这里我们需要比较A和B。我们现在知道它们之间的距离是2。每个字符串上的更改效果如何？更改A将涉及4个字符中的2个字符的净更改。更改B将涉及6个字符中的2个净更改。

Clearly, 2/4 (50%) is more than 2/6 (33.33%).

显然，2/4（50％）大于2/6（33.33％）。

Okay. What do I do with this information?

好的。 我该如何处理这些信息？

The purpose here is to find out a 'score' which we can then use to compare any two strings.

这里的目的是找出一个“分数”，然后我们可以用它来比较任意两个字符串。

So this score is basically the lowest percentage of change in either string?

因此，该分数基本上是两个字符串中变化最低的百分比？

Exactly. In the case of A and B, the score will be 2/6 since it is lesser than 2/4. Similarly for any two strings A and B the score can be obtained by using:

究竟。对于A和B，由于得分小于2/4，因此得分将为2/6。同样，对于任何两个字符串A和B，可以使用以下方法获得分数：

double score = GetLevenshteinDistance(A,B)/Math.Max(A.Length,B.Length);

This is exactly what we did earlier. We just divide the distance by the largest length to get a score which when multiplied by 100, will give you the percentage of change.

这正是我们之前所做的。我们将距离除以最大长度即可得到分数，将其乘以100即可得出变化百分比。

And how does getting this score help?

取得这个分数有什么帮助？

This means you can now programmatically specify a condition like - 'Only consider two strings equal if either of the strings does not have to be changed more than 30% of what it originally is.

这意味着您现在可以以编程方式指定类似条件-'如果两个字符串中的任何一个字符串的更改不必超过其原始值的30％，则仅考虑两个字符串相等。

How exactly?

到底如何

Using this code:

使用此代码：

String A = "John";
String B = "Johnny";
int distance = GetLevenshteinDistance(A, B);
double score = (double)distance / Math.Max(A.Length, B.Length);

if (score < 0.3) //0.3 corresponds to 30%
{
    Console.WriteLine("Bingo. These are almost the same.");
}
else
{
    Console.WriteLine("Nope. Not close enough.");
}

Cool. What if I want them to be considered as equal only if the change percentage is, say, less than or equal to 22%?

凉。 如果我希望仅当变化百分比小于或等于22％时才将它们视为相等怎么办？

you just change the '0.3' in the code to'0.22'. This works for any percentage you want.

您只需将代码中的“ 0.3”更改为“ 0.22”即可。这适用于您想要的任何百分比。

Can you show me some place I can actually use this?

您能告诉我一些我可以实际使用的地方吗？

Let's see. Suppose you need to make a program to get the scores of students from a database based on a teacher entering the name as input.

让我们来看看。假设您需要编写一个程序，以老师输入姓名作为输入，从数据库中获取学生的分数。

Teachers are human, last time I checked. Suppose the teacher entered 'Jonh' instead of 'John'. You obviously will not find a record for 'Jonh' in the database. But, in case you're using this method of approximate matching and you compare 'John' with the input 'Jonh', it will register a match and seamlessly return John's scores without any issues.

老师是人类，我上次检查。假设老师输入的是“ Jonh”而不是“ John”。您显然不会在数据库中找到“ Jonh”的记录。但是，如果您使用这种近似匹配方法，并且将“ John”与输入“ Jonh”进行比较，它将注册一个匹配并无缝返回John的分数。

Is there anything else I should know before using this?

使用此工具之前，我还有什么要知道的吗？

Yes. There are always other metrics to choose from and different ones are suitable for different uses. Use this article as a starting point to research other ways of approximately matching strings.

是。总会有其他指标可供选择，不同的指标适用于不同的用途。以本文为起点，研究近似匹配字符串的其他方式。

Great. Anything else before I leave?

大。 我离开之前还有什么吗？

Yes! You can mark this article as helpful if you thought it was :-)

是! 如果您认为这篇文章对您有帮助，可以将其标记为有用：-)

Edit:

编辑：

As DanRollins pointed out in his comment, Soundex is another metric used for approximate string matching. It is used to determine whether two strings 'sound' the same. Soundex belongs to a branch of algorithms called phonetic algorithms. You can read more about it in the wiki page here.

正如DanRollins在其评论中指出的那样，Soundex是用于近似字符串匹配的另一个指标。它用于确定两个字符串“听起来”是否相同。 Soundex属于算法的一个分支，称为语音算法。您可以在此处的Wiki页面上了解更多信息。

翻译自: https://www.experts-exchange.com/articles/2661/A-Guide-to-Approximate-String-Matching.html

字符串的近似匹配算法

cunchi8090

关注

3
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
字符串的近似匹配算法_近似字符串匹配指南

字符串的近似匹配算法Okay. So what exactly is the problem here? 好的。那到底是什么问题呢？ How often have we come across situations where we need to know if two strings are 'similar' but not necessaril...
复制链接

扫一扫