如何在Postgresql中使用模糊字符串匹配

最新推荐文章于 2024-05-10 11:32:25 发布

cumi6497

最新推荐文章于 2024-05-10 11:32:25 发布

阅读量1.9k

点赞数

文章标签：字符串算法 python java mysql

原文链接：https://www.freecodecamp.org/news/fuzzy-string-matching-with-postgresql/

版权

It's a fact - people make typos or simply use alternate spellings on a frequent basis.

这是事实-人们经常打错字或只是使用替代拼写。

Whatever the cause, from a practical point of view, different variants of similar strings can pose challenges for software developers. Your application needs to be capable of handling these inevitable edge-cases.

从实际的角度来看，无论原因是什么，相似字符串的不同变体都会给软件开发人员带来挑战。您的应用程序需要能够处理这些不可避免的情况。

Take names, for example. I go by Peter in some places, Pete in others. Amongst other variants, my name can be represented by:

以名字为例。我在某些地方经过彼得，在其他地方经过彼得。在其他变体中，我的名字可以用以下方式表示：

"Pete Gleeson"
“皮特格里森”
"Peter J Gleeson"
“彼得·格里森(Peter J Gleeson)”
"Mr P Gleeson"
“格里森先生”
"Gleeson, Peter"
“格里森，彼得”

And that's not to mention alternative spellings of my surname, such as "Gleason". All these different variations for just one string - matching them against each other programmatically might not seem obvious.

更不用说我姓氏的其他拼写形式，例如“格里森”。对于一个字符串，所有这些不同的变体-以编程方式将它们彼此匹配可能并不明显。

Luckily, there are solutions out there.

幸运的是，那里有解决方案。

The generic name for these solutions is 'fuzzy string matching'. The 'fuzzy' refers to the fact that the solution does not look for a perfect, position-by-position match when comparing two strings. Instead, they allow some degree of mismatch (or 'fuzziness').

这些解决方案的通用名称是“模糊字符串匹配”。 “模糊”是指这样的事实，即在比较两个字符串时，解决方案并不寻求完美的逐位匹配。相反，它们允许一定程度的不匹配(或“模糊性”)。

There are solutions available in many different programming languages. Today, we'll explore some options available in Postgresql (or 'Postgres') - a widely used open source SQL dialect with some seriously useful add-on features.

有许多不同编程语言提供的解决方案。今天，我们将探讨Postgresql(或“ Postgres”)中可用的一些选项-一种广泛使用的开放源SQL方言，具有一些非常有用的附加功能。

配置 (Setting up)

First, make sure you have Postgres installed on your machine.

首先，确保在计算机上安装了Postgres 。

Then, create a new database in its own directory (you can call it anything you like, here, I called it 'fuzz-demo'). From the command line:

然后，在其自己的目录中创建一个新数据库(您可以随意命名，在这里，我将其称为“ fuzz-demo”)。从命令行：

$ mkdir fuzz-demo && cd fuzz-demo
$ initdb .
$ pg_ctl -D . start
$ createdb fuzz-demo

For this demo, I used a table with details about artists in the Museum of Modern Art. You can download the artists.csv file from Kaggle.

对于此演示，我使用了一张桌子，上面有现代艺术博物馆中有关艺术家的详细信息。您可以从Kaggle下载artist.csv文件。

Next, you can start psql (a terminal-based front end for Postgresql):

接下来，您可以启动psql(Postgresql的基于终端的前端)：

$ psql fuzz-demo

Now, create a table called artists:

现在，创建一个名为artists的表：

CREATE TABLE artists (
	artist_id INT,
    name VARCHAR,
    nationality VARCHAR,
    gender VARCHAR,
    birth_year INT,
    death_year INT);

Finally, you can use Postgresql's COPY function to copy the contents of artists.csv into the table:

最后，您可以使用Postgresql的COPY函数将artist.csv的内容复制到表中：

COPY artists FROM '~/Downloads/artists.csv' DELIMTER ',' CSV HEADER;

If everything has worked so far, you should be able to start querying the artists table.

如果到目前为止一切正常，则应该可以开始查询Artists表。

SELECT * FROM artists LIMIT 10;

通配符过滤器 (Wildcard filters)

Say you remember the first name of an artist called Barbara, but cannot quite remember her second name. It began with 'Hep...', but you're not sure how it ended.

假设您记得一位叫Barbara的艺术家的名字，但是却不太记得她的名字。它以“ Hep ...”开始，但是您不确定它是如何结束的。

Here, you can use a filter and SQL's wildcard operator %. This symbol stands in for any number of unspecified characters.

在这里，您可以使用过滤器和SQL的通配符% 。该符号代表任意数量的未指定字符。

SELECT
	* 
FROM artists
WHERE name LIKE 'Barbara%'
AND name LIKE '%Hep%';

The first part of the filter finds artists whose name begins with 'Barbara', and ends in any combination of characters.

过滤器的第一部分查找名称以'Barbara'开头且以任何字符组合结尾的艺术家。

The second part of the filter finds artists whose name can begin and end with any combination of characters, but must contain the letters 'Hep' in that order.

过滤器的第二部分查找艺术家的名字可以以任何字符组合开头和结尾，但必须按顺序包含字母“ Hep”。

But what if you are unsure of the spelling of either name? Filters and wildcards will only get you so far.

但是，如果您不确定两个名字的拼写怎么办？过滤器和通配符只会帮助您解决问题。

使用三字组 (Using trigrams)

Luckily, Postgres has a helpful extension with the catchy name pg_trgm. You can enable it from psql using the command below:

幸运的是，Postgres的扩展名为pg_trgm，很有帮助。您可以使用以下命令从psql启用它：

CREATE EXTENSION pg_trgm;

This extension brings with it some helpful functions for fuzzy string matching. The underlying principle is the use of trigrams (which sound like something out of Harry Potter).

此扩展带有一些有用的函数，用于模糊字符串匹配。基本原理是使用三字组合(听起来像哈利·波特那样)。

Trigrams are formed by breaking a string into groups of three consecutive letters. For example, the string "hello" would be represented by the following set of trigrams:

通过将字符串分成三个连续字母的组来形成三元组。例如，字符串“ hello”将由以下三字母组表示：

" h", " he", "hel", "ell", "llo", "lo "
“ h”，“ he”，“ hel”，“ ell”，“ llo”，“ lo”

By comparing how similar the set of trigrams are between two strings, it is possible to estimate how similar they are on a scale between 0 and 1. This allows for fuzzy matching, by setting a similarity threshold above which strings are considered to match.

通过比较两个字符串之间的字母组合的相似程度，可以估计它们在0到1之间的尺度上的相似程度。这可以通过设置相似性阈值来进行模糊匹配，在该阈值之上可以认为字符串匹配。

SELECT
	*
FROM artists
WHERE SIMILARITY(name,'Claud Monay') > 0.4 ;

Perhaps you want to see the top five matches?

也许您想看到前五场比赛？

SELECT 
	*
FROM artists
ORDER BY SIMILARITY(name,'Lee Casner') DESC
LIMIT 5;

The default threshold is 0.3. You can use the % operator in this case as shorthand for fuzzy matching names against a potential match:

默认阈值为0.3。在这种情况下，可以使用%运算符作为针对潜在匹配的模糊匹配名称的简写：

SELECT
	*
FROM artists
WHERE name % 'Andrey Deran';

Perhaps you only have an idea of one part of the name. The % operator lets you compare against elements of an array, so you can match against any part of the name. The next query uses Postgres' STRING_TO_ARRAY function to split the artists' full names into arrays of separate names.

也许您只知道名称的一部分。使用%运算符可以与数组的元素进行比较，因此可以与名称的任何部分进行匹配。下一个查询使用Postgres的STRING_TO_ARRAY函数将艺术家的全名拆分为单独名称的数组。

SELECT
	*
FROM artists
WHERE 'Cadinsky' % ANY(STRING_TO_ARRAY(name,' '));

语音算法 (Phonetic algorithms)

Another approach to fuzzy string matching comes from a group of algorithms called phonetic algorithms.

模糊字符串匹配的另一种方法来自一组称为语音算法的算法。

These are algorithms which use sets of rules to represent a string using a short code. The code contains the key information about how the string should sound if read aloud. By comparing these shortened codes, it is possible to fuzzy match strings which are spelled differently but sound alike.

这些是使用规则集来使用短代码表示字符串的算法。该代码包含有关大声读取字符串的声音的关键信息。通过比较这些缩短的代码，可以对拼写不同但听起来相似的匹配字符串进行模糊处理。

Postgres comes with an extension that lets you make use of some of these algorithms. You can enable it with the following command:

Postgres带有一个扩展，可让您使用其中的一些算法。您可以使用以下命令启用它：

CREATE EXTENSION fuzzystrmatch;

One example is an algorithm called Soundex. Its origins go back over 100 years - it was first patented in 1918 and was used in the 20th century for analysing US census data.

一个示例是称为Soundex的算法。它的起源可以追溯到100年前-它于1918年首次获得专利，并在20世纪用于分析美国人口普查数据。

Soundex works by converting strings into four letter codes which describe how they sound. For example, the Soundex representations of 'flower' and 'flour' are both F460.

Soundex的工作原理是将字符串转换成四个字母代码，以描述它们的发音。例如，“花”和“面粉”的Soundex表示形式均为F460。

The query below finds the record which sounds like the name 'Damian Hurst'.

下面的查询查找听起来像名称“ Damian Hurst”的记录。

SELECT
	*
FROM artists
WHERE nationality IN ('American', 'British')
AND SOUNDEX(name) = SOUNDEX('Damian Hurst');

Another algorithm is one called metaphone. This works on a similar basis to Soundex, in that it converts strings into a code representation using a set of rules.

另一种算法是称为元音。这与Soundex相似，其工作原理是使用一组规则将字符串转换为代码表示形式。

The metaphone algorithm will return codes of different lengths (unlike Soundex, which always returns four characters). You can pass an argument to the METAPHONE function indicating the maximum length code you want it to return.

变音位算法将返回不同长度的代码(与Soundex不同，后者始终返回四个字符)。您可以将一个参数传递给METAPHONE函数，该参数指示您希望其返回的最大长度代码。

SELECT
	artist_id,
    name,
    METAPHONE(name,10)
FROM artists
WHERE nationality = 'American'
LIMIT 5;

Because both metaphone and Soundex return strings as outputs, you can use them in other fuzzy string matching functions. This combined approach can yield powerful results. The example below finds the five closest matches for the name Si Tomlee.

因为metaphone和Soundex都将字符串作为输出返回，所以您可以在其他模糊字符串匹配函数中使用它们。这种组合方法可以产生有力的结果。下面的示例查找名称Si Tomlee的五个最接近的匹配项。

SELECT
	*
FROM artists
WHERE nationality = 'American'
ORDER BY SIMILARITY(
	METAPHONE(name,10),
    METAPHONE('Si Tomlee',10)
    ) DESC
LIMIT 5;

Here, a trigram-only approach would not have helped much, as there is little overlap between 'Cy Twombly' and 'Si Tomlee'. In fact, these only have a SIMILARITY score of 0.05, even though they sound similar when read aloud.

在这里，仅使用三元组的方法不会有太大帮助，因为“ Cy Twombly”和“ Si Tomlee”之间几乎没有重叠。实际上，尽管它们朗读时听起来相似，但它们的SIMILARITY仅为0.05。

Due to their historical origins, neither of these algorithms works well with names or words of non-English language origin. However, there are more internationally-focused versions.

由于其历史渊源，这些算法都不能很好地用于非英语来源的名称或单词。但是，还有更多面向国际的版本。

One example is the double metaphone algorithm. This uses a more sophisticated set of rules for producing metaphones. It can provide alternative encodings for English and non-English origin strings.

一个例子是双变音位算法。这使用了一套更复杂的规则来生产对讲机。它可以为英语和非英语来源字符串提供替代编码。

As an example, see the query below. It compares the double metaphone outputs for different spellings of Spanish artist Joan Miró:

例如，请参见下面的查询。它比较了西班牙艺术家JoanMiró的不同拼写的双音位输出：

SELECT
	'Joan Miró' AS name, 
    DMETAPHONE('Joan Miró'),
    DMETAPHONE_ALT('Joan Miró')
UNION SELECT
	'Juan Mero' AS name,
    DMETAPHONE('Juan Mero'),
    DMETAPHONE_ALT('Juan Mero');

走远 (Going the distance)

Finally, another approach to fuzzy string matching in Postgres is to calculate the 'distance' between strings. There are several ways to do this. Postgres provides functionality to calculate the Levenshtein distance.

最后，Postgres中模糊字符串匹配的另一种方法是计算字符串之间的“距离”。有几种方法可以做到这一点。 Postgres提供了计算Levenshtein距离的功能。

At a high level, the Levenshtein distance between two strings is the minimum number of edits required to transform one string into the other. Edits are considered at the character level, and can include:

在较高的水平上，两个字符串之间的Levenshtein距离是将一个字符串转换为另一个字符串所需的最小编辑次数。编辑是在字符级别考虑的，可以包括：

substitutions,
替换，
deletions, and
删除，以及
insertions
插入

For example, the Levenshtein distance between the words 'bigger' and 'better' is 3, because you can transform 'bigger' into 'better' by substituting 'igg' for 'ett'.

例如，单词“ bigger”和“ better”之间的Levenshtein距离为3，因为您可以通过用“ igg”代替“ ett”将“ bigger”变为“ better”。

Meanwhile, the Levenshtein distance between 'biggest' and 'best' is also 3, because you can transform 'biggest' into 'best' by deleting the letters 'igg'.

同时，“最大”和“最佳”之间的Levenshtein距离也是3，因为您可以通过删除字母“ igg”将“最大”转换为“最佳”。

See below for a query which finds the artists with the smallest Levenshtein distances from the name 'Freda Kallo'.

请参阅以下查询，查找距离名称“ Freda Kallo”最小的Levenshtein距离的艺术家。

SELECT
	*,
    LEVENSHTEIN(name, 'Freda Kallo')
FROM artists
ORDER BY LEVENSHTEIN(name, 'Freda Kallo') ASC
LIMIT 5

谢谢阅读！ (Thanks for reading!)

Hopefully this overview of fuzzy string matching in Postgresql has given you some new insights and ideas for your next project.

希望Postgresql中的模糊字符串匹配概述为您的下一个项目提供一些新的见解和想法。

There are of course other methods for fuzzy string matching not covered here, and in other programming languages.

当然，这里还有其他编程语言中未提及的其他用于模糊字符串匹配的方法。

For example, if you use Python, take a look at the fuzzywuzzy package. Or if you prefer R, you can use the inbuilt agrep() function, or try out the stringdist package.

例如，如果您使用Python，请查看Fuzzywuzzy软件包。或者，如果您更喜欢R，则可以使用内置的agrep()函数，或尝试使用stringdist包。

翻译自: https://www.freecodecamp.org/news/fuzzy-string-matching-with-postgresql/

cumi6497

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
如何在Postgresql中使用模糊字符串匹配

It's a fact - people make typos or simply use alternate spellings on a frequent basis. 这是事实-人们经常打错字或只是使用替代拼写。 Whatever the cause, from a practical point of view, different variants of similar strings...
复制链接

扫一扫