python字符串模糊匹配，并计算匹配分数

格物致理，

已于 2023-04-24 23:28:16 修改

阅读量2k

点赞数

分类专栏： python 文章标签： python

于 2023-04-20 10:48:09 首次发布

本文链接：https://blog.csdn.net/qiuqiuit/article/details/130233863

版权

python 专栏收录该内容

28 篇文章 3 订阅

订阅专栏

thefuzz是一个Python库，用于字符串的模糊匹配。它主要通过SequenceMatcher类计算两个字符串的相似度，该类可以从difflib或python-Levenshtein包中获取。thefuzz提供了ratio方法和process模块，用于计算相似度和从列表中找出最匹配的字符串。process.extractOne函数结合了WRatio或自定义评分函数来确定最佳匹配。

摘要由CSDN通过智能技术生成

一、thefuzz

thefuzz包以前叫fuzzywuzzy，0.19版本开始改名为thefuzz，github地址：

GitHub - seatgeek/thefuzz: Fuzzy String Matching in Python

可以通过命令pip install thefuzz安装此包。用法还是比较简单的:

from thefuzz import fuzz

fuzz.ratio("test", "test!")

>>89

上面两个字符串的相似度为89%。

二、相似度ratio的计算

我们先看看这个包下面的源码，来查看thefuzz是怎么实现模糊匹配的。thefuzz源码包的结构如下：

先看看ratio方法源码：

def ratio(s1, s2):
    s1, s2 = utils.make_type_consistent(s1, s2)

    m = SequenceMatcher(None, s1, s2)
    return utils.intr(100 * m.ratio())

可以看到，ratio方法用到了一个比较关键的类SequenceMatcher，但是这个类却有可能来自两个不同的地方。

2.1 SequenceMatcher的来源

看看fuzzy.py的头部代码：

import platform
import warnings

try:
    #从当前文件夹的StringMatcher中导入StringMatcher
    from .StringMatcher import StringMatcher as SequenceMatcher
except ImportError:
    if platform.python_implementation() != "PyPy":
        warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
    from difflib import SequenceMatcher

#导入当前文件夹的utils包，.代表当前目录
from . import utils

上面代码涉及了一个导入问题，即先从当前文件StringMatcher中导入StringMatcher，如果导入出现异常，就去difflib中导入SequenceMatcher。

正如上面第一张图中看到的，当然文件夹下面确实有一个叫StringMatcher.py的文件，也看看它前面的代码：

from Levenshtein import *
from warnings import warn

class StringMatcher:
.............
.............

可以看出，这个StringMatcher类引用了Levenshtein包，这个包也是用来计算字符串模糊匹配的，效率上来说，有可能比difflib中的SequenceMatcher快4-10倍。

Levenshtein包是用C语言写的，比较复杂，最初的项目地址：

GitHub - miohtama/python-Levenshtein: The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

后来这个作者没有维护了，然后由另一个在维护，项目的地址在这里：

GitHub - ztane/python-Levenshtein: The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

但是，这个页面上，作者也说了，他也7年不维护了，现在没找到新的维护者。

不管怎么说，如果要使用Levenshtein，还是可以安装的：

pip install python-Levenshtein

总结就是，thefuzz有两种实现方式，一种是依赖difflib，另一种依靠 python-Levenshtein。先看简单的difflib。

2.2 difflib包中的SequenceMatcher

首先导入：

from difflib import SequenceMatcher

这个类的主要作用是计算两个匹配字符串的相似度，如下：

s = SequenceMatcher(None, "abcde", "bcde")
s.ratio()

输出值为0.888888。这个是怎么计算的呢？可以查看difflib.py的源代码（我的电脑在D:\ProgramData\Miniconda3\Lib目录下），如下：

def ratio(self):
    matches = sum(triple[-1] for triple in self.get_matching_blocks())
    return _calculate_ratio(matches, len(self.a) + len(self.b))

这个方法涉及到两个比较重要的方法，一个是get_matching_blocks()，这个方法用于获取匹配的字符块。另一个方法_calculate_ratio，用于计算相似度，先看_calculate_ratio，代码如下：

def _calculate_ratio(matches, length):
    if length:
        return 2.0 * matches / length
    return 1.0

上面代码的第三行是关键，matches表示的字符个数，length是两个字符串加起来的总长度。如上面的"abcde"和 "bcde"，ratio的计算方法就是2*4/9，即8/9=0.888888。

再看看get_matching_blocks方法。这个方法比较复杂，我们先来看下，这个方法的用法：

s = SequenceMatcher(None, "abchde", "bcde")
print(s.get_matching_blocks())

输出如下：

[Match(a=1, b=0, size=2), Match(a=4, b=2, size=2), Match(a=6, b=4, size=0)]

什么意思？从方法的名字大概就能看出来，就是获得匹配的所有字符块。上面的代码输出了3个Match对象，Match(a=1, b=0, size=2)的意思是"abchde"从索引1(a=1)开始，"bcde"从索引0(b=0)开始，匹配到2（size=2）个相等字符，即“bc”。

最后一个Match(a=6, b=4, size=0)是固定的，a、b代表两个字符串的长度，size=0固定不变。用代码描述如下：

(len(a), len(b), 0)

但是如果前面字串符已经匹配过，就不会再进行匹配了，如下：

s = SequenceMatcher(None, "bc", "abchdebc")
print(s.get_matching_blocks())

输出：

[Match(a=0, b=1, size=2), Match(a=2, b=8, size=0)]

即“bc”只匹配了第一次的位置，后面就算出现和它一样的字符串，也不再进行匹配。

三、process模块

从第一张图中可以看到，除了fuzz.py这个文件，还有一个叫process.py的文件，process模块常用的是从候选列表中，返回与目标字符串最相似的一个结果。来看一个简单的例子：

from thefuzz import fuzz,process

choices = ["hello world", "hello china", "hello beijing"]
print(process.extractOne("china",choices))

#输出内容
>>('hello china', 90)

正如上面代码所示，process最常用的用法是从众多字符串中，找到最佳匹配的字符串。

process.extractOne的格式如下：

extractOne(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0):

"""
Args:
    query: A string to match against
    choices: A list or dictionary of choices, suitable for use with extract().
    processor: Optional function for transforming choices before matching.
    scorer: Scoring function for extract().
    score_cutoff: Optional argument for score threshold. If the best
            match is found, but it is not greater than this number, then
            return None anyway ("not a good enough match").  Defaults to 0.

Returns:
    A tuple containing a single match and its score, if a match
    was found that was above score_cutoff. Otherwise, returns None.
"""

query：查询的字符串；

choices: 待匹配的字符串列表或者字典；

processor:可选参数，转换器，在匹配前先对choices进行转换处理；

scorer:可选参数，分数器，用于计算分数；

score_cutoff：可选参数，这个参数的作用是设置一个分数门槛（默认为0），如果小于这个分数，就不返回匹配的字符串，而是返回一个None。

extractOne返回的结果是一个tuple元组（最佳匹配结果，分数）。

我们比较关心的一个问题是，这个分数是怎么计算的？看看下面例子：

from thefuzz import fuzz,process

print(fuzz.ratio("china","hello china"))

choices = ["hello world", "hello china", "hello beijing"]
print(process.extractOne("china",choices))

#输出内容
>>62
>>('hello china', 90)

可以看出，fuzz.ratio与process.extractOne分数的计算方式不一样(一个是62分，一个90分)。fuzz.ratio的计分方式，上面已经讲了，下面来看看extractOne的计分方式。

extractOne的源码如下：

def extractOne(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0):
    best_list = extractWithoutOrder(query, choices, processor, scorer, score_cutoff)
    try:
        return max(best_list, key=lambda i: i[1])
    except ValueError:
        return None

我们刚才说了，第三个参数scorer是用于计分的，它的默认值为default_scorer，那我们先找到这个default_scorer的值：

default_scorer = fuzz.WRatio

即默认的计分方式为fuzz.WRatio，那么我们回到fuzz.py中，看看WRatio是做什么的？

from thefuzz import fuzz

default_scorer = fuzz.WRatio
default_scorer("china", "hello china")

#输出
>> 90

可以看出，WRatio的计分方式确实和上面的extractOne相同，都是90分。WRatio的计分方式比较复杂，涉及到一个权重(weight)的概念，它是基于fuzz.ratio()的基础上，做了进一步的校正。

如果我们不想采用WRatio的计分方式，或者想采用fuzz.ratio()的计分方式来提取最佳匹配结果，可以这样：

from thefuzz import fuzz,process

print(fuzz.ratio("china","hello china"))

choices = ["hello world", "hello china", "hello beijing"]
print(process.extractOne("china",choices,scorer=fuzz.QRatio))

#输出
>>62
>>('hello china', 62)

上面代码的计分结果都是62分，因为fuzz.QRatio的内部，除了对参数进行了一些简单的处理以外，直接调用fuzz.ratio()方法返回了结果。所以fuzz.QRatio和fuzz.ratio()的计分方式完全相同。

fuzz.QRatio源代码：

# q is for quick
def QRatio(s1, s2, force_ascii=True, full_process=True):

    if full_process:
        p1 = utils.full_process(s1, force_ascii=force_ascii)
        p2 = utils.full_process(s2, force_ascii=force_ascii)
    else:
        p1 = s1
        p2 = s2

    if not utils.validate_string(p1):
        return 0
    if not utils.validate_string(p2):
        return 0

    return ratio(p1, p2)

通过上面的例子可以看出，如果我们对QRatio、WRatio这些计分方式不满意的话，完全可以自己实现了一个Ratio，将它做为extractOne的参数，实现定制的返回结果。