python 字符串模糊配对_Python中的模糊字符串匹配

最新推荐文章于 2024-08-18 03:31:57 发布

weixin_39526546

最新推荐文章于 2024-08-18 03:31:57 发布

阅读量1.2k

点赞数

文章标签： python 字符串模糊配对

小编典典

这里有几种优化级别可以将这个问题从O(n ^ 2)转变为较小的时间复杂度。

预处理：在第一遍中对您的列表进行排序，为每个字符串创建一个输出映射，它们对于映射的键可以是规范化的字符串。规范化可能包括：

小写转换，

没有空格，去除特殊字符，

如果可能，将unicode转换为ascii等效项，请使用unicodedata.normalize或unidecode模块)

这将导致"Andrew H Smith"，"andrew h. smith"，"ándréw h.

smith"产生相同的密钥"andrewhsmith"，和你的万余名组将减少到一个较小的一套独特的/类似的分组名。

您可以使用此utlity方法来规范化您的字符串(尽管不包括unicode部分)：

def process_str_for_similarity_cmp(input_str, normalized=False, ignore_list=[]):

""" Processes string for similarity comparisons , cleans special characters and extra whitespaces

if normalized is True and removes the substrings which are in ignore_list)

Args:

input_str (str) : input string to be processed

normalized (bool) : if True , method removes special characters and extra whitespace from string,

and converts to lowercase

ignore_list (list) : the substrings which need to be removed from the input string

Returns:

str : returns processed string

"""

for ignore_str in ignore_list:

input_str = re.sub(r'{0}'.format(ignore_str), "", input_str, flags=re.IGNORECASE)

if normalized is True:

input_str = input_str.strip().lower()

#clean special chars and extra whitespace

input_str = re.sub("\W", "", input_str).strip()

return input_str

现在，如果相似的字符串的规范化密钥相同，则它们将已经位于同一存储桶中。

为了进行进一步的比较，您将只需要比较键，而不是名称。例如 andrewhsmith和andrewhsmeeth，因为除了上面进行的标准化比较之外，名称的这种相似性还需要模糊字符串匹配。

Bucketing ：您是否真的需要将5个字符的密钥与9个字符的密钥进行比较，看看是否匹配95％？你不可以。因此，您可以创建匹配字符串的存储桶。例如，将5个字符名称与4-6个字符名称匹配，将6个字符名称与5-7个字符匹配，等等。对于大多数实际匹配而言，字符键的n + 1，n-1个字符限制是一个不错的选择。

开始比赛：名字的大部分变化都会有相同的第一个字符的标准化格式(例如Andrew H Smith，ándréw h. smith和Andrew H. Smeeth生成密钥andrewhsmith，andrewhsmith和andrewhsmeeth分别，他们通常不会在第一个字符不同，所以你可以开始键跑匹配。a其他键开头为a，并且位于长度段之内，这将大大减少您的匹配时间，因此无需匹配关键字andrewhsmith，因此bndrewhsmith，几乎不会出现带有首字母的名称变化。

然后，您可以使用此方法(或FuzzyWuzzy模块)的内容来查找字符串相似度百分比，可以排除jaro_winkler或difflib中的一个来优化速度和结果质量：

def find_string_similarity(first_str, second_str, normalized=False, ignore_list=[]):

""" Calculates matching ratio between two strings

Args:

first_str (str) : First String

second_str (str) : Second String

normalized (bool) : if True ,method removes special characters and extra whitespace

from strings then calculates matching ratio

ignore_list (list) : list has some characters which has to be substituted with "" in string

Returns:

Float Value : Returns a matching ratio between 1.0 ( most matching ) and 0.0 ( not matching )

using difflib's SequenceMatcher and and jellyfish's jaro_winkler algorithms with

equal weightage to each

Examples:

>>> find_string_similarity("hello world","Hello,World!",normalized=True)

1.0

>>> find_string_similarity("entrepreneurship","entreprenaurship")

0.95625

>>> find_string_similarity("Taj-Mahal","The Taj Mahal",normalized= True,ignore_list=["the","of"])

1.0

"""

first_str = process_str_for_similarity_cmp(first_str, normalized=normalized, ignore_list=ignore_list)

second_str = process_str_for_similarity_cmp(second_str, normalized=normalized, ignore_list=ignore_list)

match_ratio = (difflib.SequenceMatcher(None, first_str, second_str).ratio() + jellyfish.jaro_winkler(unicode(first_str), unicode(second_str)))/2.0

return match_ratio

2020-07-28

weixin_39526546

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫