在数据对比的时候,有的时候需要对比文章或者段落等,但是标点符号也会成为比对的内容,本次主要是去掉对比内容的标点符号后的字符提取
import re
def remove_punctuation(sentence):
# 使用正则表达式匹配所有标点符号,并替换为空格
sentence = re.sub(r'[^\w\s]', '', sentence)
return sentence
ss = "hello~ world!"
print(remove_punctuation(ss))
结果:hello world
这样的话,就可以让两个字符串取对比了
import re
def remove_punctuation(sentence):
# 使用正则表达式匹配所有标点符号,并替换为空格
sentence = re.sub(r'[^\w\s]', '', sentence)
return sentence
ss = "hello~ world!"
ss1 = "hello world~~~"
ss2 = "hello word!"
import Levenshtein
def levenshtein_similarity(text1, text2):
distance = Levenshtein.distance(text1, text2)
max_length = max(len(text1), len(text2))
similarity = 1 - distance / max_length
return similarity
print(levenshtein_similarity(ss,ss1))
print(levenshtein_similarity(remove_punctuation(ss),remove_punctuation(ss1)))
print(levenshtein_similarity(ss,ss2))
print(levenshtein_similarity(remove_punctuation(ss),remove_punctuation(ss2)))
结果:
0.7142857142857143
1.0
0.8461538461538461
0.9090909090909091
就上面的结果,ss去掉符号为“hello world”,ss1去掉符号:“hello world”,ss2去掉符号:“hello word” ,所以上面的相似结果比对就出来啦~