可直接使用difflib.SequenceMatcher
def ratio(self):
"""Return a measure of the sequences' similarity (float in [0,1]).
Where T is the total number of elements in both sequences, and
M is the number of matches, this is 2.0*M / T.
Note that this is 1 if the sequences are identical, and 0 if
they have nothing in common.
.ratio() is expensive to compute if you haven't already computed
.get_matching_blocks() or .get_opcodes(), in which case you may
want to try .quick_ratio() or .real_quick_ratio() first to get an
upper bound.
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
>>> s.quick_ratio()
0.75
>>> s.real_quick_ratio()
1.0
"""
matches = sum(triple[-1] for triple in self.get_matching_blocks())
return _calculate_ratio(matches, len(self.a) + len(self.b))
具体代码:
import difflib
#判断相似度的方法,用到了difflib库
def get_similar(str1, str2):
return difflib.SequenceMatcher(None, str1, str2).quick_ratio()
#执行方法进行验证
if __name__ == '__main__':
a = '阿里巴巴集团创始人'
b = '云南大学副教授'
print(get_similar(a, b))
方法中相似度计算方式十分简单:
similar = 2M/T
M:两个字符串相同的字符数
T:两个字符串总字符数