如果你对Levenshtein和Difflib的相似度进行快速的视觉比较感兴趣,我计算了两百三十万的书名:
import codecs, difflib, Levenshtein, distance with codecs.open("titles.tsv","r","utf-8") as f: title_list = f.read().split("\n")[:-1] for row in title_list: sr = row.lower().split("\t") diffl = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio() lev = Levenshtein.ratio(sr[3], sr[4]) sor = 1 - distance.sorensen(sr[3], sr[4]) jac = 1 - distance.jaccard(sr[3], sr[4]) print diffl, lev, sor, jac
然后我用R绘制结果:
为了好奇,我还比较了Difflib,Levenshtein,Sørensen和Jaccard的相似度值:
library(ggplot2) require(GGally) difflib
结果:
Difflib / Levenshtein的相似性真的很有趣。