最近在做知识图谱的时候,需要用到实体对齐的方法,后面发现了用最小编辑距离和jacard可以做一个实体对齐的算法,原代码见参考文献,但是源代码写得有点粗糙,我这里重新整理了一下,最小编辑距离代码:
def edit_distance(word1, word2):
len1 = len(word1)
len2 = len(word2)
dp = np.zeros((len1 + 1, len2 + 1))
for i in range(len1 + 1):
dp[i][0] = i
for j in range(len2 + 1):
dp[0][j] = j
for i in range(1, len1 + 1):
for j in range(1, len2 + 1):
delta = 0 if word1[i - 1] == word2[j - 1] else 1
dp[i][j] = min(dp[i - 1][j - 1] + delta, min(dp[i - 1][j] + 1, dp[i][j - 1] + 1))
return dp[len1][len2]
jacard代码:
def Jaccrad(terms_model,reference):
grams_reference = set(reference)
grams_model = set(terms_model)
temp = 0
for i in grams_reference:
if i in grams_model:
temp = temp + 1
fenmu = len(grams_model) + len(grams_reference) - temp
jaccard_coefficient = float(temp / fenmu)
return jaccard_coefficient
测试代码:
blists=["vipkid","vipki",'vip','福建省委']
for i in range(len(blists)):
for j in range(0,i):
a = blists[i]
b = blists[j]
print(blists[i],blists[j])
td = Jaccrad(a, b)
# print(td)
std =edit_distance(a, b)/max(len(a),len(b))
fy = 1-std
# print(fy)
huizon = (td+fy)/2
print('avg_sim: ', huizon)
输出为:
vipki vipkid
avg_sim: 0.8166666666666667
vip vipkid
avg_sim: 0.55
vip vipki
avg_sim: 0.675
福建省委 vipkid
avg_sim: 0.0
福建省委 vipki
avg_sim: 0.0
福建省委 vip
avg_sim: 0.0
效果还是可以的,当然也可以举出反例,然后再选择合适的阈值来进行实体对齐了哈,这里阈值就自己定了,下游也就自己写咯
参考文献
[1].基于Neo4j 图数据库的知识图谱的关联对齐(实体对齐)——上篇. https://blog.csdn.net/for_yayun/article/details/100292617