实体对齐基础:基于Neo4j 图数据库的知识图谱的关联对齐-最小编辑距离-jacard算法

最近在做知识图谱的时候,需要用到实体对齐的方法,后面发现了用最小编辑距离和jacard可以做一个实体对齐的算法,原代码见参考文献,但是源代码写得有点粗糙,我这里重新整理了一下,最小编辑距离代码:

def edit_distance(word1, word2):
    len1 = len(word1)
    len2 = len(word2)
    dp = np.zeros((len1 + 1, len2 + 1))
    for i in range(len1 + 1):
        dp[i][0] = i
    for j in range(len2 + 1):
        dp[0][j] = j
    for i in range(1, len1 + 1):
        for j in range(1, len2 + 1):
            delta = 0 if word1[i - 1] == word2[j - 1] else 1
            dp[i][j] = min(dp[i - 1][j - 1] + delta, min(dp[i - 1][j] + 1, dp[i][j - 1] + 1))
    return dp[len1][len2]

jacard代码:

def Jaccrad(terms_model,reference):
    grams_reference = set(reference)
    grams_model = set(terms_model)
    temp = 0
    for i in grams_reference:
        if i in grams_model:
            temp = temp + 1
    fenmu = len(grams_model) + len(grams_reference) - temp
    jaccard_coefficient = float(temp / fenmu)
    return jaccard_coefficient

测试代码:

blists=["vipkid","vipki",'vip','福建省委']
for i in range(len(blists)):
    for j in range(0,i):
        a = blists[i]
        b = blists[j]
        print(blists[i],blists[j])
        td = Jaccrad(a, b)
#         print(td)
        std =edit_distance(a, b)/max(len(a),len(b))
        fy = 1-std
#         print(fy)
        huizon = (td+fy)/2
        print('avg_sim: ', huizon)

输出为:

vipki vipkid
avg_sim:  0.8166666666666667
vip vipkid
avg_sim:  0.55
vip vipki
avg_sim:  0.675
福建省委 vipkid
avg_sim:  0.0
福建省委 vipki
avg_sim:  0.0
福建省委 vip
avg_sim:  0.0

效果还是可以的,当然也可以举出反例,然后再选择合适的阈值来进行实体对齐了哈,这里阈值就自己定了,下游也就自己写咯

参考文献

[1].基于Neo4j 图数据库的知识图谱的关联对齐(实体对齐)——上篇. https://blog.csdn.net/for_yayun/article/details/100292617

 

  • 4
    点赞
  • 40
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 2
    评论
以下是使用GCN进行知识图谱实体对齐的Python代码示例: 首先,需要安装相关依赖库,例如networkx、numpy、scikit-learn和pytorch等。 ``` python import networkx as nx import numpy as np from sklearn.preprocessing import LabelEncoder, OneHotEncoder from sklearn.model_selection import train_test_split import torch import torch.nn.functional as F from torch_geometric.data import Data from torch_geometric.nn import GCNConv # 构建两个G1和G2 G1 = nx.read_edgelist('graph1.txt', delimiter=',', nodetype=int) G2 = nx.read_edgelist('graph2.txt', delimiter=',', nodetype=int) # 创建节点ID到索引的映射 id_to_idx = {} for i, node in enumerate(G1.nodes()): id_to_idx[node] = i for i, node in enumerate(G2.nodes()): id_to_idx[node] = i + len(G1.nodes()) # 创建节点的标签编码器 le = LabelEncoder() labels = list(G1.nodes()) + list(G2.nodes()) le.fit(labels) # 获取节点特征 node_features = np.zeros((len(labels), len(le.classes_))) for i, node in enumerate(labels): node_features[i, le.transform([node])[0]] = 1 # 创建节点特征和关系边缘索引的PyTorch几何数据对象 G1_edges = np.array(list(G1.edges())) G2_edges = np.array(list(G2.edges())) + len(G1.nodes()) edges = np.concatenate((G1_edges, G2_edges), axis=0) data = Data(x=torch.from_numpy(node_features).float(), edge_index=torch.from_numpy(edges).T) # 划分训练和测试数据集 train_idx, test_idx = train_test_split(np.arange(len(labels)), test_size=0.2, random_state=42) # 创建GCN模型 class GCN(torch.nn.Module): def __init__(self, in_channels, hidden_channels, out_channels): super(GCN, self).__init__() self.conv1 = GCNConv(in_channels, hidden_channels) self.conv2 = GCNConv(hidden_channels, out_channels) def forward(self, x, edge_index): x = self.conv1(x, edge_index) x = F.relu(x) x = F.dropout(x, training=self.training) x = self.conv2(x, edge_index) return x # 训练GCN模型 device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = GCN(node_features.shape[1], 16, 2).to(device) optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4) criterion = torch.nn.CrossEntropyLoss() data = data.to(device) train_idx = torch.from_numpy(train_idx).to(device) test_idx = torch.from_numpy(test_idx).to(device) def train(): model.train() optimizer.zero_grad() out = model(data.x, data.edge_index)[train_idx] loss = criterion(out, torch.cat((torch.zeros(len(G1.nodes())), torch.ones(len(G2.nodes())))).long().to(device)) loss.backward() optimizer.step() return loss.item() def test(): model.eval() out = model(data.x, data.edge_index)[test_idx] pred = out.argmax(dim=1) acc = int((pred == torch.cat((torch.zeros(len(G1.nodes())), torch.ones(len(G2.nodes())))).long().to(device)).sum()) / len(test_idx) return acc for epoch in range(1, 201): loss = train() acc = test() print(f'Epoch {epoch:03d}, Loss: {loss:.4f}, Acc: {acc:.4f}') # 对齐实体 model.eval() out = model(data.x, data.edge_index) pred = out.argmax(dim=1) for i, node in enumerate(labels): if pred[i] == 0: G1.nodes[node]['label'] = le.inverse_transform([node_features[i].argmax()])[0] else: G2.nodes[node]['label'] = le.inverse_transform([node_features[i].argmax()])[0] ``` 在这个示例中,首先读入两个G1和G2的边缘列表文件,并将它们转换为networkx对象。然后,使用LabelEncoder将节点标签转换为数字,并将节点特征和边缘索引转换为PyTorch几何数据对象。接着,将数据集划分为训练集和测试集,并创建一个GCN模型。最后,使用训练集训练模型,并使用测试集评估模型的性能。在训练完成之后,可以使用模型预测每个节点所属的,并将相应的节点标签添加到每个中。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

农民小飞侠

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值