transH算法实现知识图谱补全实验

transH算法实现知识图谱补全实验

1. 目的

使用transH算法进行知识图谱补全实验

2. 数据集

本次实验采用freebase数据集的FB15k, 该数据集共有entity2id.txt,relation2id.txt,test.txt,train.txt和valid.txt共五个文件。实验过程中,训练时主要采用entity2id.txt,relation2id.txt,train.txt三个文件,测试集使用test.txt。

3. 方法

本次实验主要采用transH模型进行知识图谱补全实验,使用pytorch工具辅助算法实现。

  1. transH算法原理

TransH 模型在 TransE 的基础上为每个关系多学一个映射向量, 具体思路是将三元组中的关系(relation),抽象成一个向量空间中的超平面(Hyperplane),每次都是将头结点或者尾节点映射到这个超平面上,再通过超平面上的平移向量计算头尾节点的差值。

这样做一定程度上缓解了transE模型不能很好地处理一对多,多对一等关系属性的问题.

  1. 具体算法实现
  2. 将头节点h和尾节点t映射到超平面上,计算三元组的差值

Wr是超平面的法向量,dr是超平面上的平移向量

  1. 将头节点h和尾节点t映射到超平面上,计算三元组的差值
  2. 计算损失函数

其中[ x ]+ 看做 max(0, x),y为margin值用于区分正例与负例。

  1. 损失函数通过随机梯度下降法进行训练

  2. 代码实现过程:(见代码)

数据集训练:

  1. 数据集加载,得到实体集,关系集和三元组集

  2. 数据预处理,将实体集,关系集和三元组集初始化为向量,计算每个关系中每个头结点平均对应的尾节点数,以及每个尾结点平均对应的头节点数

  3. 初始化transH所需参数,包括向量维度,以及损失函数所需的各种参数

  4. 使用torch.Tensor()方法初始化实体向量,关系向量dr和关系超平面法向量Wr

  5. 开始分批训练,分成100批数据

  6. 计算tph/(tph+hpt),决定负例随机替换掉头节点还是尾节点,由此获得负例集

  7. 计算损失函数,使用随机梯度下降法调整向量来最小化损失函数

  8. 更新实体集和关系集中的向量

  9. 从第5步开始,重复训练100次,不断降低损失函数的值

  10. 得到归一化向量的实体集和关系集

4. 指标

测试验证:

  1. 求Mean rank值

​ 将每个testing_triple的尾节点用实体集中每一个实体代替,计算f函数,将得到的结果升序排列,将所有testing triple的排名做平均得到Mean rank

  1. 求hit@10值

按照上述进行f函数值排列,计算测试集中testing triple正确答案排在序列的前十的个数,除以总个数得到hit@10的值

5. 结论

Trans模型是是知识图谱补全算法中经典的算法,TransE模型最为经典但是无法很好解决一对多,多对一的问题,TransH和TransE算法类似,但增加了关系映射超平面,一定程度上缓解了不能很好地处理多映射属性关系的问题。

代码:

transH_torch.py

import torch
import torch.optim as optim
import torch.nn.functional as F

import codecs
import numpy as np
import copy
import time
import random

entity2id = {}
relation2id = {}
relation_tph = {}   #关系每个头结点平均对应的尾节点数
relation_hpt = {}   #关系每个尾结点平均对应的头节点数

'''
数据加载
entity2id: {entity1:id1,entity2:id2}
relation2id: {relation1:id1,relation2:id2}
'''

def data_loader(file):
    print("load file...")
    file1 = file + "train.txt"
    file2 = file + "entity2id.txt"
    file3 = file + "relation2id.txt"

    with open(file2, 'r') as f1, open(file3, 'r') as f2:
        lines1 = f1.readlines()
        lines2 = f2.readlines()
        for line in lines1:
            line = line.strip().split('\t')
            if len(line) != 2:
                continue
            entity2id[line[0]] = line[1]

        for line in lines2:
            line = line.strip().split('\t')
            if len(line) != 2:
                continue
            relation2id[line[0]] = line[1]

    entity_set = set()      #训练集中的所有实体
    relation_set = set()    #训练集中的所有关系
    triple_list = []        #训练集中的所有三元组
    relation_head = {}      #训练集中的关系的所有头部和头部数量,格式:{r_:{head1:count1,head2:count2}}
    relation_tail = {}      #训练集中的关系的所有尾部和尾部数量,格式:{r_:{tail1:count1,tail2:count2}}

    with codecs.open(file1, 'r') as f:
        content = f.readlines()
        for line in content:
            triple = line.strip().split("\t")
            if len(triple) != 3:
                continue

            h_ = entity2id[triple[0]]
            t_ = entity2id[triple[1]]
            r_ = relation2id[triple[2]]

            triple_list.append([h_, t_, r_])

            entity_set.add(h_)
            entity_set.add(t_)

            relation_set.add(r_)
            if r_ in relation_head:
                if h_ in relation_head[r_]:
                    relation_head[r_][h_] += 1
                else:
                    relation_head[r_][h_] = 1
            else:
                relation_head[r_] = {}
                relation_head[r_][h_] = 1

            if r_ in relation_tail:
                if t_ in relation_tail[r_]:
                    relation_tail[r_][t_] += 1
                else:
                    relation_tail[r_][t_] = 1
            else:
                relation_tail[r_] = {}
                relation_tail[r_][t_] = 1
#计算关系中个头结点平均对应的尾节点数
    for r_ in relation_head:
        sum1, sum2 = 0, 0
        for head in relation_head[r_]:
            sum1 += 1
            sum2 += relation_head[r_][head]
        tph = sum2/sum1
        relation_tph[r_] = tph
#计算关系每个尾结点平均对应的头节点数
    for r_ in relation_tail:
        sum1, sum2 = 0, 0
        for tail in relation_tail[r_]:
            sum1 += 1
            sum2 += relation_tail[r_][tail]
        hpt = sum2/sum1
        relation_hpt[r_] = hpt

    print("Complete load. entity : %d , relation : %d , triple : %d" % (
        len(entity_set), len(relation_set), len(triple_list)))

    return entity_set, relation_set, triple_list


class TransH:
    def __init__(self, entity_set, relation_set, triple_list, embedding_dim=50, lr=0.01, margin=1.0, norm=1, C=1.0, epsilon = 1e-5):
        self.entities = entity_set  #实体集
        self.relations = relation_set   #关系集
        self.triples = triple_list  #三元组
        self.dimension = embedding_dim  #向量维度
        self.learning_rate = lr
        self.margin = margin
        self.norm = norm
        self.loss = 0.0
        self.norm_relations = {}    #Wr
        self.hyper_relations = {}   #dr
        self.C = C  #软约束的权重
        self.epsilon = epsilon  #艾普西隆

    def data_initialise(self):
        entityVectorList = {}   #实体向量列表
        relationNormVectorList = {}
        relationHyperVectorList = {}
        device = "cpu"
        #将实体和关系映射成50维的向量
        for entity in self.entities:
            entity_vector = torch.Tensor(self.dimension).uniform_(-6.0 / np.sqrt(self.dimension), 6.0 / np.sqrt(self.dimension))
            entityVectorList[entity] = entity_vector.requires_grad_(True)

        for relation in self.relations:
            relation_norm_vector = torch.Tensor(self.dimension).uniform_(-6.0 / np.sqrt(self.dimension), 6.0 / np.sqrt(self.dimension))
            relation_hyper_vector = torch.Tensor(self.dimension).uniform_(-6.0 / np.sqrt(self.dimension), 6.0 / np.sqrt(self.dimension))

            print(relation_norm_vector)


            relationNormVectorList[relation] = relation_norm_vector.requires_grad_(True)
            relationHyperVectorList[relation] = relation_hyper_vector.requires_grad_(True)

        self.entities = entityVectorList    #{id:vector,id:vector}
        self.norm_relations = relationNormVectorList    #{id:vector,id:vector}
        self.hyper_relations = relationHyperVectorList  #{id:vector,id:vector}


    def training_run(self, epochs=100, nbatches=100):
        #每一批的数量
        batch_size = int(len(self.triples) / nbatches)
        print("batch size: ", batch_size)

        for epoch in range(epochs):
            #开始计时
            start = time.time()
            #损失
            self.loss = 0.0

            # Normalise the embedding of the entities to 1
            # for entity in self.entities:
            #     self.entities[entity] = self.normalization(self.entities[entity])

            #分批处理
            for batch in range(nbatches):
                #随机抽取batch_size大小的样本
                batch_samples = random.sample(self.triples, batch_size)

                Tbatch = []
                for sample in batch_samples:
                    #将sample深度拷贝给corrupted_sample
                    corrupted_sample = copy.deepcopy(sample)
                    pr = np.random.random(1)[0] #从0到1中随机获得一个浮点数
                    #p表示头结点被替换的概率
                    p = relation_tph[corrupted_sample[2]] / (
                                relation_tph[corrupted_sample[2]] + relation_hpt[corrupted_sample[2]])
                    #pr>p时,head任意改变为另一个实体,pr<=p时,tail任意改变为另一个实体
                    if pr > p:
                        # 随机选择一个实体替换头结点corrupted_sample[0]
                        corrupted_sample[0] = random.sample(self.entities.keys(), 1)[0]
                        while corrupted_sample[0] == sample[0]:
                            corrupted_sample[0] = random.sample(self.entities.keys(), 1)[0]
                    else:
                        # 随机选择一个实体替换尾结点corrupted_sample[1]
                        corrupted_sample[1] = random.sample(self.entities.keys(), 1)[0]
                        while corrupted_sample[1] == sample[1]:
                            corrupted_sample[1] = random.sample(self.entities.keys(), 1)[0]
                    # Tbatch加入(正例,负例)
                    if (sample, corrupted_sample) not in Tbatch:
                        Tbatch.append((sample, corrupted_sample))
                #更新
                self.update_triple_embedding(Tbatch)
            #结束时间
            end = time.time()
            print("epoch: ", epoch, "cost time: %s" % (round((end - start), 3)))
            print("running loss: ", self.loss)

        with codecs.open("entity_" + str(self.dimension) + "dim_batch" + str(batch_size), "w") as f1:

            for e in self.entities:
                f1.write(e + "\t")
                f1.write(str(list(self.entities[e])))
                f1.write("\n")

        with codecs.open("relation_norm_" + str(self.dimension) + "dim_batch" + str(batch_size), "w") as f2:
            for r in self.norm_relations:
                f2.write(r + "\t")
                f2.write(str(list(self.norm_relations[r])))
                f2.write("\n")

        with codecs.open("relation_hyper_" + str(self.dimension) + "dim_batch" + str(batch_size), "w") as f3:
            for r in self.hyper_relations:
                f3.write(r + "\t")
                f3.write(str(list(self.hyper_relations[r])))
                f3.write("\n")


    def normalization(self, vector):
        v = vector / torch.sum(torch.square(vector))
        return v.requires_grad_(True)

    # 范数L2 torch.norm()元素平方求和再开根号
    # f=||(h-WrhWr)+dr-(t-WrtWr)||2
    def norm_l2(self, h, r_norm, r_hyper, t):
        return torch.norm(h - r_norm.dot(h)*r_norm + r_hyper -(t - r_norm.dot(t)*r_norm))


    # 软约束项 模长约束
    def scale_entity(self, vector):
        return torch.relu(torch.sum(vector**2) - 1)

    # #正交化
    # def orthogonality(self, norm, hyper):
    #     return np.dot(norm, hyper)**2/np.linalg.norm(hyper)**2 - self.epsilon**2

    #更新三元组
    def update_triple_embedding(self, Tbatch):

        for correct_sample, corrupted_sample in Tbatch:
            correct_head = self.entities[correct_sample[0]]
            correct_tail  = self.entities[correct_sample[1]]

            #Wr
            relation_norm = self.norm_relations[correct_sample[2]]

            #dr
            relation_hyper = self.hyper_relations[correct_sample[2]]

            corrupted_head = self.entities[corrupted_sample[0]]
            corrupted_tail = self.entities[corrupted_sample[1]]

            # # calculate the distance of the triples
            # correct_distance = self.norm_l2(correct_head, relation_norm, relation_hyper, correct_tail)
            # corrupted_distance = self.norm_l2(corrupted_head, relation_norm, relation_hyper, corrupted_tail)

            #SGD随机梯度下降法,调整所有向量的值来最小化loss函数
            opt1 = optim.SGD([correct_head], lr=0.01)
            opt2 = optim.SGD([correct_tail], lr=0.01)
            opt3 = optim.SGD([relation_norm], lr=0.01)
            opt4 = optim.SGD([relation_hyper], lr=0.01)

            if correct_sample[0] == corrupted_sample[0]:
                opt5 = optim.SGD([corrupted_tail], lr=0.01)
                #正例距离
                correct_distance = self.norm_l2(correct_head, relation_norm, relation_hyper, correct_tail)
                #负例距离
                corrupted_distance = self.norm_l2(correct_head, relation_norm, relation_hyper, corrupted_tail)
                #约束
                scale = self.scale_entity(correct_head) + self.scale_entity(correct_tail) + self.scale_entity(corrupted_tail)

            else:
                opt5 = optim.SGD([corrupted_head], lr=0.01)
                correct_distance = self.norm_l2(correct_head, relation_norm, relation_hyper, correct_tail)
                corrupted_distance = self.norm_l2(corrupted_head, relation_norm, relation_hyper, correct_tail)
                scale = self.scale_entity(correct_head) + self.scale_entity(correct_tail) + self.scale_entity(corrupted_head)

            opt1.zero_grad()
            opt2.zero_grad()
            opt3.zero_grad()
            opt4.zero_grad()
            opt5.zero_grad()

            loss = F.relu(self.margin + correct_distance - corrupted_distance) + self.C * scale
            loss.backward()
            self.loss += loss
            opt1.step()
            opt2.step()
            opt3.step()
            opt4.step()
            opt5.step()


            # 归一化这些新的向量,而不是将所有向量一起归一化
            self.entities[correct_sample[0]] = correct_head
            self.entities[correct_sample[1]] = correct_tail
            if correct_sample[0] == corrupted_sample[0]:
                # 如果负例实体替换了尾部实体,则更新尾部实体的补全
                self.entities[corrupted_sample[1]] = corrupted_tail
            elif correct_sample[1] == corrupted_sample[1]:
                # # 如果负例实体替换了头部实体,则更新头部实体的补全
                self.entities[corrupted_sample[0]] = corrupted_head
            # 该论文提到该关系的嵌入不需要规范化
            self.norm_relations[correct_sample[2]] = relation_norm
            self.hyper_relations[correct_sample[2]] = relation_hyper

if __name__ == '__main__':
    file1 = "D:/Pycharmprojects/bigDataAnalysis/FB15k/"
    entity_set, relation_set, triple_list = data_loader(file1)

    transH = TransH(entity_set, relation_set, triple_list, embedding_dim=50, lr=0.01, margin=1.0, norm=1)
    transH.data_initialise()
    transH.training_run()

test1.py

import json
import operator
import time

import numpy as np
import codecs

from homework_8.transH_torch import data_loader,entity2id,relation2id


def test_data_loader(entity_embedding_file, norm_relation_embedding_file, hyper_relation_embedding_file, test_data_file):

    file1 = entity_embedding_file
    file2 = norm_relation_embedding_file
    file3 = hyper_relation_embedding_file
    file4 = test_data_file

    entity_dic = {}
    norm_relation = {}
    hyper_relation = {}
    triple_list = []

    with codecs.open(file1, 'r') as f1, codecs.open(file2, 'r') as f2, codecs.open(file3, 'r') as f3:
        lines1 = f1.readlines()
        lines2 = f2.readlines()
        lines3 = f3.readlines()
        for line in lines1:
            line = line.strip().split('\t')
            if len(line) != 2:
                continue
            entity_dic[line[0]] = json.loads(line[1])

        for line in lines2:
            line = line.strip().split('\t')
            if len(line) != 2:
                continue
            norm_relation[line[0]] = json.loads(line[1])

        for line in lines3:
            line = line.strip().split('\t')
            if len(line) != 2:
                continue
            hyper_relation[line[0]] = json.loads(line[1])

    with codecs.open(file4, 'r') as f4:
        content = f4.readlines()
        for line in content:
            triple = line.strip().split("\t")
            if len(triple) != 3:
                continue

            head = entity2id[triple[0]]
            tail = entity2id[triple[1]]
            relation = relation2id[triple[2]]

            triple_list.append([head, tail, relation])

    print("Complete load. entity : %d , relation : %d , triple : %d" % (
        len(entity_dic), len(norm_relation), len(triple_list)))

    return entity_dic, norm_relation, hyper_relation, triple_list

class testTransH:
    def __init__(self, entities_dict, norm_relation, hyper_relation, test_triple_list, train_triple_list, filter_triple=False, n=2500 ,norm=1):
        self.entities = entities_dict
        self.norm_relation = norm_relation
        self.hyper_relation = hyper_relation
        self.test_triples = test_triple_list
        self.train_triples = train_triple_list
        self.filter = filter_triple
        self.norm = norm
        self.n = n
        self.mean_rank = 0
        self.hit_10 = 0



    def test_run(self):
        hits = 0
        rank_sum = 0
        num = 0

        for triple in self.test_triples:
            start = time.time()
            num += 1

            print(num)
            rank_head_dict = {}
            rank_tail_dict = {}
            #
            for entity in self.entities.keys():

                head_triple = [entity, triple[1], triple[2]]
                if self.filter:
                    if head_triple in self.train_triples:
                        continue
                head_embedding = self.entities[head_triple[0]]
                tail_embedding = self.entities[head_triple[1]]
                norm_relation = self.norm_relation[head_triple[2]]
                hyper_relation = self.hyper_relation[head_triple[2]]
                distance = self.distance(head_embedding, norm_relation,hyper_relation, tail_embedding)
                rank_head_dict[tuple(head_triple)] = distance


            for tail in self.entities.keys():
                tail_triple = [triple[0], tail, triple[2]]
                if self.filter:
                    if tail_triple in self.train_triples:
                        continue
                head_embedding = self.entities[tail_triple[0]]
                tail_embedding = self.entities[tail_triple[1]]
                norm_relation = self.norm_relation[tail_triple[2]]
                hyper_relation = self.hyper_relation[tail_triple[2]]
                distance = self.distance(head_embedding, norm_relation, hyper_relation, tail_embedding)
                rank_tail_dict[tuple(tail_triple)] = distance

            '''
                升序排列
             '''
            rank_head_sorted = sorted(rank_head_dict.items(), key=operator.itemgetter(1), reverse=False)
            rank_tail_sorted = sorted(rank_tail_dict.items(), key=operator.itemgetter(1), reverse=False)

            # 计算mean rank和hit@10
            # 头节点替换计算
            for i in range(len(rank_head_sorted)):
                if triple[0] == rank_head_sorted[i][0][0]:
                    if i < 10:
                        hits += 1
                    rank_sum = rank_sum + i + 1
                    break

            # 尾节点替换计算
            for i in range(len(rank_tail_sorted)):
                if triple[1] == rank_tail_sorted[i][0][1]:
                    if i < 10:
                        hits += 1
                    rank_sum = rank_sum + i + 1
                    break
            end = time.time()

        self.hit_10 = hits / (2 * 10000)
        self.mean_rank = rank_sum / (2 * 10000)

        return self.hit_10, self.mean_rank

    #计算向量差
    def distance(self, h, r_norm, r_hyper, t):
        head = np.array(h)
        norm = np.array(r_norm)
        hyper = np.array(r_hyper)
        tail = np.array(t)
        h_hyper = head - np.dot(norm, head) * norm
        t_hyper = tail - np.dot(norm, tail) * norm
        d = h_hyper + hyper - t_hyper
        return np.sum(np.square(d))



if __name__ == "__main__":
    _, _, train_triple = data_loader("D:/Pycharmprojects/bigDataAnalysis/FB15k/")
    a="D:/Pycharmprojects/bigDataAnalysis/FB15k_200epoch_TransH_pytorch_entity_100dim_batch4800"
    entity, norm_relation, hyper_relation, test_triple = test_data_loader(a,
                                                               "D:/Pycharmprojects/bigDataAnalysis/FB15k_200epoch_TransH_pytorch_norm_relations_100dim_batch4800",
                                                               "D:/Pycharmprojects/bigDataAnalysis/FB15k_200epoch_TransH_pytorch_hyper_relations_100dim_batch4800",
                                                               "D:/Pycharmprojects/bigDataAnalysis/FB15k/test.txt")

    test = testTransH(entity, norm_relation, hyper_relation, test_triple, train_triple, filter_triple=False, n=2500, norm=2)
    hit10, mean_rank = test.test_run()
    print("raw entity hits@10: ", hit10)
    print("raw entity meanrank: ",mean_rank)


  • 14
    点赞
  • 79
    收藏
    觉得还不错? 一键收藏
  • 11
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 11
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值