之前用python实现了一遍,但是那个版本训练和测试需要耗费很长时间,就改写pytorch版,也当作学习pytorch的一个经历。依旧是从完整的模型流程来说明。先给出论文中的算法流程:
数据处理
pytorch自带数据处理的模块Dataset和DataLoader,可以很方便的供使用者使用。其中Dataset模块用于构建数据集,比如读取数据可以结合这个模块实现,DataLoader是Dataset的下一步处理模块,在模型训练中,我们通常要对数据集划分batch size,这就可以用DataLoader模块来实现。这两个模块在使用的时候,核心在Dataset。
Config类
为了方便处理,我首先写了一个Config.py文件,专门用来存储项目中需要用到的一些文件路径。
class Config(object): def __init__(self): super() self.train_fb15k = "./datasets/fb15k/train.txt" # 训练集路径 self.test_fb15k = "./datasets/fb15k/test.txt" # 测试集路径 self.valid_fb15k = "./datasets/fb15k/valid.txt" # 验证集路径 self.entity2id_train_file = "./datasets/fb15k/entity2id_train.txt" # 训练集实体到索引的映射 self.relation2id_train_file = "./datasets/fb15k/relation2id_train.txt" # 训练集关系到索引的映射 self.entity2id_test_file = "./datasets/fb15k/entity2id_test.txt" # 测试集实体到索引的映射 self.relation2id_test_file = "./datasets/fb15k/relation2id_test.txt" # 测试集关系到索引的映射 self.entity2id_valid_file = "./datasets/fb15k/entity2id_valid.txt" # 验证集实体到索引的映射 self.relation2id_valid_file = "./datasets/fb15k/relation2id_valid.txt" # 验证集关系到索引的映射 self.entity_50dim_batch400 = "./datasets/fb15k/entity_50dim_batch400" # 400 batch, 实体embedding向量50维的训练结果 self.relation_50dim_batch400 = "./datasets/fb15k/relation_50dim_batch400" # 400 batch, 关系embedding向量50维的训练结果 self.saved_TransE = "./saved_models/TransE.pkl" # save TransE model
torch.utils.data.Dataset模块
Dataset模块在torch.utils.data下,在使用该模块构建我们自己的数据集时,需要我们继承Dataset类,并重写方法和方法。以训练集为例,我创建了一个TrainSet.py文件,专门用来对测试集进行处理,这里使用的依旧是FB15k数据集。
TrainSet.py
import numpy as npimport pandas as pdfrom torch.utils.data import Dataset, DataLoaderfrom config import Configimport randomfrom collections import Counterclass TrainSet(Dataset): ''' 训练集构造类 ''' def __init__(self, config): super(TrainSet, self).__init__() # 初始化参数设置 self.config = config # 加载数据 self.entity_dic, self.relation_dic, self.pos_triples = self.load_data() # 样本总数 self.sample_num = len(self.pos_triples) # entity总数 self.entity_num = len(self.entity_dic) # 关系类型数 self.relation_num = len(self.relation_dic) # 加载负例样本 self.neg_triples = self.generate_neg() print(f"TrainSet: {self.entity_num} entities, {self.relation_num} relations, {self.sample_num} triples.") # 重写len方法 def __len__(self): return self.sample_num # 重写getitem方法 def __getitem__(self, item): return [self.pos_triples[item], self.neg_triples[item]] def load_data(self): ''' author: Chengyu Lin 2020/9/20 10:38 description: 加载数据, 返回entity to index, relation to index, positive triples param: data: pandas读取的数据 return entity_dic, relation_dic, triples, type: dict ''' # read raw data raw_data = pd.read_csv(self.config.train_fb15k, sep='\t', header=None, names=['head', 'relation', 'tail'], keep_default_na=False, encoding='utf-8') raw_data = raw_data.applymap(lambda x: x.strip()) # get head, relation, tail head_count = Counter(raw_data['head']) relation_count = Counter(raw_data['relation']) tail_count = Counter(raw_data['tail']) entity_list = list((head_count + tail_count).keys()) relation_list = list(relation_count.keys()) # convert data to dic entity_dic = dict([(entity, idx) for idx, entity in enumerate(entity_list)]) relation_dic = dict([(relation, idx) for idx, relation in enumerate(relation_list)]) # convert triples to index triples = self.convert_triple_to_index(raw_data.values, relation_dic, entity_dic) return entity_dic, relation_dic, triples def convert_triple_to_index(self, triples, relation_dic, entity_dic): ''' author: Chengyu Lin 2020/9/20 10:38 description: 将triple转化为index的格式 exp: [head, relation, tail] => [0,0,0] param: triples: raw data中的triple: [head, relation, tail] => [/m/01qscs,/award/award_nominee/award_nominations./award/award_nomination/award,/m/02x8n1n] type: list param: relation_dic: relation to index, type: dict, exp: {relation: 0} param: entity_dic: entity to index, type: dict, exp: {entity: 0} return triple_set, type: list, exp: [head, relation, tail] => [0,0,0] ''' triple_set = np.array([[entity_dic[triple[0]], relation_dic[triple[1]], entity_dic[triple[2]]] for triple in triples]) return triple_set def generate_neg(self): ''' author: Chengyu Lin 2020/9/20 10:38 description: 生成原始数据的负例样本 exp: [head, relation, tail] => [0,0,0] param: None return triple_set, type: list, exp: [head, relation, tail] => [0,0,0] ''' neg_data = [] for idx, v in enumerate(self.pos_triples): seed = random.random() # 随机数种子, 判定替换head entity or tail entity if seed > 0.5: # replace head rand_head = v[0] while rand_head == v[0]: head_ = random.sample(self.entity_dic.keys(), 1)[0] # head name, selected by random.sample rand_head = self.entity_dic[head_] neg_data.append([rand_head, v[1], v[2]]) else: # replace tail rand_tail = v[2] while rand_tail == v[2]: tail_ = random.sample(self.entity_dic.keys(), 1)[0] # tail name, selected by random.sample rand_tail = self.entity_dic[tail_] neg_data.append([v[0], v[1], rand_tail]) return np.array(neg_data)
TrainSet.py文件中仅包含一个TrainSet类,该类继承自torch.utils.data中的Dataset类,而要想该类被DataLoader所使用,必须包含之前说的方法和方法,这两个方法用于返回数据集的大小和获取一条数据集。因此,为了实现完整的功能,该类还应该加载进数据,才能满足这两个方法所需要的一些属性,我定义了load_data、convert_triple_to_index和generate_neg这三个方法。
load_data方法
def load_data(self): ''' author: Chengyu Lin 2020/9/20 10:38 description: 加载数据, 返回entity to index, relation to index, positive triples param: data: pandas读取的数据 return entity_dic, relation_dic, triples, type: dict ''' # read raw data raw_data = pd.read_csv(self.config.train_fb15k, sep='\t', header=None, names=['head', 'relation', 'tail'], keep_default_na=False, encoding='utf-8') raw_data = raw_data.applymap(lambda x: x.strip()) # get head, relation, tail head_count = Counter(raw_data['head']) relation_count = Counter(raw_data['relation']) tail_count = Counter(raw_data['tail']) entity_list = list((head_count + tail_count).keys()) relation_list = list(relation_count.keys()) # convert data to dic entity_dic = dict([(entity, idx) for idx, entity in enumerate(entity_list)]) relation_dic = dict([(relation, idx) for idx, relation in enumerate(relation_list)]) # convert triples to index triples = self.convert_triple_to_index(raw_data.values, relation_dic, entity_dic) return entity_dic, relation_dic, triples
该方法主要用到了pandas和collections两个包。首先使用pandas的read_csv方法读取训练集所在文件中的数据,这一步pandas提供了非常方便的接口,可以帮忙处理分隔符等。然后用applymap将一个lambda表达式的函数用在了raw_data上,用来去除每个数据首尾的一些转义字符。FB15k中的数据是RDF格式,通常是这个样子的:
这一条数据以\t为分隔符(sep),其基本格式为head entity, relation, tail entity。所以raw_data读取上来是这个样子的:
然后调用collections库中的Counter方法对数据进行统计,也可以用下面这行代码打印看一下head_count对应的前3个内容:
print(f"head_count: {head_count.most_common(3)}")
得到的结果是:
可以看到Counter类会对传进去的值做计数操作,且能保证每个元素唯一,也就实现了在python版本中的one-hot编码。因为要对所有的实体进行one-hot编码,所以后面就把这head_count和tail_count中的entity给合并到一个list里面,entity_list,然后结合enumerate方法构建了entity到index的映射字典,同样的也对relation进行了相同的操作。接着就需要将每个三元组样本转换成编码格式,这里调用了convert_triple_to_index方法。
convert_triple_to_index方法
def convert_triple_to_index(self, triples, relation_dic, entity_dic): ''' author: Chengyu Lin 2020/9/20 10:38 description: 将triple转化为index的格式 exp: [head, relation, tail] => [0,0,0] param: triples: raw data中的triple: [head, relation, tail] => [/m/01qscs,/award/award_nominee/award_nominations./award/award_nomination/award,/m/02x8n1n] type: list param: relation_dic: relation to index, type: dict, exp: {relation: 0} param: entity_dic: entity to index, type: dict, exp: {entity: 0} return triple_set, type: list, exp: [head, relation, tail] => [0,0,0] ''' triple_set = np.array([[entity_dic[triple[0]], relation_dic[triple[1]], entity_dic[triple[2]]] for triple in triples]) return triple_set
方法中已经写了部分注释,这个方法就是用之前构建的entity到index映射的字典和relation到index映射的字典将triple转换为由一系列index构成的形式,此外,为了方便后面的计算,都用numpy将样本转换成了矩阵形式。到此,样本到index的构建已经完毕了。
generate_neg方法——生成corrupted_triples
与先前python版本不同的是,负样本的生成我是在数据预处理中完成的,这并不影响整个算法。
def generate_neg(self): ''' author: Chengyu Lin 2020/9/20 10:38 description: 生成原始数据的负例样本 exp: [head, relation, tail] => [0,0,0] param: None return triple_set, type: list, exp: [head, relation, tail] => [0,0,0] ''' neg_data = [] for idx, v in enumerate(self.pos_triples): seed = random.random() # 随机数种子, 判定替换head entity or tail entity if seed > 0.5: # replace head rand_head = v[0] while rand_head == v[0]: head_ = random.sample(self.entity_dic.keys(), 1)[0] # head name, selected by random.sample rand_head = self.entity_dic[head_] neg_data.append([rand_head, v[1], v[2]]) else: # replace tail rand_tail = v[2] while rand_tail == v[2]: tail_ = random.sample(self.entity_dic.keys(), 1)[0] # tail name, selected by random.sample rand_tail = self.entity_dic[tail_] neg_data.append([v[0], v[1], rand_tail]) return np.array(neg_data)
采用的方式和python版本中的一样,对于每个训练集中的样本,我都从所有实体中采样一个,然后随机替换掉head部分或者tail部分。接下来就是len方法和getitem方法的构建。
len方法
在上面流程以后,TrainSet类已经有了导入的数据所构成的属性,len方法的作用就是返回数据集的长度。
# 重写len方法 def __len__(self): return self.sample_num
getitem方法
该方法用于返回一个数据样本。
# 重写getitem方法 def __getitem__(self, item): return [self.pos_triples[item], self.neg_triples[item]]
因为在训练过程中需要用到负样本,所以对于训练集需要将两种样本合并起来进行返回。
TestSet.py
TestSet.py专门用来处理测试集数据,具体的与TrainSet.py类似,在这里就不多做叙述。
import torchfrom torch.utils.data import Dataset, DataLoaderimport pandas as pd from config import Configimport numpy as npfrom TrainSet import TrainSetclass TestSet(Dataset): ''' 测试集构造类 ''' def __init__(self, config, train_set): super() self.config = config self.train_set = train_set self.raw_data = self.load_data() self.triple_list = self.convert_word_to_index(self.raw_data, self.train_set.entity_dic, self.train_set.relation_dic) self.triples_num = len(self.triple_list) # 重写len方法 def __len__(self): return self.triples_num # 重写getitem方法 def __getitem__(self, item): return self.triple_list[item] def load_data(self): raw_data = pd.read_csv(self.config.test_fb15k, sep='\t', header=None, names=['head', 'relation', 'tail'], keep_default_na=False , encoding='utf-8') return raw_data.values def convert_word_to_index(self, triples, entity_dic, relation_dic): triple_list = [[entity_dic[triple[0]], relation_dic[triple[1]], entity_dic[triple[2]]] for triple in triples] return np.array(triple_list)
使用DataLoader加载数据集
有了上面构建好的数据集,就可以很方便的使用DataLoader来帮我们加载数据集。
config = Config() datasets = TrainSet(config) train_loader = DataLoader(datasets, batch_size=32, shuffle=True) for batch_idx, data in enumerate(train_loader): pos, neg = data print(pos) print("-" * 20) print(pos[0]) break
DataLoader类需要三个参数,data - 待处理的Dataset,batch_size - 所要划分的数据集大小,shuffle - 是否随机化数据。然后就可以用enumerate获得每个样本。到此,输出处理部分结束。
模型构建
使用pytorch构建模型,需要继承自nn.Module类,里面必须包含forward方法用作模型的训练函数,其余的和正常的python语法相同。
import torchimport torch.nn as nnimport torch.nn.functional as funcimport numpy as npclass TransE(nn.Module): def __init__(self, entity_num, relation_num, embedding_dim=50, margin=1, norm=2): super(TransE, self).__init__() # 参数初始化 self.device = torch.device('cuda') # 使用gpu加速 self.entity_num = entity_num # 实体数目 # 实体数目 self.relation_num = relation_num # 关系数目 # 关系数目 self.dim = embedding_dim # embedding维度 # embedding维度 self.margin = torch.FloatTensor([margin]).to(self.device) # 边际值 # 边际设定, γ self.norm = norm # p-范数距离 # 初始化embedding # 创建entity embedding的初始化向量 tmp_entity_embeddings = torch.empty(entity_num, self.dim).uniform_(-6 / np.sqrt(self.dim), 6 / np.sqrt(self.dim)) # 使用from_pretrained方法构造entity_embedding self.entity_embeddings = nn.Embedding.from_pretrained(tmp_entity_embeddings, freeze=False) # 创建relation embedding的初始化向量 tmp_relation_embeddings = torch.empty(relation_num, self.dim).uniform_(-6 / np.sqrt(self.dim), 6 / np.sqrt(self.dim)) self.relation_embeddings = nn.Embedding.from_pretrained(tmp_relation_embeddings, freeze=False) # 这一步需要对relation embedding进行标准化操作, l <= l / ||l|| relation_norm = torch.norm(self.relation_embeddings.weight.data, dim=1, keepdim=True) self.relation_embeddings.weight.data = self.relation_embeddings.weight.data / relation_norm def forward(self, pos_head, pos_relation, pos_tail, neg_head, neg_relation, neg_tail): ''' pos = [3, batch size], neg = [3, batch size] forward部分对应算法的损失函数计算 ''' dis_pos = self.entity_embeddings(pos_head) + self.relation_embeddings(pos_relation) - self.entity_embeddings(pos_tail) dis_neg = self.entity_embeddings(neg_head) + self.relation_embeddings(neg_relation) - self.entity_embeddings(neg_tail) return self.calculate_loss(dis_pos, dis_neg).requires_grad_() def calculate_loss(self, dis_pos, dis_neg): ''' 一步的损失计算, 论文中的公式[γ + distance(h + l, t) - distance(h' + l, t')]+ +表示的是大于0则引入计算, 小于0则以0引入损失, 也就对应损失函数中的relu函数 :param dis_pos: correct triple的相似度 :param dis_neg: corrupted triple的相似度 :return: 一步的损失 ''' dis_diff = self.margin + torch.norm(dis_pos, p=self.norm, dim=1) - torch.norm(dis_neg, p=self.norm, dim=1) return torch.sum(func.relu(dis_diff)) def tail_predict(self, x, k=10): h = x[0] r = x[1] t = x[2] # hr: [batch_size, embed_size] => [batch_size, 1, embed_size] => [batch_size, N, embed_size] hr = self.entity_embeddings(h) + self.relation_embeddings(r) hr = torch.unsqueeze(hr, dim=1) hr = hr.expand(hr.shape[0], self.entity_num, self.dim) # embed_tail: [batch_size, N, embed_size] embed_tail = self.entity_embeddings.weight.data.expand(hr.shape[0], self.entity_num, self.dim) # compute similarity: [batch_size, N] similarity = torch.norm(hr - embed_tail, dim=2) # indices: [batch_size, k] values, indices = torch.topk(similarity, k, dim=1, largest=False) # mean_indices: [batch_size, N] mean_values, mean_indices = torch.topk(similarity, self.entity_num, dim=1, largest=False) # tail: [batch_size] => [batch_size, 1] tail = t.view(-1, 1) # result of hits10 hits10 = torch.sum(torch.eq(indices, tail)).item() # result of mean rank mean_rank = torch.sum(torch.eq(mean_indices, tail).nonzero(), dim=0)[1] return hits10, mean_rank
整个模型分为初始化阶段、forward训练阶段、损失计算和链接预测。
init方法
init方法主要是模型构造函数,这一步进行模型参数初始化,还有就是embedding向量的初始化,这两步操作与论文中的相同。
def __init__(self, entity_num, relation_num, embedding_dim=50, margin=1, norm=2): super(TransE, self).__init__() # 参数初始化 self.device = torch.device('cuda') self.entity_num = entity_num # 实体数目 self.relation_num = relation_num # 关系数目 self.dim = embedding_dim # embedding维度 self.margin = torch.FloatTensor([margin]).to(self.device) # 边际设定, γ self.norm = norm # p-范数距离 # 初始化embedding # 创建entity embedding的初始化向量 tmp_entity_embeddings = torch.empty(entity_num, self.dim).uniform_(-6 / np.sqrt(self.dim), 6 / np.sqrt(self.dim)) # 使用from_pretrained方法构造entity_embedding self.entity_embeddings = nn.Embedding.from_pretrained(tmp_entity_embeddings, freeze=False) # 创建relation embedding的初始化向量 tmp_relation_embeddings = torch.empty(relation_num, self.dim).uniform_(-6 / np.sqrt(self.dim), 6 / np.sqrt(self.dim)) self.relation_embeddings = nn.Embedding.from_pretrained(tmp_relation_embeddings, freeze=False) # 这一步需要对relation embedding进行标准化操作, l <= l / ||l|| relation_norm = torch.norm(self.relation_embeddings.weight.data, dim=1, keepdim=True) self.relation_embeddings.weight.data = self.relation_embeddings.weight.data / relation_norm
forward方法和calculate_loss方法
forward方法是torch框架下对训练函数的命名。
def forward(self, pos_head, pos_relation, pos_tail, neg_head, neg_relation, neg_tail): ''' pos = [3, batch size], neg = [3, batch size] forward部分对应算法的损失函数计算 ''' dis_pos = self.entity_embeddings(pos_head) + self.relation_embeddings(pos_relation) - self.entity_embeddings(pos_tail) dis_neg = self.entity_embeddings(neg_head) + self.relation_embeddings(neg_relation) - self.entity_embeddings(neg_tail) return self.calculate_loss(dis_pos, dis_neg).requires_grad_() def calculate_loss(self, dis_pos, dis_neg): ''' 一步的损失计算, 论文中的公式[γ + distance(h + l, t) - distance(h' + l, t')]+ +表示的是大于0则引入计算, 小于0则以0引入损失, 也就对应损失函数中的relu函数 :param dis_pos: correct triple的相似度 :param dis_neg: corrupted triple的相似度 :return: 一步的损失 ''' dis_diff = self.margin + torch.norm(dis_pos, p=self.norm, dim=1) - torch.norm(dis_neg, p=self.norm, dim=1) return torch.sum(func.relu(dis_diff))
计算方法按照算法中的公式来的,就不多做叙述了。
tail_predict方法
该方法用来模型的训练,用来计算链接预测中hits10和mean_rank。
def tail_predict(self, x, k=10): h = x[0] r = x[1] t = x[2] # hr: [batch_size, embed_size] => [batch_size, 1, embed_size] => [batch_size, N, embed_size] hr = self.entity_embeddings(h) + self.relation_embeddings(r) hr = torch.unsqueeze(hr, dim=1) hr = hr.expand(hr.shape[0], self.entity_num, self.dim) # embed_tail: [batch_size, N, embed_size] embed_tail = self.entity_embeddings.weight.data.expand(hr.shape[0], self.entity_num, self.dim) # compute similarity: [batch_size, N] similarity = torch.norm(hr - embed_tail, dim=2) # indices: [batch_size, k] values, indices = torch.topk(similarity, k, dim=1, largest=False) # mean_indices: [batch_size, N] mean_values, mean_indices = torch.topk(similarity, self.entity_num, dim=1, largest=False) # tail: [batch_size] => [batch_size, 1] tail = t.view(-1, 1) # result of hits10 hits10 = torch.sum(torch.eq(indices, tail)).item() # result of mean rank mean_rank = torch.sum(torch.eq(mean_indices, tail).nonzero(), dim=0)[1] return hits10, mean_rank
具体思路是这样子的,对于输入的每一个batch size的数据矩阵,因为要计算h + r ≈ t,首先计算h + r得到hr,然后将其扩展到[batch_size, N, embed_size]这样一个维度,方便矩阵运算,这要效率更高,然后用torch的topk方法返回前k小的数据,在这里实际上计算的是一种不相似度,similarity越小,越相似。然后与t相比较,这里是将tail转换成[batch size, 1]的大小,通过获得对应的索引位置就可以计算hits10和mean_rank。到此模型构建完毕。
模型训练和测试
接下来是模型训练模块,我编写了run.py文件专门用来实现模型训练和模型测试。
import torchfrom torch import optim, nnfrom torch.utils.data import Dataset, DataLoaderfrom TrainSet import TrainSetfrom Testset import TestSetfrom config import Configfrom models.TransE import TransEclass ModelRunning(object): def __init__(self, config, epochs, batch_size=32, learning_rate=0.01, dim=50, norm=2, margin=1): super() self.config = config self.epochs = epochs self.batch_size = batch_size self.learning_rate = learning_rate self.dim = dim self.norm = norm self.margin = margin # 指定torch是cpu还是gpu self.device = torch.device('cpu') if torch.cuda.is_available(): self.device = torch.device('cuda') def train(self, train_set): train_loader = DataLoader(train_set, batch_size=self.batch_size, shuffle=True) # create model model = TransE(train_set.entity_num, train_set.relation_num, self.dim, self.margin, self.norm).to(self.device) # create optimizer optimizer = optim.SGD(model.parameters(), lr=self.learning_rate, momentum=0) # begin training print("Training...") for epoch in range(self.epochs): # e / ||e|| entity_norm = torch.norm(model.entity_embeddings.weight.data, dim=1, keepdim=True) model.entity_embeddings.weight.data = model.entity_embeddings.weight.data / entity_norm # total loss total_loss = 0.0 for idx, (pos, neg) in enumerate(train_loader): pos, neg = pos.to(self.device), neg.to(self.device) # pos: [batch_size, 3] => [3, batch_size] pos = torch.transpose(pos, 0, 1).long() # neg: [batch_size, 3] => [3, batch_size] neg = torch.transpose(neg, 0, 1).long() pos_head, pos_relation, pos_tail = pos[0], pos[1], pos[2] neg_head, neg_relation, neg_tail = neg[0], neg[1], neg[2] loss = model(pos_head, pos_relation, pos_tail, neg_head, neg_relation, neg_tail) total_loss += loss.item() optimizer.zero_grad() loss.backward() optimizer.step() print(f"epoch {epoch+1}, loss = {total_loss/train_loader.__len__()}") print("Training ended.") # save model torch.save(model, config.saved_TransE) print("saved model.") def test(self, train_set): print("load test datasets...") test_datasets = TestSet(self.config, train_set) test_loader = DataLoader(test_datasets, batch_size=256, shuffle=True) print("loaded test datasets.") print("load model...") model = torch.load(config.saved_TransE) print("model loaded.") print("Testing...") hits10, mean_rank = 0, 0 for idx, d in enumerate(test_loader): d = d.to(self.device) d = torch.transpose(d, 0, 1).long() # hits10 & mean rank tmp_hits10, tmp_mean_rank = model.tail_predict(d, k=10) hits10 += tmp_hits10 mean_rank += tmp_mean_rank print(f"length of test data: {test_datasets.__len__()}") print(f"hits10: {hits10}") print(f"mean rank: {mean_rank}") print(f"hits10: {hits10 / test_datasets.__len__()}, mean rank: {mean_rank / test_datasets.__len__()}") print("Test ended.")if __name__ == "__main__": config = Config() # load train data print("load train data...") train_set = TrainSet(config) print("loaded train data.") app = ModelRunning(config, epochs=100) app.train(train_set) app.test(train_set)
该模块中的ModelRunning类专门用来模型训练和测试,主要分为两个方法,train和test方法,里面的init方法也是用来初始化的,就不做多余的描述了。
train方法
def train(self, train_set): train_loader = DataLoader(train_set, batch_size=self.batch_size, shuffle=True) # create model model = TransE(train_set.entity_num, train_set.relation_num, self.dim, self.margin, self.norm).to(self.device) # create optimizer optimizer = optim.SGD(model.parameters(), lr=self.learning_rate, momentum=0) # begin training print("Training...") for epoch in range(self.epochs): # e / ||e|| entity_norm = torch.norm(model.entity_embeddings.weight.data, dim=1, keepdim=True) model.entity_embeddings.weight.data = model.entity_embeddings.weight.data / entity_norm # total loss total_loss = 0.0 for idx, (pos, neg) in enumerate(train_loader): pos, neg = pos.to(self.device), neg.to(self.device) # pos: [batch_size, 3] => [3, batch_size] pos = torch.transpose(pos, 0, 1).long() # neg: [batch_size, 3] => [3, batch_size] neg = torch.transpose(neg, 0, 1).long() pos_head, pos_relation, pos_tail = pos[0], pos[1], pos[2] neg_head, neg_relation, neg_tail = neg[0], neg[1], neg[2] # 不需要显示调用 loss = model(pos_head, pos_relation, pos_tail, neg_head, neg_relation, neg_tail) total_loss += loss.item() optimizer.zero_grad() loss.backward() optimizer.step() print(f"epoch {epoch+1}, loss = {total_loss/train_loader.__len__()}") print("Training ended.") # save model torch.save(model, config.saved_TransE) print("saved model.")
从代码的流程不难看出整个过程非常的机械化的,DataLoader加载数据后,首先要做的是声明模型和优化器(这里采用SGD,与论文中的一致),然后就是对模型按照设计好的epochs进行迭代多少轮,每一轮根据一个batch_size进行训练,在训练的过程中,也与算法一致,需要先对每个样本的实体向量进行归一化操作,然后用模型进行训练,注意这里不需要显示的调用forward方法。然后计算每个epoch的损失,接着梯度更新,最后反向传播。训练完成以后,保存模型。
test方法
def test(self, train_set): print("load test datasets...") test_datasets = TestSet(self.config, train_set) test_loader = DataLoader(test_datasets, batch_size=256, shuffle=True) print("loaded test datasets.") print("load model...") model = torch.load(config.saved_TransE) print("model loaded.") print("Testing...") hits10, mean_rank = 0, 0 for idx, d in enumerate(test_loader): d = d.to(self.device) d = torch.transpose(d, 0, 1).long() # hits10 & mean rank tmp_hits10, tmp_mean_rank = model.tail_predict(d, k=10) hits10 += tmp_hits10 mean_rank += tmp_mean_rank print(f"length of test data: {test_datasets.__len__()}") print(f"hits10: {hits10}") print(f"mean rank: {mean_rank}") print(f"hits10: {hits10 / test_datasets.__len__()}, mean rank: {mean_rank / test_datasets.__len__()}") print("Test ended.")
test方法与train方法不同的地方在于我们这里显示调用了tail_predict方法进行测试,因为这并不是torch框架里面所要遵循的规则,且不需要每个epoch进行操作了。
实验结果
我设置epoch为50,margin=1,learn_rate=0.01, dim=50的情况下,模型在hits10上表现为34.48%,在mean_rank上表现为322,这一结果与论文中的比较接近。
ours:
TransE:
本次实验的复现还是相对比较成功的。