pytorch dataset读取数据流程_pytorch实现TransE模型

之前用python实现了一遍,但是那个版本训练和测试需要耗费很长时间,就改写pytorch版,也当作学习pytorch的一个经历。依旧是从完整的模型流程来说明。先给出论文中的算法流程:

172bd72968afe7186f17678bff12cae7.png

数据处理

pytorch自带数据处理的模块Dataset和DataLoader,可以很方便的供使用者使用。其中Dataset模块用于构建数据集,比如读取数据可以结合这个模块实现,DataLoader是Dataset的下一步处理模块,在模型训练中,我们通常要对数据集划分batch size,这就可以用DataLoader模块来实现。这两个模块在使用的时候,核心在Dataset。

Config类

为了方便处理,我首先写了一个Config.py文件,专门用来存储项目中需要用到的一些文件路径。

class Config(object):    def __init__(self):        super()        self.train_fb15k = "./datasets/fb15k/train.txt"     # 训练集路径        self.test_fb15k = "./datasets/fb15k/test.txt"       # 测试集路径        self.valid_fb15k = "./datasets/fb15k/valid.txt"     # 验证集路径        self.entity2id_train_file = "./datasets/fb15k/entity2id_train.txt"      # 训练集实体到索引的映射        self.relation2id_train_file = "./datasets/fb15k/relation2id_train.txt"  # 训练集关系到索引的映射        self.entity2id_test_file = "./datasets/fb15k/entity2id_test.txt"        # 测试集实体到索引的映射        self.relation2id_test_file = "./datasets/fb15k/relation2id_test.txt"    # 测试集关系到索引的映射        self.entity2id_valid_file = "./datasets/fb15k/entity2id_valid.txt"      # 验证集实体到索引的映射        self.relation2id_valid_file = "./datasets/fb15k/relation2id_valid.txt"  # 验证集关系到索引的映射        self.entity_50dim_batch400 = "./datasets/fb15k/entity_50dim_batch400"   # 400 batch, 实体embedding向量50维的训练结果        self.relation_50dim_batch400 = "./datasets/fb15k/relation_50dim_batch400"   # 400 batch, 关系embedding向量50维的训练结果        self.saved_TransE = "./saved_models/TransE.pkl"                         # save TransE model

torch.utils.data.Dataset模块

Dataset模块在torch.utils.data下,在使用该模块构建我们自己的数据集时,需要我们继承Dataset类,并重写方法和方法。以训练集为例,我创建了一个TrainSet.py文件,专门用来对测试集进行处理,这里使用的依旧是FB15k数据集。

TrainSet.py
import numpy as npimport pandas as pdfrom torch.utils.data import Dataset, DataLoaderfrom config import Configimport randomfrom collections import Counterclass TrainSet(Dataset):    '''        训练集构造类    '''    def __init__(self, config):        super(TrainSet, self).__init__()        # 初始化参数设置        self.config = config        # 加载数据        self.entity_dic, self.relation_dic, self.pos_triples = self.load_data()        # 样本总数        self.sample_num = len(self.pos_triples)        # entity总数        self.entity_num = len(self.entity_dic)        # 关系类型数        self.relation_num = len(self.relation_dic)        # 加载负例样本        self.neg_triples = self.generate_neg()        print(f"TrainSet: {self.entity_num} entities, {self.relation_num} relations, {self.sample_num} triples.")    # 重写len方法    def __len__(self):        return self.sample_num    # 重写getitem方法    def __getitem__(self, item):        return [self.pos_triples[item], self.neg_triples[item]]    def load_data(self):        '''        author: Chengyu Lin 2020/9/20 10:38        description: 加载数据, 返回entity to index, relation to index, positive triples        param: data: pandas读取的数据        return entity_dic, relation_dic, triples, type: dict        '''        # read raw data        raw_data = pd.read_csv(self.config.train_fb15k, sep='\t', header=None, names=['head', 'relation', 'tail'],                                keep_default_na=False, encoding='utf-8')        raw_data = raw_data.applymap(lambda x: x.strip())        # get head, relation, tail        head_count = Counter(raw_data['head'])        relation_count = Counter(raw_data['relation'])        tail_count = Counter(raw_data['tail'])        entity_list = list((head_count + tail_count).keys())        relation_list = list(relation_count.keys())        # convert data to dic        entity_dic = dict([(entity, idx) for idx, entity in enumerate(entity_list)])        relation_dic = dict([(relation, idx) for idx, relation in enumerate(relation_list)])        # convert triples to index        triples = self.convert_triple_to_index(raw_data.values, relation_dic, entity_dic)        return entity_dic, relation_dic, triples    def convert_triple_to_index(self, triples, relation_dic, entity_dic):        '''        author: Chengyu Lin 2020/9/20 10:38        description: 将triple转化为index的格式 exp: [head, relation, tail] => [0,0,0]        param: triples: raw data中的triple: [head, relation, tail] => [/m/01qscs,/award/award_nominee/award_nominations./award/award_nomination/award,/m/02x8n1n]                type: list        param: relation_dic: relation to index, type: dict, exp: {relation: 0}        param: entity_dic: entity to index, type: dict, exp: {entity: 0}        return triple_set, type: list, exp: [head, relation, tail] => [0,0,0]        '''        triple_set = np.array([[entity_dic[triple[0]], relation_dic[triple[1]], entity_dic[triple[2]]] for triple in triples])        return triple_set        def generate_neg(self):        '''        author: Chengyu Lin 2020/9/20 10:38        description: 生成原始数据的负例样本 exp: [head, relation, tail] => [0,0,0]        param: None        return triple_set, type: list, exp: [head, relation, tail] => [0,0,0]        '''        neg_data = []        for idx, v in enumerate(self.pos_triples):            seed = random.random()  # 随机数种子, 判定替换head entity or tail entity            if seed > 0.5:                # replace head                rand_head = v[0]                while rand_head == v[0]:                    head_ = random.sample(self.entity_dic.keys(), 1)[0] # head name, selected by random.sample                    rand_head = self.entity_dic[head_]                neg_data.append([rand_head, v[1], v[2]])            else:                # replace tail                rand_tail = v[2]                while rand_tail == v[2]:                    tail_ = random.sample(self.entity_dic.keys(), 1)[0] # tail name, selected by random.sample                    rand_tail = self.entity_dic[tail_]                neg_data.append([v[0], v[1], rand_tail])        return np.array(neg_data)

TrainSet.py文件中仅包含一个TrainSet类,该类继承自torch.utils.data中的Dataset类,而要想该类被DataLoader所使用,必须包含之前说的方法和方法,这两个方法用于返回数据集的大小和获取一条数据集。因此,为了实现完整的功能,该类还应该加载进数据,才能满足这两个方法所需要的一些属性,我定义了load_data、convert_triple_to_index和generate_neg这三个方法。

load_data方法

    def load_data(self):        '''        author: Chengyu Lin 2020/9/20 10:38        description: 加载数据, 返回entity to index, relation to index, positive triples        param: data: pandas读取的数据        return entity_dic, relation_dic, triples, type: dict        '''        # read raw data        raw_data = pd.read_csv(self.config.train_fb15k, sep='\t', header=None, names=['head', 'relation', 'tail'],                                keep_default_na=False, encoding='utf-8')        raw_data = raw_data.applymap(lambda x: x.strip())        # get head, relation, tail        head_count = Counter(raw_data['head'])        relation_count = Counter(raw_data['relation'])        tail_count = Counter(raw_data['tail'])        entity_list = list((head_count + tail_count).keys())        relation_list = list(relation_count.keys())        # convert data to dic        entity_dic = dict([(entity, idx) for idx, entity in enumerate(entity_list)])        relation_dic = dict([(relation, idx) for idx, relation in enumerate(relation_list)])        # convert triples to index        triples = self.convert_triple_to_index(raw_data.values, relation_dic, entity_dic)        return entity_dic, relation_dic, triples

该方法主要用到了pandas和collections两个包。首先使用pandas的read_csv方法读取训练集所在文件中的数据,这一步pandas提供了非常方便的接口,可以帮忙处理分隔符等。然后用applymap将一个lambda表达式的函数用在了raw_data上,用来去除每个数据首尾的一些转义字符。FB15k中的数据是RDF格式,通常是这个样子的:

5d8d51733049367e90ad896aec87bb0f.png

这一条数据以\t为分隔符(sep),其基本格式为head entity, relation, tail entity。所以raw_data读取上来是这个样子的:

a24c9e304c429c0dac978fa8720e0ef2.png

然后调用collections库中的Counter方法对数据进行统计,也可以用下面这行代码打印看一下head_count对应的前3个内容:

print(f"head_count: {head_count.most_common(3)}")

得到的结果是:

0dd5b132dafc551084d564547124f5ec.png

可以看到Counter类会对传进去的值做计数操作,且能保证每个元素唯一,也就实现了在python版本中的one-hot编码。因为要对所有的实体进行one-hot编码,所以后面就把这head_count和tail_count中的entity给合并到一个list里面,entity_list,然后结合enumerate方法构建了entity到index的映射字典,同样的也对relation进行了相同的操作。接着就需要将每个三元组样本转换成编码格式,这里调用了convert_triple_to_index方法。

convert_triple_to_index方法

    def convert_triple_to_index(self, triples, relation_dic, entity_dic):        '''        author: Chengyu Lin 2020/9/20 10:38        description: 将triple转化为index的格式 exp: [head, relation, tail] => [0,0,0]        param: triples: raw data中的triple: [head, relation, tail] => [/m/01qscs,/award/award_nominee/award_nominations./award/award_nomination/award,/m/02x8n1n]                type: list        param: relation_dic: relation to index, type: dict, exp: {relation: 0}        param: entity_dic: entity to index, type: dict, exp: {entity: 0}        return triple_set, type: list, exp: [head, relation, tail] => [0,0,0]        '''        triple_set = np.array([[entity_dic[triple[0]], relation_dic[triple[1]], entity_dic[triple[2]]] for triple in triples])        return triple_set

方法中已经写了部分注释,这个方法就是用之前构建的entity到index映射的字典和relation到index映射的字典将triple转换为由一系列index构成的形式,此外,为了方便后面的计算,都用numpy将样本转换成了矩阵形式。到此,样本到index的构建已经完毕了。

generate_neg方法——生成corrupted_triples

与先前python版本不同的是,负样本的生成我是在数据预处理中完成的,这并不影响整个算法。

    def generate_neg(self):        '''        author: Chengyu Lin 2020/9/20 10:38        description: 生成原始数据的负例样本 exp: [head, relation, tail] => [0,0,0]        param: None        return triple_set, type: list, exp: [head, relation, tail] => [0,0,0]        '''        neg_data = []        for idx, v in enumerate(self.pos_triples):            seed = random.random()  # 随机数种子, 判定替换head entity or tail entity            if seed > 0.5:                # replace head                rand_head = v[0]                while rand_head == v[0]:                    head_ = random.sample(self.entity_dic.keys(), 1)[0] # head name, selected by random.sample                    rand_head = self.entity_dic[head_]                neg_data.append([rand_head, v[1], v[2]])            else:                # replace tail                rand_tail = v[2]                while rand_tail == v[2]:                    tail_ = random.sample(self.entity_dic.keys(), 1)[0] # tail name, selected by random.sample                    rand_tail = self.entity_dic[tail_]                neg_data.append([v[0], v[1], rand_tail])        return np.array(neg_data)

采用的方式和python版本中的一样,对于每个训练集中的样本,我都从所有实体中采样一个,然后随机替换掉head部分或者tail部分。接下来就是len方法和getitem方法的构建。

len方法

在上面流程以后,TrainSet类已经有了导入的数据所构成的属性,len方法的作用就是返回数据集的长度。

    # 重写len方法    def __len__(self):        return self.sample_num

getitem方法

该方法用于返回一个数据样本。

    # 重写getitem方法    def __getitem__(self, item):        return [self.pos_triples[item], self.neg_triples[item]]

因为在训练过程中需要用到负样本,所以对于训练集需要将两种样本合并起来进行返回。

TestSet.py

TestSet.py专门用来处理测试集数据,具体的与TrainSet.py类似,在这里就不多做叙述。

import torchfrom torch.utils.data import Dataset, DataLoaderimport pandas as pd from config import Configimport numpy as npfrom TrainSet import TrainSetclass TestSet(Dataset):    '''    测试集构造类    '''    def __init__(self, config, train_set):        super()        self.config = config        self.train_set = train_set        self.raw_data = self.load_data()        self.triple_list = self.convert_word_to_index(self.raw_data, self.train_set.entity_dic, self.train_set.relation_dic)                self.triples_num = len(self.triple_list)    # 重写len方法    def __len__(self):        return self.triples_num    # 重写getitem方法    def __getitem__(self, item):        return self.triple_list[item]    def load_data(self):        raw_data = pd.read_csv(self.config.test_fb15k, sep='\t', header=None,                                names=['head', 'relation', 'tail'], keep_default_na=False                                , encoding='utf-8')        return raw_data.values        def convert_word_to_index(self, triples, entity_dic, relation_dic):        triple_list = [[entity_dic[triple[0]], relation_dic[triple[1]], entity_dic[triple[2]]] for triple in triples]        return np.array(triple_list)

使用DataLoader加载数据集

有了上面构建好的数据集,就可以很方便的使用DataLoader来帮我们加载数据集。

    config = Config()    datasets = TrainSet(config)    train_loader = DataLoader(datasets, batch_size=32, shuffle=True)    for batch_idx, data in enumerate(train_loader):        pos, neg = data        print(pos)        print("-" * 20)        print(pos[0])        break

DataLoader类需要三个参数,data - 待处理的Dataset,batch_size - 所要划分的数据集大小,shuffle - 是否随机化数据。然后就可以用enumerate获得每个样本。到此,输出处理部分结束。

模型构建

使用pytorch构建模型,需要继承自nn.Module类,里面必须包含forward方法用作模型的训练函数,其余的和正常的python语法相同。

import torchimport torch.nn as nnimport torch.nn.functional as funcimport numpy as npclass TransE(nn.Module):    def __init__(self, entity_num, relation_num, embedding_dim=50, margin=1, norm=2):        super(TransE, self).__init__()        # 参数初始化        self.device = torch.device('cuda')  # 使用gpu加速        self.entity_num = entity_num        # 实体数目                               # 实体数目        self.relation_num = relation_num    # 关系数目                              # 关系数目        self.dim = embedding_dim            # embedding维度                              # embedding维度        self.margin = torch.FloatTensor([margin]).to(self.device)  # 边际值       # 边际设定, γ        self.norm = norm                                                # p-范数距离        # 初始化embedding        # 创建entity embedding的初始化向量        tmp_entity_embeddings = torch.empty(entity_num, self.dim).uniform_(-6 / np.sqrt(self.dim), 6 / np.sqrt(self.dim))        # 使用from_pretrained方法构造entity_embedding        self.entity_embeddings = nn.Embedding.from_pretrained(tmp_entity_embeddings, freeze=False)        # 创建relation embedding的初始化向量        tmp_relation_embeddings = torch.empty(relation_num, self.dim).uniform_(-6 / np.sqrt(self.dim), 6 / np.sqrt(self.dim))        self.relation_embeddings = nn.Embedding.from_pretrained(tmp_relation_embeddings, freeze=False)        # 这一步需要对relation embedding进行标准化操作, l <= l / ||l||        relation_norm = torch.norm(self.relation_embeddings.weight.data, dim=1, keepdim=True)        self.relation_embeddings.weight.data = self.relation_embeddings.weight.data / relation_norm    def forward(self, pos_head, pos_relation, pos_tail, neg_head, neg_relation, neg_tail):        '''        pos = [3, batch size], neg = [3, batch size]        forward部分对应算法的损失函数计算        '''        dis_pos = self.entity_embeddings(pos_head) + self.relation_embeddings(pos_relation) - self.entity_embeddings(pos_tail)        dis_neg = self.entity_embeddings(neg_head) + self.relation_embeddings(neg_relation) - self.entity_embeddings(neg_tail)        return self.calculate_loss(dis_pos, dis_neg).requires_grad_()    def calculate_loss(self, dis_pos, dis_neg):        '''        一步的损失计算, 论文中的公式[γ + distance(h + l, t) - distance(h' + l, t')]+        +表示的是大于0则引入计算, 小于0则以0引入损失, 也就对应损失函数中的relu函数        :param dis_pos: correct triple的相似度        :param dis_neg: corrupted triple的相似度        :return: 一步的损失        '''        dis_diff = self.margin + torch.norm(dis_pos, p=self.norm, dim=1) - torch.norm(dis_neg, p=self.norm, dim=1)        return torch.sum(func.relu(dis_diff))        def tail_predict(self, x, k=10):        h = x[0]        r = x[1]        t = x[2]        # hr: [batch_size, embed_size] => [batch_size, 1, embed_size] => [batch_size, N, embed_size]        hr = self.entity_embeddings(h) + self.relation_embeddings(r)        hr = torch.unsqueeze(hr, dim=1)        hr = hr.expand(hr.shape[0], self.entity_num, self.dim)        # embed_tail: [batch_size, N, embed_size]        embed_tail = self.entity_embeddings.weight.data.expand(hr.shape[0], self.entity_num, self.dim)        # compute similarity: [batch_size, N]        similarity = torch.norm(hr - embed_tail, dim=2)        # indices: [batch_size, k]        values, indices = torch.topk(similarity, k, dim=1, largest=False)        # mean_indices: [batch_size, N]        mean_values, mean_indices = torch.topk(similarity, self.entity_num, dim=1, largest=False)        # tail: [batch_size] => [batch_size, 1]        tail = t.view(-1, 1)        # result of hits10        hits10 = torch.sum(torch.eq(indices, tail)).item()        # result of mean rank        mean_rank = torch.sum(torch.eq(mean_indices, tail).nonzero(), dim=0)[1]        return hits10, mean_rank

整个模型分为初始化阶段、forward训练阶段、损失计算和链接预测。

init方法

init方法主要是模型构造函数,这一步进行模型参数初始化,还有就是embedding向量的初始化,这两步操作与论文中的相同。

    def __init__(self, entity_num, relation_num, embedding_dim=50, margin=1, norm=2):        super(TransE, self).__init__()        # 参数初始化        self.device = torch.device('cuda')        self.entity_num = entity_num                                    # 实体数目        self.relation_num = relation_num                                # 关系数目        self.dim = embedding_dim                                        # embedding维度        self.margin = torch.FloatTensor([margin]).to(self.device)       # 边际设定, γ        self.norm = norm                                                # p-范数距离        # 初始化embedding        # 创建entity embedding的初始化向量        tmp_entity_embeddings = torch.empty(entity_num, self.dim).uniform_(-6 / np.sqrt(self.dim), 6 / np.sqrt(self.dim))        # 使用from_pretrained方法构造entity_embedding        self.entity_embeddings = nn.Embedding.from_pretrained(tmp_entity_embeddings, freeze=False)        # 创建relation embedding的初始化向量        tmp_relation_embeddings = torch.empty(relation_num, self.dim).uniform_(-6 / np.sqrt(self.dim), 6 / np.sqrt(self.dim))        self.relation_embeddings = nn.Embedding.from_pretrained(tmp_relation_embeddings, freeze=False)        # 这一步需要对relation embedding进行标准化操作, l <= l / ||l||        relation_norm = torch.norm(self.relation_embeddings.weight.data, dim=1, keepdim=True)        self.relation_embeddings.weight.data = self.relation_embeddings.weight.data / relation_norm

forward方法和calculate_loss方法

forward方法是torch框架下对训练函数的命名。

    def forward(self, pos_head, pos_relation, pos_tail, neg_head, neg_relation, neg_tail):        '''        pos = [3, batch size], neg = [3, batch size]        forward部分对应算法的损失函数计算        '''        dis_pos = self.entity_embeddings(pos_head) + self.relation_embeddings(pos_relation) - self.entity_embeddings(pos_tail)        dis_neg = self.entity_embeddings(neg_head) + self.relation_embeddings(neg_relation) - self.entity_embeddings(neg_tail)        return self.calculate_loss(dis_pos, dis_neg).requires_grad_()    def calculate_loss(self, dis_pos, dis_neg):        '''        一步的损失计算, 论文中的公式[γ + distance(h + l, t) - distance(h' + l, t')]+        +表示的是大于0则引入计算, 小于0则以0引入损失, 也就对应损失函数中的relu函数        :param dis_pos: correct triple的相似度        :param dis_neg: corrupted triple的相似度        :return: 一步的损失        '''        dis_diff = self.margin + torch.norm(dis_pos, p=self.norm, dim=1) - torch.norm(dis_neg, p=self.norm, dim=1)        return torch.sum(func.relu(dis_diff))

计算方法按照算法中的公式来的,就不多做叙述了。

tail_predict方法

该方法用来模型的训练,用来计算链接预测中hits10和mean_rank。

    def tail_predict(self, x, k=10):        h = x[0]        r = x[1]        t = x[2]        # hr: [batch_size, embed_size] => [batch_size, 1, embed_size] => [batch_size, N, embed_size]        hr = self.entity_embeddings(h) + self.relation_embeddings(r)        hr = torch.unsqueeze(hr, dim=1)        hr = hr.expand(hr.shape[0], self.entity_num, self.dim)        # embed_tail: [batch_size, N, embed_size]        embed_tail = self.entity_embeddings.weight.data.expand(hr.shape[0], self.entity_num, self.dim)        # compute similarity: [batch_size, N]        similarity = torch.norm(hr - embed_tail, dim=2)        # indices: [batch_size, k]        values, indices = torch.topk(similarity, k, dim=1, largest=False)        # mean_indices: [batch_size, N]        mean_values, mean_indices = torch.topk(similarity, self.entity_num, dim=1, largest=False)        # tail: [batch_size] => [batch_size, 1]        tail = t.view(-1, 1)        # result of hits10        hits10 = torch.sum(torch.eq(indices, tail)).item()        # result of mean rank        mean_rank = torch.sum(torch.eq(mean_indices, tail).nonzero(), dim=0)[1]        return hits10, mean_rank

具体思路是这样子的,对于输入的每一个batch size的数据矩阵,因为要计算h + r ≈ t,首先计算h + r得到hr,然后将其扩展到[batch_size, N, embed_size]这样一个维度,方便矩阵运算,这要效率更高,然后用torch的topk方法返回前k小的数据,在这里实际上计算的是一种不相似度,similarity越小,越相似。然后与t相比较,这里是将tail转换成[batch size, 1]的大小,通过获得对应的索引位置就可以计算hits10和mean_rank。到此模型构建完毕。

模型训练和测试

接下来是模型训练模块,我编写了run.py文件专门用来实现模型训练和模型测试。

import torchfrom torch import optim, nnfrom torch.utils.data import Dataset, DataLoaderfrom TrainSet import TrainSetfrom Testset import TestSetfrom config import Configfrom models.TransE import TransEclass ModelRunning(object):    def __init__(self, config, epochs, batch_size=32, learning_rate=0.01, dim=50, norm=2, margin=1):        super()        self.config = config        self.epochs = epochs        self.batch_size = batch_size        self.learning_rate = learning_rate        self.dim = dim        self.norm = norm        self.margin = margin        # 指定torch是cpu还是gpu        self.device = torch.device('cpu')        if torch.cuda.is_available():            self.device = torch.device('cuda')        def train(self, train_set):        train_loader = DataLoader(train_set, batch_size=self.batch_size, shuffle=True)        # create model        model = TransE(train_set.entity_num, train_set.relation_num, self.dim, self.margin, self.norm).to(self.device)        # create optimizer        optimizer = optim.SGD(model.parameters(), lr=self.learning_rate, momentum=0)        # begin training        print("Training...")        for epoch in range(self.epochs):            # e / ||e||            entity_norm = torch.norm(model.entity_embeddings.weight.data, dim=1, keepdim=True)            model.entity_embeddings.weight.data = model.entity_embeddings.weight.data / entity_norm                        # total loss            total_loss = 0.0            for idx, (pos, neg) in enumerate(train_loader):                pos, neg = pos.to(self.device), neg.to(self.device)                # pos: [batch_size, 3] => [3, batch_size]                pos = torch.transpose(pos, 0, 1).long()                # neg: [batch_size, 3] => [3, batch_size]                neg = torch.transpose(neg, 0, 1).long()                pos_head, pos_relation, pos_tail = pos[0], pos[1], pos[2]                neg_head, neg_relation, neg_tail = neg[0], neg[1], neg[2]                                loss = model(pos_head, pos_relation, pos_tail, neg_head, neg_relation, neg_tail)                total_loss += loss.item()                                optimizer.zero_grad()                loss.backward()                optimizer.step()            print(f"epoch {epoch+1}, loss = {total_loss/train_loader.__len__()}")        print("Training ended.")        # save model        torch.save(model, config.saved_TransE)        print("saved model.")    def test(self, train_set):        print("load test datasets...")        test_datasets = TestSet(self.config, train_set)        test_loader = DataLoader(test_datasets, batch_size=256, shuffle=True)        print("loaded test datasets.")        print("load model...")        model = torch.load(config.saved_TransE)        print("model loaded.")        print("Testing...")        hits10, mean_rank = 0, 0        for idx, d in enumerate(test_loader):            d = d.to(self.device)            d = torch.transpose(d, 0, 1).long()            # hits10 & mean rank            tmp_hits10, tmp_mean_rank = model.tail_predict(d, k=10)            hits10 += tmp_hits10            mean_rank += tmp_mean_rank        print(f"length of test data: {test_datasets.__len__()}")        print(f"hits10: {hits10}")        print(f"mean rank: {mean_rank}")        print(f"hits10: {hits10 / test_datasets.__len__()}, mean rank: {mean_rank / test_datasets.__len__()}")        print("Test ended.")if __name__ == "__main__":    config = Config()    # load train data    print("load train data...")    train_set = TrainSet(config)    print("loaded train data.")    app = ModelRunning(config, epochs=100)    app.train(train_set)    app.test(train_set)

该模块中的ModelRunning类专门用来模型训练和测试,主要分为两个方法,train和test方法,里面的init方法也是用来初始化的,就不做多余的描述了。

train方法

    def train(self, train_set):        train_loader = DataLoader(train_set, batch_size=self.batch_size, shuffle=True)        # create model        model = TransE(train_set.entity_num, train_set.relation_num, self.dim, self.margin, self.norm).to(self.device)        # create optimizer        optimizer = optim.SGD(model.parameters(), lr=self.learning_rate, momentum=0)        # begin training        print("Training...")        for epoch in range(self.epochs):            # e / ||e||            entity_norm = torch.norm(model.entity_embeddings.weight.data, dim=1, keepdim=True)            model.entity_embeddings.weight.data = model.entity_embeddings.weight.data / entity_norm                        # total loss            total_loss = 0.0            for idx, (pos, neg) in enumerate(train_loader):                pos, neg = pos.to(self.device), neg.to(self.device)                # pos: [batch_size, 3] => [3, batch_size]                pos = torch.transpose(pos, 0, 1).long()                # neg: [batch_size, 3] => [3, batch_size]                neg = torch.transpose(neg, 0, 1).long()                pos_head, pos_relation, pos_tail = pos[0], pos[1], pos[2]                neg_head, neg_relation, neg_tail = neg[0], neg[1], neg[2]                # 不需要显示调用                loss = model(pos_head, pos_relation, pos_tail, neg_head, neg_relation, neg_tail)                total_loss += loss.item()                                optimizer.zero_grad()                loss.backward()                optimizer.step()            print(f"epoch {epoch+1}, loss = {total_loss/train_loader.__len__()}")        print("Training ended.")        # save model        torch.save(model, config.saved_TransE)        print("saved model.")

从代码的流程不难看出整个过程非常的机械化的,DataLoader加载数据后,首先要做的是声明模型和优化器(这里采用SGD,与论文中的一致),然后就是对模型按照设计好的epochs进行迭代多少轮,每一轮根据一个batch_size进行训练,在训练的过程中,也与算法一致,需要先对每个样本的实体向量进行归一化操作,然后用模型进行训练,注意这里不需要显示的调用forward方法。然后计算每个epoch的损失,接着梯度更新,最后反向传播。训练完成以后,保存模型。

test方法

    def test(self, train_set):        print("load test datasets...")        test_datasets = TestSet(self.config, train_set)        test_loader = DataLoader(test_datasets, batch_size=256, shuffle=True)        print("loaded test datasets.")        print("load model...")        model = torch.load(config.saved_TransE)        print("model loaded.")        print("Testing...")        hits10, mean_rank = 0, 0        for idx, d in enumerate(test_loader):            d = d.to(self.device)            d = torch.transpose(d, 0, 1).long()            # hits10 & mean rank            tmp_hits10, tmp_mean_rank = model.tail_predict(d, k=10)            hits10 += tmp_hits10            mean_rank += tmp_mean_rank        print(f"length of test data: {test_datasets.__len__()}")        print(f"hits10: {hits10}")        print(f"mean rank: {mean_rank}")        print(f"hits10: {hits10 / test_datasets.__len__()}, mean rank: {mean_rank / test_datasets.__len__()}")        print("Test ended.")

test方法与train方法不同的地方在于我们这里显示调用了tail_predict方法进行测试,因为这并不是torch框架里面所要遵循的规则,且不需要每个epoch进行操作了。

实验结果

我设置epoch为50,margin=1,learn_rate=0.01, dim=50的情况下,模型在hits10上表现为34.48%,在mean_rank上表现为322,这一结果与论文中的比较接近。

ours:

dc8b8dbbd1ab9d60efccbebbdf0f0629.png

TransE:

d392f9f52dcc2c5be25436555ef17eb2.png

本次实验的复现还是相对比较成功的。

  • 4
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
要处理视频数据,你可以使用 PyTorch 中的 `torchvision.datasets.VideoDataset` 数据集类。这个类可以用来加载视频数据集,并将其转换为 PyTorch 中的 `Tensor` 类型。你可以通过以下代码来加载视频数据集: ```python import torchvision.datasets as datasets # 定义数据集的路径 data_path = "/path/to/video/dataset" # 定义数据集的转换 transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), ]) # 加载数据集 video_dataset = datasets.VideoDataset( root=data_path, video_sampler=None, decode_audio=False, transform=transform, extensions=("avi",), ) ``` 在上面的代码中,你需要指定数据集的路径和数据集的扩展名。然后,你还需要定义一个数据集的转换,以便将视频转换为 `Tensor` 类型。最后,你可以使用 `VideoDataset` 类来加载数据集。 当你加载视频数据集之后,你可以使用 PyTorch 中的 `DataLoader` 类来创建一个数据加载器。数据加载器可以帮助你批量加载数据,以便进行模型训练。以下是创建数据加载器的示例代码: ```python from torch.utils.data import DataLoader # 定义批量大小 batch_size = 32 # 创建数据加载器 video_loader = DataLoader( video_dataset, batch_size=batch_size, shuffle=True, num_workers=4, ) ``` 在上面的代码中,你需要指定批量大小和数据加载器的工作线程数。然后,你可以使用 `DataLoader` 类来创建数据加载器,以便批量加载数据。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值