CMeKG代码解读(以项目为导向从零开始学习知识图谱)（二）

chen_nnn

已于 2022-06-20 15:24:31 修改

阅读量2.2k

点赞数 7

分类专栏：笔记文章标签：知识图谱人工智能 python github

于 2022-02-07 22:35:27 首次发布

本文链接：https://blog.csdn.net/chen_nnn/article/details/122814434

版权

笔记专栏收录该内容

12 篇文章 40 订阅

订阅专栏

书接上文

https://blog.csdn.net/chen_nnn/article/details/122795768https://blog.csdn.net/chen_nnn/article/details/122795768

Model4po类

class Model4po(nn.Module):
    def __init__(self, num_p=config.num_p, hidden_size=768):
        super(Model4po, self).__init__()
        self.dropout = nn.Dropout(p=0.4)
        self.linear = nn.Linear(in_features=hidden_size, out_features=num_p * 2, bias=True)
        self.sigmoid = nn.Sigmoid()

    def forward(self, hidden_states, batch_subject_ids, input_mask):
        all_s = torch.zeros((hidden_states.shape[0], hidden_states.shape[1], hidden_states.shape[2]),
                            dtype=torch.float32)

        for b in range(hidden_states.shape[0]):
            s_start = batch_subject_ids[b][0]
            s_end = batch_subject_ids[b][1]
            s = hidden_states[b][s_start] + hidden_states[b][s_end]
            cue_len = torch.sum(input_mask[b])
            all_s[b, :cue_len, :] = s
        hidden_states += all_s

        output = self.sigmoid(self.linear(self.dropout(hidden_states))).pow(4)

        return output  # (batch_size, max_seq_len, num_p*2)

猜测起名用意是关系和实体的模型，但是对于4的含义并不十分明确。

init():

与前面的Model4s基本相同，初始化了一些变量

forward():

首先是根据传递的参数hidden_states的形状，生成一个同型的且数据类型是32位的浮点型。

其次是我感觉这里的batch_subject_ids对应的就是在IterableDataset类中创建的batch_subject_ids因为刚好这个列表只有两列，第一列来记录主体开始的位置，第二列来记录主体结束的位置。然后根据这个其实位置从数据集中将主体找出来。

在这个函数中的hidden_states也很有可能就是上一个类中的同名列表，是模型计算后的产物。根据掩码的输入情况，全部记录下主体的情况，最后附加在hidden_states列表后，再根据sigmoid函数计算出所期望的输出值。

load_schema():

def load_schema(path):
    with open(path, 'r', encoding='utf-8', errors='replace') as f:
        data = json.load(f)
        predicate = list(data.keys())
        prediction2id = {}
        id2predicate = {}
        for i in range(len(predicate)):
            prediction2id[predicate[i]] = i
            id2predicate[i] = predicate[i]
    num_p = len(predicate)
    config.prediction2id = prediction2id
    config.id2predicate = id2predicate
    config.num_p = num_p

根据函数参数中给出的路径，以‘utf-8’的编码方式打开，打开之后使用json方法加载里面数据，并将其中的key以列表的形式返回给predicate（谓语）列表，并在之后的两个字典中，建立对应的键值对分别是从prediction到id和从id到predicate。这里不妨大胆猜测一下，留待日后验证，这里建立的键值对关系可能是用于日后从大量的数据集中进行自然语言处理的文本提取等内容。

load_data():

def load_data(path):
    text_spos = []
    with open(path, 'r', encoding='utf-8', errors='replace') as f:
        data = json.load(f)
        for item in data:
            text = item['text']
            spo_list = item['spo_list']
            text_spos.append({
                'text': text,
                'spo_list': spo_list
            })
    return text_spos

根据函数参数中给出的路径，以‘utf-8’的编码方式打开，打开之后使用json方法加载里面数据。并取出其中的两项内容，一是文本的内容，而是从里面处理出来的主体，关系，客体的三元组，将这两项以字典的形式存储，然后作为text_spos的一个列表元素保留。

load_fn():

def loss_fn(pred, target):
    loss_fct = nn.BCELoss(reduction='none')
    return loss_fct(pred, target)

nn.BCELoss讲的是对一个batch里面的数据做二元交叉熵。以下是对BCELoss函数的解析。

Creates a criterion that measures the Binary Cross Entropy between the target and the input probabilities:

The unreduced (i.e. with :attr:`reduction` set to ``'none'``) loss can be described as:

$\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \\ l_n = - w_n \left[ y_n \cdot \log x_n + (1 - y_n) \cdot \log (1 - x_n) \right],$

where : is the batch size. If :`reduction` is not 'none'(default 'mean'), then:

$\ell(x, y) = \begin{cases} \operatorname{mean}(L), & \text{if reduction} = \text{`mean';}\\ \operatorname{sum}(L), & \text{if reduction} = \text{`sum'.} \end{cases}$

This is used for measuring the error of a reconstruction in for example an auto-encoder. Note that the targets : should be numbers between 0 and 1.
Notice that if  is either 0 or 1, one of the log terms would be mathematically undefined in the above loss equation. PyTorch chooses to set, since .However, an infinite term in the loss equation is not desirable for several reasons.

For one, if either  or , then we would be multiplying 0 with infinity. Secondly, if we have an infinite loss value, then we would also have an infinite term in our gradient, since.This would make BCELoss's backward method nonlinear with respect to , and using it for things like linear regression would not be straight-forward.

Our solution is that BCELoss clamps its log function outputs to be greater than or equal to -100. This way, we can always have a finite loss value and a linear backward method.

而对于交叉熵的介绍可以参照这篇博文：一文搞懂交叉熵在机器学习中的使用，透彻理解交叉熵背后的直觉_史丹利复合田的博客-CSDN博客_交叉熵的理解关于交叉熵在loss函数中使用的理解交叉熵（cross entropy）是深度学习中常用的一个概念，一般用来求目标与预测值之间的差距。以前做一些分类问题的时候，没有过多的注意，直接调用现成的库，用起来也比较方便。最近开始研究起对抗生成网络（GANs），用到了交叉熵，发现自己对交叉熵的理解有些模糊，不够深入。遂花了几天的时间从头梳理了一下相关知识点，才算透彻的理解了，特地记录下来，以便日后查阅。https://blog.csdn.net/tsyccnh/article/details/79163834?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522164424262316780357211806%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=164424262316780357211806&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~top_positive~default-1-79163834.first_rank_v2_pc_rank_v29&utm_term=%E4%BA%A4%E5%8F%89%E7%86%B5&spm=1018.2226.3001.4187

train():

def train(train_data_loader, model4s, model4po, optimizer):
    for epoch in range(config.EPOCH):
        begin_time = time.time()
        model4s.train()
        model4po.train()
        train_loss = 0.
        for bi, batch in enumerate(train_data_loader):
            if bi >= len(train_data_loader) // config.batch_size:
                break
            batch_token_ids, batch_mask_ids, batch_segment_ids, batch_subject_labels, batch_subject_ids, batch_object_labels = batch
            batch_token_ids = torch.tensor(batch_token_ids, dtype=torch.long)
            batch_mask_ids = torch.tensor(batch_mask_ids, dtype=torch.long)
            batch_segment_ids = torch.tensor(batch_segment_ids, dtype=torch.long)
            batch_subject_labels = torch.tensor(batch_subject_labels, dtype=torch.float)
            batch_object_labels = torch.tensor(batch_object_labels, dtype=torch.float).view(config.batch_size,
                                                                                            config.max_seq_len,
                                                                                            config.num_p * 2)
            batch_subject_ids = torch.tensor(batch_subject_ids, dtype=torch.int)

            batch_subject_labels_pred, hidden_states = model4s(batch_token_ids, batch_mask_ids, batch_segment_ids)
            loss4s = loss_fn(batch_subject_labels_pred, batch_subject_labels.to(torch.float32))
            loss4s = torch.mean(loss4s, dim=2, keepdim=False) * batch_mask_ids
            loss4s = torch.sum(loss4s)
            loss4s = loss4s / torch.sum(batch_mask_ids)

            batch_object_labels_pred = model4po(hidden_states, batch_subject_ids, batch_mask_ids)
            loss4po = loss_fn(batch_object_labels_pred, batch_object_labels.to(torch.float32))
            loss4po = torch.mean(loss4po, dim=2, keepdim=False) * batch_mask_ids
            loss4po = torch.sum(loss4po)
            loss4po = loss4po / torch.sum(batch_mask_ids)

            loss = loss4s + loss4po
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            train_loss += float(loss.item())
            print('batch:', bi, 'loss:', float(loss.item()))

        print('final train_loss:', train_loss / len(train_data_loader) * config.batch_size, 'cost time:',
              time.time() - begin_time)

    del train_data_loader
    gc.collect();

    return {
        "model4s_state_dict": model4s.state_dict(),
        "model4po_state_dict": model4po.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
    }

首先分析一下该函数的输入参数，train_data_loader这个根据往后代码中内容可以推断这个是由IterableDataset类中最后返回的列表，model4s和model4po自不必说，是前文初始化好的模型。

进入循环后，开始计时。面对model4s和model4po的train方法,如果模型中有BN层(Batch Normalization）和 Dropout，需要在训练时添加model.train()。model.train() 是保证BN层能够用到每一批数据的均值和方差。对于Dropout，model.train()是随机取一部分网络连接来训练更新参数。以下是关于train()的声明。

def train(self: T, mode: bool = True) -> T:
    r"""Sets the module in training mode.

    This has any effect only on certain modules. See documentations of
    particular modules for details of their behaviors in training/evaluation
    mode, if they are affected, e.g. :class:`Dropout`, :class:`BatchNorm`,
    etc.

    Args:
        mode (bool): whether to set training mode (``True``) or evaluation
                     mode (``False``). Default: ``True``.

    Returns:
        Module: self
    """
    if not isinstance(mode, bool):
        raise ValueError("training mode is expected to be boolean")
    self.training = mode
    for module in self.children():
        module.train(mode)
    return self

enumerate方法是将train_data_loader里面每一条数据拿出来编号并放到一个列表中作为一个列表元素，编号从0开始。前面的bi是将编号取出，后面的batch是将数据取出，并根据这个数据中存储方式，再分到六个tensor列表中。torch.tensor仅仅是一个python函数，torch.tensor会从data中的数据部分做拷贝（而不是直接引用），根据原始数据类型生成相应的torch.LongTensor、torch.FloatTensor和torch.DoubleTensor。参考：【PyTorch】Tensor和tensor的区别_玄云飘风的博客-CSDN博客_tensor和tensor本文列举的框架源码基于PyTorch1.0，交互语句在0.4.1上测试通过import torch在PyTorch中，Tensor和tensor都能用于生成新的张量：>>> a=torch.Tensor([1,2])>>> atensor([1., 2.])>>> a=torch.tensor([1,2])>>&gt...https://blog.csdn.net/tfcy694/article/details/85338745而torch.tensor.view则是将其按照我们希望的大小去改变其的形状，参考：torch.tensor.view(*args)_danerer的专栏-CSDN博客_torch.tensor.viewview(*args) → Tensor返回一个有相同数据但大小不同的tensor。返回的tensor必须有与原tensor相同的数据和相同数目的元素，但可以有不同的大小。一个tensor必须是连续的contiguous()才能被查看。import torchx = torch.randn(4, 5)print('tensor原型:',x)print('tensor维度变换，由（4...https://blog.csdn.net/danerer/article/details/82908205

在将这些准备工作都做好之后，开始模型的训练，根据前文定义好的model4s训练返回两个参数，根据接下来的四个公式计算交叉熵loss4s，来确定模型的训练效果。model4po同理，计算出loss4po的交叉熵，然后将两者相加，作为最后总的交叉熵。

接下来的几步利用梯度下降法去重新计算梯度问题。简单来说就是进来一个batch的数据，计算一次梯度，更新一次网络。由于backward()函数累积了梯度，并且您不想在小批处理之间混合梯度，所以您必须在一个新的小批处理开始时将它们归零。这就像一般的累加器变量在代码中被初始化为0一样。

optimizer.zero_grad()：清空过往梯度

loss.backward()：反向传播，计算当前梯度

optimizer.step()：根据梯度更新网络参数

然后在页面中打印出第几批的交叉熵和最终的交叉熵以及训练所耗费的时间。删除 train_data_loader释放空间，gc.collect()命令可以回收没有被使用的空间，但是这个命令还会返回一个值，是清除掉的垃圾变量的个数。

最后返回的情况，保存模型中的weight权值和bias偏置值，以键值对的方式保存在了字典中。

extract_spoes():

def extract_spoes(text, model4s, model4po):
    """
    return: a list of many tuple of (s, p, o)
    """
    # 处理text
    with torch.no_grad():
        tokenizer = config.tokenizer
        max_seq_len = config.max_seq_len
        token_ids = torch.tensor(
            tokenizer.encode(text, max_length=max_seq_len, pad_to_max_length=True, add_special_tokens=True)).view(1, -1)
        if len(text) > max_seq_len - 2:
            text = text[:max_seq_len - 2]
        mask_ids = torch.tensor([1] * (len(text) + 2) + [0] * (max_seq_len - len(text) - 2)).view(1, -1)
        segment_ids = torch.tensor([0] * max_seq_len).view(1, -1)
        subject_labels_pred, hidden_states = model4s(token_ids, mask_ids, segment_ids)
        subject_labels_pred = subject_labels_pred.cpu()
        subject_labels_pred[0, len(text) + 2:, :] = 0
        start = np.where(subject_labels_pred[0, :, 0] > 0.4)[0]
        end = np.where(subject_labels_pred[0, :, 1] > 0.4)[0]

        subjects = []
        for i in start:
            j = end[end >= i]
            if len(j) > 0:
                j = j[0]
                subjects.append((i, j))

        if len(subjects) == 0:
            return []
        subject_ids = torch.tensor(subjects).view(1, -1)

        spoes = []
        for s in subjects:
            object_labels_pred = model4po(hidden_states, subject_ids, mask_ids)
            object_labels_pred = object_labels_pred.view((1, max_seq_len, config.num_p, 2)).cpu()
            object_labels_pred[0, len(text) + 2:, :, :] = 0
            start = np.where(object_labels_pred[0, :, :, 0] > 0.4)
            end = np.where(object_labels_pred[0, :, :, 1] > 0.4)

            for _start, predicate1 in zip(*start):
                for _end, predicate2 in zip(*end):
                    if _start <= _end and predicate1 == predicate2:
                        spoes.append((s, predicate1, (_start, _end)))
                        break

    id_str = ['[CLS]']
    i = 1
    index = 0
    while i < token_ids.shape[1]:
        if token_ids[0][i] == 102:
            break

        word = tokenizer.decode(token_ids[0, i:i + 1])
        word = re.sub('#+', '', word)
        if word != '[UNK]':
            id_str.append(word)
            index += len(word)
            i += 1
        else:
            j = i + 1
            while j < token_ids.shape[1]:
                if token_ids[0][j] == 102:
                    break
                word_j = tokenizer.decode(token_ids[0, j:j + 1])
                if word_j != '[UNK]':
                    break
                j += 1
            if token_ids[0][j] == 102 or j == token_ids.shape[1]:
                while i < j - 1:
                    id_str.append('')
                    i += 1
                id_str.append(text[index:])
                i += 1
                break
            else:
                index_end = text[index:].find(word_j)
                word = text[index:index + index_end]
                id_str.append(word)
                index += index_end
                i += 1
    res = []
    for s, p, o in spoes:
        s_start = s[0]
        s_end = s[1]
        sub = ''.join(id_str[s_start:s_end + 1])
        o_start = o[0]
        o_end = o[1]
        obj = ''.join(id_str[o_start:o_end + 1])
        res.append((sub, config.id2predicate[p], obj))

    return res

该函数的功能在一开始就已经写明了，是处理text文本后返回一个（s，p，o）的三元组。

with torch.no_grad() 是一个上下文管理器，被该语句 wrap 起来的部分将不会track 梯度。所以如果有不想被track的计算部分可以通过这么一个上下文管理器包裹起来。这样可以执行计算，但该计算不会在反向传播中被记录。详解可见：with torch.no_grad() 详解_岛的博客-CSDN博客_torch.no_grad():torch.no_grad() 是一个上下文管理器，被该语句 wrap 起来的部分将不会track 梯度。例如：a = torch.tensor([1.1], requires_grad=True)b = a * 2bOut[63]: tensor([2.2000], grad_fn=<MulBackward0>)b.add_(2)Out[64]: tensor([4....https://blog.csdn.net/weixin_46559271/article/details/105658654

根据上一篇文章当中提到的torch.tensor(tokenizer.encode())和在上文当中提到的view方法，可知token_ids最终是一个tensor型的只有一行但是有len(text)列，内容是text文本分词后的id值。再判断该文本长度是否超过max_seq_len这个最大字长，如果超过的话，截取其中最大字长-2的部分，原因是保留起始终止符号的位置。据此再创建两个tensor列表，均为只有一行，列数为最大字长，其中mask_ids中存储了text文本的长度并用1表示，segment_ids是一个全为零的列表。然后据此三个tensor列表，利用model4s模型，算出两个返回值。.cpu()的作用是将数据的处理设备从其他设备（如.cuda()拿到cpu上），不会改变变量类型，转换后仍然是Tensor变量。并将其第0页第len(text)+2行之后的所有数据都赋值为0。利用numpy.where方法找到符合条件的列表的位置。

对于numpy.where（）而言，只有条件 (condition)，没有x和y，则输出满足条件 (即非0) 元素的坐标 (等价于numpy.nonzero)。这里的坐标以tuple的形式给出，通常原数组有多少维，输出的tuple中就包含几个数组，分别对应符合条件元素的各维坐标。（源自：numpy.where() 用法详解 - massquantity - 博客园，代码中返回的坐标竖着使用）

>>> a = np.arange(27).reshape(3,3,3)
>>> a
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23],
        [24, 25, 26]]])

>>> np.where(a > 5)
(array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
 array([2, 2, 2, 0, 0, 0, 1, 1, 1, 2, 2, 2, 0, 0, 0, 1, 1, 1, 2, 2, 2]),
 array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]))

但是在本例中的使用中加了很多的限制条件，首先是np.where的条件中加入了是对第1页所有行第1列的数值的判断，所以返回值是一个只有行数的列表，同时又在最后加了np.where[0]的条件使得列表是一个只含数字的列表，列表中的数字代表符合条件的行坐标，start和end同理。然后将符合标准的，结束大于开始的行坐标对（表示主体的开始和结束）保存起来。如果没有符合条件的行，就返回一个空列表，如果有，则对其转换类型并变形。

并对每一个存在在subject中的元组都进行一次循环，循环的内容是，根据前文训练出的hidden_states和刚计算出的subject_ids，还有mask_ids对po模型进行训练并返回值，并将其进行变形处理，变成一个四维列表，只有1本max_seq_len（256）页config.num_p（23）行2列的列表，然后将这个列表中超出函数中text文本字长的部分都赋值0。然后使用numpy.where找到符合条件的起始和终止位置列表，每个列表中都包含两行，分别表示页数和行数，再利用zip函数将这个页数和行数组合在一起。zip函数声明如下：

Make an iterator that aggregates elements from each of the iterables.
Returns an iterator of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables. The iterator stops when the shortest input iterable is exhausted. With a single iterable argument, it returns an iterator of 1-tuples. With no arguments, it returns an empty iterator. Equivalent to:
def zip(*iterables):
# zip('ABCD', 'xy') --> Ax By
sentinel = object()
iterators = [iter(it) for it in iterables]
while iterators:
result = []
for it in iterators:
elem = next(it, sentinel)
if elem is sentinel:
return
result.append(elem)
yield tuple(result)

然后进行判断，判断为真的条件是，结束位置在开始位置之后，且两个判断得到的谓语关系是一致的，此时就将前面循环条件s（主体位置信息），predicate（谓语信息），还有（_start, _end）（客体位置信息）保存到spoes列表中。

id_str里面首先放置了一个句子的首位，以便于后来的编码字符向内添加。token_ids由于前文将其变形为只有一行的列表，所以token_ids.shape[1]就是它的列数，但是对于102的判断的意思还不得而知，102这个数字的并无法和前文的某些数字关联起来。将token_ids里面的i和i后面的数字取出来，并据此解码，得到一个word，并对word进行修饰将其中包含的“#+”替换为“ ”，如果这个word不是一个未知字符，就将它添加到id_str中，并更新长度。但如果这时得到的word是一个未知字符，那么就取j为i的下一个元素，并进入一个while循环，以同样的方式也编码取出一个字符，直到取到一个不是未知字符时结束这个while循环。在退出循环之后判断，如果循环是因为id值等于102或者是将其完全遍历之后结束循环的，在id_str后面附加j-i个空格，并且将text中在index之后的文本也全部加入到id_str之中。如果是找到一个word_j，那么就在text文本中找到该文本中的开始位置，并将从index之后一直到该word_j之前的全部字符都存储到id_str中，并更新index到该word_j之前的位置。

最后将之前保存的spoes列表中的信息取出，对照刚构建好的id_str中取出响应的主体和客体，并将主体，谓语，客体这三个合并在一个元组内存储到res列表中，作为整个函数的返回值。

SPO类：

没有搞明白这个类的作用，tuple函数用于将其输入参数转换为元组，而且对于这个spo[0]、spo[1]、spo[2]的指代也不是很明确，一种猜想是针对一个三元组，将其中的主体和客体转化为元组，另一种猜想是针对三元组列，将所有的主体列和客体列转换为元组。

class SPO(tuple):
    def __init__(self, spo):
        self.spox = (
            tuple(config.tokenizer.tokenize(spo[0])),
            spo[1],
            tuple(config.tokenizer.tokenize(spo[2])),
        )

    def __hash__(self):
        return self.spox.__hash__()

    def __eq__(self, spo):
        return self.spox == spo.spox

chen_nnn

关注

7
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
CMeKG代码解读(以项目为导向从零开始学习知识图谱)（二）

作者从零开始学习和知识图谱有关技术和内容，而本文的核心内容是对CMeKG的python代码进行学习和解读，供大家讨论参考共同进步。CMeKG（Chinese Medical Knowledge Graph）是利用自然语言处理与文本挖掘技术，基于大规模医学文本数据，以人机结合的方式研发的中文医学知识图谱。...
复制链接

扫一扫