PyTorch 实现关系抽取

最新推荐文章于 2024-06-04 09:55:48 发布

StarLib

最新推荐文章于 2024-06-04 09:55:48 发布

阅读量3.9k

点赞数 3

分类专栏： NLP

本文链接：https://blog.csdn.net/StarLib/article/details/104567187

版权

本文介绍了使用PyTorch实现的关系抽取模型，基于《Relation Classification via Multi-Level Attention CNNs》论文，通过input-attention和attention-pooling学习句子中关键信息。模型包括数据预处理、输入表示、卷积层、注意力池化和损失函数等组件。在训练过程中，发现由于数据集划分问题导致实验结果异常好。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

title: Py To rch 关系抽取
date: 2020-02-28 23:52:25
tags:

Py To rch 关系抽取

复现论文：Relation Classification via Multi-Level Attention CNNs

源码: https://github.com/SStarLib/myACnn

文章目录

Py To rch 关系抽取

一、论文简介

简介：

《Relation Classification via Multi-Level Attention CNNs》这篇论文是清华刘知远老师的团队出的一篇文章，这篇文章通过基于两种attention机制的CNN来进行关系抽取任务

motivation：

句子中有些词，对实体关系分类起着重要作用。例如

“Fizzy [drinks] and meat cause heart disease and [diabetes].”

这里面的cause 对实体关系分类就有很重要的作用。通过attention机制学习到句子中比较重要的词

通过 input-attention 找到句子中对于entity1和entity2中比较重要的部分
通过 attention-pooling 找到特征图中对于构建relation embedding中重要的部分

二、模型构建

数据预处理

数据预处理应该是比较花时间的一部分。我这里做的也不好。不过很多论文使用这个数据集，可以找到别人已经处理好的数据。

构建模型

需要构建的模块大概分为：

输入表示
input attention
convolution
attention-based pooling
损失函数

1. 输入表示

文本数据向量化

通过数据集构建vocab，所谓的vocab 就是一个存储 word-index 的字典。在vocab中需要实现的功能有“增、通过 to ken 查 index，通过 index查 to ken”, vocab（完整的代码点这里）

class Vocabulary(object):
    """Class to process text and extract vocabulary for mapping"""
    def __init__(self, token_to_idx=None):
        """
        :param token_to_idx(dict): a pre_existing map of tokens to indices
        """

        if token_to_idx is None:
            token_to_idx = {
     }
        self._token_to_idx = token_to_idx
        self._idx_to_token = {
     idx: token
                              for token, idx in self._token_to_idx.items()}
       
    ...........

将输入的句子向量化 vec to rizer

将句子中的文本数值化，生成一个索引列表。
构建数据集生成器 dataset

该类实现了 to rch.utils.data.Dataset 方法，方便使用DataLoader载入数据。

模型的表示层

使用py to rch的自带的embedding函数。本论文中需要构建四个表示层。word embedding， pos1 embedding， pos2 embedding， relation embedding。

拼接词向量：

原论文中公式如上，将连续三个词沿embedding 维度拼接，目的尽可能保留序列信息。实现代码：

 def createWin(self, input_vec):
        """
        [b_s, s_l, e_s+2*pe_s], k=win_size
        :param input_vec: [b_s, s_l, e_s+2*pe_s]
        :return:shape [b_s, s_l, (e_s+2*pe_s)*k]
        """
        n = self.win_size
        result = input_vec
        input_len = input_vec.shape[1]
        for i in range(1,n):
            input_temp = input_vec.narrow(1,i, input_len-i) # 长度应该是 input_len - i
            ten = torch.zeros((input_vec.shape[0