基于深度学习的关系抽取

hithithithithit

已于 2022-08-12 22:56:41 修改

阅读量1.7k

点赞数 2

分类专栏： # RE 文章标签：深度学习关系抽取关系抽取数据集远程监督命名实体识别

于 2022-07-02 21:54:28 首次发布

本文链接：https://blog.csdn.net/qq_38901850/article/details/125449067

版权

RE 专栏收录该内容

11 篇文章 3 订阅

订阅专栏

一、关系抽取的介绍

介绍

关系抽取[1][2]旨在从给定的自然语言文本抽取出实体类型和关系类型的三元组（主体，客体，关系类型）。其中，关系抽取可以为知识图谱的自动构建[3]、搜索引擎、问答等下游任务提供支撑。

方法

在关系抽取的过程中，一般的解决方式是将关系抽取分为命名实体识别[4][5]和关系分类[6][7]两个子任务，即先从给定的文本中识别出所有的实体，然后对识别出的实体进行关系分类，这种方法也叫做流水线[8][9][10]方法。这种方法存在着许多的弊端，实体模块预测出来的错误会传递到下一模块，导致关系模块发生预测错误，被称为级联错误。流水线方法也会忽略两个子任务之间的联系，影响抽取的效果。而且由于实体模块预测出来的实体并不一定会存在某一个关系中，导致信息冗余，提高了错误率。但是这种方法也有其独有的优点，例如它会将前面产生的实体类型和实体作为特征注入到关系模型中，可以进一步提高准确率。

最近的工作使用了联合关系抽取[11]，即一步实现实体和关系的三元组抽取。联合抽取在一个参数空间中对文本进行编码可以获得命名实体识别和关系分类两个子任务的共享参数，可以获取两个任务之间地依赖关系。有效的避免了级联错误和信息冗余。

文档级和句子级

按任务来分又可分为句子级别的关系抽取和文档级别的关系抽取[12][13][14]。句子级和文档级的区别往往是模糊的，一般来说，句子级别指的是每次输入只有一个和句子，句子一般是以句号隔开的。文档级别通常指的是含有多个句子的一段文本。而在实际情况中，某些长句子可能比几个句子还要长，这使得模型一次处理不了这么长的数据。这种长文本的文档级关系抽取是目前的一个难点和挑战。相比句子级别的关系抽取，文档级别主要体现在下面这几个方面：1.文档中相比句子中含有更多的实体，而组成关系的实体却比较少，无疑这增加了模型的处理难度；2.文档中的关系类型分布不均衡，长尾类型较多，这种不均衡增加了关系抽取的难度。

远程监督

按照监督类型又可分为有监督和远程监督[15][16][17]两种方法。目前大多数的方法都是基于监督学习的，这种方法无法解决关系抽取中的长尾问题。通常的解决方法是引入远程监督学习，借助外部的知识库增加类别较少的语料，可以有效地解决关系类别数量不均衡的问题。

难点

关系抽取的难点主要是多标签关系分类EPO和单个实体多重关系SEO。多标签关系分类是指两个实体之间存在多种关系，一般的关系分类中，两个实体之间只有一种关系。单个实体多重关系指的是一个实体和句子中多个实体都有关系。较为困难的是嵌套实体关系分类，例如，在句子“我想去北京天安门。”中，北京天安门位于北京，这是一种嵌套式的关系分类。

二、相关工作

1、方法

流水线[8][9][10]方法将关系抽取任务分解为命名实体识别[4][5]和关系分类[6][7]两个子任务。[10][18]中使用了两个独立的预训练模型[19][20][21]，实体模型用来进行命名实体识别，对所有的可能的实体片段进行实体类型的预测。关系模型通过前面得到的实体及其类型，对所有的实体对进行关系类型的预测，这种方法简单却有卓有成效。为此，[18]对[10]进行了改进，在之前的基础上对实体的部分采用了悬浮标记，对关系模块的输入主体实体的标签方式进行了固定，得到了最好的性能。

联合抽取的方法多种多样，一般可分为以下几种：

>基于序列标注[22][23]的方法，此方法类似于命名实体识别的序列标注方法，[23]具体做法就是把给实体打上（B-CP-1）这样的标签。其中，B表示单词在实体中的位置，其他可选的标签还有I、E、O。CP表示实体的关系标签，这里的CP表示Country-President的关系。后面的1表示该实体在三元组中是主体的角色，同理，如果为2表示在该三元组中的客体。基于序列标注的方法缺点在于只能解决简单的关系抽取，对于复杂关系抽取几乎起不到任何作用。

>基于填表[24][25][26]的方法，主要思想将关系抽取的问题转化为一个二维表的结构预测问题。然后在二维表中预测token对的关系，很好地解决了EPO和SEO问题，但是还无法很好地解决嵌套实体对之间的关系。具体来说，二维表的行和列是相同的用token组成的句子，填表的方法就是预测每对token的单元格中的实体头尾位置和关系类型。

>基于图[27][28][29]的方法， [27]提出了一个新的基于图的联合学习模型，将关系抽取任务转化为有向图问题，使用了一个基于转换的框架去完成。不仅有效地解决了重叠关系的问题，还用了损失函数加强了实体对之间的联系。

>基于机器阅读理解[30][31]的方法，[30]在机器阅读理解的基础上，将问题转化为一个多轮问答问题。很好地利用了机器阅读理解的优点，在文本中抽取实体和关系被转化成了抽取片段的问题。

2、任务

文档级关系抽取[12][13][14]相比传统的句子级关系抽取更加富有挑战性。文档关系抽取旨在从多个句子中一次抽取所有的关系。大部分现存的方法是利用句子之间的依赖信息去建立一个文档图[12][13]，然后利用图神经网络进行推理。还有的使用了基于Transformer架构的预训练模型，大规模的预训练模型可以抓住长程的关系。

文档抽取的难点在于其处理的对象是整个文档，大量的实体分布在不同的句子中，其中文档中可能有单个实体的多次提及，这使得文档关系抽取困难重重。目前的模型都关注于获取全局表征和语法特征而忽略了实体对之间的依赖。[32][33]使用了卷积神经网络架构捕获实体对之间的联系，但是卷积神经网络无法很好地获取上下文表征。另外，现存的方法都使用了远程监督[15][16][17]，将知识库里面的实体关系用于文档抽取标注，丰富了语料库，大大提高了抽取的性能。

3、远程监督

传统的监督学习的关系抽取虽然取得了不错的效果，但是需要很大的人力消耗去标注数据集，严重浪费时间、金钱。所以[15]提出了使用远程监督进行关系抽取数据集的构造。具体的做法是对知识库中具有关系标签的两个实体，如果正好也正好出现在了文本中，那么将文本中共同出现的实体也打上这样的标签。显而易见，这样的方法具有很大的弊端，不可能所有共同出现的实体都具有这样的关系。这样虽然减轻了人工和时间成本，但是由于强约束性的假设出现了大量的标注错误和数据长尾问题。

为了解决生成数据标注错误的问题，[34]使用了多实例学习算法对数据进行层级标注，从而减少远程监督噪声数据的影响。多实例允许一对实体拥有多个示例和标签，一定程度上缓解了错误标签的问题。[17]采用多实例学习思想来减缓负例的数据，然后将多实例整合到卷积神经网络中来完成关系抽取任务。后续的工作也是基于对多实例学习的改进和对注意力的改进来进行的。

以上从有监督学习的联合抽取和流水线方法，文档级和句子级关系抽取以及远程监督在关系抽取中的作用三个方面对关系抽取做了一个相关的阐述并指明了目前关系抽取领域的研究热点问题。

三、相关数据集

句子级关系抽取语料库：

SemEval2010 Task8

SemEval2010 Task8[35] 数据集关注的是句子中两个句子之间的语义关系。共包含了10种关系类型，分别是Cause-Effect (CE)、Instrument-Agency (IA)、Product-Producer (PP)、Content-Container (CC)、Entity-Origin (EO)、Entity-Destination (ED)、Component-Whole (CW)、Member-Collection (MC)、Message-Topic (MT)、Other，除了Other关系，其他关系都有方向。

WebNLG

WebNLG[36] 语料库由实体对和它们之间的关系和以自然语言文本对应的事实三元组组成。最初，该语料被用于WebNLG自然语言生成挑战，使用了DBPedia中的三元组，包括六个类别（宇航员、建筑、纪念碑、大学、运动队、著作）。现在WebNLG多用于模型对重叠关系能力的评测。WebNLG语料一个句子内部包含多个关系类型，其中共有246个正常的实体对关系，457个SEO实体对关系，26个EPO实体关系。

ACE05

ACE 2005多语种培训语料库包含完整的英语、阿拉伯语和汉语训练数据，用于2005年自动内容提取(ACE)技术评估。语料库由多种类型的数据组成包括实体、关系和事件，这些数据由语言数据联盟(LDC)标注，并得到ACE计划的支持和LDC的额外援助。ACE项目的目标是开发自动内容提取技术，用以支持人类语言文本形式的自动处理。该语料共包含了7种无向关系（ART、GEN-AFF、METONYMY、ORG-AFF、PART-WHOLE、PER-SOC、PHYS），11种有向关系（ART、GEN-AFF、AFF-GEN、METONYMY、ORG-AFF、AFF-ORG、PART-WHOLE、WHOLE-PART、PER-SOC、SOC-PER、PHYS）。

SciERC

SciERC[37] 数据集是一个由500个科学摘要组成的集合，这些科学摘要用科学实体、它们之间的关系以及共同参考聚类进行了注释。摘要摘自语义学者语料库中四个人工智能社区的12个人工智能会议/研讨会议事录。SciERC 扩展了科学文章 SemEval 2017 Task 10和 SemEval 2018 Task 7中以前的数据集，扩展了实体类型、关系类型、关系覆盖面，并使用共引用链接添加了跨句关系。SciERC共包含以下7种关系类型：ComPare、Part-of、Conjuction、Evaluate-for、Used-for、HyponymOf。

文档级关系抽取数据集：

DocRED

DocRED[38] 是一个大规模的众包数据集，原始语料基于维基百科，包含5053份文章，其中存在大约7%的实体对具有多种关系。该数据集在CodaLab上开放有benchmark。DocRED不仅对实体句内关系进行构建，还对句间关系进行考虑。是基于Wikipedia和Wikidata构建的新数据集。具有以下三个特点。(1)DocRED同时对命名实体和关系进行标注，是最大的从纯文本中提取文档级RE的人工标注数据集；(2)DocRED需要阅读文档中的多个句子，通过综合文档的所有信息来提取实体并推断其关系；(3)除了人工标注的数据，还提供了大规模的远距离监督数据，这使得DocRED可以同时适用于监督和弱监督的场景。DocRED人工数据集和远程监督数据集共包含96种关系类型。

远程监督数据集：

NYT

NYT[39] 全名New York Times，是关于远程监督关系抽取任务广泛使用的数据集。该数据集是通过将freebase中的关系与纽约时报（NYT）语料库对齐而生成的。纽约时报New York Times数据集包含150篇来自纽约时报的商业文章。抓取了从2009年11月到2010年1月纽约时报网站上的所有文章。NYT语料经常被用于模型处理SEO和EPO问题的能力评测。NYT语料一个句子内部包含多个关系类型，其中共有3266个正常的实体对关系，1297个SEO实体对关系，978个EPO实体关系。

四、参考文献

[1] Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 2002. Kernel methods for relation extraction. In Empirical Methods in Natural Language Processing (EMNLP), pages 71–78.

[2] Razvan Bunescu and Raymond Mooney. 2005. A shortest path dependency kernel for relation extraction. In Empirical Methods in Natural Language Processing (EMNLP), pages 724–731.

[3] Takanobu, R., Zhang, T., Liu, J., & Huang, M. (2019). A Hierarchical Framework for Relation Extraction with Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 7072-7079.

[4] Erik Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language independent named entity recognition. In Computational Natural Language Learning (CoNLL), pages142–147.

[5] Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Computational Natural Language Learning (CoNLL), pages 147–155.

[6] Liu C Y , Sun W B , Chao W H , et al. Convolution Neural Network for Relation Extraction[J]. International Conference on Advanced Data Mining and Applications, 2013.

[7] Linlin Wang, Zhu Cao, Gerard de Melo, and Zhiyuan Liu. 2016. Relation Classification via Multi-Level Attention CNNs. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1298–1307.

[8] Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 2003. Kernel methods for relation extraction. <i>J. Mach. Learn. Res.</i> 3, null (3/1/2003), 1083–1106.

[9] Yee Seng Chan and Dan Roth. 2011. Exploiting Syntactico-Semantic Structures for Relation Extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 551–560.

[10] Zexuan Zhong and Danqi Chen. 2021. A Frustratingly Easy Approach for Entity and Relation Extraction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 50–61.

[11] Makoto Miwa and Mohit Bansal. 2016. End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1105–1116.

[12] Shuang Zeng, Runxin Xu, Baobao Chang, and Lei Li. 2020. Double Graph Based Reasoning for Document-level Relation Extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1630–1640.

[13] Shuang Zeng, Yuting Wu, and Baobao Chang. 2021. SIRE: Separate Intra- and Inter-sentential Reasoning for Document-level Relation Extraction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 524–534.

[14] Qingyu Tan, Ruidan He, Lidong Bing, and Hwee Tou Ng. 2022. Document-Level Relation Extraction with Adaptive Focal Loss and Knowledge Distillation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1672–1681.

[15] Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011.

[16] Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural Relation Extraction with Selective Attention over Instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2124–2133.

[17] Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1753–1762.

[18] Deming Ye, Yankai Lin, Peng Li, and Maosong Sun. 2022. Packed Levitated Marker for Entity and Relation Extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4904–4917.

[19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics (NAACL), pages 4171–4186

[20] Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert: A pretrained language model for scientific text. In Empirical Methods in Natural Language Processing (EMNLP), pages 3606–3611.

[21] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations (ICLR).

[22] Arzoo Katiyar and Claire Cardie. 2017. Going out on a limb: Joint Extraction of Entity Mentions and Relations without Dependency Trees. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 917–928.

[23] Suncong Zheng, Feng Wang, Hongyun Bao, Yuexing Hao, Peng Zhou, and Bo Xu. 2017. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1227–1236.

[24] Meishan Zhang, Yue Zhang, and Guohong Fu. 2017. End-to-End Neural Relation Extraction with Global Optimization. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1730–1740.

[25] Yijun Wang, Changzhi Sun, Yuanbin Wu, Hao Zhou, Lei Li, and Junchi Yan. 2021. UniRE: A Unified Label Space for Entity Relation Extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 220–231.

[26] Shang, Y., Huang, H., & Mao, X. (2022). OneRel: Joint Entity and Relation Extraction with One Module in One Step. ArXiv, abs/2203.05412.

[27] Zhepei Wei, Jianlin Su, Yue Wang, Yuan Tian, and Yi Chang. 2020. A Novel Cascade Binary Tagging Framework for Relational Triple Extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1476–1488.

[28] Tsu-Jui Fu, Peng-Hsuan Li, and Wei-Yun Ma. 2019. GraphRel: Modeling Text as Relational Graphs for Joint Entity and Relation Extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1409–1418.

[29] Changzhi Sun, Yeyun Gong, Yuanbin Wu, Ming Gong, Daxin Jiang, Man Lan, Shiliang Sun, and Nan Duan. 2019. Joint Type Inference on Entities and Relations via Graph Convolutional Networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1361–1370.

[30] Xiaoya Li, Fan Yin, Zijun Sun, Xiayu Li, Arianna Yuan, Duo Chai, Mingxin Zhou, and Jiwei Li. 2019. Entity-Relation Extraction as Multi-Turn Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1340–1350.

[31] Hermann, K.M., Kociský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., & Blunsom, P. (2015). Teaching Machines to Read and Comprehend. NIPS.

[32] Zhang, N., Chen, X., Xie, X., Deng, S., Tan, C., Chen, M., Huang, F., Si, L., & Chen, H. (2021). Document-level Relation Extraction as Semantic Segmentation. IJCAI.

[33] Jingye Li, Kang Xu, Fei Li, Hao Fei, Yafeng Ren, and Donghong Ji. 2021. MRN: A Locally and Globally Mention-Based Reasoning Network for Document-Level Relation Extraction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1359–1370.

[34] Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S. Weld. 2011. Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 541–550.

[35] Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2010. SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations between Pairs of Nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 33–38.

[36] Emilie Colin, Claire Gardent, Yassine M’rabet, Shashi Narayan, and Laura Perez-Beltrachini. 2016. The WebNLG Challenge: Generating Text from DBPedia Data. In Proceedings of the 9th International Natural Language Generation conference, pages 163–167.

[37] Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh Hajishirzi. 2018. Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3219–3232.

[38] Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. 2019. DocRED: A Large-Scale Document-Level Relation Extraction Dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 764–777.

[39] Riedel, S., Yao, L., McCallum, A. (2010). Modeling Relations and Their Mentions without Labeled Text. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2010. Lecture Notes in Computer Science(), vol 6323.