信息抽取数据集和相关SOTA介绍

最新推荐文章于 2024-05-09 09:34:35 发布

HxShine

最新推荐文章于 2024-05-09 09:34:35 发布

阅读量1.1k

点赞数

分类专栏： nlp_paper nlp学习 nlp 文章标签：实体抽取关系抽取自然语言处理

本文链接：https://blog.csdn.net/qq_16949707/article/details/127411267

版权

nlp 同时被 3 个专栏收录

97 篇文章 3 订阅

订阅专栏

nlp_paper

75 篇文章 7 订阅

订阅专栏

nlp学习

61 篇文章 2 订阅

订阅专栏

一、概览

模型	NYT*/NYT	WebNLG*/WebNLG	ACE	ACE05	ACE04	SciERC
TPLinker	91.9/92.0	91.9/86.7
TPLinkerPlus：https://github.com/131250208/TPlinker-joint-extraction	The best F1: 0.931/0.934 (on validation set), 0.926/0.926 (on test set)	The best F1: 0.934/0.889 (on validation set), 0.923/0.882 (on test set)
PURE				65.6	60.2	35.6
PFN:A Partition Filter Network for Joint Entity and Relation Extraction,https://arxiv.org/pdf/2108.12202v8.pdf	92.4	93.6	80.0	66.8	62.5	38.4
OneRel:Joint Entity and Relation Extraction with One Module in One Step,https://arxiv.org/pdf/2203.05412.pdf	92.8/92.9	94.3/91.0
PL-Marker:Packed Levitated Marker for Entity and Relation Extraction,https://arxiv.org/pdf/2109.06067v5.pdf				bert base:69，albxxl:73	bert base:66.7，albxxl:69.7	bert base:53.2

二、paperswithcode 所有关系抽取相关数据集

https://paperswithcode.com/datasets?task=relation-extraction
在这里插入图片描述

三、数据集介绍和SOTA

3.1 WebNLG

https://paperswithcode.com/sota/relation-extraction-on-webnlg
Introduced by Gardent et al. in Creating Training Corpora for NLG Micro-Planners
The WebNLG corpus comprises of sets of triplets describing facts (entities and relations between them) and the corresponding facts in form of natural language text. The corpus contains sets with up to 7 triplets each along with one or more reference texts for each set. The test set is split into two parts: seen, containing inputs created for entities and relations belonging to DBpedia categories that were seen in the training data, and unseen, containing inputs extracted for entities and relations belonging to 5 unseen categories.
Initially, the dataset was used for the WebNLG natural language generation challenge which consists of mapping the sets of triplets to text, including referring expression generation, aggregation, lexicalization, surface realization, and sentence segmentation. The corpus is also used for a reverse task of triplets extraction.
Versioning history of the dataset can be found here.
在这里插入图片描述

3.2 ACE 2005

https://paperswithcode.com/sota/relation-extraction-on-ace-2005
ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events by the Linguistic Data Consortium (LDC) with support from the ACE Program and additional assistance from LDC.

在这里插入图片描述

3.3 ACE 2004

https://paperswithcode.com/sota/relation-extraction-on-ace-2004

ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities and relations and was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic.
在这里插入图片描述

3.5 SciERC

https://paperswithcode.com/dataset/scierc
在这里插入图片描述

SciERC dataset is a collection of 500 scientific abstract annotated with scientific entities, their relations, and coreference clusters. The abstracts are taken from 12 AI conference/workshop proceedings in four AI communities, from the Semantic Scholar Corpus. SciERC extends previous datasets in scientific articles SemEval 2017 Task 10 and SemEval 2018 Task 7 by extending entity types, relation types, relation coverage, and adding cross-sentence relations using coreference links.
在这里插入图片描述

3.6 FewRel

http://www.zhuhao.me/fewrel/
FewRel is a Few-shot Relation classification dataset, which features 70, 000 natural language sentences expressing 100 relations annotated by crowdworkers.
Please refer to our EMNLP 2018 paper to learn more about this dataset.
在这里插入图片描述