信息抽取数据集和相关SOTA介绍

75 篇文章 7 订阅
61 篇文章 2 订阅

一、概览

模型NYT*/NYTWebNLG*/WebNLGACEACE05ACE04SciERC
TPLinker91.9/92.091.9/86.7
TPLinkerPlus:https://github.com/131250208/TPlinker-joint-extractionThe best F1: 0.931/0.934 (on validation set), 0.926/0.926 (on test set)The best F1: 0.934/0.889 (on validation set), 0.923/0.882 (on test set)
PURE65.660.235.6
PFN:A Partition Filter Network for Joint Entity and Relation Extraction,https://arxiv.org/pdf/2108.12202v8.pdf92.493.680.066.862.538.4
OneRel:Joint Entity and Relation Extraction with One Module in One Step,https://arxiv.org/pdf/2203.05412.pdf92.8/92.994.3/91.0
PL-Marker:Packed Levitated Marker for Entity and Relation Extraction,https://arxiv.org/pdf/2109.06067v5.pdfbert base:69,albxxl:73bert base:66.7,albxxl:69.7bert base:53.2

二、paperswithcode 所有关系抽取相关数据集

https://paperswithcode.com/datasets?task=relation-extraction
在这里插入图片描述

三、数据集介绍和SOTA

3.1 WebNLG

https://paperswithcode.com/sota/relation-extraction-on-webnlg
Introduced by Gardent et al. in Creating Training Corpora for NLG Micro-Planners
The WebNLG corpus comprises of sets of triplets describing facts (entities and relations between them) and the corresponding facts in form of natural language text. The corpus contains sets with up to 7 triplets each along with one or more reference texts for each set. The test set is split into two parts: seen, containing inputs created for entities and relations belonging to DBpedia categories that were seen in the training data, and unseen, containing inputs extracted for entities and relations belonging to 5 unseen categories.
Initially, the dataset was used for the WebNLG natural language generation challenge which consists of mapping the sets of triplets to text, including referring expression generation, aggregation, lexicalization, surface realization, and sentence segmentation. The corpus is also used for a reverse task of triplets extraction.
Versioning history of the dataset can be found here.
在这里插入图片描述

3.2 ACE 2005

https://paperswithcode.com/sota/relation-extraction-on-ace-2005
ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events by the Linguistic Data Consortium (LDC) with support from the ACE Program and additional assistance from LDC.

在这里插入图片描述

在这里插入图片描述

3.3 ACE 2004

https://paperswithcode.com/sota/relation-extraction-on-ace-2004

ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities and relations and was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic.
在这里插入图片描述

3.5 SciERC

https://paperswithcode.com/dataset/scierc
在这里插入图片描述

SciERC dataset is a collection of 500 scientific abstract annotated with scientific entities, their relations, and coreference clusters. The abstracts are taken from 12 AI conference/workshop proceedings in four AI communities, from the Semantic Scholar Corpus. SciERC extends previous datasets in scientific articles SemEval 2017 Task 10 and SemEval 2018 Task 7 by extending entity types, relation types, relation coverage, and adding cross-sentence relations using coreference links.
在这里插入图片描述

3.6 FewRel

http://www.zhuhao.me/fewrel/
FewRel is a Few-shot Relation classification dataset, which features 70, 000 natural language sentences expressing 100 relations annotated by crowdworkers.
Please refer to our EMNLP 2018 paper to learn more about this dataset.
在这里插入图片描述

四、数据集统计结果

4.1 NYT/WebNLG数据分析

在这里插入图片描述

4.2 ACE05数据分析

在这里插入图片描述

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值