ACE 2005数据集(介绍2)

以下内容来自https://catalog.ldc.upenn.edu/LDC2006T06

ACE 2005 Multilingual Training Corpus

ACE2005多语言训练语料

Item Name:ACE 2005 Multilingual Training Corpus
Author(s):Christopher Walker, Stephanie Strassel, Julie Medero, Kazuaki Maeda
LDC Catalog No.:LDC2006T06
ISBN:1-58563-376-3
ISLRN:458-031-085-383-4
Release Date:February 15, 2006
Member Year(s):2006
DCMI Type(s):Text

Data Source(s):

数据源:

weblogs, broadcast news, newsgroups, broadcast conversation

微博,广播新闻,新闻组,广播对话

Project(s):ACE

Application(s):

应用:

automatic content extraction

自动内容抽取

Language(s):

语言:

Mandarin Chinese, Standard Arabic, English

普通话中文、标准阿拉伯语、英语

Language ID(s):cmn, arb, eng
License(s):LDC User Agreement for Non-Members
Online Documentation:LDC2006T06 Documents
Licensing Instructions:Subscription & Standard Members, and Non-Members
Citation:Walker, Christopher, et al. ACE 2005 Multilingual Training Corpus LDC2006T06. Web Download. Philadelphia: Linguistic Data Consortium, 2006.

Introduction

介绍

ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events by the Linguistic Data Consortium (LDC) with support from the ACE Program and additional assistance from LDC.

ACE 2005多语种培训语料库包含完整的英语、阿拉伯语和汉语训练数据,用于2005年自动内容提取(ACE)技术评估。语料库由多种类型的数据组成包括实体、关系和事件,这些数据由语言数据联盟(LDC)标注,并得到ACE计划的支持和LDC的额外援助。

The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form.

ACE项目的目标是开发自动内容提取技术,用以支持人类语言文本形式的自动处理。

In November 2005, sites were evaluated on system performance in five primary areas: the recognition of entities, values, temporal expressions, relations, and events. Entity, relation and event mention detection were also offered as diagnostic tasks. All tasks with the exception of event tasks were performed for three languages, English, Chinese and Arabic. Events tasks were evaluated in English and Chinese only. This release comprises the official training data for these evaluation tasks.

2005年11月,对站点进行了五个主要方面的系统性能评估:实体的识别、值、时间表达式、关系和事件实体、关系和事件提及检测也作为诊断任务提供。除事件任务外,所有任务均使用英语、汉语和阿拉伯语三种语言执行。事件任务任务仅用英文和中文进行评估。这个版本包括这些评价任务的官方培训数据。

For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions and other documentation, see LDC's ACE website.

有关ACE项目语言资源的更多信息,包括注释指南、任务定义和其他文档,请参见LDC的ACE网站。

Data

数据

Below is information about the amount of data in this release and its annotation status.

下面是关于此版本中的数据量及其注释状态的信息。

  • 1P: data subject to first pass (complete) annotation
  • 1P: 须先通过(完整)注释的资料
  • DUAL: data also subject to dual first pass (complete) annotation
  • 对偶:数据也服从对偶第一遍(完整)注释
  • ADJ: data also subject to discrepancy resolution/adjudication
  • ADJ: 资料也有经争议解决/裁定
  • NORM: data also subject to TIMEX2 normalization
  • NORM: 数据也要服从TIMEX2标准化 

--------------------- 
对1P,DUAL,ADJ, NORM的解释(来自:原文:https://blog.csdn.net/taolusi/article/details/80812597  作者:taolusi )

adj、fp1、fp2、timex2norm文件夹分别表示的是不同的标注过程。ACE语料在所有任务上都是通过两个独立工作的标注器来进行标注的。第一轮的标注成为1P,与之独立的双重第一轮标注成为DUAL。对于1P和DUAL来说,一个标注器完成文件的所有任务。文件是通过自动标注工作流程系统(Annotation Work-flow System, AWS)来进行分配的,而且文件分配是双盲的。Note:1P和DUAL在文件夹里都是以'fp1'和'fp2'来存放的,也就是说1P和fp1对应,DUAL和fp2对应。每个文件的1P和DUAL版本之间的差异由资深标注员或者小组负责人来进行裁决,从而得到一个高质量的gold standard文件。gold standard裁决文件被成为ADJ(也就是我们上边说的ADJ文件夹)。在裁决之后,TIMEX2值被标准化处理以后得到NORM。这个语料中的所有数据集都已经被NORM标注。
--------------------- 

 

English
wordsfiles
1PDUALADJNORM1PDUALADJNORM
NW6065857807334594839912812481106
BN59239581445244455967239234217226
BC4661246110338744041568675260
WL45210436483552937897127122114119
UN4516144473263713736658573749
CTS4700347003348683984546463439
Total303833297185216545259889666650535599

Chinese Note: Chinese data expressed in terms of characters. We assume a correspondence of roughly 1.5 characters/word.

注:中文数据以字符表示,我们假设对应大约1.5个字符/单词

charsfiles
 1P(完整)注释DUAL对偶ADJ争议裁决1PDUALADJ
NW新闻专线127319124175121797248242238
BN广播新闻134963133696120513332328298
WL微博71839680636568110710197
Total334121325834307991687671633
Arabic
wordsfiles
1PDUALADJ1PDUALADJ
NW612875615853026239226221
BN292592716526907134128127
WL216872018120181605555
Total112233103504100114433409403

Samples

For examples of the data in this publication, please review the following samples:

  • 3
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 16
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 16
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值