ACE 2005 文件格式

最新推荐文章于 2024-04-12 14:40:48 发布

taolusi

最新推荐文章于 2024-04-12 14:40:48 发布

阅读量9.6k

点赞数 9

分类专栏： NLP 文章标签： ACE 2005 ACE05 relation extraction corpus

本文链接：https://blog.csdn.net/taolusi/article/details/80812597

版权

NLP 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

如果还没有获取到ACE语料的人，不建议看本篇文章，因为没有语料的对应，有些东西会比较难看懂，虽然我已经尽量用比较友好的方式来写本文。

由于做关系抽取要用到ACE 2005的语料，所以在此记录一下相关的信息，包括各个文件的内容和格式等，也方便初入门者可以更快地了解这个语料。以下内容主要翻译自CMU的William Cohen小组（我也没有了解过这个小组）和ACE 2007 Spanish DevTest - Pilot Evaluation的文档（一开始没有好好看LDC的网站，所以根据关键词Google到了这个，不过好像这个好像还挺有用的，这里也给以下ACE 2005 Multilingual Training Corpus的网址，上边也有它的Online Documentation）。最后附上一个语料预处理的github项目ACE 2005 Corpus Preprocessing

ACE 2005数据集标注了基本任务：the recognition of entities, values, temporal expressions, relation and events。如果想了解更详细的关于ACE05评测的内容，可以看这里The ACE 2005 (ACE05) Evaluation Plan。

这个数据集里的数据来源于多种资源，可用于阿拉伯语、汉语和英语这三种语言的任务。

ACE 2005语料库训练部分的详细统计数字如下图所示：

上图中的英文资源的各个类别应该对应于语料English文件夹中的bn、bc、nw、wl、un、cts文件夹；阿拉伯语资源对应Arabic文件夹中的bn、nw、wl文件夹；汉语资源对应于Chinese文件夹中的bn、nw、wl文件夹。

在上述每个文件夹下，又包含adj、fp1、fp2、timex2norm文件夹和Filelist文件（Arabic和chinese文件夹下不包含timex2norm文件夹，由于我只用到English语料，所以未探究为啥另外两种语料中没有timex2norm，了解的小伙伴麻烦告知一下）。

以上adj、fp1、fp2、timex2norm文件夹分别表示的是不同的标注过程。ACE语料在所有任务上都是通过两个独立工作的标注器来进行标注的。第一轮的标注成为1P，与之独立的双重第一轮标注成为DUAL。对于1P和DUAL来说，一个标注器完成文件的所有任务。文件是通过自动标注工作流程系统（Annotation Work-flow System， AWS）来进行分配的，而且文件分配是双盲的。（这一段我是瞎翻的，我也不知道自己在说啥）

Note：1P和DUAL在文件夹里都是以'fp1'和'fp2'来存放的，也就是说1P和fp1对应，DUAL和fp2对应。

每个文件的1P和DUAL版本之间的差异由资深标注员或者小组负责人来进行裁决，从而得到一个高质量的gold standard文件。gold standard裁决文件被成为ADJ（也就是我们上边说的ADJ文件夹）。在裁决之后，TIMEX2值被标准化处理以后得到NORM。这个语料中的所有数据集都已经被NORM标注。

整个的标注过程可以用如下的图来表示：

1P: entities         DUAL: entities
    TIMEX2 extents         TIMEX2 extents

        |                    |
        |                    |
        |____________________|
                  |
                  |
                  |
                  V
             ADJ: entities
                  TIMEX2 extents
                  |
                  |
                  |
                  V
             NORM: TIMEX2 normalization

在上述fp1、fp2、adg和timex2norm文件夹中，对于一个给定的文档，我们能够看到这个文档的.sgm源文件以及.ag.xml和.apf.xml的标注文件。

换句话说，对于每一个新闻专题来说，上述每一个文件夹中都包含一个源文本(.sgm文件)的相同副本以及相关标注的不同版本(.ag.xml、.apg.xml和.tab文件)。需要注意的是，在许多情况下，对于一个给定的源文本，如果在两个标注阶段的后一个阶段中没有做任何更改，那么两个标注阶段会产生相同的输出。

FIlelist文件包含了对于每一个文件的单词统计信息和标注状态。

如下为完整标注文件和它们对应的源文件的路径：

    */timex2norm/*sgm
    */timex2norm/*apf.xml

接下来是每一种文件类型的内容格式。对于大多数用户来说，最重要的文件是.sgm文件和.apg.xml文件。

Source Text (.sgm) Files

      - These files contain the source text data in an SGML format; they
        use UTF-8 encoding and UNIX-style line termination.

   AG (.ag.xml) Files

      - These are annotation files created with the LDC's annotation
        toolkit.  These files have been converted to the corresponding
        .apf.xml files.
        
   ACE Program Format (APF) (.apf.xml) Files

      - These files are in the official ACE annotation file format. ACE 
        format is derived by means of a routine format conversion process,
        so that the underlying annotation content of the two files is 
        equivalent  See section 8 for more details.

   ID table (.tab) Files

      - These files store mapping tables between the IDs used in the
        ag.xml files and their corresponding apf.xml files.

关于APF的一些说明（懒得翻，以后有需要的时候再翻一下）

 - Offsets

     APF uses the offset counting method traditionally used in previous
     ACE evaluation programs: 

       1) Each (UTF-8) character, not byte, is counted as one.  

       2) Each newline character is counted as one.  (The .sgm files
          use the UNIX-style end of line characters.)

       3) SGML tags are *not* counted towards offsets.  (Please note
          that the AG files included in this release do count SGML tags in
          offsets.)

       4) SGML entities are counted in terms of each character in the
          entities.  For example, "&amp;" is counted as five
          characters, not as one character.

   - TIMEX2

     The timex2 element represents TIMEX2 time expression annotations.
     Its optional attributes, such as "VAL" and "MOD", represent the
     TIMEX2 normalization values. 

   - TYPE, LDCTYPE and LDCATR in entity_mention

     The TYPE attribute in entity_mention stores the official ACE entity
     mention type, and the LDCTYPE and LDCATR attributes store the
     attributes used in the LDC's annotation process.

   - Name in entity_attributes

     The "name" element in entity_attributes stores the heads of
     "NAM"-type mentions as in the previous years.  In response to
     George Doddington's request, we have added the NAME attribute to
     the "name" element.  The NAME attribute stores slightly normalized
     versions of the names where:

     - \n is replaced with a space
     - multiple spaces are reduced to one space
     - " (double quote) is removed

     - Example:

     <entity_attributes>
        <name NAME="United States">
           <charseq START="4242" END="4254">United
     States</charseq>
        </name>
     </entity_attributes>

   - Nickname metonymy

     Nickname metonyms are indicated with METONYMY_MENTION="TRUE" in
     entity_mentions.  "NAN"-type entity mentions marked as nickname
     metonymy do not give rise to name elements.

   - Cross-type metonymy

     "Cross-type" metonyms are represented with relations of the type
     METONYMY.  The METONYMY type relations do not have
     relation_mentions.  The METONYMY type relations are automatically
     generated after the annotation process, and are the only kind of
     relation annotations that appear in this corpus.

   - For more details, please refer to the APF V5.1.2 DTD.

taolusi

关注

9
点赞
踩
16

收藏

觉得还不错? 一键收藏
39
评论
ACE 2005 文件格式

由于做关系抽取要用到ACE 2005的语料，所以在此记录一下相关的信息，包括各个文件的内容和格式等，也方便初入门者可以更快地了解这个语料。ACE 2005数据集标注了基本任务：the recognition of entities, values, temporal expressions, relation and events。如果想了解更详细的关于ACE05评测的内容，可以看这里The AC...
复制链接

扫一扫

专栏目录