知识图谱实体链接是什么?一份“由浅入深”的综述


作者 | 尼古拉·瓦砾

来源 | Paperweekly(ID:paperweekly)

【导读】这个世界充斥着无数的结构化数据(wiki)和非结构化数据(web),然而,如何将两者有效地集成仍然是个非常困难的问题。

本文介绍实体链接 (Entity Linking) 这一技术方向,会先从最基础的概念讲起,然后对EL中的三个主要模块做一个清晰的梳理。在此基础上,选取三篇比较有代表性的论文,详述其中的核心方法和思想。

EL入门


1. 任务定义

实体链接,就是把文本中的mention链接到KG里的entity的任务。如下图所示[1]:

Entity Linking示意图

有些读者可能对知识图谱的概念不甚了解,这边先解释一下图谱里常用的一些概念。

  • Knowledge Graph (知识图谱):一种语义网络,旨在描述客观世界的概念实体及其之间的关系,有时也称为Knowledge Base (知识库)。

    • 图谱由三元组构成:<实体1,关系,实体2> 或者 <实体,属性,属性值>;

    • 例如:<姚明,plays-in,NBA>、<姚明,身高,2.29m>;

    • 常见的KB有:Wikidata、DBpedia、YOGO。

  • Entity (实体):实体是知识图谱的基本单元,也是文本中承载信息的重要语言单位。

  • Mention (提及):自然文本中表达实体的语言片段。

回过头再看,上面的这个图中,“乔丹”、“美国”、“NBA”这些蓝色的片段都是mention,其箭头所指的“块块”就是它们在图谱里对应的entity。


2. 几个应用

EL有什么用呢?一般有KB的地方就离不开EL。以下是EL的几个应用[2]:

  1. Question Answering:EL是KBQA的刚需,linking到实体之后才能查询图数据库;

  2. Content Analysis:舆情分析、内容推荐、阅读增强;

  3. Information Retrieval:基于语义实体的搜索引擎,google搜索一些实体,右侧会出现wikipedia页面;

  4. Knowledge Base population:扩充知识库,更新实体和关系。


3. Taxonomy

Taxonomy

大体来说,EL的工作可以分为两类[3]:

  • End-to-End:先从文本中提取到实体mention (即NER),对应到候选实体,然后将提取到的entities消除歧义,映射到给定的KB中。

  • Linking-Only:与第一种方法对比,跳过了第一步。该方法直接将text和mention作为输入,找到候选实体并消除歧义,映射到给定的KB中。

由于端到端的工作比较少,且NER也没太多可讲的。本文着重介绍Linking-Only的相关技术方向和工作。

EL的三大模块

EL的工作非常有挑战性,主要有两个原因:

  1. Mention Variations:同一实体有不同的mention。(<科比>:小飞侠、黑曼巴、科铁、蜗壳、老科。)

  2. Entity Ambiguity:同一mention对应不同的实体。(“苹果”:中关村苹果不错;山西苹果不错。)

针对上述两个问题,一般会用Candidate Entity Generation (CEG) 和Entity Disambiguation (ED) 两个模块[2]来分别解决:

  1. Candidate Entity Generation:从mention出发,找到KB中所有可能的实体,组成候选实体集 (candidate entities);

  2. Entity Disambiguation:从candidate entities中,选择最可能的实体作为预测实体。

下面我们来讲讲这两个模块里都有些啥东西。其中,CEG的方法都比较朴素,没什么可讲的,笔者会把重点放在ED上。

1. Candidate Entity Generation (CEG)

  • 最重要的方法:Name Dictionary ( {mention: entity} )

  • 哪些别名:首字母缩写、模糊匹配、昵称、拼写错误等。

  • 构建方法:

    • Wikipedia(Redirect pages, Disambiguation pages, Hyperlinks);

    • 基于搜索引擎:调google api,搜mention。若前m个有wiki entity,建立map;

    • Heuristic Methods;

    • 人工标注、用户日志。

CEG这部分,最主流也最有效的方法就是Name Dictionary,说白了就是配别名。虽然CEG很朴素,但作为EL任务中的第一道门槛,其重要性不言而喻。对于每一个entity,紧凑而充分地配置别名,才能保证生成的candidate entites没有遗漏掉ground truth entity。

具体的,要配置哪些别名,要用什么构建方法,往往取决于EL的使用场景。比如做百科问答或是通用文本的阅读增强,就很依赖于wikipedia和搜索引擎;但如果是某个具体的行业领域,就需要通过一些启发式的方法、用户日志、网页爬取,甚至人工标注的方法来构建Name Dictionary。

2. Entity Disambiguation (ED) (手动划重点)

  • Features

    • Context-Independent Features:

      • LinkCount:#(m->e),知识库中某个提及m指向实体e的次数;

      • Entity Attributes:Popularity、Type;

    • Context-Dependent Features:

      • Textual Context:BOW, Concept Vector

      • Coherence Between Entities:WLM、PMI、Jaccard Distance

实体消歧时,不同场景的特征选取是非常重要的。总的来说,实体消歧的特征分为,context独立和context不独立的。

特征里,独立的有:mention到实体的LinkCount、实体自身的一些属性(比如热度、类型等等)。其中,LinkCount作为一个先验知识,在消歧时,往往很有用,比如当我们在问“姚明有多高?”时,大概率都是在问<篮球运动员姚明>,而不是其他不为人知的“姚明”。虽然context中完全没有包含篮球运动员这一信息,但大多数情况下,根据“姚明”到<篮球运动员姚明>的LinkCount最高,选其作为实体进行查询,都会是一个不错的答案。

不独立的有:文本的context、实体间的coherence (一致性)。这部分,可深入挖掘的东西比较多,文本context可以用一些深度学习的方法去深度理解文本的语义,从而实现消歧;实体间的一致性更加有趣,由于文本包含的所有的mention都没有确定,所以全局地进行entities的消歧实际上是一个NP-hard的问题。因此,如何更加快速有效地利用一致性特征,是一个非常有趣的方向。

基于这些常用的特征,消歧的方法可以大致分为以下几种:

  • Learning to Rank Methods:Point-wise、Pair-wise、List-wise。由于ED任务ground truth只有一个实体,一般都是用point-wise来做。输入是文本的context、mention、某个entity的一些attributes,输出mention指向该entity的置信度,以此rank,选出最可信的entity;

  • Probabilistic Methods:Incorporate heterogeneous knowledge into a probabilistic model。结合不同信息,得到条件概率  ,其中 c 是输入文本,e 为实体, m 是mention。比如用归一化的LinkCount信息,作为先验概率  ;

  • Graph-Based Approaches:maximize coherene between entities。利用图特征 (entity embedding、relation),在消歧时,考虑全局消歧后实体的一致性;

一般来说,现在的ED工作都会综合以上的方法来设计,后面我们会具体介绍几篇近期的论文,大家可以对照这三类方法看看。

3. 还有个小问题:Unlinkable Mention Prediction

除了上面的两大模块,还有一个小问题,就是如何拒识掉未知实体,毕竟你不可能建立一个能穷举万物的KB。这就涉及到Unlinkable Mention Prediction,不是很复杂,一般就三种做法:

  • NIL Threshold:通过一个置信度的阈值来卡一下;

  • Binary Classification:训练一个二分类的模型,判断Top-rankeded Entity是否真的是文中的mention想要表达的实体;

  • Rank with NIL:在rank的时候,在候选实体中加入NIL Entity。

一般就阈值卡一下就好了,不是太大的问题。但如果具体的场景是做KB Population且实体还不是很全的时候,就需要重点关注一下了。


EL的近期工作

为了让读者能更清楚地了解EL,笔者在这里选取了三篇近两年出的,比较有代表性的工作[4] [5] [6],给大家具体讲讲:

  1. Deep Joint Entity Disambiguation with Local Neural Attention. (Ganea and Hofmann, 2017, EMNLP)

  2. Improving entity linking by modeling latent relations between mentions. (Le et al., 2018, ACL)

  3. DeepType: multilingual entity linking by neural type system evolution. (Raiman et al., 2018, AAAI)


1. Deep Joint Entity Disambiguation with Local Neural Attention

早期的EL工作都非常依赖manually-designed的特征,这篇文章是EL领域第一篇不依赖特征工程,用深度学习来学习基础特征的工作。主要的创新点和关键部件有三个:

  • Entity Embeddings:用到了知识库里实体的embedding;

  • Context Attention:用attention机制来获得context的表征;

  • Collective Disambiguation:考虑实体间的coherence,联合消歧。

给定文本  ,其中有一堆mention  。  对应的实体为  ,对应的context为  ,对应的候选实体集为  。文章提出了两种模型:Local model、Global model。local只考虑mention的context;global还需要考虑实体间的一致性,联合消歧。

  • Local model:令  为local score function,local model要解决的问题即:

  • Global model:除了context,还考虑实体间的coherence (为了简化,只考虑两元一致性)。令  为实体之间的pair-wise coherence score function,  ,全局搜索:

Local and pair-wise score function 的计算方法如下:

其中,  是实体  的embedding,  是对角矩阵。  是在上下文  上取attention后的表征,具体的attention计算如下图所示:

Local Model里的Attention机制

这里是用candidate entity embeddings做key,context word embeddings做value,得到score matrix之后,按列取max,如果某个word的score较高表示这个word至少和一个entity相关度高。为了去除stop words的影响,作者只取了top R的score,剩下的置为负无穷。

得到score后,还会结合m到e的LinkCount先验概率,计算出最终各个实体的概率。该工作在AIDA数据集上取得了SOTA ( local: 88.8、global: 92.22 )。


2. Improving entity linking by modeling latent relations between mentions

上一篇论文开创性地在EL中引入entity embedding作为信息,很自然的,我们会思考一个问题,KB中还有别的可利用的信息吗?参考本文一开始的那张图片,“乔丹”、“美国”、“Nike”这些实体之间还有着“公民”、“赞助商”等关系信息,显然,若加以利用,一定能成大器。

于是Le et al.在Ganea and Hofmann工作的基础上,增加了隐关系信息。假定图谱中有K个关系,令  之间为关系k的置信度为  ,上文中的pair-wise coherence score function可以写成:

其中,  都是用来表示关系k的对角矩阵 (类似于关系k的embedding),  为归一化因子,  为将  映射到  的函数。这样一来,我们就隐式地添加了关系k,丰富了计算全局实体一致性时所参考的信息。

看起来很fancy!但是有一个问题,这个归一化因子我们要咋算呢?作者提供了两种思路:

  1. Rel-norm:Relation-wise normalization。就是以关系k维度来norm;

  2. Ment-norm:Mention-wise Normalization。就是以实体j维度来norm。

两种norm方式的示意图

看一下上面这张图,就很清晰了。该工作在AIDA数据集上得到了新的SOTA (global: 93.07)。

3. DeepType: multilingual entity linking by neural type system evolution

前两篇论文都是在联合消歧的这个角度,做了一定的工作。DeepType这篇文章则另辟蹊径,从优化知识库的type系统来做。文章很重要的一个观点是:当我们能预测出实体mention的type,消歧这个任务就做的差不多了。EL系统主要分成三个模块:

  1. Type System:一组正交的type轴和一个type标注函数;

    1. type轴:一组互斥的type集合  (e.g.  )

    2. type标注函数: 

    3. 举个例子,假定一个包含两个轴 {IsA, Topic} 的 type 系统,<追一科技>对应的就是 {公司,人工智能}

  2. Type Classifier:给定mention和text,输出mention对应实体的type;

  3. Entity Prediction Model:给定mention、text和候选实体,预测概率最高的实体。(文中直接用的LinkCount)

很显然,这三个模块的核心点在于Type System的构建。由于Entity Prediction Model是直接用的LinkCount,实际上整个EL系统就只有两组参数,一组是Type System的离散参数  ,一组是Type Classifier的连续参数  。给定text及其所含的mention  ,其中  为ground truth的实体,  为候选实体集,令  为EL系统的消歧准确率,则我们的问题可以定义为:

其中  ,这个entity score可以看成是给定m后,EL系统给出的置信度,后面会给具体公式。

同步优化两组参数很耗时,因此文章分成Type System和Type Classfier两个部分独立优化。

a. Discrete Optimization of Type System:

为了避免同时训练Type System和Type Classifier,我们得先固定classifier,然后优化Proxy Objective  ,这里我们假设两个classifier model的极端:

  1. Oracle (极端优秀):假设Type Classifier开了天眼,不管  咋变,都能预测对mention的type,然后从该type对应的候选实体集的子集中选LinkCount最高的,令 

2. Greedy (极端蠢):不预测type,直接从候选实体集中选linkcount最高的。

最简单的思路直接  其实就完事了,但是我们要考虑到真正的classifier并没有开天眼,假设classifier的Learnability(学习能力)是  ,那么更合理的目标函数应该是:

但是怎么才能不依赖  去计算学习能力呢,如果是用softmax多分类,那就糟了,每换一次  ,就得重新训练一个classifier。因此作者巧妙的用二元分类器代替了多分类,令  ,这样就只需要最开始计算一次就好了,如下图所示:

a为训练type系统时的classifier,b为后面真正用的classifier

b. Type Classifier

就按照上图b优化就是了,没什么可说的。

c. Inference

在训练完Type System和Type Classifier之后,我们就可以计算我们上面说的EntityScore了:

其中  是k个type组成type轴,  是smoothing系数。最终在AIDA数据集上取得了新的SOTA(94.88)。

4. 三篇论文效果对比和思考

前两篇都是联合消歧的,而DeepType其实是个Local模型。这么看来,如果考虑联合消歧的话,应该还会有提升。


总结

Entity Linking其实算是个很复杂的技术领域,因为优化过程中,不仅要考虑text的文本信息、KB的信息、消歧后的一致性,还需要根据具体的业务场景采用不同的方案,同时也不能为了效果去暴力搜索NP-hard的离散优化问题。在具体实施的过程中不可能面面俱到,需要一定程度的trade-off。总结起来,四大特征:LinkCount、Context、Attributes、Coherence。方法千千万,大家灵活运用。

Appendices

  • 数据集:

  • Knowledge Base:Wikipedia, YAGO, DBpedia, Freebase;

  • EL监督数据:

    • 中文:CCKS

    • 英文:TAC KBP 2010 EL、AIDA CoNLL-YAGO

  • 一个不错的多因子消歧总结:cloud.tencent.com/devel

  • ccks实体链接第一名方案:github.com/panchunguang

  • DeepType原作博客 (有一些可玩的插件):openai.com/blog/discove


References

  1. 韩先培,实体链接:从文本到概念:docs.huihoo.com/infoq/b

  2. Wei Shen, Jiawei Han: Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions. dbgroup.cs.tsinghua.edu.cn

  3. Sebastian Ruder: NLP Progress of Entity Linking. github.com/sebastianrud

  4. Ganea and Hofmann, 2017, EMNLP: Joint Entity Disambiguation with Local Neural Attention. arxiv.org/abs/1704.0492

  5. Le et al., 2018, ACL: Improving entity linking by modeling latent relations between mentions. arxiv.org/abs/1804.1063

  6. Raiman et al., 2018, AAAI: DeepType: multilingual entity linking by neural type system evolution. arxiv.org/abs/1802.0102

原文链接:

https://zhuanlan.zhihu.com/p/100248426

(*本文为AI科技大本营转载文章,转载请联系原作者)

精彩推荐

人工智能数学基础系列公开课通过人工智能热点问题开始,引出其中蕴涵的数学原理,然后构建解决实际问题的数学模型和方法,兼具趣味性与实用性。

 1月16日晚8:00, 哈工大屈教授在线直播课---『看得见 』的数学,带大家解密计算机视觉背后的数学知识!

点击阅读原文,或扫描海报二维码免费报名

加入公开课福利群,每周还有精选学习资料、技术图书等福利发送、60+公开课免费学习

推荐阅读

TAC KBP Chinese Entity Linking Comprehensive Training and Evaluation Data 2011-2014 LDC2015E17 March 20, 2015 Linguistic Data Consortium 1. Overview Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. The goal of Entity Linking is to determine whether or not the entity referred to in each query has a matching entity node in the reference Knowledge Base (KB) (LDC2014T16). If there is a matching node for a query, annotators create a link between the two. If there is not a matching node for a query, the entity is marked as 'NIL' and then clustered with other NIL entities into equivalence classes. For more information, please refer to the Entity Linking section of NIST's 2014 TAC KBP website (2014 was the last year in which the Chinese Entity Linking evaluation was conducted as of the time this package was created) at http://nlp.cs.rpi.edu/kbp/2014/ This package contains all evaluation and training data developed in support of TAC KBP Chinese Entity Linking during the four years since the task's inception in 2011. This includes queries, KB links, equivalence class clusters for NIL entities (those that could not be linked to an entity in the knowledge base), and entity type information for each of the queries. The data included in this package were originally released by LDC to TAC KBP coordinators and performers under the following ecorpora catalog IDs and titles: LDC2011E46: TAC 2011 KBP Cross-lingual Sample Entity Linking Queries V1.1 LDC2011E55: TAC 2011 KBP Cross-lingual Training Entity Linking V1.1 LDC2012E34: TAC 2011 KBP Cross-Lingual Evaluation Entity Linking Annotation LDC2012E66: TAC 2012 KBP Chinese Entity Linking Web Training Queries and Annotations LDC2012E103: TAC 2012 KBP Chinese Entity Linking Evaluation Annotations V1.2 LDC2013E96: TAC 2013 KBP Chinese Entity Linking Evaluation Queries and Knowledge Base Links V1.2 LDC2014E47: TAC 2014 KBP Chinese Entity Linking Discussion Forum Training Data LDC2014E83: TAC 2014 KBP Chinese Entity Linking Evaluation Queries and Knowledge Base Links V2.0 2. Contents ./README.txt This file ./data/2011/eval/tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml This file contains 2176 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CLCMN_" (if a Chinese language query) or "EL_CLENG_" (if an English language query) plus a five-digit zero-padded, sequentially assigned integer (e.g. "EL_CLCMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2011/eval/source_documents/ from which the namestring was extracted. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 120 291 420 831 CMN NW Non-NIL: 279 150 221 650 ENG NW NIL: 90 129 20 239 ENG NW Non-NIL: 93 72 104 269 ENG WB NIL: 16 0 5 21 ENG WB Non-NIL: 44 68 54 166 ---------------------------------------- Total: 624 710 824 2176 ./data/2011/eval/tac_kbp_2011_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 4 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a four-digit zero-padded sequentially assigned integer (e.g. NIL-0001, NIL-0002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). ./data/2011/eval/source_documents/* This directory contains all of the source documents listed in the <docid> attribute for each query in tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml. See section 5 for more information about source documents. ./data/2011/training/tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml This file is a concatenation of the queries files originally released in LDC2011E46 (sample) and LDC2011E55 (training). This file contains 2171 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CLCMN_" (if a Chinese language query) or "EL_CLENG_" (if an English language query) plus a five-digit zero-padded, sequentially assigned integer (e.g. "EL_CLCMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2011/training/source_documents/ from which the namestring was extracted. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 124 293 426 843 CMN NW Non-NIL: 284 149 227 660 ENG NW NIL: 143 116 63 322 ENG NW Non-NIL: 122 100 100 322 ENG WB NIL: 0 1 0 1 ENG WB Non-NIL: 14 3 6 23 ---------------------------------------- Total: 687 662 822 2171 ./data/2011/training/tac_kbp_2011_chinese_entity_linking_sample_and_training_KB_links.tab This file is a concatenation of the KB_links files originally released in LDC2011E46 (sample) and LDC2011E55 (training). This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 4 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a four-digit zero-padded sequentially assigned integer (e.g. NIL-0001, NIL-0002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). ./data/2011/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml See section 5 for more information about source documents. ./data/2012/eval/tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml This file contains 2122 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CMN_" plus a five-digit zero-padded, sequentially assigned integer (e.g., "EL_CMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2012/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 99 89 167 355 CMN NW Non-NIL: 164 167 148 479 CMN WB NIL: 88 86 68 242 CMN WB Non-NIL: 131 112 110 353 ENG NW NIL: 90 79 68 237 ENG NW Non-NIL: 101 107 83 291 ENG WB NIL: 6 26 16 48 ENG WB Non-NIL: 26 52 39 117 ---------------------------------------- Total: 705 718 699 2122 ./data/2012/eval/tac_kbp_2012_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 5 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. ./data/2012/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2012/training/tac_kbp_2012_chinese_entity_linking_training_queries.xml This file contains 158 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CMN_" plus a five-digit zero-padded, sequentially assigned integer (e.g., "EL_CMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2012/training/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 2 2 2 6 CMN NW Non-NIL: 0 2 0 2 CMN WB NIL: 16 16 17 49 CMN WB Non-NIL: 24 25 24 73 ENG WB NIL: 3 4 0 7 ENG WB Non-NIL: 7 5 9 21 ---------------------------------------- Total: 52 54 52 158 ./data/2012/training/tac_kbp_2012_chinese_entity_linking_training_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 5 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2012_chinese_entity_linking_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. ./data/2012/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2012_chinese_entity_linking_training_queries.xml See section 5 for more information about source documents. ./data/2013/eval/tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml This file contains 2155 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL13_CMN" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL13_CMN_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2013/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total ----------------------------------------- CMN NW NIL: 123 197 125 445 CMN NW Non-NIL: 124 119 163 406 CMN WB NIL: 112 105 87 304 CMN WB Non-NIL: 173 150 162 485 ENG NW NIL: 52 16 68 136 ENG NW Non-NIL: 83 87 64 234 ENG WB NIL: 11 19 7 37 ENG WB Non-NIL: 28 42 38 108 ----------------------------------------- Total: 706 735 714 2155 ./data/2013/eval/tac_kbp_2013_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2013/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2014/eval/tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml This file contains 2739 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL14_CMN_" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL14_CMN_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2014/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total --------------------------------------------- CMN DF NIL: 118 40 16 174 CMN DF Non-NIL: 426 61 66 553 CMN NW NIL: 179 413 300 892 CMN NW Non-NIL: 349 139 184 672 ENG DF NIL: 1 4 5 10 ENG DF Non-NIL: 5 26 25 56 ENG NW NIL: 10 65 32 107 ENG NW Non-NIL: 87 66 119 272 ENG WB Non-NIL: 1 0 2 3 --------------------------------------------- Total: 1176 814 749 2739 ./data/2014/eval/tac_kbp_2014_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2014/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2014/training/tac_kbp_2014_chinese_entity_linking_training_queries.xml This file contains 514 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL14_CMN_TRAINING" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL14_CMN_TRAINING_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2014/training/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total ----------------------------------------- ENG DF NIL: 1 6 3 10 ENG DF Non-NIL: 33 37 41 111 CMN DF NIL: 28 46 6 80 CMN DF Non-NIL: 109 83 121 313 ----------------------------------------- Total: 171 172 171 514 ./data/2014/training/tac_kbp_2014_chinese_entity_linking_training_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2014_chinese_entity_linking_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (all DF or discussion forum threads in these data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2014/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2014_chinese_entity_linking_training_queries.xml See section 5 for more information about source documents. ./dtd/2011_kbpentlink.dtd DTD for: tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml ./dtd/2012_2013_2014_kbpentlink.dtd DTD for: tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml tac_kbp_2012_chinese_entity_linking_training_queries.xml tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml tac_kbp_2014_chinese_entity_linking_training_queries.xml 3. Annotation Given a name string and using information from the query's source document, bilingual Chinese/English-speaking annotators used a specialized search engine to look in the Knowledge Base for a page in which the entity referred to by the query was the central topic. If such a page was found, a link was created between the query and the matching KB node ID. If no matching page was found, the query was marked as NIL and later coreferenced with other NIL entities. Annotators were allowed to use online searching to assist in determining the KB link/NIL status. Queries for which a human annotator could not confidently determine the KB link status were removed from the final data sets. 4. Text Normalization Name string matches are case and punctuation sensitive. The only text normalization performed was: 1. conversion of newlines to spaces, except where preceding characters were hyphens ("-"), in which case newlines were removed 2. conversion of multiple spaces to a single space 5. Source Documents All the text data in the source files have been taken directly from previous LDC corpus releases, and are being provided here essentially "as-is", with little or no additional quality control. An overall scan of character content in the source collections indicates some relatively small quantities of various problems, especially in the web and discussion forum data, including language mismatch (characters from Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding errors (some documents have apparently undergone "double encoding" into UTF-8, and others may have been "noisy" to begin with, or may have gone through an improper encoding conversion, yielding occurrences of the Unicode "replacement character" (U+FFFD) throughout the corpus); the web collection also has characters whose Unicode code points lie outside the "Basic Multilanguage Plane" (BMP), i.e. above U+FFFF. All documents that have filenames beginning with "cmn-NG" and "eng-NG" are Web Document data (WB) and some of these fail XML parsing (see below for details). All files that start with "bolt-" are Discussion Forum threads (DF) and have the XML structure described below. All other files are Newswire data (NW) and have the newswire markup pattern detailed below. Note as well that some source documents are duplicated across a few of the separated source_documents directories, indicating that some queries from different data sets originated from the same source documents. As it is acceptable for sources to be reused for Entity Linking queries, this duplication is intentional and expected. The subsections below go into more detail regarding the markup and other properties of the three source data types: 5.1 Newswire Data Newswire data use the following markup framework: <DOC id="{doc_id_string}" type="{doc_type_label}"> <HEADLINE> ... </HEADLINE> <DATELINE> ... </DATELINE> <TEXT> <P> ... </P> ... </TEXT> </DOC> where the HEADLINE and DATELINE tags are optional (not always present), and the TEXT content may or may not include "<P> ... </P>" tags (depending on whether or not the "doc_type_label" is "story"). All the newswire files are parseable as XML. 5.2 Discussion Forum Data Discussion forum files use the following markup framework: <doc id="{doc_id_string}"> <headline> ... </headline> <post ...> ... <quote ...> ... </quote> ... </post> ... </doc> where there may be arbitrarily deep nesting of quote elements, and other elements may be present (e.g. "<a...>...</a>" anchor tags). As mentioned in section 2 above, each <doc> unit contains at least five post elements. All the discussion forum files are parseable as XML. 5.3 Web Document Data "Web" files use the following markup framework: <DOC> <DOCID> {doc_id_string} </DOCID> <DOCTYPE> ... </DOCTYPE> <DATETIME> ... </DATETIME> <BODY> <HEADLINE> ... </HEADLINE> <TEXT> <POST> <POSTER> ... </POSTER> <POSTDATE> ... </POSTDATE> ... </POST> </TEXT> </BODY> </DOC> Other kinds of tags may be present ("<QUOTE ...>", "<A >", etc). Some of the web source documents contain material that interferes with XML parsing (e.g. unescaped "&", or "<QUOTE>" tags that lack a corresponding "</QUOTE>"). 6. Using the Data 6.1 Offset calculation The values of the beg and end XML elements in the later queries.xml files indicate character offsets to identify text extents in the source. Offset counting starts from the initial character (character 0) of the source document and includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. 6.2 Proper ingesting of XML queries While the character offsets are calculated based on treating the source document as "raw text", the "name" strings being referenced by the queries sometimes contain XML metacharacters, and these had to be "re-escaped" for proper inclusion in the queries.xml file. For example, an actual name like "AT&T" may show up a source document file as "AT&T" (because the source document was originally formatted as XML data). But since the source doc is being treated here as raw text, this name string is treated in queries.xml as having 7 characters (i.e., the character offsets, when provided, will point to a string of length 7). However, the "name" element itself, as presented in the queries.xml file, will be even longer - "AT&amp;T" - because the queries.xml file is intended to be handled by an XML parser, which will return "AT&T" when this "name" element is extracted. Using the queries.xml data without XML parsing would yield a mismatch between the "name" value and the corresponding string in the source data. 7. Copyright Information (c) 2015 Trustees of the University of Pennsylvania 8. Contact Information For further information about this data release, contact the following project staff at LDC: Joseph Ellis, Project Manager <joellis@ldc.upenn.edu> Jeremy Getman, Lead Annotator <jgetman@ldc.upenn.edu> Stephanie Strassel, PI <strassel@ldc.upenn.edu> -------------------------------------------------------------------------- README created by Jeremy Getman on February 4, 2015 updated by Joe Ellis on February 16, 2015 updated by Jeremy Getman on February 17, 2015 updated by Joe Ellis on March 18, 2015
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值