实体链指(1)Entity Linking 综述

What is Entity Linking?

实体链接(entity linking) 任务是指识别出文本中的提及(mention)、确定mention的含义并消除其可能存在的歧义,然后建立该mention到 知识库(KB)实体(entity) 的链接,从而将非结构化数据连接到结构化数据的过程。

实体链接利用知识库中大量实体的丰富信息,可以实现各种语义应用,如实体链接是很多信息抽取(IR)、知识问答、构建知识图谱等,是自然语言理解(NLU)pipeline 中的重要组件。

在这里插入图片描述

通常这个过程一般分为2个子任务,即NER/MD(mention detection)实体消歧。由这2个子任务是否拆分开独立进行,形成了目前2种主流的实体链指方法:

  • 端到端(End-to-End)
  • 实体消歧(Disambiguation-Only)

End-to-End是指不拆分子任务,一次性实现query mention的识别并将其链指到KB中正确的实体;而Disambiguation-Only需要提供确定的mention及其边界,相当于就是默认NER任务已完成并能确切的给出gold query mention,然后再将此mention链指到KB中的实体。

接下来我们通过2020年的一篇综述: 《Neural entity linking: A survey of models based on deep learning》 来大致了解下自2015年以来EL任务的定义、实现框架、模型、以及应用。

Formal Definition

  1. Knowledge Graph (KG)
  2. Entity Recognition (ER)
    E R : C → M n ER:C\rightarrow M^n ER:CMn
  3. Entity Recognition (ER)
    E D : ( M , C ) n → E n ED:(M,C)^n \rightarrow E^n ED:(M,C)nEn

我们可以把实体识别的过程定义成一个文本到mention的函数ER,其中 c i ∈ C c_i \in C ciC C C C即为文档集context,可以是一个document、或question、或query;n个m, m i ∈ M m_i \in M miM ,M是文本中所有可能的text spans。

可以把实体消歧的任务定义成由输入文本context以及文本中的mention映射到KB中实体的函数ED

为了学习这种映射关系,EL models需要使用有监督信号进行学习,也就是说需要query以及mention-entity pair对的充分的标注数据。

Architecture

作者总结了EL的一个通用框架主要包含2步,即 NER实体消歧 ,而实体消歧又可继续拆分成2步:即 候选生成排序 ,候选生成用于产出mention对应KB中的候选实体列表,是一个粗召过程,然后第2步就是在候选集上进行精排,得到最终gold link,而这一步通常需要基于context/mention和entity的表征来计算相似度,给出得分最高的mention-entity pair。

在这里插入图片描述

上述过程主要涉及候选生成 、mention/entity的表征或者叫编码、排序等几个模块:

  1. Candidate Generation
    C G : M → ( e 1 , e 2 , . . . , e k ) CG:M \rightarrow (e_1,e_2,...,e_k) CG:M(e1,e2,...,ek)
  2. Context-mention Encoding
    m E N C : ( C , M ) n → ( y m 1 , y m 2 , . . . , y m n ) mENC:(C,M)^n \rightarrow (y_{m_1}, y_{m_2},...,y_{m_n}) mENC:(C,M)n(ym1,ym2,...,ymn)
  3. Entity Encoding
    e E N C : E k → ( y e 1 , y e 2 , . . . , y e k ) eENC: E^k \rightarrow (y_{e_1},y_{e_2},...,y_{e_k}) eENC:Ek(ye1,ye2,...,yek)
  4. Entity Ranking
    R N K : ( ( e 1 , e 2 , . . . , e k ) , C , M ) n → R n × k RNK:((e^1,e^2,...,e^k),C,M)^n \rightarrow R^{n\times k} RNK:((e1,e2,...,ek),C,M)nRn×k
  5. Unlinkable Mention Prediction
    N I L p : ( C , M ) n → { 0 , 1 } n NIL_p:(C,M)^n\rightarrow \{0,1\}^n NILp:(C,M)n{0,1}n

Candidate Generation

首先说下 候选生成(Candidate Generation) ,候选生成的过程可以定义成由mention到KB中可能相关的实体列表的映射,这一步的主要作用是找到一个尽可能小、尽可能包含目标实体的集合。常用的方法可参考 @Table1 ,大致有3种:

在这里插入图片描述

  • 第一种方法是一些基于字面的浅显的匹配,例如基于编辑距离、n-grams等。但是有些场景下这种方法并不能work well,例如mention=“Big Blue”时使用这种方法很难匹配出entity=“IBM”(看表中示例);
  • 第二种方法基于别名词典,或同义词词典,这通常需要别名挖掘等工作,构建成本比较高、无法处理未登录词,召回率受词表的完整度的限制;
  • 第三种方法基于一些先验概率 p ( e ∣ m ) p(e|m) p(em)、词向量或语义召回等方式。

当然,上述的这些方法在候选生成阶段可以组合使用。

Context-mention Encoding

第2个模块是说如何产出context-mention的表征向量,当前主流的方法是使用一个encoder网络来构建一个能表征上下文的稠密向量(dense contextualized vector representation)。早些时候会使用卷积encoder或在候选与mention之间使用attention机制,但近期的模型里有2种方法占了上风:recurrent networks(例如双向LSTM) and self-attention(例如利用预训练好的Bert等基于Transformer的其他的预训练语言模型)。

Entity Encoding

第3个模块是实体的表征,它可以基于KB中的doc的字词构建词向量,类似于word2vec,或者可以利用知识库中实体之间的关系来构建表征(利用知识图谱),还还可以和context-mention一样也train一个encoder网络,其训练可以基于实体描述、实体页面标题、实体类型、实体的热度、链接次数(先验知识)等各种特征。

Entity Ranking

在这里插入图片描述
最后的排序阶段就是基于context-mention的语义表征和候选entity的表征计算相似度,在所有候选实体上按相似度打分排序,得分最高者即为gold enity。

相似度 s ( m , e i ) s(m,e_i) s(m,ei) 计算可使用点乘或者余弦相似度等:

  • dot product: s ( m , e i ) = y m ⋅ y e i s (m, e_i) = y_m \cdot y_{e_i} s(m,ei)=ymyei
  • cosine similarity: s ( m , e i ) = cos ⁡ ( y m , y e i ) = y m ⋅ y e i ∣ ∣ y m ∣ ∣ ⋅ ∣ ∣ y e i ∣ ∣ s (m, e_i) = \cos(y_m, y_{e_i})=\frac{y_m \cdot y_{e_i}}{||y_m|| \cdot ||y_{e_i}||} s(m,ei)=cos(ym,yei)=∣∣ym∣∣∣∣yei∣∣ymyei

最后infer做决策时使用的概率分布 p ( e i ∣ m ) p(e_i|m) p(eim),这一般可使用候选集上的softmax来近似: P ( e i ∣ m ) = exp ⁡ ( s ( m , e i ) ) ∑ i = 1 k exp ⁡ ( s ( m , e i ) ) P(e_i|m)=\frac{\exp(s(m,e_i))}{\sum^k_{i=1}\exp(s(m,e_i))} P(eim)=i=1kexp(s(m,ei))exp(s(m,ei)) ,同时也可以和其他先验特征: f ( e i , m ) f(e_i,m) f(ei,m)结合使用(例如候选生成阶段获得的mention-entity priors): Φ ( e i , m ) = ϕ ( P ( e i ∣ m ) , f ( e i , m ) ) \Phi(e_i,m)=\phi(P(e_i|m),f(e_i,m)) Φ(ei,m)=ϕ(P(eim),f(ei,m))

训练目标training objective可以当做分类任务进行,即使用标准的负对数似然损失(standard negative log likelihood objective),也可以是各种各样的ranking loss:

  • standard negative log likelihood objective
    L ( m ) = − s ( m , e ∗ ) + log ⁡ ∑ i = 1 k exp ⁡ ( s ( m , e i ) ) L(m)=-s(m,e_*)+\log\sum^k_{i=1}\exp(s(m,e_i)) L(m)=s(m,e)+logi=1kexp(s(m,ei))
    其中 e ∗ e_* e表示true entity。
  • ranking loss
    L ( n ) = ∑ i l ( e i , m ) L(n)=\sum_i l(e_i,m) L(n)=il(ei,m)
    其中 l ( e i , m ) = [ γ − Φ ( e ∗ , m ) + Φ ( e i , m ) ] + l(e_i,m)=[\gamma-\Phi(e_*,m)+\Phi(e_i,m)]_+ l(ei,m)=[γΦ(e,m)+Φ(ei,m)]+
    或者 l ( e i , m ) = { [ γ − Φ ( e i , m ) ] + i f    e i = e ∗ [ Φ ( e i , m ) ] + o t h e r w i s e l(e_i,m)=\begin{cases} [\gamma-\Phi(e_i,m)]_+&if \;e_i =e_*\\ [\Phi(e_i,m)]_+&otherwise\\\end{cases} l(ei,m)={[γΦ(ei,m)]+[Φ(ei,m)]+ifei=eotherwise

Unlinkable Mention Prediction

最后需要重点提到的是[NIL](Unlinkable Mention),即无法链接到KB中任何实体的mention,这可能是NER误识别引起的、或mention对应的实体在还没有登录在KB中,毕竟知识库的构建是一个逐步完善的过程。那对于Unlinkable Mention的处理,一般常见的有以下4种方法:
(1)候选阶段不产生任何候选
(2)卡阈值
(3)空实体[NIL]直接参与排序
(4)排序后再接二分类,二分类输入:m-e pairs,或者其他额外的特征,例如best linking score等,来对mention是否能被linking做最终决策

Modifications of the General Architecture

在通用框架上,又逐步演化成如下4种变种EL任务:

在这里插入图片描述

第1种是NER任务和ED(Entity Disambiguation)任务同时进行,其实就是前面提到的End2End的方法,这一块后续的文章会专门介绍,这里不多说。
E L : C → ( M , E ) n EL:C\rightarrow(M,E)^n EL:C(M,E)n

第2种是指 全局实体消歧 (指同时对多个mention进行消歧,认为实体之间也是有相互关联的):the consistency score(一致性) between correct entity candidates is expected to be higher than between incorrect ones。
L E D : ( M , C ) → E LED:(M,C)\rightarrow E LED:(M,C)E
G E D : ( ( m 1 , m 2 , . . . , m q ) , C ) → E q GED:((m_1,m_2,...,m_q),C)\rightarrow E^q GED:((m1,m2,...,mq),C)Eq
除此之外,GED(Global Entity Disambiguation)中还经常考虑更长的context甚至是整个doc,虽然增加了消歧的acc,但这也会增加计算复杂度。

第3种是指领域无关的EL任务。大部分任务存在的共性挑战是标注数据不足的问题。目前实体链接任务只在很有限的几个领域内有相对高质量的标注数据,因此如何能够做到充分利用这些标注数据甚至于不需要标注数据完成实体链接是当前存在的一个重要挑战。早期的方法基于 无监督和半监督方法(unsupervised and semi-supervised models) ,近期主要有两方面的解决方案: Distant Learning和Zero-shot Learning 。其中Distant Learning与关系抽取任务中的distant superviced思想相似,使用一些surface matching的启发式规则生成部分带噪声的远程监督数据集,并在此基础上进行弱监督学习。另一方面Zero-shot learning的核心思想是在标注数据充足的领域(Domain)训练得到具有普适性的特征,并使用尽量少的新领域信息完成领域迁移。

最后,另一个存在的挑战是 跨语言的实体链接 问题。由于部分语言的相关语料库数据非常少,实体-mention之间的先验信息也很少,所以从候选实体生成到实体排序阶段都非常有挑战。跨语言的实体链接方法试图充分利用不同语言的相同实体之间的wiki链接信息来实现尽量准确的跨语言链接。当前的跨语言实体链接方法大多严重依赖于预训练的跨语言语言模型,试图将不同语言的表征约束在同一个向量空间中,再使用同样的方法进行实体排序。

Evaluation Metrics

  • ED (Disambiguation-Only):
    F 1 = P = R = A c c = #    o f    c o r r e c t l y    d i s a m b .    m e n t i o n s #    o f    t o t a l    m e n t i o n s F_1=P=R=Acc=\frac{\#\; of\; correctly\; disamb.\; mentions}{\#\; of\; total\; mentions} F1=P=R=Acc=#oftotalmentions#ofcorrectlydisamb.mentions

  • ER+ED (End-to-End):
    P = #    o f    c o r r e c t l y    d e t e c t e d    a n d    d i s a m b .    m e n t i o n s #    o f    p r e d i c t e d    m e n t i o n s    b y    m o d e l P=\frac{\#\; of\; correctly\; detected\; and\; disamb.\; mentions}{\#\; of\; predicted\; mentions\; by\; model} P=#ofpredictedmentionsbymodel#ofcorrectlydetectedanddisamb.mentions
    R = #    o f    c o r r e c t l y    d e t e c t e d    a n d    d i s a m b .    m e n t i o n s #    o f    m e n t i o n s    i n    g r o u n d    t r u t h R=\frac{\#\; of\; correctly\; detected\; and\; disamb.\; mentions}{\#\; of\; mentions\; in\; ground\; truth} R=#ofmentionsingroundtruth#ofcorrectlydetectedanddisamb.mentions
    F 1 = 2 ⋅ P ⋅ R P + R F_1=\frac{2 \cdot P \cdot R}{P+R} F1=P+R2PR

Applications of EL

除了一些常规应用:文本挖掘、构建知识图谱、信息检索、问答等,
EL还解锁了一些新的应用:将实体链接系统集成到更大的网络模型中。例如在LM中集成EL任务做联合学习,由于扩展了EL而使用到KG中丰富的信息,这种集成后的训练通常能使LM得到更好的语义表征。
L J O I N T = L B E R T + L E L − r e l a t e d L_{JOINT}=L_{BERT}+L_{EL-related} LJOINT=LBERT+LELrelated
L E R N I E = L N S P + L M L M + L d E L L_{ERNIE}=L_{NSP}+L_{MLM}+L_{dEL} LERNIE=LNSP+LMLM+LdEL

TODO

后续会从Disambiguation-Only和End2End 两种方法分别介绍比较具有代表性的、并且效果SOTA的几篇paper,来详细了解下具体EL任务的实现过程。

[1] Sevgili, Özge, et al. “Neural entity linking: A survey of models based on deep learning.” Semantic Web Preprint (2022): 1-44.
[2]: [KG笔记]九、实体链接(Entity Linking)


如果需要其他NLP相关内容请移步至: 我的github:https://github.com/qingyujean/Magic-NLPer,求赞求星求鼓励~~~

最后:如果本文中出现任何错误,请您一定要帮忙指正,感激~

  • 3
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
TAC KBP Chinese Entity Linking Comprehensive Training and Evaluation Data 2011-2014 LDC2015E17 March 20, 2015 Linguistic Data Consortium 1. Overview Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. The goal of Entity Linking is to determine whether or not the entity referred to in each query has a matching entity node in the reference Knowledge Base (KB) (LDC2014T16). If there is a matching node for a query, annotators create a link between the two. If there is not a matching node for a query, the entity is marked as 'NIL' and then clustered with other NIL entities into equivalence classes. For more information, please refer to the Entity Linking section of NIST's 2014 TAC KBP website (2014 was the last year in which the Chinese Entity Linking evaluation was conducted as of the time this package was created) at http://nlp.cs.rpi.edu/kbp/2014/ This package contains all evaluation and training data developed in support of TAC KBP Chinese Entity Linking during the four years since the task's inception in 2011. This includes queries, KB links, equivalence class clusters for NIL entities (those that could not be linked to an entity in the knowledge base), and entity type information for each of the queries. The data included in this package were originally released by LDC to TAC KBP coordinators and performers under the following ecorpora catalog IDs and titles: LDC2011E46: TAC 2011 KBP Cross-lingual Sample Entity Linking Queries V1.1 LDC2011E55: TAC 2011 KBP Cross-lingual Training Entity Linking V1.1 LDC2012E34: TAC 2011 KBP Cross-Lingual Evaluation Entity Linking Annotation LDC2012E66: TAC 2012 KBP Chinese Entity Linking Web Training Queries and Annotations LDC2012E103: TAC 2012 KBP Chinese Entity Linking Evaluation Annotations V1.2 LDC2013E96: TAC 2013 KBP Chinese Entity Linking Evaluation Queries and Knowledge Base Links V1.2 LDC2014E47: TAC 2014 KBP Chinese Entity Linking Discussion Forum Training Data LDC2014E83: TAC 2014 KBP Chinese Entity Linking Evaluation Queries and Knowledge Base Links V2.0 2. Contents ./README.txt This file ./data/2011/eval/tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml This file contains 2176 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CLCMN_" (if a Chinese language query) or "EL_CLENG_" (if an English language query) plus a five-digit zero-padded, sequentially assigned integer (e.g. "EL_CLCMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2011/eval/source_documents/ from which the namestring was extracted. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 120 291 420 831 CMN NW Non-NIL: 279 150 221 650 ENG NW NIL: 90 129 20 239 ENG NW Non-NIL: 93 72 104 269 ENG WB NIL: 16 0 5 21 ENG WB Non-NIL: 44 68 54 166 ---------------------------------------- Total: 624 710 824 2176 ./data/2011/eval/tac_kbp_2011_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 4 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a four-digit zero-padded sequentially assigned integer (e.g. NIL-0001, NIL-0002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). ./data/2011/eval/source_documents/* This directory contains all of the source documents listed in the <docid> attribute for each query in tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml. See section 5 for more information about source documents. ./data/2011/training/tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml This file is a concatenation of the queries files originally released in LDC2011E46 (sample) and LDC2011E55 (training). This file contains 2171 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CLCMN_" (if a Chinese language query) or "EL_CLENG_" (if an English language query) plus a five-digit zero-padded, sequentially assigned integer (e.g. "EL_CLCMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2011/training/source_documents/ from which the namestring was extracted. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 124 293 426 843 CMN NW Non-NIL: 284 149 227 660 ENG NW NIL: 143 116 63 322 ENG NW Non-NIL: 122 100 100 322 ENG WB NIL: 0 1 0 1 ENG WB Non-NIL: 14 3 6 23 ---------------------------------------- Total: 687 662 822 2171 ./data/2011/training/tac_kbp_2011_chinese_entity_linking_sample_and_training_KB_links.tab This file is a concatenation of the KB_links files originally released in LDC2011E46 (sample) and LDC2011E55 (training). This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 4 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a four-digit zero-padded sequentially assigned integer (e.g. NIL-0001, NIL-0002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). ./data/2011/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml See section 5 for more information about source documents. ./data/2012/eval/tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml This file contains 2122 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CMN_" plus a five-digit zero-padded, sequentially assigned integer (e.g., "EL_CMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2012/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 99 89 167 355 CMN NW Non-NIL: 164 167 148 479 CMN WB NIL: 88 86 68 242 CMN WB Non-NIL: 131 112 110 353 ENG NW NIL: 90 79 68 237 ENG NW Non-NIL: 101 107 83 291 ENG WB NIL: 6 26 16 48 ENG WB Non-NIL: 26 52 39 117 ---------------------------------------- Total: 705 718 699 2122 ./data/2012/eval/tac_kbp_2012_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 5 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. ./data/2012/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2012/training/tac_kbp_2012_chinese_entity_linking_training_queries.xml This file contains 158 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CMN_" plus a five-digit zero-padded, sequentially assigned integer (e.g., "EL_CMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2012/training/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 2 2 2 6 CMN NW Non-NIL: 0 2 0 2 CMN WB NIL: 16 16 17 49 CMN WB Non-NIL: 24 25 24 73 ENG WB NIL: 3 4 0 7 ENG WB Non-NIL: 7 5 9 21 ---------------------------------------- Total: 52 54 52 158 ./data/2012/training/tac_kbp_2012_chinese_entity_linking_training_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 5 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2012_chinese_entity_linking_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. ./data/2012/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2012_chinese_entity_linking_training_queries.xml See section 5 for more information about source documents. ./data/2013/eval/tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml This file contains 2155 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL13_CMN" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL13_CMN_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2013/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total ----------------------------------------- CMN NW NIL: 123 197 125 445 CMN NW Non-NIL: 124 119 163 406 CMN WB NIL: 112 105 87 304 CMN WB Non-NIL: 173 150 162 485 ENG NW NIL: 52 16 68 136 ENG NW Non-NIL: 83 87 64 234 ENG WB NIL: 11 19 7 37 ENG WB Non-NIL: 28 42 38 108 ----------------------------------------- Total: 706 735 714 2155 ./data/2013/eval/tac_kbp_2013_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2013/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2014/eval/tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml This file contains 2739 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL14_CMN_" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL14_CMN_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2014/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total --------------------------------------------- CMN DF NIL: 118 40 16 174 CMN DF Non-NIL: 426 61 66 553 CMN NW NIL: 179 413 300 892 CMN NW Non-NIL: 349 139 184 672 ENG DF NIL: 1 4 5 10 ENG DF Non-NIL: 5 26 25 56 ENG NW NIL: 10 65 32 107 ENG NW Non-NIL: 87 66 119 272 ENG WB Non-NIL: 1 0 2 3 --------------------------------------------- Total: 1176 814 749 2739 ./data/2014/eval/tac_kbp_2014_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2014/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2014/training/tac_kbp_2014_chinese_entity_linking_training_queries.xml This file contains 514 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL14_CMN_TRAINING" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL14_CMN_TRAINING_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2014/training/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total ----------------------------------------- ENG DF NIL: 1 6 3 10 ENG DF Non-NIL: 33 37 41 111 CMN DF NIL: 28 46 6 80 CMN DF Non-NIL: 109 83 121 313 ----------------------------------------- Total: 171 172 171 514 ./data/2014/training/tac_kbp_2014_chinese_entity_linking_training_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2014_chinese_entity_linking_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (all DF or discussion forum threads in these data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2014/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2014_chinese_entity_linking_training_queries.xml See section 5 for more information about source documents. ./dtd/2011_kbpentlink.dtd DTD for: tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml ./dtd/2012_2013_2014_kbpentlink.dtd DTD for: tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml tac_kbp_2012_chinese_entity_linking_training_queries.xml tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml tac_kbp_2014_chinese_entity_linking_training_queries.xml 3. Annotation Given a name string and using information from the query's source document, bilingual Chinese/English-speaking annotators used a specialized search engine to look in the Knowledge Base for a page in which the entity referred to by the query was the central topic. If such a page was found, a link was created between the query and the matching KB node ID. If no matching page was found, the query was marked as NIL and later coreferenced with other NIL entities. Annotators were allowed to use online searching to assist in determining the KB link/NIL status. Queries for which a human annotator could not confidently determine the KB link status were removed from the final data sets. 4. Text Normalization Name string matches are case and punctuation sensitive. The only text normalization performed was: 1. conversion of newlines to spaces, except where preceding characters were hyphens ("-"), in which case newlines were removed 2. conversion of multiple spaces to a single space 5. Source Documents All the text data in the source files have been taken directly from previous LDC corpus releases, and are being provided here essentially "as-is", with little or no additional quality control. An overall scan of character content in the source collections indicates some relatively small quantities of various problems, especially in the web and discussion forum data, including language mismatch (characters from Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding errors (some documents have apparently undergone "double encoding" into UTF-8, and others may have been "noisy" to begin with, or may have gone through an improper encoding conversion, yielding occurrences of the Unicode "replacement character" (U+FFFD) throughout the corpus); the web collection also has characters whose Unicode code points lie outside the "Basic Multilanguage Plane" (BMP), i.e. above U+FFFF. All documents that have filenames beginning with "cmn-NG" and "eng-NG" are Web Document data (WB) and some of these fail XML parsing (see below for details). All files that start with "bolt-" are Discussion Forum threads (DF) and have the XML structure described below. All other files are Newswire data (NW) and have the newswire markup pattern detailed below. Note as well that some source documents are duplicated across a few of the separated source_documents directories, indicating that some queries from different data sets originated from the same source documents. As it is acceptable for sources to be reused for Entity Linking queries, this duplication is intentional and expected. The subsections below go into more detail regarding the markup and other properties of the three source data types: 5.1 Newswire Data Newswire data use the following markup framework: <DOC id="{doc_id_string}" type="{doc_type_label}"> <HEADLINE> ... </HEADLINE> <DATELINE> ... </DATELINE> <TEXT> <P> ... </P> ... </TEXT> </DOC> where the HEADLINE and DATELINE tags are optional (not always present), and the TEXT content may or may not include "<P> ... </P>" tags (depending on whether or not the "doc_type_label" is "story"). All the newswire files are parseable as XML. 5.2 Discussion Forum Data Discussion forum files use the following markup framework: <doc id="{doc_id_string}"> <headline> ... </headline> <post ...> ... <quote ...> ... </quote> ... </post> ... </doc> where there may be arbitrarily deep nesting of quote elements, and other elements may be present (e.g. "<a...>...</a>" anchor tags). As mentioned in section 2 above, each <doc> unit contains at least five post elements. All the discussion forum files are parseable as XML. 5.3 Web Document Data "Web" files use the following markup framework: <DOC> <DOCID> {doc_id_string} </DOCID> <DOCTYPE> ... </DOCTYPE> <DATETIME> ... </DATETIME> <BODY> <HEADLINE> ... </HEADLINE> <TEXT> <POST> <POSTER> ... </POSTER> <POSTDATE> ... </POSTDATE> ... </POST> </TEXT> </BODY> </DOC> Other kinds of tags may be present ("<QUOTE ...>", "<A >", etc). Some of the web source documents contain material that interferes with XML parsing (e.g. unescaped "&", or "<QUOTE>" tags that lack a corresponding "</QUOTE>"). 6. Using the Data 6.1 Offset calculation The values of the beg and end XML elements in the later queries.xml files indicate character offsets to identify text extents in the source. Offset counting starts from the initial character (character 0) of the source document and includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. 6.2 Proper ingesting of XML queries While the character offsets are calculated based on treating the source document as "raw text", the "name" strings being referenced by the queries sometimes contain XML metacharacters, and these had to be "re-escaped" for proper inclusion in the queries.xml file. For example, an actual name like "AT&T" may show up a source document file as "AT&T" (because the source document was originally formatted as XML data). But since the source doc is being treated here as raw text, this name string is treated in queries.xml as having 7 characters (i.e., the character offsets, when provided, will point to a string of length 7). However, the "name" element itself, as presented in the queries.xml file, will be even longer - "AT&amp;T" - because the queries.xml file is intended to be handled by an XML parser, which will return "AT&T" when this "name" element is extracted. Using the queries.xml data without XML parsing would yield a mismatch between the "name" value and the corresponding string in the source data. 7. Copyright Information (c) 2015 Trustees of the University of Pennsylvania 8. Contact Information For further information about this data release, contact the following project staff at LDC: Joseph Ellis, Project Manager <[email protected]> Jeremy Getman, Lead Annotator <[email protected]> Stephanie Strassel, PI <[email protected]> -------------------------------------------------------------------------- README created by Jeremy Getman on February 4, 2015 updated by Joe Ellis on February 16, 2015 updated by Jeremy Getman on February 17, 2015 updated by Joe Ellis on March 18, 2015
2011年的IJCNLP(International Joint Conference on Natural Language Processing)会议上,关于跨语言实体链接的研究被提出并讨论。跨语言实体链接是一种通过语言之间的联系将不同语言的实体进行链接的技术。 实体链接是将自然语言文本中的命名实体与知识库中的实体进行关联的过程。而跨语言实体链接则是在多语言环境下进行的实体链接,通过对多语言文本中的实体进行识别和匹配,在不同语言之间建立对应关系。 在2011年的IJCNLP会议上,研究者们提出了一些跨语言实体链接的方法和技术。其中一种方法是通过使用词语对齐和翻译技术,将不同语言中的词语进行对齐和翻译,然后再进行实体链接。另一种方法是利用跨语言知识库,通过多语言实体之间的关系建立链接。 跨语言实体链接的研究具有重要的应用意义。它可以帮助我们在不同语言的文本数据中进行实体关联分析,从而更好地理解和处理跨语言文本信息。例如,在跨国公司的市场营销中,我们需要了解不同语言中商品的名称、品牌、特性等信息,通过跨语言实体链接,可以将这些信息进行关联分析,为决策提供依据。 然而,在2011年的IJCNLP会议上,这个领域的研究仍然处于初步阶段,存在一些挑战。其中,语言之间的差异和语义的多样性是主要的挑战之一。不同语言之间的词汇和语法结构差异较大,同时,同一个实体在不同语境下可能有不同的名称,这增加了实体链接的复杂性。 因此,未来的研究需要进一步探索更有效的跨语言实体链接方法,以提高链接的准确性和鲁棒性。这将有助于解决实际应用中的语言障碍问题,并促进多语言信息处理的发展。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值