[NLP] 实体链接论文阅读—Entity Linking for Chinese Short Texts Based on BERT and Entity Name Embeddings

Entity Linking for Chinese Short Texts Based on BERT and Entity Name Embeddings

写在前面:
最近在阅读实体链接论文,实体消歧是实体链接必须有的步骤,而实体链接的前一步为实体识别,对于只想知道本文到底用什么方法进行实体消歧并实体链接的,请直接移步第三部分Model架构的描述。这次直接用raw markdown写了,尝试过用线上编辑器hackmd或是mac自带的pages,问题都出在图片的问题,才发现imgur的图床原来不能直接连,然后pages排版要在博客发表又要重新编辑加上简繁转换等很麻烦der东西,所以直接写raw markdown了啊。原文编撰的不是很用心,很多重要的细节只套用个公式不对变量加以说明就带过,而且连引用的公式都出错。包含从Transformer引用的position encoding和后面的BERT-ENE模型连名字都打错,嗯…没关系我们只是要去了解架构对吧。

Abs.

传统实体链接任务主要是针对长文本,拥有完整的上下文信息,有助于实体识别与实体消歧。由于口语化、方言和短文本不足的上下文,中文短文本实体链接尚具挑战。
整个实体链接过程包含两个任务:实体识别和实体消歧。

实体识别

  1. 使用知识库内的文本描述信息来提高实体识别的表现,提出了BERT-EntityNameEmbedding (BRT-ENE) 模型。
  2. 特别是实体的词向量迁入是由知识库中实体的描述文本从中挖掘而来。
  3. 短文本内的候选实体用名称辞典匹配技术得到。
  4. 最后结果再用BERT-ENE模型过滤,完成实体识别任务。

另外还提出了BERT-ENE模型与BERT-CRF模型的组合,可以跟传统方法显著改善识别效果。

实体消歧

视为二分类模型,将预测的概率排序。最高概率的实体当作正确实体。基于本文提出的方法,我们在CCKS2019的中文短文本实体链接任务中得到了第一名。

1 Introduction

实体链接过程

网络上的数据包含了大量的命名实体,然而这些实体的含义通常很模糊,特别是当命名实体频繁出现时。一个实体可能有许多名称,仅仅单一个名称可能代表着数个不同命名实体,再来,随着诸如Wikipedia之类的知识共享社区的出现,信息提取技术的迅速发展促进了大规模知识库的自动构建,知识库包含了实体、实体自身信息和实体之间的信息。知识库的自动构建涉及到网络文本的关系抽取,然后再加入到知识库内。在这个阶段,要将抽取出的实体进行消歧,就叫做实体链接 (Entity Linking)。实体链接任务将已识别的实体引用映射到现有知识库中的正确实体对象。

本文提出的方法旨在解决CCKS2019任务2:中文短文本实体链接。知识库数据由百度提供,长度最多为50个字符,平均在27个字符。数据采样在下列的text/mention中能看到。

在此范例中,句子含有16个字符,5个实体,其中有单一字符的实体'诗'。从这里我们能看到这个任务的困难度。相较于更长的文本以及英语文本,中文短文本实体链接有更大的困难。

text:
求一些亦正亦邪的人物的性格描写《有所思》 萧衍的诗 南北朝诗人

mention:
人物, 性格, 有所思, 萧衍, 诗, 南北朝, 诗人

BERT-CRF模型只能使用短文本信息,而不能利用知识库的信息。因此,仍然会存在诸如实体边界识别错误和实体识别不完整等问题。为了弥补这些不足并充分利用知识库的信息,我们提出了BERT-ENE模型。 对于实体消歧的子任务,我们将其视为二分类问题,并使用基于BERT的二分类模型对候选实体消歧。

主要的创新点在于:

  1. 预训练模型技术性的部署在短文本的实体识别和消歧中,充分提取短文本的语义信息。
  2. 实体识别过程中,引入实体名称的向量嵌入 (embedding) ,充分利用知识库的文本描述信息,解决短文本信息量太少的问题。
  3. 提出一种结合BERT-ENE模型和BERT-CRF模型的新模型,大大提高实体识别的有效性。

2 Related Work

本文采取的两种子任务的解决办法:

  • 实体识别子任务:主要基于匹配技术使用命名实体识别与名称辞典。
  • 实体消歧子任务:采取二分类的思想来完成消歧。

鉴于命名实体识别策略不能完全认出文本中所有候选实体,研究人员使用基于名称词典的匹配技术去增进效能,名称辞典是从知识库中抽取而来的。词典中的每个名称都是一个keyword(关键字),我们可以使用不同方法得到候选实体,其中有许多精确匹配的策略。为了增加召回率(recall),(Zheng et al.)使用用字串匹配规则的宽松匹配而不是精确匹配;为了增加精确率(precision),有些研究使用经验概率去选择候选实体。现存针对选择匹配结果的方法大部分都基于规则或是概率,缺少了深度学习模型的优点。

当前主要有三种实体消歧方法,例如基于rank、基于二分类方法、基于图模型方法。下一步我们专注在探讨二分类方法,直接与我们的工作相关。

二分类方法:mention和候选实体的关联特征通常用于训练二分类模型,这个模型可以确定候选实体是否为正例,例如:

  • (Pan et al.)抽取了包含单词特征、单词目录和命名实体目录的特征,再使用SVM分类。

传统机器学习方法依赖大量手工特征,特征的品质会很严重的影响分类器的效能,例如:

  • (Sun et al.)提出使用深度学习方法得到mention、上下文和实体的语义表示。
  • (Huang et al.)提出深度与义关联模型来度量实体语义关联。
  • (Ganea et al.)通过实体嵌入(entity embedding)和局部上下文窗口注意力机制实现实体消歧。

3 Model

3.1 Data Preprocessing

训练数据:

  • text:文本
  • ment_data:包含mentionkb_id栏位

知识库:

  • subject_id:主体ID
  • subject:主体
  • alias:别名
  • data:包含多个predictate(谓词)object(对象)栏位
  • …等

Introducing a new alias 引入新的别名:对数据集进行统计分析后,发现训练集中少数实体名称跟在实体库中不匹配,例如:

  1. 安妮 '海瑟薇:文本中有特殊字元
  2. 新浪微薄:输入文本中的实体名称错误
  3. 国家质检总局:知识库没有此别名

为了解决像上述例子的问题,引入了对应于知识库中实体的新别名,步骤如下:

  1. 对于错误1,规范特殊字元并将处理后的名称添加到相应实体的别名。例如所有中文标点符号都转换成英文标点符号。
  2. 对于错误2和错误3,计算实体不匹配的次数 E n u m E_{num} Enum,训练集中实体E不匹配的所有字串 M 1 , M 2 , . . . , M i M_1,M_2,...,M_i M1,M2,...,Mi的出现次数,以及每个 M i M_i Mi的出现次数 M i n u m M_{i_{num}} Minum。如果 E n u m > 4 E_{num}>4 Enum>4 M i n u m > 3 M_{i_{num}}>3 Minum>3,就将字串 M i M_i Mi添加到实体E的别名。

Construction of entity description text 构造实体描述文本:将数据中的predicate(谓词)object(对象)连接,得到实体描述文本。为了方便后续处理,再根据以下规则截断文本:谓词和宾语的长度>30时按比例截断,否则不截断。

Name dictionary construction 构造名称词典:根据实体名称、实体别名、实体名称的小写字母和上面新引入的别名构成。构造名称词典后,每个实体名称对应一个或多个实体ID。例'victory':[‘10001’, ‘19044’, ‘37234’, ‘38870’, ‘40008’, ‘85426’, ‘86532’, ‘140750’]。

3.2 Entity recognition

实体识别的部分使用BERT-CRF架构模型,如图:

在这里插入图片描述

BIO标注格式的BERT-CRF模型,BERT的[CLS]和[SEP]的位置用tag TAG表示。

Input layer

BERT的输入为word embedding、position embedding和type embedidng的总和。

  • word embedding对应于每个单词
  • type embedding的值为0或1,表示是否为被截断的第一个句字或第二个句子。0表示第一个句子,1表示是第二个句子。

命名实体识别任务中只有一个句子,且type embedding总是为0,要学习时间序列特征。BERT使用position embedding来作为时间序列的信息,下列为position embedding的计算公式:

P E ( p o s , 2 i ) = sin ⁡ ( p o s / 1000 0 2 i / d m o d e l , ( 1 ) PE(pos, 2i)=\sin (pos/10000^{2i/d_{model}}, (1) PE(pos,2i)=sin(pos/100002i/dmodel,(1)

P E ( p o s , 2 i + 1 ) = cos ⁡ ( p o s / 1000 0 2 i / d m o d e l , ( 2 ) PE(pos, 2i+1)=\cos (pos/10000^{2i/d_{model}}, (2) PE(pos,2i+1)=cos(pos/100002i/dmodel,(2)

这里使用sin和cos纯粹只是因为Transformer的作者基于经验法则,使用有界的周期性函数,以避免失去相邻位置字符的信息差异以及控制位position embedding的值域在一定范围内,这个部分在Transformer的原文也没有仔细描述,建议可以看看知乎大牛们的解释,总之就是一个小技巧。

BERT layer

包含12层Transformer的编码器架构。编码器单元最重要的模组是自我注意力的部分。如下列公式:

( Q , K , V ) = s o f t m a x ( Q K T d k ) , ( 3 ) (Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}}), (3) (Q,K,V)=softmax(dk QKT),(3)

Q , K , V Q,K,V Q,K,V为输入单词向量矩阵, d k d_k dk是输入向量维度。核心是self-attention(自我注意力机制)

为了让模型能够专注在不同位置上,增加注意力单元的"representative subspace(代表子空间)",Transformer使用multi-head模式,如下公式:

M u l t i H e a d ( Q , K , V ) = C o n c a t ( h e a d 1 , . . . . , h e a d h ) W o , ( 4 ) MultiHead(Q, K, V ) = Concat(head_1 , ...., head_h)W^o, (4) MultiHead(Q,K,V)=Concat(head1,....,headh)Wo,(4)

h e a d i = A t t e n t i o n ( Q W i Q , K W i k , V W i V ) , ( 5 ) head_i = Attention (QW_i^Q , K W_i^k , V W_i^V), (5) headi=Attention(QWiQ,KWik,VWiV),(5)

所謂的multi-head多頭模式就是將q, k, v分成多個,q_1,q_2,…和k_1,k_2,…,和v_1,v_2,…等,

为了解决深度学习中的降级问题,将残差网路和layer的归一化添加到Transformer编码器单元,如下公式:

Layer norm L N ( x i ) = α × x i − μ L σ L 2 + ε , ( 6 ) LN(x_i)=\alpha \times \frac{x_i- \mu _{L}}{\sqrt{\sigma_L^2+\varepsilon}}, (6) LN(xi)=α×σL2+ε xiμL,(6)

全連接網路 F F N = max ⁡ ( 0 , x W 1 + b 1 ) W 2 + b 2 , ( 7 ) FFN=\max(0,xW_1+b_1)W_2+b_2, (7) FFN=max(0,xW1+b1)W2+b2,(7)

CRF layer

BERT layer考虑长期的上下文信息,没有考虑标签之间的依赖关系。采用CRF对标签序列建模,通过考虑标签之间的邻接关系来获得全局最佳的标签序列。

3.3 BERT-EntityNameEmbedding (BERT-ENE) model

如下图所示,特点有:

  1. 使用知识库内的实体名称和别名信息构建一个实体名称辞典。
  2. 使用BERT预训练模型和实体描述文本来选择[CLS]位置的向量输出作为实体名称的embedding。
  3. 通过辞典匹配获得短文本中的候选实体。
  4. 使用BERT-ENE模型过滤匹配结果。

BERT-ENE

Dictionary matching:使用构建的实体名称辞典,采用最大精准度匹配的思想来在文本中匹配实体。为了增进匹配精准度,在不匹配的名称词典中留下单一字符的实体名称,例如'诗'

Entity name embedding:实体名称embedding透过BERT模型得来,有下列特点:

  1. 每个实体文本描述都输入到BERT model,在位置[CLS]的输出向量被抽取来表达该实体的意义。我们因此得到每个实体的向量表示。
  2. 在一实体名称只对应到一个实体(一对一)的case中,实体对应的向量必须直接使用实体名称的嵌入,例如:'无尽武道':['10007]
  3. 在一个实体名称对应到多个实体(一对多)的case中,使用向量平均,例如:‘胜利’: [’10001’, ’19044’, ’37234’, ’38870’, ’40008’, ’85426’, ’86532’, ’ 140750’]

使用了这个方法,在本文的实验中,每个实体名称有768维度的embedding。

BERT-ENE model:输入有两个部分:短文本和实体名称embedding。像BERT一样,短文本输入也有三个部分:单词嵌入、位置嵌入和类型嵌入。短文本输入层是一个BERT layer和一个GRU layer。

GRU是RNN的变体,其计算过程如下:

z t = σ ( W i ∗ [ h t − 1 , x t ] ) , ( 8 ) z_t=\sigma(W_i \ast [h_{t-1}, x_t]) , (8) zt=σ(Wi[ht1,xt]),(8)

r t = σ ( W r ∗ [ h t − 1 , x t ] ) , ( 9 ) r_t=\sigma(W_r \ast [h_{t-1}, x_t]) , (9) rt=σ(Wr[ht1,xt]),(9)

h t ~ = tanh ⁡ ( W c ∗ [ r t ⋅ h t − 1 , x t ] ) , ( 10 ) \tilde{h_t}=\tanh (W_c\ast [r_t \cdot h_{t-1},x_t]), (10) ht~=tanh(Wc[rtht1,xt]),(10)

h t = ( 1 − z t ) ⋅ c t − 1 + z t ⋅ h t ~ , ( 11 ) h_t=(1-z_t)\cdot c_{t-1}+z_t \cdot \tilde{h_t}, (11) ht=(1zt)ct1+ztht~,(11)

为了取得上下文信息完全的优点,有两个方式去实现GRU。BERT输出首先输入到GRU的前向网络和反向传播网络中。

  1. GRU前向网络结尾位置上匹配实体名称到对应的向量 V e n d V_{end} Vend
  2. GRU反向网路中对应到开始位置的向量 V b e g i n V_{begin} Vbegin被抽取

将这两个向量拼接成 V c o n V_con Vcon,作为该实体名称的语义表示。为了学习全文信息,分别对前向和反向GRU的输出进行max pooling(最大池化)操作,以获得向量 V m a x V_{max} Vmax,来表示整个文本的语义。最后再把 V m a x V_{max} Vmax V c o n V_{con} Vcon与相应的嵌入实体名称拼接在一起,透过CNN layer的全连接层,再经过sigmoid激活函数,得到预测的概率。

BERT-ENE模型本质上是一个二分类模型,目的是过滤出匹配的实体。损失函数如下:

l o s s = − ∑ i = 1 n y i ^ log ⁡ y i + ( 1 − y i ^ log ⁡ ( 1 − y i ^ ) ) , ( 12 ) loss=-\sum_{i=1}^n \hat{y_i} \log{y_i}+(1-\hat{y_i}\log(1-\hat{y_i})), (12) loss=i=1nyi^logyi+(1yi^log(1yi^)),(12)

3.4 Result fustion 将结果联合起来

如同上述,在实体识别时使用两个模型:BERT-CRF和BERT-ENE模型。BERT-CRF的实体识别或许在匹配候选实体时会出现边界错误而匹配错误。BERT-ENE模型使用辞典匹配,每个BERT-ENE的结果因此可以找到知识库中的候选实体,避免了边界错误。BERT-ENE模型
在词典匹配时移除了单一字符的实体,有两个解决办法可以进行结合来得到更好的结果。这个联合规则就是:如果在同个位置上有两个结果的复本,选择BERT-ENE的结果,BERT-CRF的结果只在如果实体名称是单一字符时才采用。

3.5 Entity Disambiguation

  • 基于二分类的思想。
  • 训练过程中,匹配的实体作为正例,再在候选实体中选两个负例。
  • 将要消歧的实体的短文本和描述文本连接起来,作为BERT模型输入。
  • [CLS]位置输出的向量与候选实体的开始和结束位置的特征向量相连,透过全连接层和sigmoid激活函数,获得候选实体的概率,选择概率最高的实体作为正确的实体。
  • BERT 二分类模型如下图所示:

BERT binary classification

如同在BERT,输入包含了单词嵌入、位置嵌入和类型嵌入。两个句子的二分类要求类型嵌入应有两个值,其中第一个句子的嵌入为0,第二个句子的嵌入为1。

二分类任务的损失函数:

l o s s = − ∑ i = 1 n y i ^ log ⁡ y i + ( 1 − y i ^ log ⁡ ( 1 − y i ^ ) ) , ( 13 ) loss=-\sum_{i=1}^n \hat{y_i} \log{y_i}+(1-\hat{y_i}\log(1-\hat{y_i})), (13) loss=i=1nyi^logyi+(1yi^log(1yi^)),(13)

4 实验

略。

  • 2
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
TAC KBP Chinese Entity Linking Comprehensive Training and Evaluation Data 2011-2014 LDC2015E17 March 20, 2015 Linguistic Data Consortium 1. Overview Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. The goal of Entity Linking is to determine whether or not the entity referred to in each query has a matching entity node in the reference Knowledge Base (KB) (LDC2014T16). If there is a matching node for a query, annotators create a link between the two. If there is not a matching node for a query, the entity is marked as 'NIL' and then clustered with other NIL entities into equivalence classes. For more information, please refer to the Entity Linking section of NIST's 2014 TAC KBP website (2014 was the last year in which the Chinese Entity Linking evaluation was conducted as of the time this package was created) at http://nlp.cs.rpi.edu/kbp/2014/ This package contains all evaluation and training data developed in support of TAC KBP Chinese Entity Linking during the four years since the task's inception in 2011. This includes queries, KB links, equivalence class clusters for NIL entities (those that could not be linked to an entity in the knowledge base), and entity type information for each of the queries. The data included in this package were originally released by LDC to TAC KBP coordinators and performers under the following ecorpora catalog IDs and titles: LDC2011E46: TAC 2011 KBP Cross-lingual Sample Entity Linking Queries V1.1 LDC2011E55: TAC 2011 KBP Cross-lingual Training Entity Linking V1.1 LDC2012E34: TAC 2011 KBP Cross-Lingual Evaluation Entity Linking Annotation LDC2012E66: TAC 2012 KBP Chinese Entity Linking Web Training Queries and Annotations LDC2012E103: TAC 2012 KBP Chinese Entity Linking Evaluation Annotations V1.2 LDC2013E96: TAC 2013 KBP Chinese Entity Linking Evaluation Queries and Knowledge Base Links V1.2 LDC2014E47: TAC 2014 KBP Chinese Entity Linking Discussion Forum Training Data LDC2014E83: TAC 2014 KBP Chinese Entity Linking Evaluation Queries and Knowledge Base Links V2.0 2. Contents ./README.txt This file ./data/2011/eval/tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml This file contains 2176 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CLCMN_" (if a Chinese language query) or "EL_CLENG_" (if an English language query) plus a five-digit zero-padded, sequentially assigned integer (e.g. "EL_CLCMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2011/eval/source_documents/ from which the namestring was extracted. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 120 291 420 831 CMN NW Non-NIL: 279 150 221 650 ENG NW NIL: 90 129 20 239 ENG NW Non-NIL: 93 72 104 269 ENG WB NIL: 16 0 5 21 ENG WB Non-NIL: 44 68 54 166 ---------------------------------------- Total: 624 710 824 2176 ./data/2011/eval/tac_kbp_2011_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 4 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a four-digit zero-padded sequentially assigned integer (e.g. NIL-0001, NIL-0002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). ./data/2011/eval/source_documents/* This directory contains all of the source documents listed in the <docid> attribute for each query in tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml. See section 5 for more information about source documents. ./data/2011/training/tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml This file is a concatenation of the queries files originally released in LDC2011E46 (sample) and LDC2011E55 (training). This file contains 2171 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CLCMN_" (if a Chinese language query) or "EL_CLENG_" (if an English language query) plus a five-digit zero-padded, sequentially assigned integer (e.g. "EL_CLCMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2011/training/source_documents/ from which the namestring was extracted. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 124 293 426 843 CMN NW Non-NIL: 284 149 227 660 ENG NW NIL: 143 116 63 322 ENG NW Non-NIL: 122 100 100 322 ENG WB NIL: 0 1 0 1 ENG WB Non-NIL: 14 3 6 23 ---------------------------------------- Total: 687 662 822 2171 ./data/2011/training/tac_kbp_2011_chinese_entity_linking_sample_and_training_KB_links.tab This file is a concatenation of the KB_links files originally released in LDC2011E46 (sample) and LDC2011E55 (training). This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 4 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a four-digit zero-padded sequentially assigned integer (e.g. NIL-0001, NIL-0002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). ./data/2011/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml See section 5 for more information about source documents. ./data/2012/eval/tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml This file contains 2122 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CMN_" plus a five-digit zero-padded, sequentially assigned integer (e.g., "EL_CMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2012/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 99 89 167 355 CMN NW Non-NIL: 164 167 148 479 CMN WB NIL: 88 86 68 242 CMN WB Non-NIL: 131 112 110 353 ENG NW NIL: 90 79 68 237 ENG NW Non-NIL: 101 107 83 291 ENG WB NIL: 6 26 16 48 ENG WB Non-NIL: 26 52 39 117 ---------------------------------------- Total: 705 718 699 2122 ./data/2012/eval/tac_kbp_2012_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 5 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. ./data/2012/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2012/training/tac_kbp_2012_chinese_entity_linking_training_queries.xml This file contains 158 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CMN_" plus a five-digit zero-padded, sequentially assigned integer (e.g., "EL_CMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2012/training/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 2 2 2 6 CMN NW Non-NIL: 0 2 0 2 CMN WB NIL: 16 16 17 49 CMN WB Non-NIL: 24 25 24 73 ENG WB NIL: 3 4 0 7 ENG WB Non-NIL: 7 5 9 21 ---------------------------------------- Total: 52 54 52 158 ./data/2012/training/tac_kbp_2012_chinese_entity_linking_training_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 5 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2012_chinese_entity_linking_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. ./data/2012/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2012_chinese_entity_linking_training_queries.xml See section 5 for more information about source documents. ./data/2013/eval/tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml This file contains 2155 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL13_CMN" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL13_CMN_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2013/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total ----------------------------------------- CMN NW NIL: 123 197 125 445 CMN NW Non-NIL: 124 119 163 406 CMN WB NIL: 112 105 87 304 CMN WB Non-NIL: 173 150 162 485 ENG NW NIL: 52 16 68 136 ENG NW Non-NIL: 83 87 64 234 ENG WB NIL: 11 19 7 37 ENG WB Non-NIL: 28 42 38 108 ----------------------------------------- Total: 706 735 714 2155 ./data/2013/eval/tac_kbp_2013_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2013/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2014/eval/tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml This file contains 2739 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL14_CMN_" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL14_CMN_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2014/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total --------------------------------------------- CMN DF NIL: 118 40 16 174 CMN DF Non-NIL: 426 61 66 553 CMN NW NIL: 179 413 300 892 CMN NW Non-NIL: 349 139 184 672 ENG DF NIL: 1 4 5 10 ENG DF Non-NIL: 5 26 25 56 ENG NW NIL: 10 65 32 107 ENG NW Non-NIL: 87 66 119 272 ENG WB Non-NIL: 1 0 2 3 --------------------------------------------- Total: 1176 814 749 2739 ./data/2014/eval/tac_kbp_2014_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2014/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2014/training/tac_kbp_2014_chinese_entity_linking_training_queries.xml This file contains 514 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL14_CMN_TRAINING" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL14_CMN_TRAINING_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2014/training/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total ----------------------------------------- ENG DF NIL: 1 6 3 10 ENG DF Non-NIL: 33 37 41 111 CMN DF NIL: 28 46 6 80 CMN DF Non-NIL: 109 83 121 313 ----------------------------------------- Total: 171 172 171 514 ./data/2014/training/tac_kbp_2014_chinese_entity_linking_training_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2014_chinese_entity_linking_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (all DF or discussion forum threads in these data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2014/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2014_chinese_entity_linking_training_queries.xml See section 5 for more information about source documents. ./dtd/2011_kbpentlink.dtd DTD for: tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml ./dtd/2012_2013_2014_kbpentlink.dtd DTD for: tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml tac_kbp_2012_chinese_entity_linking_training_queries.xml tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml tac_kbp_2014_chinese_entity_linking_training_queries.xml 3. Annotation Given a name string and using information from the query's source document, bilingual Chinese/English-speaking annotators used a specialized search engine to look in the Knowledge Base for a page in which the entity referred to by the query was the central topic. If such a page was found, a link was created between the query and the matching KB node ID. If no matching page was found, the query was marked as NIL and later coreferenced with other NIL entities. Annotators were allowed to use online searching to assist in determining the KB link/NIL status. Queries for which a human annotator could not confidently determine the KB link status were removed from the final data sets. 4. Text Normalization Name string matches are case and punctuation sensitive. The only text normalization performed was: 1. conversion of newlines to spaces, except where preceding characters were hyphens ("-"), in which case newlines were removed 2. conversion of multiple spaces to a single space 5. Source Documents All the text data in the source files have been taken directly from previous LDC corpus releases, and are being provided here essentially "as-is", with little or no additional quality control. An overall scan of character content in the source collections indicates some relatively small quantities of various problems, especially in the web and discussion forum data, including language mismatch (characters from Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding errors (some documents have apparently undergone "double encoding" into UTF-8, and others may have been "noisy" to begin with, or may have gone through an improper encoding conversion, yielding occurrences of the Unicode "replacement character" (U+FFFD) throughout the corpus); the web collection also has characters whose Unicode code points lie outside the "Basic Multilanguage Plane" (BMP), i.e. above U+FFFF. All documents that have filenames beginning with "cmn-NG" and "eng-NG" are Web Document data (WB) and some of these fail XML parsing (see below for details). All files that start with "bolt-" are Discussion Forum threads (DF) and have the XML structure described below. All other files are Newswire data (NW) and have the newswire markup pattern detailed below. Note as well that some source documents are duplicated across a few of the separated source_documents directories, indicating that some queries from different data sets originated from the same source documents. As it is acceptable for sources to be reused for Entity Linking queries, this duplication is intentional and expected. The subsections below go into more detail regarding the markup and other properties of the three source data types: 5.1 Newswire Data Newswire data use the following markup framework: <DOC id="{doc_id_string}" type="{doc_type_label}"> <HEADLINE> ... </HEADLINE> <DATELINE> ... </DATELINE> <TEXT> <P> ... </P> ... </TEXT> </DOC> where the HEADLINE and DATELINE tags are optional (not always present), and the TEXT content may or may not include "<P> ... </P>" tags (depending on whether or not the "doc_type_label" is "story"). All the newswire files are parseable as XML. 5.2 Discussion Forum Data Discussion forum files use the following markup framework: <doc id="{doc_id_string}"> <headline> ... </headline> <post ...> ... <quote ...> ... </quote> ... </post> ... </doc> where there may be arbitrarily deep nesting of quote elements, and other elements may be present (e.g. "<a...>...</a>" anchor tags). As mentioned in section 2 above, each <doc> unit contains at least five post elements. All the discussion forum files are parseable as XML. 5.3 Web Document Data "Web" files use the following markup framework: <DOC> <DOCID> {doc_id_string} </DOCID> <DOCTYPE> ... </DOCTYPE> <DATETIME> ... </DATETIME> <BODY> <HEADLINE> ... </HEADLINE> <TEXT> <POST> <POSTER> ... </POSTER> <POSTDATE> ... </POSTDATE> ... </POST> </TEXT> </BODY> </DOC> Other kinds of tags may be present ("<QUOTE ...>", "<A >", etc). Some of the web source documents contain material that interferes with XML parsing (e.g. unescaped "&", or "<QUOTE>" tags that lack a corresponding "</QUOTE>"). 6. Using the Data 6.1 Offset calculation The values of the beg and end XML elements in the later queries.xml files indicate character offsets to identify text extents in the source. Offset counting starts from the initial character (character 0) of the source document and includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. 6.2 Proper ingesting of XML queries While the character offsets are calculated based on treating the source document as "raw text", the "name" strings being referenced by the queries sometimes contain XML metacharacters, and these had to be "re-escaped" for proper inclusion in the queries.xml file. For example, an actual name like "AT&T" may show up a source document file as "AT&T" (because the source document was originally formatted as XML data). But since the source doc is being treated here as raw text, this name string is treated in queries.xml as having 7 characters (i.e., the character offsets, when provided, will point to a string of length 7). However, the "name" element itself, as presented in the queries.xml file, will be even longer - "AT&amp;T" - because the queries.xml file is intended to be handled by an XML parser, which will return "AT&T" when this "name" element is extracted. Using the queries.xml data without XML parsing would yield a mismatch between the "name" value and the corresponding string in the source data. 7. Copyright Information (c) 2015 Trustees of the University of Pennsylvania 8. Contact Information For further information about this data release, contact the following project staff at LDC: Joseph Ellis, Project Manager <[email protected]> Jeremy Getman, Lead Annotator <[email protected]> Stephanie Strassel, PI <[email protected]> -------------------------------------------------------------------------- README created by Jeremy Getman on February 4, 2015 updated by Joe Ellis on February 16, 2015 updated by Jeremy Getman on February 17, 2015 updated by Joe Ellis on March 18, 2015
2011年的IJCNLP(International Joint Conference on Natural Language Processing)会议上,关于跨语言实体链接的研究被提出并讨论。跨语言实体链接是一种通过语言之间的联系将不同语言的实体进行链接的技术。 实体链接是将自然语言文本中的命名实体与知识库中的实体进行关联的过程。而跨语言实体链接则是在多语言环境下进行的实体链接,通过对多语言文本中的实体进行识别和匹配,在不同语言之间建立对应关系。 在2011年的IJCNLP会议上,研究者们提出了一些跨语言实体链接的方法和技术。其中一种方法是通过使用词语对齐和翻译技术,将不同语言中的词语进行对齐和翻译,然后再进行实体链接。另一种方法是利用跨语言知识库,通过多语言实体之间的关系建立链接。 跨语言实体链接的研究具有重要的应用意义。它可以帮助我们在不同语言的文本数据中进行实体关联分析,从而更好地理解和处理跨语言文本信息。例如,在跨国公司的市场营销中,我们需要了解不同语言中商品的名称、品牌、特性等信息,通过跨语言实体链接,可以将这些信息进行关联分析,为决策提供依据。 然而,在2011年的IJCNLP会议上,这个领域的研究仍然处于初步阶段,存在一些挑战。其中,语言之间的差异和语义的多样性是主要的挑战之一。不同语言之间的词汇和语法结构差异较大,同时,同一个实体在不同语境下可能有不同的名称,这增加了实体链接的复杂性。 因此,未来的研究需要进一步探索更有效的跨语言实体链接方法,以提高链接的准确性和鲁棒性。这将有助于解决实际应用中的语言障碍问题,并促进多语言信息处理的发展。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值