实体链接(Entity-Linking)

关于实体链接(Entity-Linking)任务的调研。

1. 简介:

实体链接是指将自然语言文本中出现的实体提及(entity mention)关联到对应知识图谱实体上去的任务,如标准数据库,知识库,地名词典,维基百科页面等中的对应条目进行链接。

2. 主要的方法,三个模块:

  • 候选实体生成(candidate entity generation)模块,负责从输入文本中检测出实体提及集合M(包括输入文本中提到的所有实体),并从给定知识图谱中找到每个实体提及可能对应的候选实体集合,常用的候选实体生成方法包括词典匹配方法、表层形式扩展法和统计模型法;
  • 实体消歧模块,负责对每个实体提及m对应的候选实体集合中多个候选实体打分和排序,并输出得分最高的候选实体作为m的实体链接结果。常用的候选实体排序方法包括基于监督学习的方法和基于非监督学习的方法
  • 无链接指代预测(unlinkable mention prediction),负责预测输入文本中哪些实体提及是无法被链接到知识图谱中去的。这种情况通常是由知识图谱本身的不完备性导致的,即输入文本中提及的实体尚未被现有知识图谱覆盖(在知识图谱中找不到对应的实体)。

3. 实现流程:

  • 命名实体识别
  • 候选实体生成
  • 实体消歧
  • 未发现实体聚类

4. 候选实体生成

4.1. 词典匹配方法:

  • 匹配词典抽取方法,首先需要构建抽取<实体提及,知识图谱实体>对词典,最常见的匹配词典抽取方法是利用维基百科网站中实体标题、重定向页、消歧页、加粗短语以及超链接之间的内在连接抽取<实体提及,知识图谱实体>。
  • 下表给出基于不同类型数据进行词典构建的具体方法。由于维基百科和包括Freebase 在内的很多其他知识图谱都存在很好的对应关系,因此通过上述方法获取的词典能够很好地用于基于其他知识图谱的实体链接任务。

  • 构建好匹配词典后,基于匹配词典对输入文本进行实体提及识别的方式主要有两种:

1)第一种采用完全匹配(exact match)方法,即文本中每个实体提及一定要准确出现在匹配词典中。完全匹配方法易于实现,但对词典实体提及集合的覆盖度要求较高。一旦某个实体提及发生变化,匹配过程就会失败。
2)第二种采用模糊匹配(fuzzy match)方法,即允许文本中每个实体提及和词典中对应的实体提及在字面上存在一定的差异。常见的模糊匹配机制包括:

  1. 如果文本中某个实体提及被词典中某个实体提及完全包含或它完全包含词典中某个实体提及,那么这两个实体提及模糊匹配成功;
  2. 如里文本中某个实体提及和词典中某个实体提及存在一定程度的单词重叠,那么这两个实体提及模糊匹配成功;
  3. 如果文本中某个实体提及和词典中某个实体提及基于字符串相似度算法(例如 character Dice score,skip bigram Dice score,Hamming distance,编辑距离等)具有很高的相似性,那么这两个实体提及模糊匹配成功

4.2. 统计学习方法(即命名实体识别)

  • 词典匹配方法采用预先抽取好的实体提及集合对输入文本进行实体提及检测。一旦某些实体提及并未出现在匹配词典抽取的语料中,那么该类方法就无法处理。
  • 通过从标注数据上抽取特征学习统计模型,可以用来检测之前并未见过的实体提及(具有较好的泛化性)——命名实体识别任务。

5. 实体消歧(候选实体排序)

5.1. 监督学习方法:

监督学习方法使用的特征分为上下文无关特征和上下文相关特征两大类。

  1. 上下文无关特征(context-independent feature)仅基于实体提及和候选实体本身对不同候选实体进行打分和排序。常用的上下文无关特征包括:
  • 实体提及和候选实体的名称是否完全匹配;
  • 实体提及(或候选实体)是否以候选实体(或实体提及)作为前缀或后缀;
  • 实体提及(或候选实体)是否完全包含候选实体(或实体提及);
  • 实体提及所包含单词的首字母序列是否和候选实体所包含首字母序列相同;
  • 实体提及和候选实体共同包含的单词数目;
  • 候选实体流行度特征,表示实体提及m链接到候选实体 的先验概率
  • 实体提及和候选实体之间的类型匹配特征。该特征对比实体提及的NER 类型(例如 People,Location、Organization 等)与候选实体在知识图谱中的类型是否一致。
  1. 上下文相关特征(context-dependent feature)基于实体提及和候选实体所在上下文之间的相关度对不同候选实体进行打分和排序。常用的上下文相关特征包括:
  • 词袋特征,通过将实体提及和候选实体分别表示为向量形式,计算二者之间的相似性。实体提及向量等于该实体提及所在上下文对应的词袋向量表示。 候选实体向量根据实体的来源不同,生成的方式也不同:对于来自维基百科的候选实体,该向量等于该实体维基百科页面对应的词袋向量表示;对于来自知识图谱的候选实体,该向量等于与该实体直接相连的知识图谱实体和谓词对应的词袋向量表示。
  • 概念向量特征,专门针对基于维基百科的实体链接任务。对于每个候选实体,基于该实体维基百科页面中的重定向、锚文本、关键词、InfoBox 等信息生成一个概念向量,并计算其与实体提及上下文对应词袋向量之间的相似度。
  • 基于上述特征可以利用多种机器学习算法,训练候选实体排序模型。基于朴素贝叶斯、最大熵或支持向量机可以训练二分类器,用来决定实体提及m和候选实体e间是否存在链接关系。

5.2 无监督学习算法:

为了减少实体链接系统对标注数据的需求,可以将无监督学习方法用于候选实体排序任务。常用的方法包括基于向量空间模型的方法和基于信息检索的方法。

  • 基于向量空间模型的方法首先将实体提及m和m对应的某个候选实体分别转化为向量表示。然后,通过计算这两个向量表示之间的距离对不同候选实体进行排序。实体提及和候选实体的不同向量表示生成方法对应了不同的工作。
  • 基于信息检索的方法将每个知识图谱实体对应的维基百科文档作为该实体的表示,并基于该类文档对全部知识图谱实体建立索引。给定输入文本中的一个实体提及m,该类方法首先从输入文本中找到包含m的全部句子集合,并通过去停用词等过滤操作生成一个查询语句。然后,使用该查询语句从知识图谱实体对应的索引中查找得到相关性最高的知识图谱实体,作为m的实体链接结果。
  • 无监督学习方法通常适用于长文本实体链接任务,这是由于短文本无法很好地生成实体提及对应的向量表示或查询语句。

6. 无链接提及预测:

由于知识图谱的不完备性,并不是每个实体提及在知识图谱中都能够找到对应的实体。对于这类实体提及,实体链接系统通常将其链接到一个特殊的“空实体(用符号 NIL 表示)”上去,该任务就是无链接提及预测(unlinkable mention prediction).
无链接提及预测任务常用的策略有三种:

  • 如果一个实体提及对应的候选实体生成结果是空集,那么该实体提及的链接结果是NIL;
  • 如果一个实体提及对应排名最高的候选实体得分低于一个预先设定的阈值,那么该实体提及的链接结果是 NIL.这里用到的阈值通常根据系统在标注数据上的表现进行预设;
  • 给定一个实体提及及其对应排名最高的候选实体,使用二分类器对其进行分类。如果分类结果是1,则返回候选实体作为实体链接结果。否则,该实体提及的链接结果是 NIL. 此外,也可以将NIL 作为一个特殊的实体直接加到每个实体提及对应的候选实体集合中进行打分和排序。

7. 总结:

实体链接任务对智能问答系统而言非常重要。成功识别问题中提到的知识图谱实体不仅有助于问答系统对问题的理解、辅助问答系统完成对问题和答案类型的判断,还能将该实体作为桥梁从知识图谱中找到更多的相关信息帮助答案排序或答案生成任务。

在智能问答场景中,由于真实问题的长度通常较短、知识图谱内容不完备、实体链接对应标注数据集有限等原因,实体链接任务目前依然存在许多问题和挑战。未来的研究需要更大规模和更高覆盖度的标注数据,用于训练更加鲁棒(robust)的实体链接系统。此外,实体链接任务需要和智能问答系统进行整合,进行端到端的训练,这样有助于避免子模块可能产生的错误传递问题。



 

  • 1
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
TAC KBP Chinese Entity Linking Comprehensive Training and Evaluation Data 2011-2014 LDC2015E17 March 20, 2015 Linguistic Data Consortium 1. Overview Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. The goal of Entity Linking is to determine whether or not the entity referred to in each query has a matching entity node in the reference Knowledge Base (KB) (LDC2014T16). If there is a matching node for a query, annotators create a link between the two. If there is not a matching node for a query, the entity is marked as 'NIL' and then clustered with other NIL entities into equivalence classes. For more information, please refer to the Entity Linking section of NIST's 2014 TAC KBP website (2014 was the last year in which the Chinese Entity Linking evaluation was conducted as of the time this package was created) at http://nlp.cs.rpi.edu/kbp/2014/ This package contains all evaluation and training data developed in support of TAC KBP Chinese Entity Linking during the four years since the task's inception in 2011. This includes queries, KB links, equivalence class clusters for NIL entities (those that could not be linked to an entity in the knowledge base), and entity type information for each of the queries. The data included in this package were originally released by LDC to TAC KBP coordinators and performers under the following ecorpora catalog IDs and titles: LDC2011E46: TAC 2011 KBP Cross-lingual Sample Entity Linking Queries V1.1 LDC2011E55: TAC 2011 KBP Cross-lingual Training Entity Linking V1.1 LDC2012E34: TAC 2011 KBP Cross-Lingual Evaluation Entity Linking Annotation LDC2012E66: TAC 2012 KBP Chinese Entity Linking Web Training Queries and Annotations LDC2012E103: TAC 2012 KBP Chinese Entity Linking Evaluation Annotations V1.2 LDC2013E96: TAC 2013 KBP Chinese Entity Linking Evaluation Queries and Knowledge Base Links V1.2 LDC2014E47: TAC 2014 KBP Chinese Entity Linking Discussion Forum Training Data LDC2014E83: TAC 2014 KBP Chinese Entity Linking Evaluation Queries and Knowledge Base Links V2.0 2. Contents ./README.txt This file ./data/2011/eval/tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml This file contains 2176 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CLCMN_" (if a Chinese language query) or "EL_CLENG_" (if an English language query) plus a five-digit zero-padded, sequentially assigned integer (e.g. "EL_CLCMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2011/eval/source_documents/ from which the namestring was extracted. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 120 291 420 831 CMN NW Non-NIL: 279 150 221 650 ENG NW NIL: 90 129 20 239 ENG NW Non-NIL: 93 72 104 269 ENG WB NIL: 16 0 5 21 ENG WB Non-NIL: 44 68 54 166 ---------------------------------------- Total: 624 710 824 2176 ./data/2011/eval/tac_kbp_2011_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 4 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a four-digit zero-padded sequentially assigned integer (e.g. NIL-0001, NIL-0002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). ./data/2011/eval/source_documents/* This directory contains all of the source documents listed in the <docid> attribute for each query in tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml. See section 5 for more information about source documents. ./data/2011/training/tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml This file is a concatenation of the queries files originally released in LDC2011E46 (sample) and LDC2011E55 (training). This file contains 2171 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CLCMN_" (if a Chinese language query) or "EL_CLENG_" (if an English language query) plus a five-digit zero-padded, sequentially assigned integer (e.g. "EL_CLCMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2011/training/source_documents/ from which the namestring was extracted. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 124 293 426 843 CMN NW Non-NIL: 284 149 227 660 ENG NW NIL: 143 116 63 322 ENG NW Non-NIL: 122 100 100 322 ENG WB NIL: 0 1 0 1 ENG WB Non-NIL: 14 3 6 23 ---------------------------------------- Total: 687 662 822 2171 ./data/2011/training/tac_kbp_2011_chinese_entity_linking_sample_and_training_KB_links.tab This file is a concatenation of the KB_links files originally released in LDC2011E46 (sample) and LDC2011E55 (training). This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 4 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a four-digit zero-padded sequentially assigned integer (e.g. NIL-0001, NIL-0002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). ./data/2011/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml See section 5 for more information about source documents. ./data/2012/eval/tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml This file contains 2122 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CMN_" plus a five-digit zero-padded, sequentially assigned integer (e.g., "EL_CMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2012/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 99 89 167 355 CMN NW Non-NIL: 164 167 148 479 CMN WB NIL: 88 86 68 242 CMN WB Non-NIL: 131 112 110 353 ENG NW NIL: 90 79 68 237 ENG NW Non-NIL: 101 107 83 291 ENG WB NIL: 6 26 16 48 ENG WB Non-NIL: 26 52 39 117 ---------------------------------------- Total: 705 718 699 2122 ./data/2012/eval/tac_kbp_2012_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 5 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. ./data/2012/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2012/training/tac_kbp_2012_chinese_entity_linking_training_queries.xml This file contains 158 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CMN_" plus a five-digit zero-padded, sequentially assigned integer (e.g., "EL_CMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2012/training/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 2 2 2 6 CMN NW Non-NIL: 0 2 0 2 CMN WB NIL: 16 16 17 49 CMN WB Non-NIL: 24 25 24 73 ENG WB NIL: 3 4 0 7 ENG WB Non-NIL: 7 5 9 21 ---------------------------------------- Total: 52 54 52 158 ./data/2012/training/tac_kbp_2012_chinese_entity_linking_training_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 5 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2012_chinese_entity_linking_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. ./data/2012/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2012_chinese_entity_linking_training_queries.xml See section 5 for more information about source documents. ./data/2013/eval/tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml This file contains 2155 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL13_CMN" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL13_CMN_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2013/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total ----------------------------------------- CMN NW NIL: 123 197 125 445 CMN NW Non-NIL: 124 119 163 406 CMN WB NIL: 112 105 87 304 CMN WB Non-NIL: 173 150 162 485 ENG NW NIL: 52 16 68 136 ENG NW Non-NIL: 83 87 64 234 ENG WB NIL: 11 19 7 37 ENG WB Non-NIL: 28 42 38 108 ----------------------------------------- Total: 706 735 714 2155 ./data/2013/eval/tac_kbp_2013_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2013/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2014/eval/tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml This file contains 2739 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL14_CMN_" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL14_CMN_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2014/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total --------------------------------------------- CMN DF NIL: 118 40 16 174 CMN DF Non-NIL: 426 61 66 553 CMN NW NIL: 179 413 300 892 CMN NW Non-NIL: 349 139 184 672 ENG DF NIL: 1 4 5 10 ENG DF Non-NIL: 5 26 25 56 ENG NW NIL: 10 65 32 107 ENG NW Non-NIL: 87 66 119 272 ENG WB Non-NIL: 1 0 2 3 --------------------------------------------- Total: 1176 814 749 2739 ./data/2014/eval/tac_kbp_2014_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2014/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2014/training/tac_kbp_2014_chinese_entity_linking_training_queries.xml This file contains 514 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL14_CMN_TRAINING" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL14_CMN_TRAINING_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2014/training/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total ----------------------------------------- ENG DF NIL: 1 6 3 10 ENG DF Non-NIL: 33 37 41 111 CMN DF NIL: 28 46 6 80 CMN DF Non-NIL: 109 83 121 313 ----------------------------------------- Total: 171 172 171 514 ./data/2014/training/tac_kbp_2014_chinese_entity_linking_training_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2014_chinese_entity_linking_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (all DF or discussion forum threads in these data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2014/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2014_chinese_entity_linking_training_queries.xml See section 5 for more information about source documents. ./dtd/2011_kbpentlink.dtd DTD for: tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml ./dtd/2012_2013_2014_kbpentlink.dtd DTD for: tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml tac_kbp_2012_chinese_entity_linking_training_queries.xml tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml tac_kbp_2014_chinese_entity_linking_training_queries.xml 3. Annotation Given a name string and using information from the query's source document, bilingual Chinese/English-speaking annotators used a specialized search engine to look in the Knowledge Base for a page in which the entity referred to by the query was the central topic. If such a page was found, a link was created between the query and the matching KB node ID. If no matching page was found, the query was marked as NIL and later coreferenced with other NIL entities. Annotators were allowed to use online searching to assist in determining the KB link/NIL status. Queries for which a human annotator could not confidently determine the KB link status were removed from the final data sets. 4. Text Normalization Name string matches are case and punctuation sensitive. The only text normalization performed was: 1. conversion of newlines to spaces, except where preceding characters were hyphens ("-"), in which case newlines were removed 2. conversion of multiple spaces to a single space 5. Source Documents All the text data in the source files have been taken directly from previous LDC corpus releases, and are being provided here essentially "as-is", with little or no additional quality control. An overall scan of character content in the source collections indicates some relatively small quantities of various problems, especially in the web and discussion forum data, including language mismatch (characters from Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding errors (some documents have apparently undergone "double encoding" into UTF-8, and others may have been "noisy" to begin with, or may have gone through an improper encoding conversion, yielding occurrences of the Unicode "replacement character" (U+FFFD) throughout the corpus); the web collection also has characters whose Unicode code points lie outside the "Basic Multilanguage Plane" (BMP), i.e. above U+FFFF. All documents that have filenames beginning with "cmn-NG" and "eng-NG" are Web Document data (WB) and some of these fail XML parsing (see below for details). All files that start with "bolt-" are Discussion Forum threads (DF) and have the XML structure described below. All other files are Newswire data (NW) and have the newswire markup pattern detailed below. Note as well that some source documents are duplicated across a few of the separated source_documents directories, indicating that some queries from different data sets originated from the same source documents. As it is acceptable for sources to be reused for Entity Linking queries, this duplication is intentional and expected. The subsections below go into more detail regarding the markup and other properties of the three source data types: 5.1 Newswire Data Newswire data use the following markup framework: <DOC id="{doc_id_string}" type="{doc_type_label}"> <HEADLINE> ... </HEADLINE> <DATELINE> ... </DATELINE> <TEXT> <P> ... </P> ... </TEXT> </DOC> where the HEADLINE and DATELINE tags are optional (not always present), and the TEXT content may or may not include "<P> ... </P>" tags (depending on whether or not the "doc_type_label" is "story"). All the newswire files are parseable as XML. 5.2 Discussion Forum Data Discussion forum files use the following markup framework: <doc id="{doc_id_string}"> <headline> ... </headline> <post ...> ... <quote ...> ... </quote> ... </post> ... </doc> where there may be arbitrarily deep nesting of quote elements, and other elements may be present (e.g. "<a...>...</a>" anchor tags). As mentioned in section 2 above, each <doc> unit contains at least five post elements. All the discussion forum files are parseable as XML. 5.3 Web Document Data "Web" files use the following markup framework: <DOC> <DOCID> {doc_id_string} </DOCID> <DOCTYPE> ... </DOCTYPE> <DATETIME> ... </DATETIME> <BODY> <HEADLINE> ... </HEADLINE> <TEXT> <POST> <POSTER> ... </POSTER> <POSTDATE> ... </POSTDATE> ... </POST> </TEXT> </BODY> </DOC> Other kinds of tags may be present ("<QUOTE ...>", "<A >", etc). Some of the web source documents contain material that interferes with XML parsing (e.g. unescaped "&", or "<QUOTE>" tags that lack a corresponding "</QUOTE>"). 6. Using the Data 6.1 Offset calculation The values of the beg and end XML elements in the later queries.xml files indicate character offsets to identify text extents in the source. Offset counting starts from the initial character (character 0) of the source document and includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. 6.2 Proper ingesting of XML queries While the character offsets are calculated based on treating the source document as "raw text", the "name" strings being referenced by the queries sometimes contain XML metacharacters, and these had to be "re-escaped" for proper inclusion in the queries.xml file. For example, an actual name like "AT&T" may show up a source document file as "AT&T" (because the source document was originally formatted as XML data). But since the source doc is being treated here as raw text, this name string is treated in queries.xml as having 7 characters (i.e., the character offsets, when provided, will point to a string of length 7). However, the "name" element itself, as presented in the queries.xml file, will be even longer - "AT&amp;T" - because the queries.xml file is intended to be handled by an XML parser, which will return "AT&T" when this "name" element is extracted. Using the queries.xml data without XML parsing would yield a mismatch between the "name" value and the corresponding string in the source data. 7. Copyright Information (c) 2015 Trustees of the University of Pennsylvania 8. Contact Information For further information about this data release, contact the following project staff at LDC: Joseph Ellis, Project Manager <joellis@ldc.upenn.edu> Jeremy Getman, Lead Annotator <jgetman@ldc.upenn.edu> Stephanie Strassel, PI <strassel@ldc.upenn.edu> -------------------------------------------------------------------------- README created by Jeremy Getman on February 4, 2015 updated by Joe Ellis on February 16, 2015 updated by Jeremy Getman on February 17, 2015 updated by Joe Ellis on March 18, 2015
2011年的IJCNLP(International Joint Conference on Natural Language Processing)会议上,关于跨语言实体链接的研究被提出并讨论。跨语言实体链接是一种通过语言之间的联系将不同语言的实体进行链接的技术。 实体链接是将自然语言文本中的命名实体与知识库中的实体进行关联的过程。而跨语言实体链接则是在多语言环境下进行的实体链接,通过对多语言文本中的实体进行识别和匹配,在不同语言之间建立对应关系。 在2011年的IJCNLP会议上,研究者们提出了一些跨语言实体链接的方法和技术。其中一种方法是通过使用词语对齐和翻译技术,将不同语言中的词语进行对齐和翻译,然后再进行实体链接。另一种方法是利用跨语言知识库,通过多语言实体之间的关系建立链接。 跨语言实体链接的研究具有重要的应用意义。它可以帮助我们在不同语言的文本数据中进行实体关联分析,从而更好地理解和处理跨语言文本信息。例如,在跨国公司的市场营销中,我们需要了解不同语言中商品的名称、品牌、特性等信息,通过跨语言实体链接,可以将这些信息进行关联分析,为决策提供依据。 然而,在2011年的IJCNLP会议上,这个领域的研究仍然处于初步阶段,存在一些挑战。其中,语言之间的差异和语义的多样性是主要的挑战之一。不同语言之间的词汇和语法结构差异较大,同时,同一个实体在不同语境下可能有不同的名称,这增加了实体链接的复杂性。 因此,未来的研究需要进一步探索更有效的跨语言实体链接方法,以提高链接的准确性和鲁棒性。这将有助于解决实际应用中的语言障碍问题,并促进多语言信息处理的发展。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值