[论文阅读笔记42]BioSyn

39 篇文章 14 订阅

题目

Biomedical Entity Representations with Synonym Marginalization

具有同义词边缘化的生物医学实体表示

Korea University (韩国)高丽大学

代码:https://github.com/dmis-lab/BioSyn

Sung M , Jeon H , Lee J , et al. Biomedical Entity Representations with Synonym Marginalization[J]. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020.

摘要

专注于学习仅基于实体的同义词的生物医学实体的表示。
提出BIOSYN.

背景

不同字符有相同的含义,相同的含义有不同的表面字符;
一般的解决方法是采用二元分类来处理,如果相同就是正样本,否则为负样本。

论文提出:BIOSYN – 使用同义词边缘化技术,它最大限度地将top候选对象中所有同义词表示的概率最大化。

采用稀疏与稠密的方法分别去捕捉“morphological”与“semantic”的信息;

方法

贡献:提出BIOSYN模型,它是基于同义词边缘化的思想;

以前的工作都是pair-wise训练模型的,且明确要求负样本对的;负样本的方法,负样本的采样对结果的影响是十分大的。本论文的工作是基于边缘化正样本的方法。

基于检索的思想去研究相关 – maximum inner product search (MIPS) 【计算最大内积搜索】

  1. 问题定义

    image-20210618114256977

    对于输入的m, CUI(·)返回同义词n的CUI,其中θ 为模型参数。N是所有同义词休,n为其中的元素。

  2. 模型

image-20210618114002585

Mention与Dictionary采用同等的Encoder来编码,它们是共享的,接着就是内积;

在训练阶段迭代更新top候选与基于表达计算marginal同义词概率;

在预测阶段,使用MIPS来计算最相近的同义词;

稀疏实体表示

e_s_me_s_n分别表示输入与同义词的tf-idf稀疏表示,稀疏相似定义为:

image-20210618140354316

f(··)表示相似函数,通过两向量的内积来计算。

密集实体表示

稀疏表示实现了形态学的编码表示,密集表示则是是语义信息编码表示;

学习有效的密集表示是实体标准化的一个关键挑战;

这里使用BioBERT来编码。【Biobert: a pre-trained biomedical language representation model for biomedical text mining – 2019】

预训练BioBERT,fifine-tune是使用 synonym marginalization algorithm;

image-20210618141428860

m = {**m1*, …, ml}*,是subword序列,由Word-Piece tokenizer分隔开的子词集合;[CLS] 表示输入的输出向量,即是表示这个m的向量。

image-20210618142030159

这个f也是内积的相似函数。

Similarity Function(相似函数)

image-20210618142211744

其中,λ是sparse分类,它是可训练标量权重。

  1. 训练

    基于模型侯选检索与最大化同义词正向边缘概率的方法。在这个框架中,使用实体编码器来迭代地更新顶级候选者。

    Iterative Candidate Retrieval

    这步就像是召回。从大量的候选集中选择小部分来训练。

    k: 表示对于训练集检索出来top候选的总数;

    a: 表示来自dense候选的比率。(0 α 1)

    [ak]个S_dense候选, k - [ak]个S_sparse候选.

    Synonym Marginalization

    image-20210618144140753

    分母是前k个候选之和。

    对于m的同义正向边缘概率定义为:

    image-20210618145231255

    EQUAL(m, n)为1时, CUI(m)等价于CUI(n).

    损失函数

    image-20210618145557131

    M表示mentions的总数;

  2. 预测

    预测时只是计算S(m, n) 就可以了,然后选择最近似的一个就OK了。

实验

预处理:大小写,标点符号,拼写错误,缩写(Ab3P),组合概念词(启发式规则);

https://github.com/ncbi-nlp/Ab3P

对于稀疏: tf-idf方法,使用uni-, bi-grams.

k = 20 – 候选数

a = 0.5 – dense的占比

学习率 = 1e-5

weight decay = 1e-2

mini-batch size = 16

λ = 2~4

数据集

image-20210618150855857

NCBI Disease Corpus:https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE

Biocreative V CDR:https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/track-3-cdr/

TAC2017ADR:https://bionlp.nlm.nih.gov/tac2017adversereactions

结果

image-20210618151533160

​ BIOSYN(S-SCORE): 只使用sparse scores来推理预测;

​ BIOSYN(D-SCORE): 只使用dense scores来推理预测;

​ BIOSYN (α = 0*.*0): 只使用sparse candidates来训练;

​ BIOSYN (α = 1*.*0): 只使用dense candidates来训练;

迭代候选检索过程

image-20210618153340234

候选集数量效果进行研究

更高的候选数不会提高更高的准确率

image-20210618151900402

同义词边缘化研究

marginal maximum likelihood (MML)与其它损失函数对比:hard EM, standard pair-wise;

Memory augmented policy optimization for program synthesis and semantic parsing – 2018

Dnorm: disease name normalization with pairwise learning to rank – 2013

image-20210618152335266

分析

Iterative Candidate Samples

image-20210618152534864

Error Analysis

相关工作

  1. 生物医学实体表达依赖于生物医学词表达:

Word2vec

Distributed representations of words and phrases and their compositionality – 2013

Distributional semantics resources for biomedical text processing – 2013, PubMed语料

生物医版的word2Vec广泛应用于其它任务上,标准化任务也不例外:《Medical entity linking using triplet network》 - 2019

BioBERT

Biobert: a pre-trained biomedical language representation model for biomedical text mining – 2019 – 基于bert模型使用生物语料进行训练的模型

  1. 任务问题陈述

对生物医学实体表示质量评价通常是通过生物医学实体标准化任务来验证;

目标:将生物医学文本Mention映射到字典中相关的CUI(概念唯一ID);

任务相关:entity linking,entity grounding;

挑战:生物医学领域有大量的同义词;

相关论文:

Dnorm: disease name normaliza tion with pairwise learning to rank – 2013

Robust representation learning of biomedical names – 2019

Sieve-based entity linking for the biomedical domain – 2015

Taggerone: joint named entity recognition and normalization with semi-markov models – 2016

  1. 传统的标准化方法 – 基于手工规则进行

    DNorm, CNN-based ranking method,NSEEN(与论文相似),BNE(与论文相似)

    Nseen: Neural semantic embedding for entity normalization – 2019

    Robust representation learning of biomedical names – 2019 – BNE

    使用LSTM模型将字典中提到的概念名称映射到潜在空间,并使用负采样技术改进了嵌入。

总结

不知道这个方法用到中文标准化会怎么样?

参考

代码:https://github.com/dmis-lab/BioSyn

论文:https://arxiv.org/abs/2005.00239

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
TAC KBP Chinese Entity Linking Comprehensive Training and Evaluation Data 2011-2014 LDC2015E17 March 20, 2015 Linguistic Data Consortium 1. Overview Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. The goal of Entity Linking is to determine whether or not the entity referred to in each query has a matching entity node in the reference Knowledge Base (KB) (LDC2014T16). If there is a matching node for a query, annotators create a link between the two. If there is not a matching node for a query, the entity is marked as 'NIL' and then clustered with other NIL entities into equivalence classes. For more information, please refer to the Entity Linking section of NIST's 2014 TAC KBP website (2014 was the last year in which the Chinese Entity Linking evaluation was conducted as of the time this package was created) at http://nlp.cs.rpi.edu/kbp/2014/ This package contains all evaluation and training data developed in support of TAC KBP Chinese Entity Linking during the four years since the task's inception in 2011. This includes queries, KB links, equivalence class clusters for NIL entities (those that could not be linked to an entity in the knowledge base), and entity type information for each of the queries. The data included in this package were originally released by LDC to TAC KBP coordinators and performers under the following ecorpora catalog IDs and titles: LDC2011E46: TAC 2011 KBP Cross-lingual Sample Entity Linking Queries V1.1 LDC2011E55: TAC 2011 KBP Cross-lingual Training Entity Linking V1.1 LDC2012E34: TAC 2011 KBP Cross-Lingual Evaluation Entity Linking Annotation LDC2012E66: TAC 2012 KBP Chinese Entity Linking Web Training Queries and Annotations LDC2012E103: TAC 2012 KBP Chinese Entity Linking Evaluation Annotations V1.2 LDC2013E96: TAC 2013 KBP Chinese Entity Linking Evaluation Queries and Knowledge Base Links V1.2 LDC2014E47: TAC 2014 KBP Chinese Entity Linking Discussion Forum Training Data LDC2014E83: TAC 2014 KBP Chinese Entity Linking Evaluation Queries and Knowledge Base Links V2.0 2. Contents ./README.txt This file ./data/2011/eval/tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml This file contains 2176 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CLCMN_" (if a Chinese language query) or "EL_CLENG_" (if an English language query) plus a five-digit zero-padded, sequentially assigned integer (e.g. "EL_CLCMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2011/eval/source_documents/ from which the namestring was extracted. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 120 291 420 831 CMN NW Non-NIL: 279 150 221 650 ENG NW NIL: 90 129 20 239 ENG NW Non-NIL: 93 72 104 269 ENG WB NIL: 16 0 5 21 ENG WB Non-NIL: 44 68 54 166 ---------------------------------------- Total: 624 710 824 2176 ./data/2011/eval/tac_kbp_2011_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 4 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a four-digit zero-padded sequentially assigned integer (e.g. NIL-0001, NIL-0002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). ./data/2011/eval/source_documents/* This directory contains all of the source documents listed in the <docid> attribute for each query in tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml. See section 5 for more information about source documents. ./data/2011/training/tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml This file is a concatenation of the queries files originally released in LDC2011E46 (sample) and LDC2011E55 (training). This file contains 2171 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CLCMN_" (if a Chinese language query) or "EL_CLENG_" (if an English language query) plus a five-digit zero-padded, sequentially assigned integer (e.g. "EL_CLCMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2011/training/source_documents/ from which the namestring was extracted. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 124 293 426 843 CMN NW Non-NIL: 284 149 227 660 ENG NW NIL: 143 116 63 322 ENG NW Non-NIL: 122 100 100 322 ENG WB NIL: 0 1 0 1 ENG WB Non-NIL: 14 3 6 23 ---------------------------------------- Total: 687 662 822 2171 ./data/2011/training/tac_kbp_2011_chinese_entity_linking_sample_and_training_KB_links.tab This file is a concatenation of the KB_links files originally released in LDC2011E46 (sample) and LDC2011E55 (training). This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 4 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a four-digit zero-padded sequentially assigned integer (e.g. NIL-0001, NIL-0002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). ./data/2011/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml See section 5 for more information about source documents. ./data/2012/eval/tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml This file contains 2122 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CMN_" plus a five-digit zero-padded, sequentially assigned integer (e.g., "EL_CMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2012/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 99 89 167 355 CMN NW Non-NIL: 164 167 148 479 CMN WB NIL: 88 86 68 242 CMN WB Non-NIL: 131 112 110 353 ENG NW NIL: 90 79 68 237 ENG NW Non-NIL: 101 107 83 291 ENG WB NIL: 6 26 16 48 ENG WB Non-NIL: 26 52 39 117 ---------------------------------------- Total: 705 718 699 2122 ./data/2012/eval/tac_kbp_2012_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 5 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. ./data/2012/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2012/training/tac_kbp_2012_chinese_entity_linking_training_queries.xml This file contains 158 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CMN_" plus a five-digit zero-padded, sequentially assigned integer (e.g., "EL_CMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2012/training/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 2 2 2 6 CMN NW Non-NIL: 0 2 0 2 CMN WB NIL: 16 16 17 49 CMN WB Non-NIL: 24 25 24 73 ENG WB NIL: 3 4 0 7 ENG WB Non-NIL: 7 5 9 21 ---------------------------------------- Total: 52 54 52 158 ./data/2012/training/tac_kbp_2012_chinese_entity_linking_training_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 5 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2012_chinese_entity_linking_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. ./data/2012/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2012_chinese_entity_linking_training_queries.xml See section 5 for more information about source documents. ./data/2013/eval/tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml This file contains 2155 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL13_CMN" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL13_CMN_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2013/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total ----------------------------------------- CMN NW NIL: 123 197 125 445 CMN NW Non-NIL: 124 119 163 406 CMN WB NIL: 112 105 87 304 CMN WB Non-NIL: 173 150 162 485 ENG NW NIL: 52 16 68 136 ENG NW Non-NIL: 83 87 64 234 ENG WB NIL: 11 19 7 37 ENG WB Non-NIL: 28 42 38 108 ----------------------------------------- Total: 706 735 714 2155 ./data/2013/eval/tac_kbp_2013_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2013/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2014/eval/tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml This file contains 2739 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL14_CMN_" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL14_CMN_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2014/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total --------------------------------------------- CMN DF NIL: 118 40 16 174 CMN DF Non-NIL: 426 61 66 553 CMN NW NIL: 179 413 300 892 CMN NW Non-NIL: 349 139 184 672 ENG DF NIL: 1 4 5 10 ENG DF Non-NIL: 5 26 25 56 ENG NW NIL: 10 65 32 107 ENG NW Non-NIL: 87 66 119 272 ENG WB Non-NIL: 1 0 2 3 --------------------------------------------- Total: 1176 814 749 2739 ./data/2014/eval/tac_kbp_2014_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2014/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2014/training/tac_kbp_2014_chinese_entity_linking_training_queries.xml This file contains 514 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL14_CMN_TRAINING" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL14_CMN_TRAINING_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2014/training/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total ----------------------------------------- ENG DF NIL: 1 6 3 10 ENG DF Non-NIL: 33 37 41 111 CMN DF NIL: 28 46 6 80 CMN DF Non-NIL: 109 83 121 313 ----------------------------------------- Total: 171 172 171 514 ./data/2014/training/tac_kbp_2014_chinese_entity_linking_training_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2014_chinese_entity_linking_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (all DF or discussion forum threads in these data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2014/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2014_chinese_entity_linking_training_queries.xml See section 5 for more information about source documents. ./dtd/2011_kbpentlink.dtd DTD for: tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml ./dtd/2012_2013_2014_kbpentlink.dtd DTD for: tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml tac_kbp_2012_chinese_entity_linking_training_queries.xml tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml tac_kbp_2014_chinese_entity_linking_training_queries.xml 3. Annotation Given a name string and using information from the query's source document, bilingual Chinese/English-speaking annotators used a specialized search engine to look in the Knowledge Base for a page in which the entity referred to by the query was the central topic. If such a page was found, a link was created between the query and the matching KB node ID. If no matching page was found, the query was marked as NIL and later coreferenced with other NIL entities. Annotators were allowed to use online searching to assist in determining the KB link/NIL status. Queries for which a human annotator could not confidently determine the KB link status were removed from the final data sets. 4. Text Normalization Name string matches are case and punctuation sensitive. The only text normalization performed was: 1. conversion of newlines to spaces, except where preceding characters were hyphens ("-"), in which case newlines were removed 2. conversion of multiple spaces to a single space 5. Source Documents All the text data in the source files have been taken directly from previous LDC corpus releases, and are being provided here essentially "as-is", with little or no additional quality control. An overall scan of character content in the source collections indicates some relatively small quantities of various problems, especially in the web and discussion forum data, including language mismatch (characters from Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding errors (some documents have apparently undergone "double encoding" into UTF-8, and others may have been "noisy" to begin with, or may have gone through an improper encoding conversion, yielding occurrences of the Unicode "replacement character" (U+FFFD) throughout the corpus); the web collection also has characters whose Unicode code points lie outside the "Basic Multilanguage Plane" (BMP), i.e. above U+FFFF. All documents that have filenames beginning with "cmn-NG" and "eng-NG" are Web Document data (WB) and some of these fail XML parsing (see below for details). All files that start with "bolt-" are Discussion Forum threads (DF) and have the XML structure described below. All other files are Newswire data (NW) and have the newswire markup pattern detailed below. Note as well that some source documents are duplicated across a few of the separated source_documents directories, indicating that some queries from different data sets originated from the same source documents. As it is acceptable for sources to be reused for Entity Linking queries, this duplication is intentional and expected. The subsections below go into more detail regarding the markup and other properties of the three source data types: 5.1 Newswire Data Newswire data use the following markup framework: <DOC id="{doc_id_string}" type="{doc_type_label}"> <HEADLINE> ... </HEADLINE> <DATELINE> ... </DATELINE> <TEXT> <P> ... </P> ... </TEXT> </DOC> where the HEADLINE and DATELINE tags are optional (not always present), and the TEXT content may or may not include "<P> ... </P>" tags (depending on whether or not the "doc_type_label" is "story"). All the newswire files are parseable as XML. 5.2 Discussion Forum Data Discussion forum files use the following markup framework: <doc id="{doc_id_string}"> <headline> ... </headline> <post ...> ... <quote ...> ... </quote> ... </post> ... </doc> where there may be arbitrarily deep nesting of quote elements, and other elements may be present (e.g. "<a...>...</a>" anchor tags). As mentioned in section 2 above, each <doc> unit contains at least five post elements. All the discussion forum files are parseable as XML. 5.3 Web Document Data "Web" files use the following markup framework: <DOC> <DOCID> {doc_id_string} </DOCID> <DOCTYPE> ... </DOCTYPE> <DATETIME> ... </DATETIME> <BODY> <HEADLINE> ... </HEADLINE> <TEXT> <POST> <POSTER> ... </POSTER> <POSTDATE> ... </POSTDATE> ... </POST> </TEXT> </BODY> </DOC> Other kinds of tags may be present ("<QUOTE ...>", "<A >", etc). Some of the web source documents contain material that interferes with XML parsing (e.g. unescaped "&", or "<QUOTE>" tags that lack a corresponding "</QUOTE>"). 6. Using the Data 6.1 Offset calculation The values of the beg and end XML elements in the later queries.xml files indicate character offsets to identify text extents in the source. Offset counting starts from the initial character (character 0) of the source document and includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. 6.2 Proper ingesting of XML queries While the character offsets are calculated based on treating the source document as "raw text", the "name" strings being referenced by the queries sometimes contain XML metacharacters, and these had to be "re-escaped" for proper inclusion in the queries.xml file. For example, an actual name like "AT&T" may show up a source document file as "AT&T" (because the source document was originally formatted as XML data). But since the source doc is being treated here as raw text, this name string is treated in queries.xml as having 7 characters (i.e., the character offsets, when provided, will point to a string of length 7). However, the "name" element itself, as presented in the queries.xml file, will be even longer - "AT&amp;T" - because the queries.xml file is intended to be handled by an XML parser, which will return "AT&T" when this "name" element is extracted. Using the queries.xml data without XML parsing would yield a mismatch between the "name" value and the corresponding string in the source data. 7. Copyright Information (c) 2015 Trustees of the University of Pennsylvania 8. Contact Information For further information about this data release, contact the following project staff at LDC: Joseph Ellis, Project Manager <joellis@ldc.upenn.edu> Jeremy Getman, Lead Annotator <jgetman@ldc.upenn.edu> Stephanie Strassel, PI <strassel@ldc.upenn.edu> -------------------------------------------------------------------------- README created by Jeremy Getman on February 4, 2015 updated by Joe Ellis on February 16, 2015 updated by Jeremy Getman on February 17, 2015 updated by Joe Ellis on March 18, 2015
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值