《复杂命名实体识别》

-Sussurro-

已于 2022-03-21 11:12:58 修改

阅读量1.8k

点赞数

分类专栏：复杂命名实体识别文章标签：知识图谱自然语言处理深度学习

于 2022-03-11 16:22:01 首次发布

本文链接：https://blog.csdn.net/weixin_44315848/article/details/122288493

版权

复杂命名实体识别专栏收录该内容

7 篇文章 1 订阅

订阅专栏

1 常见词汇解释

Mention：

可参考：命名实体识别相关论文中常出现的Mention该如何理解？

Span：

可参考：

Token：

可参考：NLP领域中的Token和Tokenization指的是什么？

2 复杂命名实体识别

2.1 数据集

ACE
地址：
https://www.ldc.upenn.edu/collaborations/past-projects/ace
ACE2004: https://catalog.ldc.upenn.edu/LDC2005T09
ACE2005: https://catalog.ldc.upenn.edu/LDC2006T06
ACE corpus:
ACE 2004多语言训练语料库由语言数据联盟(Linguistic Data Consortium, LDC)开发，它包含用英语（158,000个词）、中文（307,000个字符，154,000个词）和阿拉伯语（151,000个词）为实体和关系注释的各种体裁文本。ACE 2004语料库代表了用于2004年自动内容抽取(ACE)技术评估所需的完整的英语、阿拉伯语和中文训练数据集，由LDC在ACE项目的支持下创建，并得到DARPA TIDES(Translingual Information Detection, Extraction and Summarization)项目的额外帮助。这些数据之前作为电子语料库(LDC2004E17)分发给2004年ACE评估的参与者。ACE项目的目标是开发自动内容提取技术，以支持文本形式的人类语言的自动处理。二零零四年九月，我们就六个范畴的系统表现进行了评估：Entity Detection and Recognition (EDR)，Entity Mention Detection (EMD)，EDR Co-reference，Relation Detection and Recognition (RDR)，Relation Mention Detection (RMD)和RDR given reference entities。所有任务都用英语、中文和阿拉伯语三种语言进行评估。
ACE 2005多语言训练语料库由语言数据联盟(LDC)开发，包含大约1800个英语、阿拉伯语和中文混合体裁文本文件，对实体、关系和事件进行了注释。这代表了用于2005年自动内容提取(ACE)技术评估所需的完整的英语、阿拉伯语和中文训练数据集。这些体裁包括新闻专线、广播新闻、广播谈话、博客、论坛和电话谈话。
About ACE：
(Doddington et al., 2004): The Automatic Content Extraction (ACE) Program Tasks, Data, and Evaluation.
使用此数据集的论文：
》》》Wei Lu and Dan Roth. 2015. Joint Mention Extraction and Classification with Mention Hypergraphs:
Our primary experiments were conducted based on the English portion of the ACE2004 dataset and the ACE2005 dataset. Following previous work, for ACE2004, we considered all documents from $\ treebank$ , $b n e w s$ , $\ treebank$ , and $n w i r e$ , and for ACE2005, we considered all documents from $b c$ , $b n$ , $n w$ , and $w l$ . We randomly split the documents for each dataset into three portions: 80% for training, 10% for development, and the remaining 10% for evaluations.

GENIA
地址：
http://www.geniaproject.org/home
GENIA corpus:
该语料库包含 1,999 个Medline摘要，使用PubMed查询选择了三个MeSH术语：“人类”、“血细胞”和“转录因子”。对语料库进行了不同层次的语言和语义信息标注。
在GENIA语料库和相应的子语料库中标注的主要类别是：
- Part-of-Speech annotation: http://www.geniaproject.org/genia-corpus/pos-annotation
- Constituency (phrase structure) syntactic annotation: http://www.geniaproject.org/genia-corpus/treebank
- Term annotation: http://www.geniaproject.org/genia-corpus/term-corpus
- Event annotation: http://www.geniaproject.org/genia-corpus/event-corpus
- Relation annotation: http://www.geniaproject.org/genia-corpus/relation-corpus
- Coreference annotation: http://www.geniaproject.org/genia-corpus/coreference
Shared NLP Tasks：
BioNLP / JNLPBA Shared Task 2004: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ERtask/report.html
BioNLP Shared Task 2009: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/
BioNLP Shared Task 2011: http://2011.bionlp-st.org/
About GENIA：
(Ohta et al., 2002): The GENIA Corpus: an Annotated Research Abstract Corpus in Molecular Biology Domain.
(Kim et al., 2003): GENIA corpus - a semantically annotated corpus for bio-textmining.
(Kim et al., 2004): Introduction to the bio-entity recognition task at JNLPBA.
使用此数据集的论文：
》》》Jenny Rose Finkel and Christopher D. Manning. 2009. Nested Named Entity Recognition:
We performed experiments on the GENIA v.3.02 corpus (Ohta et al., 2002). This corpus contains 2000 Medline abstracts (≈500k words), annotated with 36 different kinds of biological entities, and with parts of speech.

ShARe/CLEF eHealth Evaluation Lab (SHEL) 2013
地址：
https://healthnlp.hms.harvard.edu/share/wiki/index.php?title=Main_Page
SHEL corpus:
准确的临床研究、积极的决策支持和广泛的监测所需的许多临床信息都存储在电子医疗记录(EMR)的文本文件中。在翻译科学中利用这些信息的唯一可行方法是使用自然语言处理来提取和编码这些信息。在过去的二十年里，几个研究小组已经开发了用于临床记录的自然语言处理工具，但阻碍临床NLP进展的主要瓶颈是缺乏用于训练和评估NLP应用的标准、注释的数据集。没有这些标准，独立的NLP应用程序就会大量存在，这些程序无法根据标准注释训练不同的算法，无法共享和集成NLP模块，也无法比较性能。因此我们建议开发标准数据集，使技术能够从文本医疗记录中提取科学信息。
为了实现这一目标，我们将针对三个具体的目标，每个目标都有一组子目标：
目标1：扩展现有的标准，开发一种新的达成共识的注释方案，以一种互操作、可扩展和可用的方式注释临床文本
- 开发语言和临床注释的注释方案；
- 确定对临床术语和本体论知识的依赖；
- 为语言和临床注释制定注释指南。
目标2：开发和评估一种高效和准确的手工注释方法，然后应用该方法对一组公开可用的临床文本进行注释
- 建立收集临床文本注释的基础设施；
- 开发一种获取准确标注的有效方法；
- 注释和评估最终的注释集。
目标3：开发一个公开工具包用于自动标注临床文本，并使用多维灵活的评估指标对该工具包进行评估
- 使用Mayo NLP系统将模块合并到Apache cTAKES中；
- 设计用于比较自动标注与标注语料库的评估指标。应用标准的评估方法并开发新的评估指标，以解决文本判断评估中的复杂性问题；
- 组织临床NLP系统的多轨道共享评估；
- 传播计划。
Shared NLP Tasks：
CLEF/ShARe 2013: http://sites.google.com/site/shareclefehealth/
CLEF/ShARe 2014 (in collaboration with the THYME project): http://clefehealth2014.dcu.ie/task-2
SemEval 2014 Analysis of Clinical Text Task 7 (in collaboration with the THYME project): http://alt.qcri.org/semeval2014/task7/
SemEval 2015 Analysis of Clinical Text Task 14 (in collaboration with the THYME project): http://alt.qcri.org/semeval2015/task14/
About ShARe：
(Suominen et al., 2013): Overview of the ShARe/CLEF eHealth evaluation lab 2013. Springer LNCS.
(Pradhan et al., 2014): SemEval-2014 Task 7: Analysis of Clinical Text.
使用此数据集的论文：
》》》Aldrian Obaja Muis and Wei Lu. 2016. Learning to recognize discontiguous entities:
We found the largest of such corpus to be the dataset from the task to recognize disorder mentions in clinical text, initially organized by ShARe/CLEF eHealth Evaluation Lab (SHEL) in 2013 (Suominen et al., 2013) and continued in SemEval-2014 (Pradhan et al., 2014).