An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition

最新推荐文章于 2021-03-27 12:27:50 发布

ccluqh

最新推荐文章于 2021-03-27 12:27:50 发布

阅读量1.4k

点赞数

本文链接：https://blog.csdn.net/qq_28468707/article/details/103874217

版权

论文阅读笔记专栏收录该内容

10 篇文章 0 订阅

订阅专栏

Abstract

基于传统的机器学习，其性能在很大程度上取决于特征工程。而且，这些方法是具有标记不一致问题的句子级方法。

我们提出了一种神经网络方法，（Att-BiLSTM-CRF）用于文档NER。该方法利用通过Att获得的文档级全局信息来在文档中实施同一令牌的多个实例之间标记一致性

1 Introduction

在实践中，传统机器学习方法和深度学习方法都将NER视为句子级任务，即，它们将每个句子视为单独的文档，在文档的不同句子中使用相同令牌的多个实例时，会看到多个完全独立的标记问题。即，文档中相同的提及内容使用不同的标签标记。（例如“ VK2”，化学实体“维生素K2”的缩写）。合理地，这些提及应使用相同的标签标记。但是，模型无法识别粗体字的“ VK2”。这就是所谓的标记不一致问题。

我们的工作集中在上述两个问题上：对神经网络体系结构方法的特征工程的性能依赖性以及语句级NER的标记不一致。

在文档级别捕获类似的实体注意力，将文档的不同句子中的相关标记视为非独立标记问题

引入域特征（例如语音部分（POS），分块和字典特征）

2 Materials and methods

2.1 Features

通过word2vec工具使用skip-gram模型作为预训练词嵌入来训练词嵌入。
BiLSTM Character embedding，他们不仅可以学习实体名称的内部表示形式，还可以缓解OOV的问题（Rei等人，2016）。
每个单词的POS信息和分块信息由GENIA标签生成器（http://www.nactem.ac.uk/GENIA/tagger/）生成。然后，使用两个查询表分别输出25维POS嵌入和10维分块嵌入。
常将化学词典作为一种领域知识形式添加到功能集中。
我们使用Jochem词典来生成词典功能。首先，捕获标准化令牌序列和字典条目之间的最长匹配。然后，对于匹配中的每个令牌，均以BIO标记方案对特征进行编码。最后，使用查询表输出5维字典嵌入。

2.2 BiLSTM-CRF model

2.3 Att-BiLSTM-CRF model

For NER task, Bharadwaj et al. (2016) and Rei et al. (2016) introduced the attention mechanism to enhance their model performances. Though both methods focus on the character-lever representations by the attention mechanism, they are still the sentence-level NER methods. More recently, Pandey et al. (2017) presented a model similar to our Att-BiLSTM-CRF to extract knowledge of Adverse Drug Reactions (ADRs).

Bharadwaj,A. et al. (2016) Phonologically aware neural model for named entity recognition in low resource transfer settings. EMNLP, 2016, 1462–1472.

Rei,M. et al. (2016) Attending to characters in neural sequence labeling mod- els. arXiv preprint arXiv: 1611.04361.

However, their attention mechanism focuses on which encoded elements contribute to the generation of the current unit or the prediction of an ADR in sentence-level. Different from them, we apply the attention mechanism to focus on the related tokens in the different sentences of a document to address the tagging inconsistency problem.但是，他们的注意力机制集中在哪些编码元素有助于生成当前单元或预测句子级别的ADR。与它们不同的是，我们应用注意力机制将重点放在文档的不同句子中的相关标记上，以解决标记不一致的问题。

Pandey,C. et al. (2017) Improving RNN with attention and embedding for adverse drug reactions. In: Proceedings of the 2017 International Conference on Digital Health. ACM, pp. 67–71.We define N as the number of the words in the document.

We define N as the number of the words in the document.

3 Results

Moreover, owing to document-level attention mechanism, our Att-BiLSTM-CRF model without additional features achieves better performance than other sentence-level neural network-based models and our Att-BiLSTM- CRF model with additional features achieves the best performances so far on the BioCreative CHEMDNER and CDR corpora (91.14% and 92.57% in F-score, respectively).