论文阅读《An Effective Transition-based Model for Discontinuous NER》

最新推荐文章于 2023-12-03 16:29:27 发布

说文科技

最新推荐文章于 2023-12-03 16:29:27 发布

阅读量1.4k

点赞数 3

分类专栏： NLP 经典论文文章标签：自然语言处理

喜欢文章？请私信联系作者。

本文链接：https://blog.csdn.net/liu16659/article/details/109559436

版权

NLP 同时被 2 个专栏收录

86 篇文章 18 订阅

订阅专栏

经典论文

23 篇文章 6 订阅

订阅专栏

0.总结

使用transition-based model。这种model 使用传统数据结构（stack ）作为基础，自定义栈的操作类型，从而达到一个 entity recognization 的效果
如果需要本文对应的ppt，可以从我的GitHub中获取哇~
文章来源：CSDN@LawsonAbs

本博客分成三部分：

part1是论文笔记，主要是概括论文的主要内容；
part2采用问答的方式来解答可能对论文产生疑问的地方；
part3部分是对如何实现论文的细节探讨。

PART1论文笔记

在这里插入图片描述
这意思就是：可以通过添加细粒度的实体，从而将一个discontinuous NER 问题转换成一个 nested NER 问题。下面以文中的例子来说明这个问题：
重新添加：Body Location General Feeling 两个细粒度的实体，加上原本的Adverse drug event实体名，那么就可以将之前单纯的Discontinuous NER变成了一个Nested NER问题。这也就说明二者本质上是可以相互转换的（作者应该就是想表示这个意思）

在这里插入图片描述

这里先假设向量 $c_i$ 的维度是n。

在这里插入图片描述

下面详细说一下 $s_i^a = softmax(s_i^TW_i^aB)B$ 公式中各个变量的含义：

$s_i$ is the vector representation of a span, so you can imagine $s_i \in R^d$ .
$B$ is a sequence of vectors, each of which represents a token in the buffer. so the shape is $\in R^{d \times l}$ 【那么这里的 $l$ 就是buffer 中tokens 的个数】
the $W_a$ has the shape of $R^{d \times d}$ , the formula $S_i^T W_aB$ part will get the annotation weights, whose shape is $R^{1 \times l}$ .
And the final $s_a$ has the shape of $R^{1 \times d}$ .

作者站在attention的角度描述了一下上面这个公式，如下文：

An alternative explanation if you are familiar with attention:
you can imagine $s_i$ is the query vector, B plays the role of both key vectors and value vectors.
and the $s_a$ is the weighted sum of B (value vectors), whose weights are calculated based on the correlation between key ( $s_i$ ) and query vector (B).

在这里插入图片描述

PART2疑问

1. 该用什么数据集？什么部分？

下载得到的 CADEC，解压后得到的文件夹如下所示：
在这里插入图片描述
那么是用v1，还是v2?

使用CADEC.v2文件夹中的数据即可。
但是即使在上述的文件夹中的数据仍然数据量比较大，所以作者只取了ADE的部分

但是不清楚是否是因为typo，原作者写成了 ADR。

2.预处理数据的作用是什么？效果是什么？

待更新~

3.其它准备工作

主要是语料库的下载。

需要下载elmo 模型和训练结果，下载地址为：https://allennlp.org/elmo。主要是下载weights 和options 两个文件
将下载得到的文件放到目录：/data/dai031/Corpora/ELMo/elmo_2x4096_512_2048cnn_2xhighway_5.5B 下。
需要下载glove算法预训练得到的embedding信息，也就是文件glove.6B.100d.txt：可以在 https://www.kaggle.com/danielwillgeorge/glove6b100dtxt 中下载得到。