[Paper Summary] Frustratingly Short Attn Spans in Neural LM [Daniluk 2017]

最新推荐文章于 2022-07-19 16:34:49 发布

芝麻挞

最新推荐文章于 2022-07-19 16:34:49 发布

阅读量88

点赞数

分类专栏：我爱读的paper

本文链接：https://blog.csdn.net/weixin_43928665/article/details/118623373

版权

我爱读的paper 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

该博客探讨了2017年ICLR会议上的一篇文章，该文章在自注意力机制之前，提出了不同的表示用于键、值和下一个词分布。尽管后来的注意力机制归功于其他工作，但该研究发现注意力仅利用了短期的记忆。实验表明，简单的模型，如将前N-1步的输出向量连接起来预测下一个词，可以与复杂的记忆增强神经语言模型相媲美。研究强调了长距离依赖在训练神经语言模型中的挑战，并呼吁进一步的研究。

摘要由CSDN通过智能技术生成

Frustratingly Short Attn Spans in Neural LM [Daniluk 2017]

这是2017年的ICLR很老的文而且主要的贡献是用了不同的representation for key, value and next-word distribution，当时命名还不太一样，他的key就是现在的query，他的value是现在的key，他的next-word distribution是现在的value。读的时候我一度认为现在的标准做法 projection to Q, K, V就是一开始源自这篇，但是看了citation这篇并不高感觉很小众。而且"attn only utilizes memory of a short span"时候来experiments里额外发现的，可能作者觉得这个发现很重要就直接拿来当题目了，但是就蒙蔽了他改attn的贡献。

Attention最早是[Bahdanau 2015 Neural Machine Translation by Jointly Learning to Align and Translate] 提出来的, 但当时the same vector need to fulfill three purposes at the same time(也就是现在QKV分别的purpose)。当时还是decoder attends to encoder，而且还是RNN还是seq2seq。后来self-attn里面明确提出了QKV (所以是不是大家就都认为QKV是attention is all you need那篇提出的了但是这篇应该更早？)

Keypoints

We introduce two methods for separating the overloaded usage of output vectors: a). using a dedicated key and value, and b). further separating the value into a memory value and a representation that encodes the next-word distribution.

N-gram RNN by Concatenating representations in previous steps

Neural LMs often work best in combination with traditional N-gram models, since the former excel at generalization while the latter ensure memorization.
Instead of an attn mechanism, we experiment with concatenating output representations from the previous N-1 steps to calculate next-word probabilities.
Specifically, we split the LSTM output into N-1 vectors $[h^1_t, ..., h^{N-1}_t]$ . At time step $t$ the first part of the output vector $h^1_t$ will contribute to predicting the next word, the second part $h^2_t$ will contribute to predicting the second word thereafter, and so on.
This is related to higher-order RNNs, with the difference that we do not incorporate output vectors from the previous steps into the hidden state but only use them for predicting the next word. Put another way, in our method output vectors are from the previous N-1 vectors but the calculation of hidden states still use the immediately preceding ones.

Findings

The performance of the Key-Value-Predict model does not improve significantly when increasing the attn window size. This leads to the conclusion that none of the attentive models investigated in this paper can utilize a large memory of previous token representations.
A much simpler model based only on the concatenation of recent output representations from previous time steps is on par with more sophisticated memory-augmented neural language models.
Futher work can investigate ways to encourage attending over a long history, for example by forcing the model to ignore the local context and only allow attn over output representations further behind the local history.

结尾扣题亘古不变的关注点

Training neural LMs that take long-range dependenies into account seems notoriously hard and needs further investigation.

芝麻挞

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[Paper Summary] Frustratingly Short Attn Spans in Neural LM [Daniluk 2017]

Frustratingly Short Attn Spans in Neural LM这是2017年的ICLR很老的文而且主要的贡献是用了不同的representation for key, value and next-word distribution，当时命名还不太一样，他的key就是现在的query，他的value是现在的key，他的next-word distribution是现在的value。读的时候我一度认为现在的标准做法 projection to Q, K, V就是一开始源自这篇，
复制链接

扫一扫