[Paper Summary] Frustratingly Short Attn Spans in Neural LM [Daniluk 2017]

Frustratingly Short Attn Spans in Neural LM [Daniluk 2017]


这是2017年的ICLR很老的文而且主要的贡献是用了不同的representation for key, value and next-word distribution,当时命名还不太一样,他的key就是现在的query, 他的value是现在的key,他的next-word distribution是现在的value。读的时候我一度认为现在的标准做法 projection to Q, K, V就是一开始源自这篇,但是看了citation这篇并不高感觉很小众。而且"attn only utilizes memory of a short span"时候来experiments里额外发现的,可能作者觉得这个发现很重要就直接拿来当题目了,但是就蒙蔽了他改attn的贡献。

Attention最早是[Bahdanau 2015 Neural Machine Translation by Jointly Learning to Align and Translate] 提出来的, 但当时the same vector need to fulfill three purposes at the same time(也就是现在QKV分别的purpose)。当时还是decoder attends to encoder,而且还是RNN还是seq2seq。后来self-attn里面明确提出了QKV (所以是不是大家就都认为QKV是attention is all you need那篇提出的了但是这篇应该更早?)


Keypoints
  • We introduce two methods for separating the overloaded usage of output vectors: a). using a dedicated key and value, and b). further separating the value into a memory value and a representation that encodes the next-word distribution.
N-gram RNN by Concatenating representations in previous steps
  • Neural LMs often work best in combination with traditional N-gram models, since the former excel at generalization while the latter ensure memorization.
  • Instead of an attn mechanism, we experiment with concatenating output representations from the previous N-1 steps to calculate next-word probabilities.
  • Specifically, we split the LSTM output into N-1 vectors [ h t 1 , . . . , h t N − 1 ] [h^1_t, ..., h^{N-1}_t] [ht1,...,htN1]. At time step t t t the first part of the output vector h t 1 h^1_t ht1 will contribute to predicting the next word, the second part h t 2 h^2_t ht2 will contribute to predicting the second word thereafter, and so on.
  • This is related to higher-order RNNs, with the difference that we do not incorporate output vectors from the previous steps into the hidden state but only use them for predicting the next word. Put another way, in our method output vectors are from the previous N-1 vectors but the calculation of hidden states still use the immediately preceding ones.
Findings
  • The performance of the Key-Value-Predict model does not improve significantly when increasing the attn window size. This leads to the conclusion that none of the attentive models investigated in this paper can utilize a large memory of previous token representations.
  • A much simpler model based only on the concatenation of recent output representations from previous time steps is on par with more sophisticated memory-augmented neural language models.
  • Futher work can investigate ways to encourage attending over a long history, for example by forcing the model to ignore the local context and only allow attn over output representations further behind the local history.
结尾扣题亘古不变的关注点
  • Training neural LMs that take long-range dependenies into account seems notoriously hard and needs further investigation.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值