自然语言处理中的注意力机制Summary of Attention Module in NLP_state of art method till 2019 (Updating)...

Summary of Attention Module in NLP_state of art method till 2019

  • First proposed in 'Neural Machine Translation by Jointly Learning to Align and Translate'(2015). But more originally in computer vision field.
  • Calc the weight of linear combination of encoder/decoder hidden layers to make up a part of decoder hidden states

The reason we need attention model:

    • Original NMT failed to capture information in long sentences
    • RNN, or encoder-decoder structure was proposed to deal with the long-term memory problem
    • But encoder-decoder still has an issue with very long sentences, because a fix-length hidden state cannot store infinite information while test paragraph can be very long(you can input 5000 words in Google's translation). Bahdanau indicated that the calculation of context vector could be the bottleneck of the generation performance.
    • Besides, information in the near sentences may not be more important than the farther one, which means we need to take into the relevance between words and sentences into account, rather than only considering words' distance
    • Attention was first proposed in image processing, then in Bahdanau's paper in 2014
    • [1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate https://arxiv.org/abs/1409.0473

Calculation of attention:

  • Alignment model (proposed by Bahdanau et al.)

    • calc01.png
  • where e is a position function (alignment function) judging how well the word y_i matches x_j

  • the author used a relevant weight matrix obtained from a feedforward networks

    • We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system
  • Therefore, one point we can modify is the e function, hereby provide several classic methods:

  • Bahadanau's addictive attention:

  • Multiplicative attention

  • Addictive attention

  • Location based function:

  • Scaled function:

    • scaling is used to deal with the derivation backward propagation problem, when attention weight is too large, it would be difficult to fine-tuned in training
    • [1706.03762] Attention Is All You Need https://arxiv.org/abs/1706.03762
  • Difference of the selection standard:

    • soft attention(consider all states, and parameterized) and hard attention(only select most relevant state, using monte-carlo stochastic sampling)

      • drawback of hard attention: non-differentiable
    • global attention(alignment of all states) and local attention(only focus on part of the source)

    • local attention is kind of similar to hard attention

  • self attention:

  • Hierarchical model:

  • Difference between Memory mechanism and Attention mechanism?

  • SOTA modal:

  • ref:

转载于:https://www.cnblogs.com/joezou/p/11247974.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值