How did I select papers?
First, I tried to search for “attention” in CVPR2014-2016, ICCV2009-2015 and ACMMM2012-2015. However, there are only a few papers containing this key words.
Then, I searched for “attention model” in google, and found blogs where talk about it and list some papers.
Attention
[1] talks about “Attention” and why we need it: When people see a picture, usually they would move their eyes around over time and gather information about the scene. They don’t see every pixel of the image at once. They attend to certain aspects of the picture one time-step at a time and aggregate the information. That is exactly the kind of power we want to give to our neural network models. The usual convolutional network model does have the ability to be able to recognize cluttered images but how do we find the exact set of weights which are “good”? That is a difficult task. By providing the network with a new architecture-level feature which allows it to attend to different parts of image sequentially and aggregate information over time, we make that job easier, because now the network can simply learn to ignore the clutter (or so is the hope).
In natural language processing, there is a typical task: natural language generation, which is, given context, generate target(relevant sentence). For instance, machine translation. When using deep learning to solve this task, one common method is encoder-decoder framework.
Given one sequence of words, [3] uses one RNN encoder to generate a context vector(the last hidden state of RNN), then, one RNN decoder uses this hidden state as initial state to generate words one by one.
Fig 1
However, no matter how long the input sequence is, the output of the encoder is a one vector whose dimension is just several hundred, which means that the long the input sequence is, the more information of the state vector will lose.
In fact, decoder can use all information of the input sequence instead of just the last state.
Fig 2
In paper [2], when generate the Hypothesis states(h7,h8,h9), instead of the last state(h6), all the input vectors(h1,…,h5) will be inputed. Besides, not all the inputs vectors will influence the generation of the next state. For example, to translate “私は猫が好きです。” to “I like cats”. To generate the word ”like”, we should focus on the input word ”好き” instead of other words. “Attention” means to select proper input vectors and use them to generate next target state.
Soft-attention and Hard-attention
Papers
Translation
Effective approaches to attention-based neural machine translation[4]
The attention-based models of [4] are classified into two broad categories, global and local. These classes differ in terms of whether the “attention” is placed on all source positions or on only a few source positions.
Common to these two types of models is the fact that at each time step
t
in the decoding phase, both approaches first take as input the hidden state
Fig 3. Neural machine translation – a stacking recurrent architecture for translating a source sequence A B C D into a target sequence X Y Z. Here, marks the end of a sentence
Fig 4. Global attentional model – at each time step
t
, the model infers a variable-length alignment weight vector
Fig 5. Local attention model – the model first predicts a single aligned position
pt
for the current target word. A window centered around the source position
pt
is then used to compute a context vector
ct
, a weighted average of the source hidden states in the window. The weights
at
are inferred from the current target state
ht
and those source states
hs
in the window.
Neural machine translation by jointly learning to align and translate[5]
Reference
[1] http://stackoverflow.com/questions/35549588/soft-attention-vs-hard-attention
[2] Rocktäschel, T., Grefenstette, E., Hermann, K. M., Kočiský, T., & Blunsom, P. (2015). Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664.
[3] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).
[4] Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
[5] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
[*] https://www.zhihu.com/question/36591394
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Video description generation incorporating spatio-temporal features and a soft-attention mechanism. arXiv preprint arXiv:1502.08029.