Attention in NLP
Advantage:
- integrate information over time
- handle variable-length sequences
- could be parallelized
Seq2seq
Encoder–Decoder framework:
Encoder:
h t = f ( x t , h t − 1 ) h_t = f(x_t, h_{t-1}) ht=f(xt,ht−1)
c = q ( h 1 , . . . , h T x ) c = q({h_1,...,h_{T_x}}) c=q(h1,...,hTx)
Sutskeveretal.(2014) used an LSTM as f and q ( h 1 , ⋅ ⋅ ⋅ , h T ) = h T q ({h_1,··· ,h_T}) = h_T q(h1,⋅⋅⋅,hT)=hT
Decoder:
p ( y ) = ∑ t = 1 T p ( y t ∣ y 1 , . . . , y t − 1 , c ) p(y) = \sum_{t=1}^T p(y_t | {y_1,...,y_{t-1}}, c) p(y)=t=1∑Tp(yt∣y1,...,yt−