Encoder Decoder with Attention model is a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. It uses a multilayered Gated Recurrent Unit (GRU) to map the input sequence to a vector of a fixed dimensionality, and then another deep GRU to decode the target sequence from the vector.
kaggle
A sequence to sequence model has two parts – an encoder and a decoder. Both the parts are practically two different neural network models combined into one giant network. the task of an encoder network is to understand the input sequence, and create a smaller dimensional representation of it. This representation is then forwarded to a decoder network which generates a sequence of its own that represents the output. The input is put through an encoder model which gives us the encoder output. Here, each input words is assigned a weight by the attention mechanism which is then used by the decoder to predict the next word in the sentence. We use Bahdanau attention for the encoder.
One type of network built with attention is called a transformer (explained below). If you understand the transformer, you understand attention. And the best way to understand the transformer is to contrast it with the neural networks that came before. They differ in the way they process input (which in turn contains assumptions about the structure of the data to be processed, assumptions about the world) and automatically recombine that input into relevant features.
Recurrent Networks and LSTMs
How do words work? Wel