符号说明:Question ,Context
, answer span
,
: represent both original word and its embedding
和其它大部分Reading Comprehension模型一样,也包括Embedding layer, Embedding encoder layer, Contex-query attention layer, Model encoder layer, Output layer 五个模块。
1. Embedding Layer
Word:
- 300-dim GloVe pre-trained word vectors
- fixed during training
- OOV words mapped to <UNK>, <UNK>的vector is randomly initialized, trained
Char:
- 200-dim, max word length is 16
- concatenate all char vectors of a word to form a matrix, use maximum value of each row to obtain a final vector
- trained
Final vector of a word is , and put it through two layers of high way network.
2. Embedding Encoder Layer
A stack of building blocks: [conv-layer x # + self-attention-layer + feed-forward-layer]
- depthwise separable convolutions, memory efficient and has better generalization
- kernel size is 7, number of filters is d = 128, number of conv layers within a block is 4
- self-attention use mutli-head attention, head number is 8
- Each of these basic operations (conv/self-attention/ffn) is placed inside a residual block
- input
and a given operation
, the output is
- total number of encoder blocks is 1
- input dim is
, output dim is
3. Context-Query Attention Layer
- similarity matrix
,
, the trilinear similarity function
- row softmax to get
, and Context-to-query attention is
- context-to-query attention benefits a bit, and first column softmax to get
, and then get the query to context attention
4. Model Encoder Layer
- Input to this layer is
, where
and
are row of
and
- parameters are the same as embedding encoder layer
- number of blocks is 7
- number of conv layers within a block is 2
- share weights between the model encoders
5. Output Layer
predict start and end