QAnet对embedding和modeling的编码,使用cnn和self-attention而非RNN,可以很快,可以并行处理输入的token.
关键点: cnn捕获文本的局部特征,self-attetnion学习每对词间的全局的相互作用.
The key motivation behind the design of our model is the following: convolution captures the local
structure of the text, while the self-attention learns the global interaction between each pair of words.
网络结构
1. Input Embedding Layer
堆叠word embedding(300d),char embedding(200d)
所有的oov使用随机的初始化并训练得到的词向量
2. Embedding Encoder Layer
由一摞Encoder Block[convolution-layer × # + self-attention-layer + feed-forward-layer]构成,输入500d,输出128d
- 卷积
使用了Deep learning with depthwise separable convolutions中的方法.
kernel-size=7
the number of filters is d = 128
the number of conv layers within a block is 4
- self-attention
使用multi-head机制
The number of heads is 8 throughout all the layers.
Each of these basic operations (conv/self-attention/ffn) is placed inside a residual block,
3. Context-Query Attention Layer
- 记
C
C
C,
Q
Q
Q分别为被encode的context和query矩阵.计算相似度矩阵
S ∈ R n ∗ m S \in R^{n*m} S∈Rn∗m
其中 S i , j = f ( q , c ) = W 0 ⋅ [ q , c , q ∘ c ] S_{i,j}=f(q,c)=W_0 \cdot [q,c,q \circ c] Si,j=f(q,c)=W0⋅[q,c,q∘c] - Context-to-query attention A
A = s o f t m a x ( S , a x i s = r o w ) ⋅ Q T ∈ R n ∗ d A=softmax(S,axis=row) \cdot Q^T \in R^{n*d} A=softmax(S,axis=row)⋅QT∈Rn∗d - query-to-context attention B
B = A ⋅ s o f t m a x ( S , a x i s = c o l u m n ) T ⋅ C T B=A \cdot softmax(S,axis=column)^T \cdot C^T B=A⋅softmax(S,axis=column)T⋅CT
4. Model Encoder Layer
每个位置的输入
[
c
,
a
,
c
∘
a
,
c
∘
b
]
[c,a,c \circ a,c \circ b]
[c,a,c∘a,c∘b]
a,b分别代表矩阵
A
A
A,
B
B
B的行
三个stacked model encoder blocks 共享参数
convolution layer number is 2 within a block;
the total number of blocks are 7
5. Output layer
- start point
全连接层
p 1 = s o f t m a x ( W p 1 T [ M 0 ; M 1 ] ) p^1=softmax(W_{p^1}^T[M_0;M_1]) p1=softmax(Wp1T[M0;M1]) - end point
p 1 = s o f t m a x ( W p 2 T [ M 0 ; M 2 ] ) p^1=softmax(W_{p^2}^T[M_0;M^2]) p1=softmax(Wp2T[M0;M2])
6. 损失函数
L ( θ ) = − 1 N ∑ i N log ( p y i 2 1 ) + log ( p y i 2 2 ) L(\theta)=-\frac{1}{N}\sum_i^N \log(p_{y_i^2}^1)+\log(p_{y_i^2}^2) L(θ)=−N1i∑Nlog(pyi21)+log(pyi22)