Attention Mechanism
Attention definition:
Attention用于计算"相似程度",通常描述如下:将 q u e r y ( Q ) query(Q) query(Q)和 k e y − v a l u e p a i r s key-value \ pairs key−value pairs映射到输出上,输出是 V V V中所有values的加权,其中权重是由 Q u e r y Query Query和每个 K e y Key Key计算出来的。
- 计算Q和K的相似度
S i = f ( Q , K i ) S_i = f(Q, K_i) Si=f(Q,Ki)- 将相似度softmax
α i = e S i ∑ i = 1 m e S i \alpha_i = \frac{e^{S_i}}{\sum_{i=1}^{m}e^{S_i}} αi=∑i=1meSieSi- 对V中所有的values进行加权求和计算,得到Attention向量
∑ i = 1 m α i V i \sum_{i=1}^{m}\alpha_iV_i ∑i=1mαiVi
计算相似度方法(score function):
- 点乘(dot product): f ( Q , K i ) = Q T K i f(Q, K_i)=Q^TK_i f(Q,Ki)=QTKi
- 权重(General): f ( Q , K i ) = Q T W K i f(Q, K_i)=Q^TWK_i f(Q,Ki)=QTWKi
- 拼接权重(Concat): f ( Q , K i ) = W [ Q T ; K i ] f(Q,K_i)=W[Q^T;K_i] f(Q,Ki)=W[QT;Ki]
- 感知器(Perceptron): f ( Q , K i ) = V T t a n h ( W Q + U K i ) f(Q,K_i)=V^Ttanh(WQ+UK_i) f(Q,Ki)=VTtanh(WQ+UKi)
计算相似度方法:
f
(
Q
,
K
i
)
=
Q
T
K
i
f(Q, K_i)=Q^TK_i
f(Q,Ki)=QTKi
f
(
Q
,
K
i
)
=
Q
T
W
K
i
f(Q, K_i)=Q^TWK_i
f(Q,Ki)=QTWKi
f
(
Q
,
K
i
)
=
W
[
Q
T
;
K
i
]
f(Q,K_i)=W[Q^T;K_i]
f(Q,Ki)=W[QT;Ki]
f
(
Q
,
K
i
)
=
V
T
t
a
n
h
(
W
Q
+
U
K
i
)
f(Q,K_i)=V^Ttanh(WQ+UK_i)
f(Q,Ki)=VTtanh(WQ+UKi)