粗体表示向量
s
o
f
t
m
a
x
(
x
)
=
e
x
1
T
e
x
softmax(\boldsymbol{x})=\frac{e^{\boldsymbol{x}}}{\boldsymbol{1}^T e^{\boldsymbol{x}}}
softmax(x)=1Texex
其中
1
\boldsymbol{1}
1是一个全
1
1
1的向量
d i a g ( x ) = d i a g ( x 1 , x 2 , ⋯ , x n ) diag(\boldsymbol{x})=diag(x_1,x_2,\cdots,x_n) diag(x)=diag(x1,x2,⋯,xn)
用到的法则(采用分母布局)
a
=
a
(
x
)
,
u
=
u
(
x
)
a=a(\boldsymbol{x}),\boldsymbol{u}=u(\boldsymbol{x})
a=a(x),u=u(x)则
∂
a
u
∂
x
=
a
∂
u
∂
x
+
∂
a
∂
x
u
T
\frac{\partial a\boldsymbol{u}}{\partial \boldsymbol{x}}=a\frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}}+\frac{\partial a}{\partial \boldsymbol{x}}\boldsymbol{u}^T
∂x∂au=a∂x∂u+∂x∂auT
u
=
u
(
x
)
,
v
=
v
(
x
)
\boldsymbol{u} = \boldsymbol{u}\left(\boldsymbol{x}\right),\boldsymbol{v} = \boldsymbol{v}\left(\boldsymbol{x}\right)
u=u(x),v=v(x)
∂
u
T
v
∂
x
=
∂
u
∂
x
v
+
∂
v
∂
x
u
\frac{\partial \boldsymbol{u}^T\boldsymbol{v}}{\partial \boldsymbol{x}} = \frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}}\boldsymbol{v} + \frac{\partial \boldsymbol{v}}{\partial \boldsymbol{x}}\boldsymbol{u}
∂x∂uTv=∂x∂uv+∂x∂vu
采用分母布局
设
y
=
s
o
f
t
m
a
x
(
x
)
\boldsymbol{y}=softmax(\boldsymbol{x})
y=softmax(x)
∂
s
o
f
t
m
a
x
(
x
)
∂
x
=
1
1
T
e
x
∂
e
x
∂
x
+
∂
(
1
1
T
e
x
)
∂
x
(
e
x
)
T
=
1
1
T
e
x
d
i
a
g
(
e
x
)
−
1
(
1
T
e
x
)
2
∂
(
1
T
e
x
)
∂
x
(
e
x
)
T
=
1
1
T
e
x
d
i
a
g
(
e
x
)
−
1
(
1
T
e
x
)
2
1
T
d
i
a
g
(
e
x
)
(
e
x
)
T
=
1
1
T
e
x
d
i
a
g
(
e
x
)
−
1
(
1
T
e
x
)
2
e
x
(
e
x
)
T
=
d
i
a
g
(
s
o
f
t
m
a
x
(
x
)
)
−
s
o
f
t
m
a
x
(
x
)
(
s
o
f
t
m
a
x
(
x
)
)
T
=
d
i
a
g
(
y
)
−
y
y
T
\begin{aligned} \frac{\partial\ softmax(\boldsymbol{x})}{\partial \boldsymbol{x}} &=\frac{1}{\boldsymbol{1}^T e^{\boldsymbol{x}}}\frac{\partial e^{\boldsymbol{x}}}{\partial \boldsymbol{x}}+\frac{\partial (\frac{1}{\boldsymbol{1}^T e^{\boldsymbol{x}}})}{\partial \boldsymbol{x}}(e^{\boldsymbol{x}})^T\\ &=\frac{1}{\boldsymbol{1}^T e^{\boldsymbol{x}}}diag(e^{\boldsymbol{x}})-\frac{1}{(\boldsymbol{1}^T e^{\boldsymbol{x}})^2}\frac{\partial (\boldsymbol{1}^T e^{\boldsymbol{x}})}{\partial \boldsymbol{x}}(e^{\boldsymbol{x}})^T\\ &=\frac{1}{\boldsymbol{1}^T e^{\boldsymbol{x}}}diag(e^{\boldsymbol{x}})-\frac{1}{(\boldsymbol{1}^T e^{\boldsymbol{x}})^2} \boldsymbol{1}^T diag(e^{\boldsymbol{x}})(e^{\boldsymbol{x}})^T\\ &=\frac{1}{\boldsymbol{1}^T e^{\boldsymbol{x}}}diag(e^{\boldsymbol{x}})-\frac{1}{(\boldsymbol{1}^T e^{\boldsymbol{x}})^2} e^{\boldsymbol{x}}(e^{\boldsymbol{x}})^T\\ &=diag(softmax(\boldsymbol{x}))-softmax(\boldsymbol{x})(softmax(\boldsymbol{x}))^T\\ &=diag(\boldsymbol{y})-\boldsymbol{y}\boldsymbol{y}^T \end{aligned}
∂x∂ softmax(x)=1Tex1∂x∂ex+∂x∂(1Tex1)(ex)T=1Tex1diag(ex)−(1Tex)21∂x∂(1Tex)(ex)T=1Tex1diag(ex)−(1Tex)211Tdiag(ex)(ex)T=1Tex1diag(ex)−(1Tex)21ex(ex)T=diag(softmax(x))−softmax(x)(softmax(x))T=diag(y)−yyT