问题描述
考虑模型循环网络模型:
x
(
k
)
=
f
[
W
x
(
k
−
1
)
]
(1)
x(k) = f[Wx(k-1)] \tag1{}
x(k)=f[Wx(k−1)](1)
其中
x
(
k
)
∈
R
N
x(k) \in R^N
x(k)∈RN表示网络节点状态,
W
∈
R
N
×
N
W\in R^{N\times N}
W∈RN×N表示网络结点之间相互连接的权重,网络的输出节点为
{
x
i
(
k
)
∣
i
∈
O
}
\{x_i(k)| i\in O\}
{xi(k)∣i∈O},
O
O
O为所有输出(或称“观测”)单元的下标集合
训练的目标是为了减少观测状态和预期值之间误差,即最小化损失函数:
E
=
1
2
∑
k
=
1
K
∑
i
∈
O
[
x
i
(
k
)
−
d
i
(
k
)
]
2
(2)
E = \frac{1}{2}\sum_{k=1}^K \sum_{i\in O} [x_i(k) - d_i(k)]^2 \tag{2}
E=21k=1∑Ki∈O∑[xi(k)−di(k)]2(2)
其中
d
i
(
k
)
d_i(k)
di(k) 表示
k
k
k 时刻第
i
i
i 个节点的预期值
采用梯度下降法更新
W
W
W:
W
+
=
W
−
η
d
E
d
W
W_+ = W - \eta \frac{dE}{dW}
W+=W−ηdWdE
符号约定
W
≡
[
—–
w
1
T
—–
⋮
—–
w
N
T
—–
]
N
×
N
W \equiv \begin{bmatrix} \text{-----} w_1^T \text{-----} \\ \vdots \\ \text{-----} w_N^T \text{-----} \end{bmatrix}_{N\times N}
W≡⎣⎢⎡—–w1T—–⋮—–wNT—–⎦⎥⎤N×N
将矩阵
W
W
W 拉成列向量,记为
w
w
w
w
=
[
w
1
T
,
⋯
,
w
N
T
]
T
∈
R
N
2
w = [w_1^T, \cdots, w_N^T]^T \in R^{N^2}
w=[w1T,⋯,wNT]T∈RN2
把所有时间的状态拼成列向量,记为
x
x
x
x
=
[
x
T
(
1
)
,
⋯
,
x
T
(
K
)
]
T
∈
R
N
K
x = [x^T(1), \cdots, x^T(K)]^T \in R^{NK}
x=[xT(1),⋯,xT(K)]T∈RNK
将RNN 的训练视为约束优化问题,(1)式转化成约束条件:
g
(
k
)
≡
f
[
W
x
(
k
−
1
)
]
−
x
(
k
)
=
0
,
k
=
1
,
…
,
K
(3)
g(k) \equiv f[Wx(k-1)] - x(k) =0, \quad k=1,\ldots ,K \tag{3}
g(k)≡f[Wx(k−1)]−x(k)=0,k=1,…,K(3)
记
g
=
[
g
T
(
1
)
,
…
,
g
T
(
K
)
]
T
∈
R
N
K
g = [g^T(1), \ldots, g^T(K)]^T \in R^{NK}
g=[gT(1),…,gT(K)]T∈RNK
主要推导
由于 x x x 和 w w w 之间满足约束条件(3),故 x x x 可视为 w w w 的函数,即 x ( w ) x(w) x(w)
因此 E ( x ) → E ( x ( w ) ) E(x) \to E(x(w)) E(x)→E(x(w)), g ( x , w ) → g ( x ( w ) , w ) g(x,w) \to g(x(w),w) g(x,w)→g(x(w),w)
由
g
≡
0
g\equiv 0
g≡0 得
0
=
d
g
(
x
(
w
)
,
w
)
d
w
=
∂
g
(
x
(
w
)
,
w
)
∂
x
∂
x
(
w
)
∂
w
+
∂
g
(
x
(
w
)
,
w
)
∂
w
(4)
0 = \frac{dg(x(w),w)}{dw} = \frac{\partial g(x(w),w)}{\partial x}\frac{\partial x(w)}{\partial w} + \frac{\partial g(x(w),w)}{\partial w} \tag{4}
0=dwdg(x(w),w)=∂x∂g(x(w),w)∂w∂x(w)+∂w∂g(x(w),w)(4)
故
d
E
(
x
(
w
)
)
d
w
=
∂
E
(
x
(
w
)
)
∂
x
∂
x
(
w
)
∂
w
=
−
∂
E
(
x
(
w
)
)
∂
x
(
∂
g
(
x
(
w
)
,
w
)
∂
x
)
−
1
∂
g
(
x
(
w
)
,
w
)
∂
w
\begin{aligned} \frac{dE(x(w))}{dw} &= \frac{\partial E(x(w))}{\partial x}\frac{\partial x(w)}{\partial w} \\\\ &= -\frac{\partial E(x(w))}{\partial x}\left(\frac{\partial g(x(w),w)}{\partial x}\right)^{-1} \frac{\partial g(x(w),w)}{\partial w} \end{aligned}
dwdE(x(w))=∂x∂E(x(w))∂w∂x(w)=−∂x∂E(x(w))(∂x∂g(x(w),w))−1∂w∂g(x(w),w)
简记为
d
E
d
w
=
∂
E
∂
x
(
∂
g
∂
x
)
−
1
∂
g
∂
w
(5)
\frac{dE}{dw} = \frac{\partial E}{\partial x}\left(\frac{\partial g}{\partial x}\right)^{-1} \frac{\partial g}{\partial w} \tag{5}
dwdE=∂x∂E(∂x∂g)−1∂w∂g(5)
大部分关于循环神经网络的梯度下降法,都是围绕(5)式展开
首先得清楚各项的维度:
E
∈
R
g
∈
R
N
K
x
∈
R
N
K
w
∈
R
N
2
∂
E
∂
x
∈
R
1
×
N
K
∂
g
∂
x
∈
R
N
K
×
N
K
∂
g
∂
w
∈
R
N
K
×
N
2
\begin{aligned} E &\in R \\ g &\in R^{NK}\\ x &\in R^{NK}\\ w &\in R^{N^2}\\ \frac{\partial E}{\partial x} &\in R^{1\times NK} \\ \frac{\partial g}{\partial x} &\in R^{NK\times NK} \\ \frac{\partial g}{\partial w} &\in R^{NK \times N^2} \end{aligned}
Egxw∂x∂E∂x∂g∂w∂g∈R∈RNK∈RNK∈RN2∈R1×NK∈RNK×NK∈RNK×N2
然后再看怎么求:
1.
∂
E
∂
x
=
[
e
(
1
)
,
…
,
e
(
K
)
]
e
i
(
k
)
=
{
x
i
(
k
)
−
d
i
(
k
)
,
if
i
∈
O
,
0
,
otherwise.
k
∈
1
,
…
,
K
.
\begin{aligned} \frac{\partial E}{\partial x} &= [e(1), \ldots, e(K)] \\\\ e_i(k)&= \begin{cases} x_i(k) - d_i(k), &\text{if } i\in O, \\ 0, &\text{otherwise. } \end{cases} k \in 1,\ldots,K. \end{aligned}
∂x∂Eei(k)=[e(1),…,e(K)]={xi(k)−di(k),0,if i∈O,otherwise. k∈1,…,K.
2.
∂
g
∂
x
=
[
∂
g
(
1
)
∂
x
⋮
∂
g
(
K
)
∂
x
]
=
[
∂
g
(
1
)
∂
x
(
1
)
…
∂
g
(
1
)
∂
x
(
K
)
⋮
⋱
⋮
∂
g
(
K
)
∂
x
(
1
)
…
∂
g
(
K
)
∂
x
(
K
)
]
\frac{\partial g}{\partial x} = \begin{bmatrix} \frac{\partial g(1)}{\partial x}\\ \vdots \\ \frac{\partial g(K)}{\partial x} \end{bmatrix} = \begin{bmatrix} \frac{\partial g(1)}{\partial x(1)} & \ldots & \frac{\partial g(1)}{\partial x(K)}\\ \vdots & \ddots & \vdots\\ \frac{\partial g(K)}{\partial x(1)} & \ldots & \frac{\partial g(K)}{\partial x(K)} \end{bmatrix}
∂x∂g=⎣⎢⎡∂x∂g(1)⋮∂x∂g(K)⎦⎥⎤=⎣⎢⎢⎡∂x(1)∂g(1)⋮∂x(1)∂g(K)…⋱…∂x(K)∂g(1)⋮∂x(K)∂g(K)⎦⎥⎥⎤
由(3)式可知:
∂
g
(
i
)
∂
x
(
j
)
=
{
−
I
,
if
i
=
j
,
∂
f
[
W
x
(
j
)
]
∂
x
(
j
)
,
if i=j+1
0
,
otherwise.
\frac{\partial g(i)}{\partial x(j)} = \begin{cases} -I, &\text{if } i=j, \\ \frac{\partial f[Wx(j)]}{\partial x(j)} ,&\text{if i=j+1}\\ 0, &\text{otherwise. } \end{cases}
∂x(j)∂g(i)=⎩⎪⎨⎪⎧−I,∂x(j)∂f[Wx(j)],0,if i=j,if i=j+1otherwise.
而其中
∂
f
[
W
x
(
j
)
]
∂
x
(
j
)
=
[
∂
f
(
w
1
T
x
(
j
)
)
∂
x
1
(
j
)
…
∂
f
(
w
1
T
x
(
j
)
)
∂
x
N
(
j
)
⋮
⋱
⋮
∂
f
(
w
N
T
x
(
j
)
)
∂
x
1
(
j
)
…
∂
f
(
w
N
T
x
(
j
)
)
∂
x
N
(
j
)
]
=
[
f
′
(
w
1
T
x
(
j
)
)
w
11
…
f
′
(
w
1
T
x
(
j
)
)
w
1
N
⋮
⋱
⋮
f
′
(
w
N
T
x
(
j
)
)
w
N
1
…
f
′
(
w
N
T
x
(
j
)
)
w
N
N
]
=
[
f
′
(
w
1
T
x
(
j
)
)
0
⋱
0
f
′
(
w
N
T
x
(
j
)
)
]
W
≜
D
(
j
)
W
\begin{aligned} \frac{\partial f[Wx(j)]}{\partial x(j)} & = \begin{bmatrix} \frac{\partial f(w_1^Tx(j))}{\partial x_1(j)} & \ldots & \frac{\partial f(w_1^Tx(j))}{\partial x_N(j)}\\ \vdots & \ddots & \vdots\\ \frac{\partial f(w_N^Tx(j))}{\partial x_1(j)}& \ldots & \frac{\partial f(w_N^Tx(j))}{\partial x_N(j)} \end{bmatrix}\\\\ & = \begin{bmatrix} f'(w_1^Tx(j))w_{11} & \ldots & f'(w_1^Tx(j))w_{1N}\\ \vdots & \ddots & \vdots\\ f'(w_N^Tx(j))w_{N1} & \ldots & f'(w_N^Tx(j))w_{NN} \end{bmatrix}\\\\ &= \begin{bmatrix} f'(w_1^Tx(j)) & &0\\ & \ddots & \\ 0& & f'(w_N^Tx(j)) \end{bmatrix}W \\\\ &\triangleq D(j)W \end{aligned}
∂x(j)∂f[Wx(j)]=⎣⎢⎢⎡∂x1(j)∂f(w1Tx(j))⋮∂x1(j)∂f(wNTx(j))…⋱…∂xN(j)∂f(w1Tx(j))⋮∂xN(j)∂f(wNTx(j))⎦⎥⎥⎤=⎣⎢⎡f′(w1Tx(j))w11⋮f′(wNTx(j))wN1…⋱…f′(w1Tx(j))w1N⋮f′(wNTx(j))wNN⎦⎥⎤=⎣⎡f′(w1Tx(j))0⋱0f′(wNTx(j))⎦⎤W≜D(j)W
综上所述
∂
g
∂
x
=
[
−
I
0
0
…
0
D
(
1
)
W
−
I
0
…
0
0
D
(
2
)
W
−
I
…
0
⋮
⋮
⋮
⋱
⋮
0
0
0
D
(
K
−
1
)
W
−
I
]
N
K
×
N
K
\frac{\partial g}{\partial x} = \begin{bmatrix} -I & 0& 0 &\ldots & 0\\ D(1)W & -I & 0 &\ldots & 0 \\ 0 & D(2)W & -I & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & D(K-1)W& -I \end{bmatrix}_{NK\times NK}
∂x∂g=⎣⎢⎢⎢⎢⎢⎡−ID(1)W0⋮00−ID(2)W⋮000−I⋮0………⋱D(K−1)W000⋮−I⎦⎥⎥⎥⎥⎥⎤NK×NK
∂
g
∂
w
=
[
∂
g
(
1
)
∂
w
⋮
∂
g
(
K
)
∂
w
]
=
[
∂
f
[
W
x
(
0
)
]
∂
w
⋮
∂
f
[
W
x
(
K
−
1
)
∂
w
]
\frac{\partial g}{\partial w} = \begin{bmatrix} \frac{\partial g(1)}{\partial w}\\ \vdots \\ \frac{\partial g(K)}{\partial w} \end{bmatrix}= \begin{bmatrix} \frac{\partial f[Wx(0)]}{\partial w}\\ \vdots \\ \frac{\partial f[Wx(K-1)}{\partial w} \end{bmatrix}
∂w∂g=⎣⎢⎡∂w∂g(1)⋮∂w∂g(K)⎦⎥⎤=⎣⎢⎡∂w∂f[Wx(0)]⋮∂w∂f[Wx(K−1)⎦⎥⎤
而其中
∂
f
[
W
x
(
k
)
]
∂
w
=
[
∂
f
[
w
1
T
x
(
k
)
]
∂
w
⋮
∂
f
[
w
N
T
x
(
k
)
]
∂
w
]
N
×
N
2
=
[
f
′
[
w
1
T
x
(
k
)
]
x
1
(
k
)
…
f
′
[
w
1
T
x
(
k
)
]
x
N
(
k
)
0
…
0
…
0
f
′
[
w
2
T
x
(
k
)
]
x
1
(
k
)
…
f
′
[
w
2
T
x
(
k
)
]
x
N
(
k
)
0
…
⋮
]
=
[
f
′
[
w
1
T
x
(
k
)
]
f
′
[
w
2
T
x
(
k
)
]
⋱
f
′
[
w
N
T
x
(
k
)
]
]
N
×
N
[
x
T
(
k
)
x
T
(
k
)
⋱
x
T
(
k
)
]
N
×
N
2
≜
D
(
k
)
X
(
k
)
\begin{aligned} &\frac{\partial f[Wx(k)]}{\partial w} \\ &= \begin{bmatrix} \frac{\partial f[w_1^Tx(k)]}{\partial w}\\ \vdots \\ \frac{\partial f[w_N^Tx(k)]}{\partial w} \end{bmatrix}_{N\times N^2} \\\\ &= \begin{bmatrix} f'[w_1^Tx(k)]x_{1}(k) & \ldots & f'[w_1^Tx(k)]x_{N}(k) & 0 &\ldots \\ 0 & \ldots & 0 & f'[w_2^Tx(k)]x_{1}(k) & \ldots & f'[w_2^Tx(k)]x_{N}(k) & 0& \ldots\\ \vdots \end{bmatrix} \\\\ &= \begin{bmatrix} f'[w_1^Tx(k)] &&& \\ & f'[w_2^Tx(k)] \\ && \ddots & \\ &&& f'[w_N^Tx(k)] \end{bmatrix}_{N\times N} \begin{bmatrix} x^T(k) &&& \\ & x^T(k)&& \\ && \ddots & \\ &&& x^T(k) \end{bmatrix}_{N\times N^2} \\\\ &\triangleq D(k) X(k) \end{aligned}
∂w∂f[Wx(k)]=⎣⎢⎢⎡∂w∂f[w1Tx(k)]⋮∂w∂f[wNTx(k)]⎦⎥⎥⎤N×N2=⎣⎢⎡f′[w1Tx(k)]x1(k)0⋮……f′[w1Tx(k)]xN(k)00f′[w2Tx(k)]x1(k)……f′[w2Tx(k)]xN(k)0…⎦⎥⎤=⎣⎢⎢⎡f′[w1Tx(k)]f′[w2Tx(k)]⋱f′[wNTx(k)]⎦⎥⎥⎤N×N⎣⎢⎢⎡xT(k)xT(k)⋱xT(k)⎦⎥⎥⎤N×N2≜D(k)X(k)
其中
X
(
k
)
≜
[
x
T
(
k
)
x
T
(
k
)
⋱
x
T
(
k
)
]
N
×
N
2
X(k) \triangleq\begin{bmatrix} x^T(k) &&& \\ & x^T(k)&& \\ && \ddots & \\ &&& x^T(k) \end{bmatrix}_{N\times N^2}
X(k)≜⎣⎢⎢⎡xT(k)xT(k)⋱xT(k)⎦⎥⎥⎤N×N2
故
∂
g
∂
w
=
[
D
(
0
)
X
(
0
)
D
(
1
)
X
(
1
)
⋮
D
(
K
−
1
)
X
(
K
−
1
)
]
\frac{\partial g}{\partial w} = \begin{bmatrix} D(0)X(0)\\ D(1)X(1) \\ \vdots \\ D(K-1)X(K-1) \end{bmatrix}
∂w∂g=⎣⎢⎢⎢⎡D(0)X(0)D(1)X(1)⋮D(K−1)X(K−1)⎦⎥⎥⎥⎤
参考文献
- New Results on Recurrent Network Training: Unifying the Algorithms and Accelerating Convergence