问题描述
考虑模型循环网络模型:
x
(
k
+
1
)
=
(
1
−
Δ
t
)
x
(
k
)
+
Δ
t
W
f
[
x
(
k
)
]
(1)
x(k+1) = (1-\Delta t)x(k) + \Delta t Wf[x(k)] \tag1{}
x(k+1)=(1−Δt)x(k)+ΔtWf[x(k)](1)
其中
x
(
k
)
∈
R
N
x(k) \in R^N
x(k)∈RN表示网络节点在激活前的状态,
W
∈
R
N
×
N
W\in R^{N\times N}
W∈RN×N表示网络结点之间相互连接的权重,网络的输出节点为
{
x
i
(
k
)
∣
i
∈
O
}
\{x_i(k)| i\in O\}
{xi(k)∣i∈O},
O
O
O为所有输出(或称“观测”)单元的下标集合
训练的目标是为了减少观测状态和预期值之间误差,即最小化损失函数:
E
=
1
2
∑
k
=
1
K
∑
i
∈
O
[
x
i
(
k
)
−
d
i
(
k
)
]
2
(2)
E = \frac{1}{2}\sum_{k=1}^K \sum_{i\in O} [x_i(k) - d_i(k)]^2 \tag{2}
E=21k=1∑Ki∈O∑[xi(k)−di(k)]2(2)
其中
d
i
(
k
)
d_i(k)
di(k) 表示
k
k
k 时刻第
i
i
i 个节点的预期值
符号约定
W
≡
[
—–
w
1
T
—–
⋮
—–
w
N
T
—–
]
N
×
N
W \equiv \begin{bmatrix} \text{-----} w_1^T \text{-----} \\ \vdots \\ \text{-----} w_N^T \text{-----} \end{bmatrix}_{N\times N}
W≡⎣⎢⎡—–w1T—–⋮—–wNT—–⎦⎥⎤N×N
将矩阵
W
W
W 拉成列向量,记为
w
w
w
w
=
[
w
1
T
,
⋯
,
w
N
T
]
T
∈
R
N
2
w = [w_1^T, \cdots, w_N^T]^T \in R^{N^2}
w=[w1T,⋯,wNT]T∈RN2
把所有时间的状态拼成列向量,记为
x
x
x
x
=
[
x
T
(
1
)
,
⋯
,
x
T
(
K
)
]
T
∈
R
N
K
x = [x^T(1), \cdots, x^T(K)]^T \in R^{NK}
x=[xT(1),⋯,xT(K)]T∈RNK
将RNN 的训练视为约束优化问题,(1)式转化成约束条件:
g
(
k
+
1
)
≡
−
x
(
k
+
1
)
+
(
1
−
Δ
t
)
x
(
k
)
+
Δ
t
W
f
[
x
(
k
)
]
,
k
=
1
,
…
,
K
(3)
g(k+1) \equiv -x(k+1) + (1-\Delta t)x(k) + \Delta t Wf[x(k)] , \quad k=1,\ldots ,K \tag{3}
g(k+1)≡−x(k+1)+(1−Δt)x(k)+ΔtWf[x(k)],k=1,…,K(3)
记
g
=
[
g
T
(
1
)
,
…
,
g
T
(
K
)
]
T
∈
R
N
K
g = [g^T(1), \ldots, g^T(K)]^T \in R^{NK}
g=[gT(1),…,gT(K)]T∈RNK
Atiya-Parlos 算法回顾
以上是经典的梯度下降法的思维,但是 Atiya-Parlos 提出了另一种优化思路:不是朝着参数的梯度方向更新,但仍使代价函数下降
该算法的思想是互换网络状态 x ( k ) x(k) x(k) 和权重矩阵 W W W 的作用:将状态视为控制变量,并根据 x ( k ) x(k) x(k) 的变化确定权重的变化。 换句话说,我们计算 E E E 相对于状态 x ( k ) x(k) x(k) 的梯度,并假设状态在该梯度的负方向 Δ x i ( k ) = − η ∂ E ∂ x i ( k ) \displaystyle{\Delta x_i(k) = -\eta\frac{\partial E}{\partial x_i(k)} } Δxi(k)=−η∂xi(k)∂E 上有微小变化。
接下来,我们 确 定 权 重 W 的 变 化 Δ w , 以 使 由 权 重 变 化 导 致 的 状 态 变 化 尽 可 能 地 接 近 目 标 变 化 Δ x \textcolor{red}{确定权重 W 的变化 \Delta w,以使由权重变化导致的状态变化尽可能地接近目标变化 \Delta x} 确定权重W的变化Δw,以使由权重变化导致的状态变化尽可能地接近目标变化Δx
该算法的细节如下:
Δ
x
=
−
η
(
∂
E
∂
x
i
)
T
=
−
η
e
T
=
−
η
[
e
(
1
)
,
…
,
e
(
K
)
]
T
e
i
(
k
)
=
{
x
i
(
k
)
−
d
i
(
k
)
,
if
i
∈
O
,
0
,
otherwise.
k
∈
1
,
…
,
K
.
\begin{aligned} \Delta x &= -\eta \left(\frac{\partial E}{\partial x_i} \right)^T \\ &= -\eta e^T\\ &= -\eta [e(1), \ldots, e(K)]^T \\\\ e_i(k)&= \begin{cases} x_i(k) - d_i(k), &\text{if } i\in O, \\ 0, &\text{otherwise. } \end{cases} k \in 1,\ldots,K. \end{aligned}
Δxei(k)=−η(∂xi∂E)T=−ηeT=−η[e(1),…,e(K)]T={xi(k)−di(k),0,if i∈O,otherwise. k∈1,…,K.
由约束条件得:
∂
g
∂
x
Δ
x
=
−
∂
g
∂
w
Δ
w
\frac{\partial g}{\partial x} \Delta x = - \frac{\partial g}{\partial w} \Delta w
∂x∂gΔx=−∂w∂gΔw
故已知
Δ
x
\Delta x
Δx 时,可得:
Δ
w
=
−
[
(
∂
g
∂
w
)
T
(
∂
g
∂
w
)
]
−
1
(
∂
g
∂
w
)
T
(
∂
g
∂
x
)
Δ
x
\Delta w = -\left[\left(\frac{\partial g}{\partial w}\right)^T \left(\frac{\partial g}{\partial w}\right)\right]^{-1} \left(\frac{\partial g}{\partial w}\right)^T\left( \frac{\partial g}{\partial x}\right) \Delta x
Δw=−[(∂w∂g)T(∂w∂g)]−1(∂w∂g)T(∂x∂g)Δx
需要注意逆矩阵不一定存在,故
Δ
w
=
−
[
(
∂
g
∂
w
)
T
(
∂
g
∂
w
)
+
ϵ
I
]
−
1
(
∂
g
∂
w
)
T
(
∂
g
∂
x
)
Δ
x
\Delta w = -\left[\left(\frac{\partial g}{\partial w}\right)^T \left(\frac{\partial g}{\partial w}\right) + \epsilon I \right]^{-1} \left(\frac{\partial g}{\partial w}\right)^T\left( \frac{\partial g}{\partial x}\right) \Delta x
Δw=−[(∂w∂g)T(∂w∂g)+ϵI]−1(∂w∂g)T(∂x∂g)Δx
这就是权重
W
W
W 的更新规则
计算细节
计算
∂
g
∂
w
\frac{\partial g}{\partial w}
∂w∂g
∂
g
∂
w
=
[
∂
g
(
1
)
∂
w
⋮
∂
g
(
K
)
∂
w
]
=
Δ
t
[
∂
W
f
[
x
(
0
)
]
∂
w
⋮
∂
W
f
[
x
(
K
−
1
)
]
∂
w
]
\frac{\partial g}{\partial w} = \begin{bmatrix} \frac{\partial g(1)}{\partial w}\\ \vdots \\ \frac{\partial g(K)}{\partial w} \end{bmatrix} = \Delta t \begin{bmatrix} \frac{\partial Wf[x(0)] }{\partial w}\\ \vdots \\ \frac{\partial Wf[x(K-1)] }{\partial w} \end{bmatrix}\\
∂w∂g=⎣⎢⎡∂w∂g(1)⋮∂w∂g(K)⎦⎥⎤=Δt⎣⎢⎡∂w∂Wf[x(0)]⋮∂w∂Wf[x(K−1)]⎦⎥⎤
其中
∂
W
f
[
x
(
k
)
]
∂
w
=
[
∂
w
1
T
f
[
x
(
k
)
]
∂
w
⋮
∂
w
N
T
f
[
x
(
k
)
]
∂
w
]
记
f
k
=
[
f
(
x
1
)
,
…
,
f
(
x
N
(
k
)
)
]
T
=
[
f
k
T
f
k
T
⋱
f
k
T
]
N
×
N
2
≜
F
(
k
)
\begin{aligned} \frac{\partial Wf[x(k)]}{\partial w} &= \begin{bmatrix} \frac{\partial w_1^Tf[x(k)]}{\partial w}\\ \vdots \\ \frac{\partial w_N^Tf[x(k)]}{\partial w} \end{bmatrix} \color{red}{记 f_k = [f(x_1), \ldots, f(x_N(k))]^T}\\\\ & = \begin{bmatrix} f_k^T &&& \\ & f_k^T&& \\ && \ddots & \\ &&& f_k^T \end{bmatrix}_{N\times N^2} \\\\ &\triangleq F(k) \end{aligned}
∂w∂Wf[x(k)]=⎣⎢⎢⎡∂w∂w1Tf[x(k)]⋮∂w∂wNTf[x(k)]⎦⎥⎥⎤记fk=[f(x1),…,f(xN(k))]T=⎣⎢⎢⎡fkTfkT⋱fkT⎦⎥⎥⎤N×N2≜F(k)
故
∂
g
∂
w
=
Δ
t
[
F
(
0
)
⋮
F
(
K
−
1
)
]
N
K
×
N
2
\frac{\partial g}{\partial w} = \Delta t \begin{bmatrix} F(0)\\ \vdots \\ F(K-1) \end{bmatrix}_{NK \times N^2}
∂w∂g=Δt⎣⎢⎡F(0)⋮F(K−1)⎦⎥⎤NK×N2
1
Δ
t
2
(
∂
g
∂
w
)
T
(
∂
g
∂
w
)
=
[
F
T
(
0
)
⋯
F
T
(
K
−
1
)
]
[
F
(
0
)
⋮
F
(
K
−
1
)
]
=
∑
k
=
0
K
−
1
F
T
(
k
)
F
(
k
)
=
[
∑
k
=
0
K
−
1
f
k
f
k
T
∑
k
=
0
K
−
1
f
k
f
k
T
⋱
∑
k
=
0
K
−
1
f
k
f
k
T
]
N
2
×
N
2
≜
d
i
a
g
{
C
K
−
1
}
\begin{aligned} &\frac{1}{\Delta t^2}\left(\frac{\partial g}{\partial w}\right)^T \left(\frac{\partial g}{\partial w}\right) \\ &= \begin{bmatrix} F^T(0) & \cdots & F^T(K-1) \end{bmatrix} \begin{bmatrix} F(0)\\ \vdots \\ F(K-1) \end{bmatrix} \\\\ &= \sum_{k=0}^{K-1} F^T(k)F(k) \\\\ &=\begin{bmatrix} \sum_{k=0}^{K-1} f_k f_k^T &&& \\ & \sum_{k=0}^{K-1} f_k f_k^T && \\ && \ddots & \\ &&& \sum_{k=0}^{K-1} f_k f_k^T \end{bmatrix}_{N^2 \times N^2} \\\\ &\triangleq diag\{C_{K-1}\} \end{aligned}
Δt21(∂w∂g)T(∂w∂g)=[FT(0)⋯FT(K−1)]⎣⎢⎡F(0)⋮F(K−1)⎦⎥⎤=k=0∑K−1FT(k)F(k)=⎣⎢⎢⎡∑k=0K−1fkfkT∑k=0K−1fkfkT⋱∑k=0K−1fkfkT⎦⎥⎥⎤N2×N2≜diag{CK−1}
令
γ
=
[
γ
(
1
)
γ
(
2
)
⋮
γ
(
K
)
]
N
K
=
∂
g
∂
x
Δ
x
\gamma = \begin{bmatrix} \gamma(1)\\ \gamma(2) \\ \vdots \\ \gamma(K) \end{bmatrix}_{NK} = \frac{\partial g}{\partial x} \Delta x
γ=⎣⎢⎢⎢⎡γ(1)γ(2)⋮γ(K)⎦⎥⎥⎥⎤NK=∂x∂gΔx
γ
\gamma
γ 表示由
Δ
x
\Delta x
Δx 提供的误差信息,它的计算放在本文最后,先假设它已经求出来了
则
(
∂
g
∂
w
)
T
(
∂
g
∂
x
)
Δ
x
=
Δ
t
[
F
T
(
0
)
⋯
F
T
(
K
−
1
)
]
N
2
×
N
K
[
γ
(
1
)
γ
(
2
)
⋮
γ
(
K
)
]
N
K
=
Δ
t
∑
k
=
1
K
F
T
(
k
−
1
)
γ
(
k
)
=
Δ
t
∑
k
=
1
K
[
f
k
−
1
f
k
−
1
⋱
f
k
−
1
]
N
2
×
N
[
γ
1
(
k
)
γ
2
(
k
)
⋮
γ
N
(
k
)
]
N
=
Δ
t
[
∑
k
=
1
K
f
k
−
1
γ
1
(
k
)
∑
k
=
1
K
f
k
−
1
γ
2
(
k
)
⋮
∑
k
=
1
K
f
k
−
1
γ
N
(
k
)
]
N
2
\begin{aligned} & \left(\frac{\partial g}{\partial w}\right)^T\left( \frac{\partial g}{\partial x}\right) \Delta x \\ &= \Delta t \begin{bmatrix} F^T(0) & \cdots & F^T(K-1) \end{bmatrix}_{N^2 \times NK} \begin{bmatrix} \gamma(1)\\ \gamma(2) \\ \vdots \\ \gamma(K) \end{bmatrix}_{NK} \\ &= \Delta t\sum_{k=1}^K F^T(k-1)\gamma(k) \\ &=\Delta t \sum_{k=1}^K \begin{bmatrix} f_{k-1} &&& \\ & f_{k-1}&& \\ && \ddots & \\ &&& f_{k-1} \end{bmatrix}_{N^2 \times N} \begin{bmatrix} \gamma_1(k)\\ \gamma_2(k) \\ \vdots \\ \gamma_N(k) \end{bmatrix}_{N}\\\\ &=\Delta t \begin{bmatrix} \sum_{k=1}^K f_{k-1} \gamma_1(k)\\ \sum_{k=1}^K f_{k-1} \gamma_2(k) \\ \vdots \\ \sum_{k=1}^K f_{k-1} \gamma_N(k) \end{bmatrix}_{N^2}\\\\ \end{aligned}
(∂w∂g)T(∂x∂g)Δx=Δt[FT(0)⋯FT(K−1)]N2×NK⎣⎢⎢⎢⎡γ(1)γ(2)⋮γ(K)⎦⎥⎥⎥⎤NK=Δtk=1∑KFT(k−1)γ(k)=Δtk=1∑K⎣⎢⎢⎡fk−1fk−1⋱fk−1⎦⎥⎥⎤N2×N⎣⎢⎢⎢⎡γ1(k)γ2(k)⋮γN(k)⎦⎥⎥⎥⎤N=Δt⎣⎢⎢⎢⎡∑k=1Kfk−1γ1(k)∑k=1Kfk−1γ2(k)⋮∑k=1Kfk−1γN(k)⎦⎥⎥⎥⎤N2
所以
Δ
w
=
−
[
(
∂
g
∂
w
)
T
(
∂
g
∂
w
)
+
ϵ
I
]
−
1
(
∂
g
∂
w
)
T
(
∂
g
∂
x
)
Δ
x
=
−
1
Δ
t
[
C
K
−
1
−
1
∑
k
=
1
K
f
k
−
1
γ
1
(
k
)
C
K
−
1
−
1
∑
k
=
1
K
f
k
−
1
γ
2
(
k
)
⋮
C
K
−
1
−
1
∑
k
=
1
K
f
k
−
1
γ
N
(
k
)
]
N
2
Δ
W
=
−
1
Δ
t
[
∑
k
=
1
K
f
k
−
1
T
C
K
−
1
−
1
γ
1
(
k
)
∑
k
=
1
K
f
k
−
1
T
C
K
−
1
−
1
γ
2
(
k
)
⋮
∑
k
=
1
K
f
k
−
1
T
C
K
−
1
−
1
γ
N
(
k
)
]
N
×
N
=
−
1
Δ
t
∑
k
=
1
K
[
f
k
−
1
T
γ
1
(
k
)
f
k
−
1
T
γ
2
(
k
)
⋮
f
k
−
1
T
γ
N
(
k
)
]
N
×
N
C
K
−
1
−
1
\begin{aligned} \Delta w &= -\left[\left(\frac{\partial g}{\partial w}\right)^T \left(\frac{\partial g}{\partial w}\right) + \epsilon I\right]^{-1} \left(\frac{\partial g}{\partial w}\right)^T\left( \frac{\partial g}{\partial x}\right) \Delta x \\ &= - \frac{1}{\Delta t} \begin{bmatrix} C_{K-1}^{-1} \sum_{k=1}^K f_{k-1} \gamma_1(k)\\ C_{K-1}^{-1} \sum_{k=1}^K f_{k-1} \gamma_2(k) \\ \vdots \\ C_{K-1}^{-1} \sum_{k=1}^K f_{k-1} \gamma_N(k) \end{bmatrix}_{N^2}\\\\ \Delta W &= - \frac{1}{\Delta t} \begin{bmatrix} \sum_{k=1}^K f_{k-1}^TC_{K-1}^{-1} \gamma_1(k)\\ \sum_{k=1}^K f_{k-1}^TC_{K-1}^{-1} \gamma_2(k) \\ \vdots \\ \sum_{k=1}^K f_{k-1}^TC_{K-1}^{-1} \gamma_N(k) \end{bmatrix}_{N\times N} \\ &= - \frac{1}{\Delta t} \sum_{k=1}^K\begin{bmatrix} f_{k-1}^T \gamma_1(k)\\ f_{k-1}^T \gamma_2(k) \\ \vdots \\ f_{k-1}^T \gamma_N(k) \end{bmatrix}_{N\times N} C_{K-1}^{-1} \\ \end{aligned}
ΔwΔW=−[(∂w∂g)T(∂w∂g)+ϵI]−1(∂w∂g)T(∂x∂g)Δx=−Δt1⎣⎢⎢⎢⎡CK−1−1∑k=1Kfk−1γ1(k)CK−1−1∑k=1Kfk−1γ2(k)⋮CK−1−1∑k=1Kfk−1γN(k)⎦⎥⎥⎥⎤N2=−Δt1⎣⎢⎢⎢⎡∑k=1Kfk−1TCK−1−1γ1(k)∑k=1Kfk−1TCK−1−1γ2(k)⋮∑k=1Kfk−1TCK−1−1γN(k)⎦⎥⎥⎥⎤N×N=−Δt1k=1∑K⎣⎢⎢⎢⎡fk−1Tγ1(k)fk−1Tγ2(k)⋮fk−1TγN(k)⎦⎥⎥⎥⎤N×NCK−1−1其中
C
K
−
1
=
ϵ
I
+
∑
r
=
0
K
−
1
f
r
f
r
T
C_{K-1} = \epsilon I + \sum_{r=0}^{K-1} f_r f_r^T
CK−1=ϵI+r=0∑K−1frfrT
注
意
:
上
述
Δ
W
是
基
于
1
,
2
,
…
,
K
整
个
时
间
段
的
更
新
,
不
妨
称
之
为
Δ
W
b
a
t
c
h
\color{red}{注意:上述 \Delta W 是基于 1,2,\ldots, K 整个时间段的更新,不妨称之为 \Delta W_{batch}}
注意:上述ΔW是基于1,2,…,K整个时间段的更新,不妨称之为ΔWbatch
下面将更新公式拆解在线更新(online updating)的形式:
Δ
W
b
a
t
c
h
(
K
)
=
Δ
W
(
1
)
+
⋯
+
Δ
W
(
K
)
\Delta W^{batch}(K)= \Delta W(1) + \cdots + \Delta W(K)
ΔWbatch(K)=ΔW(1)+⋯+ΔW(K)
等式右端对应每一时刻的更新量
在第
K
K
K 时刻的第
i
i
i 个神经元的输入权重的更新量:
Δ
w
i
T
(
K
)
=
−
1
Δ
t
∑
k
=
1
K
f
k
−
1
T
C
K
−
1
−
1
γ
i
(
k
)
+
1
Δ
t
∑
k
=
1
K
−
1
f
k
−
1
T
C
K
−
2
−
1
γ
i
(
k
)
=
−
1
Δ
t
f
K
−
1
T
C
K
−
1
−
1
γ
i
(
K
)
−
1
Δ
t
∑
k
=
1
K
−
1
f
k
−
1
T
(
C
K
−
1
−
1
−
C
K
−
2
−
1
)
γ
i
(
k
)
=
−
1
Δ
t
f
K
−
1
T
C
K
−
1
−
1
γ
i
(
K
)
−
1
Δ
t
∑
k
=
1
K
−
1
f
k
−
1
T
C
K
−
2
−
1
γ
i
(
k
)
(
C
K
−
2
C
K
−
1
−
1
−
I
)
=
−
1
Δ
t
f
K
−
1
T
C
K
−
1
−
1
γ
i
(
K
)
−
Δ
w
i
b
a
t
c
h
(
K
−
1
)
(
C
K
−
2
C
K
−
1
−
1
−
I
)
=
−
1
Δ
t
f
K
−
1
T
C
K
−
1
−
1
γ
i
(
K
)
−
∑
k
=
1
K
−
1
Δ
w
i
T
(
k
)
(
C
K
−
2
C
K
−
1
−
1
−
I
)
\begin{aligned} \Delta w^T_{i}(K) &= - \frac{1}{\Delta t} \sum_{k=1}^{K} f_{k-1}^TC_{K-1}^{-1} \gamma_i(k) + \frac{1}{\Delta t} \sum_{k=1}^{K-1} f_{k-1}^TC_{K-2}^{-1} \gamma_i(k)\\\\ &= - \frac{1}{\Delta t} f_{K-1}^TC_{K-1}^{-1} \gamma_i(K) - \frac{1}{\Delta t} \sum_{k=1}^{K-1} f_{k-1}^T (C_{K-1}^{-1} - C_{K-2}^{-1}) \gamma_i(k) \\\\ &=- \frac{1}{\Delta t} f_{K-1}^T C_{K-1}^{-1}\gamma_i(K) - \frac{1}{\Delta t} \sum_{k=1}^{K-1} f_{k-1}^T C_{K-2}^{-1}\gamma_i(k)(C_{K-2}C_{K-1}^{-1} - I) \\\\ &= - \frac{1}{\Delta t} f_{K-1}^TC_{K-1}^{-1} \gamma_i(K) - \Delta w_i^{batch}(K-1)(C_{K-2}C_{K-1}^{-1}- I) \\\\ &= - \frac{1}{\Delta t} f_{K-1}^TC_{K-1}^{-1} \gamma_i(K) - \sum_{k=1}^{K-1} \Delta w^T_i(k) (C_{K-2}C_{K-1}^{-1}- I) \end{aligned}
ΔwiT(K)=−Δt1k=1∑Kfk−1TCK−1−1γi(k)+Δt1k=1∑K−1fk−1TCK−2−1γi(k)=−Δt1fK−1TCK−1−1γi(K)−Δt1k=1∑K−1fk−1T(CK−1−1−CK−2−1)γi(k)=−Δt1fK−1TCK−1−1γi(K)−Δt1k=1∑K−1fk−1TCK−2−1γi(k)(CK−2CK−1−1−I)=−Δt1fK−1TCK−1−1γi(K)−Δwibatch(K−1)(CK−2CK−1−1−I)=−Δt1fK−1TCK−1−1γi(K)−k=1∑K−1ΔwiT(k)(CK−2CK−1−1−I)
可以看出,APRL 的更新规则由当前时刻的误差和 w 的累计更新(动量)组成
随着 K → ∞ K \to \infty K→∞,易知 ∑ k = 1 K − 1 Δ w i T ( k ) → c o n s t , C K − 2 C K − 1 − 1 → I \sum_{k=1}^{K-1} \Delta w^T_i(k) \to const, C_{K-2}C_{K-1}^{-1} \to I ∑k=1K−1ΔwiT(k)→const,CK−2CK−1−1→I,所以第二项趋于零
BPDC 更新规则
BPDC 对 APRL 的在线算法做了简单粗暴的近似
该近似不试图累积完整的相关矩阵
C
k
C_k
Ck,也舍弃了先前误差的累积,而且只计算瞬时相关
C
(
k
)
C(k)
C(k):
Δ
w
i
T
(
k
+
1
)
=
−
1
Δ
t
f
k
T
C
(
k
)
−
1
γ
i
(
k
+
1
)
C
(
k
)
=
ϵ
I
+
f
k
f
k
T
\begin{aligned} \Delta w^T_{i}(k+1) &= - \frac{1}{\Delta t} f_{k}^TC(k)^{-1} \gamma_i(k+1) \\\\ C(k) &= \epsilon I + f_k f_k^T \end{aligned}
ΔwiT(k+1)C(k)=−Δt1fkTC(k)−1γi(k+1)=ϵI+fkfkT
利用矩阵求逆引理:
C
(
k
)
−
1
=
(
ϵ
I
+
f
k
f
k
T
)
−
1
=
1
ϵ
I
−
1
ϵ
f
f
T
ϵ
+
f
T
f
\begin{aligned} C(k)^{-1} &= (\epsilon I + f_k f_k^T)^{-1} \\ &= \frac{1}{\epsilon}I - \frac{1}{\epsilon} \frac{ff^T}{\epsilon + f^Tf} \end{aligned}
C(k)−1=(ϵI+fkfkT)−1=ϵ1I−ϵ1ϵ+fTfffT
所以
Δ
w
i
T
(
k
+
1
)
=
−
1
Δ
t
f
k
T
(
1
ϵ
I
−
1
ϵ
f
f
T
ϵ
+
f
T
f
)
γ
i
(
k
+
1
)
=
−
1
Δ
t
f
T
ϵ
+
f
T
f
γ
i
(
k
+
1
)
\begin{aligned} \Delta w^T_{i}(k+1) &= - \frac{1}{\Delta t} f_{k}^T\left( \frac{1}{\epsilon}I - \frac{1}{\epsilon} \frac{ff^T}{\epsilon + f^Tf}\right) \gamma_i(k+1) \\\\ &= - \frac{1}{\Delta t} \frac{f^T}{\epsilon + f^Tf} \gamma_i(k+1) \end{aligned}
ΔwiT(k+1)=−Δt1fkT(ϵ1I−ϵ1ϵ+fTfffT)γi(k+1)=−Δt1ϵ+fTffTγi(k+1)
计算 γ \gamma γ
γ
=
[
γ
(
1
)
γ
(
2
)
⋮
γ
(
K
)
]
N
K
=
∂
g
∂
x
Δ
x
=
−
η
∂
g
∂
x
[
e
(
1
)
,
…
,
e
(
K
)
]
T
\gamma = \begin{bmatrix} \gamma(1)\\ \gamma(2) \\ \vdots \\ \gamma(K) \end{bmatrix}_{NK} = \frac{\partial g}{\partial x} \Delta x = -\eta \frac{\partial g}{\partial x} [e(1), \ldots, e(K)]^T
γ=⎣⎢⎢⎢⎡γ(1)γ(2)⋮γ(K)⎦⎥⎥⎥⎤NK=∂x∂gΔx=−η∂x∂g[e(1),…,e(K)]T
关键在与计算
∂
g
∂
x
\frac{\partial g}{\partial x}
∂x∂g
∂
g
∂
x
=
[
∂
g
1
∂
x
(
1
)
…
∂
g
1
∂
x
(
K
)
⋮
⋱
⋮
∂
g
K
∂
x
(
1
)
…
∂
g
K
∂
x
(
K
)
]
=
[
∂
g
1
∂
x
(
1
)
0
0
…
0
∂
g
2
∂
x
(
1
)
∂
g
2
∂
x
(
2
)
0
…
0
0
∂
g
3
∂
x
(
2
)
∂
g
3
∂
x
(
3
)
…
0
⋮
⋮
⋮
⋱
⋮
0
0
0
∂
g
K
∂
x
(
K
−
1
)
∂
g
K
∂
x
(
K
)
]
=
[
−
I
0
0
…
0
(
1
−
Δ
t
)
I
+
Δ
t
W
D
(
1
)
−
I
0
…
0
0
(
1
−
Δ
t
)
I
+
Δ
t
W
D
(
2
)
−
I
…
0
⋮
⋮
⋮
⋱
⋮
0
0
0
(
1
−
Δ
t
)
I
+
Δ
t
W
D
(
K
−
1
)
−
I
]
\begin{aligned} \frac{\partial g}{\partial x} &= \begin{bmatrix} \frac{\partial g_1}{\partial x(1)} & \ldots & \frac{ \partial g_1}{\partial x(K)}\\ \vdots & \ddots & \vdots\\ \frac{\partial g_K}{\partial x(1)} & \ldots & \frac{ \partial g_K}{\partial x(K)} \end{bmatrix}\\\\ &= \begin{bmatrix} \frac{\partial g_1}{\partial x(1)} & 0& 0 &\ldots & 0\\ \frac{\partial g_2}{\partial x(1)} & \frac{\partial g_2}{\partial x(2)}& 0 &\ldots & 0 \\ 0 & \frac{\partial g_3}{\partial x(2)} & \frac{\partial g_3}{\partial x(3)} & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \frac{\partial g_K}{\partial x(K-1)}& \frac{\partial g_K}{\partial x(K)} \end{bmatrix} \\\\ &= \begin{bmatrix} -I & 0& 0 &\ldots & 0\\ (1-\Delta t )I + \Delta t W D(1) & -I& 0 &\ldots & 0 \\ 0 & (1-\Delta t )I + \Delta t W D(2) & -I & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 &(1-\Delta t )I + \Delta t W D(K-1)& -I \end{bmatrix} \end{aligned}
∂x∂g=⎣⎢⎢⎡∂x(1)∂g1⋮∂x(1)∂gK…⋱…∂x(K)∂g1⋮∂x(K)∂gK⎦⎥⎥⎤=⎣⎢⎢⎢⎢⎢⎢⎡∂x(1)∂g1∂x(1)∂g20⋮00∂x(2)∂g2∂x(2)∂g3⋮000∂x(3)∂g3⋮0………⋱∂x(K−1)∂gK000⋮∂x(K)∂gK⎦⎥⎥⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎢⎡−I(1−Δt)I+ΔtWD(1)0⋮00−I(1−Δt)I+ΔtWD(2)⋮000−I⋮0………⋱(1−Δt)I+ΔtWD(K−1)000⋮−I⎦⎥⎥⎥⎥⎥⎤
其中
D
(
k
)
=
[
f
′
(
x
1
(
k
)
)
⋯
0
⋮
⋱
⋮
0
…
f
′
(
x
N
(
k
)
)
]
N
×
N
D(k) = \begin{bmatrix} f'(x_1(k)) & \cdots&0\\ \vdots & \ddots & \vdots\\ 0& \ldots & f'(x_N(k)) \end{bmatrix}_{N \times N}
D(k)=⎣⎢⎡f′(x1(k))⋮0⋯⋱…0⋮f′(xN(k))⎦⎥⎤N×N
所以
γ
=
−
η
∂
g
∂
x
[
e
(
1
)
,
…
,
e
(
K
)
]
T
=
−
η
[
−
I
0
0
…
0
(
1
−
Δ
t
)
I
+
Δ
t
W
D
(
1
)
−
I
0
…
0
0
(
1
−
Δ
t
)
I
+
Δ
t
W
D
(
2
)
−
I
…
0
⋮
⋮
⋮
⋱
⋮
0
0
0
(
1
−
Δ
t
)
I
+
Δ
t
W
D
(
K
−
1
)
−
I
]
[
e
(
1
)
e
(
2
)
e
(
3
)
⋮
e
(
K
)
]
=
−
η
[
−
e
(
1
)
[
(
1
−
Δ
t
)
I
+
Δ
t
W
D
(
1
)
]
e
(
1
)
−
e
(
2
)
[
(
1
−
Δ
t
)
I
+
Δ
t
W
D
(
2
)
]
e
(
2
)
−
e
(
3
)
⋮
[
(
1
−
Δ
t
)
I
+
Δ
t
W
D
(
K
−
1
)
]
e
(
K
−
1
)
−
e
(
K
)
]
N
K
×
1
\begin{aligned} \gamma &= -\eta \frac{\partial g}{\partial x} [e(1), \ldots, e(K)]^T\\ &= -\eta\begin{bmatrix} -I & 0& 0 &\ldots & 0\\ (1-\Delta t )I + \Delta t WD(1) & -I& 0 &\ldots & 0 \\ 0 & (1-\Delta t )I + \Delta t W D(2) & -I & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 &(1-\Delta t )I + \Delta t W D(K-1)& -I \end{bmatrix} \begin{bmatrix} e(1) \\ e(2) \\ e(3) \\ \vdots \\ e(K) \end{bmatrix} \\\\ &= -\eta \begin{bmatrix} -e(1) \\ [(1-\Delta t )I + \Delta t W D(1)]e(1) - e(2) \\ [(1-\Delta t )I + \Delta t W D(2)]e(2) - e(3) \\ \vdots \\ [(1-\Delta t )I + \Delta t W D(K-1)]e(K-1) - e(K) \end{bmatrix}_{NK \times 1} \end{aligned}
γ=−η∂x∂g[e(1),…,e(K)]T=−η⎣⎢⎢⎢⎢⎢⎡−I(1−Δt)I+ΔtWD(1)0⋮00−I(1−Δt)I+ΔtWD(2)⋮000−I⋮0………⋱(1−Δt)I+ΔtWD(K−1)000⋮−I⎦⎥⎥⎥⎥⎥⎤⎣⎢⎢⎢⎢⎢⎡e(1)e(2)e(3)⋮e(K)⎦⎥⎥⎥⎥⎥⎤=−η⎣⎢⎢⎢⎢⎢⎡−e(1)[(1−Δt)I+ΔtWD(1)]e(1)−e(2)[(1−Δt)I+ΔtWD(2)]e(2)−e(3)⋮[(1−Δt)I+ΔtWD(K−1)]e(K−1)−e(K)⎦⎥⎥⎥⎥⎥⎤NK×1
代入到 BPDC 更新规则:
Δ
w
i
T
(
k
+
1
)
=
−
1
Δ
t
f
T
ϵ
+
f
T
f
γ
i
(
k
+
1
)
=
η
Δ
t
f
T
ϵ
+
f
T
f
{
(
1
−
Δ
t
)
e
i
(
k
)
+
Δ
t
∑
s
∈
O
w
i
s
f
′
(
x
s
(
k
)
)
e
s
(
k
)
−
e
i
(
k
+
1
)
}
\begin{aligned} \Delta w^T_{i}(k+1) &= - \frac{1}{\Delta t} \frac{f^T}{\epsilon + f^Tf} \color{red}{ \gamma_i(k+1)} \\ &= \frac{\color{red}{\eta}}{\Delta t} \frac{f^T}{\epsilon + f^Tf} \color{red}\{ (1-\Delta t )e_i(k) + \Delta t \sum_{s\in O}w_{is} f'(x_s(k))e_s(k) - e_i(k+1) \} \end{aligned}
ΔwiT(k+1)=−Δt1ϵ+fTffTγi(k+1)=Δtηϵ+fTffT{(1−Δt)ei(k)+Δts∈O∑wisf′(xs(k))es(k)−ei(k+1)}
参考文献
- J.J. Steil, Backpropagation-decorrelation: online recurrent learning with O(N) complexity, in: Proceedings of the International Joint Conference on Neural Networks (IJCNN), vol. 1, 2004, pp. 843–848.
- J.J. Steil, Online stability of backpropagation-decorrelation recurrent learning, Neurocomputing 69 (2006) 642–650.
- J.J. Steil, Online reservoir adaptation by intrinsic plasticity for backpropagation-decorrelation and echo state learning, Neural Networks 20 (3) (2007) 353–364.
作者简介
服了这个德国佬,看了他三篇论文里的
γ
\gamma
γ 都推错了,最后还是在下面这个大姐的毕业论文里找到了正确的公式