RNN 训练算法 —— BPDC (Backpropagation-Decorrelation)

本文详细介绍了Atiya-Parlos算法(APRL)及其简化版本BPDC算法的工作原理与计算流程。通过数学推导展示了如何优化循环神经网络(RNN)的训练过程,特别是权重更新规则的推导。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

在这里插入图片描述

问题描述

考虑模型循环网络模型:
x ( k + 1 ) = ( 1 − Δ t ) x ( k ) + Δ t W f [ x ( k ) ] (1) x(k+1) = (1-\Delta t)x(k) + \Delta t Wf[x(k)] \tag1{} x(k+1)=(1Δt)x(k)+ΔtWf[x(k)](1)
其中 x ( k ) ∈ R N x(k) \in R^N x(k)RN表示网络节点在激活前的状态, W ∈ R N × N W\in R^{N\times N} WRN×N表示网络结点之间相互连接的权重,网络的输出节点为 { x i ( k ) ∣ i ∈ O } \{x_i(k)| i\in O\} {xi(k)iO} O O O为所有输出(或称“观测”)单元的下标集合

在这里插入图片描述
训练的目标是为了减少观测状态和预期值之间误差,即最小化损失函数:
E = 1 2 ∑ k = 1 K ∑ i ∈ O [ x i ( k ) − d i ( k ) ] 2 (2) E = \frac{1}{2}\sum_{k=1}^K \sum_{i\in O} [x_i(k) - d_i(k)]^2 \tag{2} E=21k=1KiO[xi(k)di(k)]2(2)
其中 d i ( k ) d_i(k) di(k) 表示 k k k 时刻第 i i i 个节点的预期值

符号约定

W ≡ [ —– w 1 T —– ⋮ —– w N T —– ] N × N W \equiv \begin{bmatrix} \text{-----} w_1^T \text{-----} \\ \vdots \\ \text{-----} w_N^T \text{-----} \end{bmatrix}_{N\times N} W—–w1T—–—–wNT—–N×N
将矩阵 W W W 拉成列向量,记为 w w w
w = [ w 1 T , ⋯   , w N T ] T ∈ R N 2 w = [w_1^T, \cdots, w_N^T]^T \in R^{N^2} w=[w1T,,wNT]TRN2
把所有时间的状态拼成列向量,记为 x x x
x = [ x T ( 1 ) , ⋯   , x T ( K ) ] T ∈ R N K x = [x^T(1), \cdots, x^T(K)]^T \in R^{NK} x=[xT(1),,xT(K)]TRNK
将RNN 的训练视为约束优化问题,(1)式转化成约束条件:
g ( k + 1 ) ≡ − x ( k + 1 ) + ( 1 − Δ t ) x ( k ) + Δ t W f [ x ( k ) ] , k = 1 , … , K (3) g(k+1) \equiv -x(k+1) + (1-\Delta t)x(k) + \Delta t Wf[x(k)] , \quad k=1,\ldots ,K \tag{3} g(k+1)x(k+1)+(1Δt)x(k)+ΔtWf[x(k)],k=1,,K(3)

g = [ g T ( 1 ) , … , g T ( K ) ] T ∈ R N K g = [g^T(1), \ldots, g^T(K)]^T \in R^{NK} g=[gT(1),,gT(K)]TRNK

在这里插入图片描述

Atiya-Parlos 算法回顾

以上是经典的梯度下降法的思维,但是 Atiya-Parlos 提出了另一种优化思路:不是朝着参数的梯度方向更新,但仍使代价函数下降

该算法的思想是互换网络状态 x ( k ) x(k) x(k) 和权重矩阵 W W W 的作用:将状态视为控制变量,并根据 x ( k ) x(k) x(k) 的变化确定权重的变化。 换句话说,我们计算 E E E 相对于状态 x ( k ) x(k) x(k) 的梯度,并假设状态在该梯度的负方向 Δ x i ( k ) = − η ∂ E ∂ x i ( k ) \displaystyle{\Delta x_i(k) = -\eta\frac{\partial E}{\partial x_i(k)} } Δxi(k)=ηxi(k)E 上有微小变化。

接下来,我们 确 定 权 重 W 的 变 化 Δ w , 以 使 由 权 重 变 化 导 致 的 状 态 变 化 尽 可 能 地 接 近 目 标 变 化 Δ x \textcolor{red}{确定权重 W 的变化 \Delta w,以使由权重变化导致的状态变化尽可能地接近目标变化 \Delta x} WΔw使Δx

该算法的细节如下:
Δ x = − η ( ∂ E ∂ x i ) T = − η e T = − η [ e ( 1 ) , … , e ( K ) ] T e i ( k ) = { x i ( k ) − d i ( k ) , if  i ∈ O , 0 , otherwise.  k ∈ 1 , … , K . \begin{aligned} \Delta x &= -\eta \left(\frac{\partial E}{\partial x_i} \right)^T \\ &= -\eta e^T\\ &= -\eta [e(1), \ldots, e(K)]^T \\\\ e_i(k)&= \begin{cases} x_i(k) - d_i(k), &\text{if } i\in O, \\ 0, &\text{otherwise. } \end{cases} k \in 1,\ldots,K. \end{aligned} Δxei(k)=η(xiE)T=ηeT=η[e(1),,e(K)]T={xi(k)di(k),0,if iO,otherwise. k1,,K.

由约束条件得:
∂ g ∂ x Δ x = − ∂ g ∂ w Δ w \frac{\partial g}{\partial x} \Delta x = - \frac{\partial g}{\partial w} \Delta w xgΔx=wgΔw
故已知 Δ x \Delta x Δx 时,可得:
Δ w = − [ ( ∂ g ∂ w ) T ( ∂ g ∂ w ) ] − 1 ( ∂ g ∂ w ) T ( ∂ g ∂ x ) Δ x \Delta w = -\left[\left(\frac{\partial g}{\partial w}\right)^T \left(\frac{\partial g}{\partial w}\right)\right]^{-1} \left(\frac{\partial g}{\partial w}\right)^T\left( \frac{\partial g}{\partial x}\right) \Delta x Δw=[(wg)T(wg)]1(wg)T(xg)Δx
需要注意逆矩阵不一定存在,故
Δ w = − [ ( ∂ g ∂ w ) T ( ∂ g ∂ w ) + ϵ I ] − 1 ( ∂ g ∂ w ) T ( ∂ g ∂ x ) Δ x \Delta w = -\left[\left(\frac{\partial g}{\partial w}\right)^T \left(\frac{\partial g}{\partial w}\right) + \epsilon I \right]^{-1} \left(\frac{\partial g}{\partial w}\right)^T\left( \frac{\partial g}{\partial x}\right) \Delta x Δw=[(wg)T(wg)+ϵI]1(wg)T(xg)Δx
这就是权重 W W W 的更新规则

计算细节

计算 ∂ g ∂ w \frac{\partial g}{\partial w} wg
∂ g ∂ w = [ ∂ g ( 1 ) ∂ w ⋮ ∂ g ( K ) ∂ w ] = Δ t [ ∂ W f [ x ( 0 ) ] ∂ w ⋮ ∂ W f [ x ( K − 1 ) ] ∂ w ] \frac{\partial g}{\partial w} = \begin{bmatrix} \frac{\partial g(1)}{\partial w}\\ \vdots \\ \frac{\partial g(K)}{\partial w} \end{bmatrix} = \Delta t \begin{bmatrix} \frac{\partial Wf[x(0)] }{\partial w}\\ \vdots \\ \frac{\partial Wf[x(K-1)] }{\partial w} \end{bmatrix}\\ wg=wg(1)wg(K)=ΔtwWf[x(0)]wWf[x(K1)]
其中
∂ W f [ x ( k ) ] ∂ w = [ ∂ w 1 T f [ x ( k ) ] ∂ w ⋮ ∂ w N T f [ x ( k ) ] ∂ w ] 记 f k = [ f ( x 1 ) , … , f ( x N ( k ) ) ] T = [ f k T f k T ⋱ f k T ] N × N 2 ≜ F ( k ) \begin{aligned} \frac{\partial Wf[x(k)]}{\partial w} &= \begin{bmatrix} \frac{\partial w_1^Tf[x(k)]}{\partial w}\\ \vdots \\ \frac{\partial w_N^Tf[x(k)]}{\partial w} \end{bmatrix} \color{red}{记 f_k = [f(x_1), \ldots, f(x_N(k))]^T}\\\\ & = \begin{bmatrix} f_k^T &&& \\ & f_k^T&& \\ && \ddots & \\ &&& f_k^T \end{bmatrix}_{N\times N^2} \\\\ &\triangleq F(k) \end{aligned} wWf[x(k)]=ww1Tf[x(k)]wwNTf[x(k)]fk=[f(x1),,f(xN(k))]T=fkTfkTfkTN×N2F(k)

∂ g ∂ w = Δ t [ F ( 0 ) ⋮ F ( K − 1 ) ] N K × N 2 \frac{\partial g}{\partial w} = \Delta t \begin{bmatrix} F(0)\\ \vdots \\ F(K-1) \end{bmatrix}_{NK \times N^2} wg=ΔtF(0)F(K1)NK×N2
1 Δ t 2 ( ∂ g ∂ w ) T ( ∂ g ∂ w ) = [ F T ( 0 ) ⋯ F T ( K − 1 ) ] [ F ( 0 ) ⋮ F ( K − 1 ) ] = ∑ k = 0 K − 1 F T ( k ) F ( k ) = [ ∑ k = 0 K − 1 f k f k T ∑ k = 0 K − 1 f k f k T ⋱ ∑ k = 0 K − 1 f k f k T ] N 2 × N 2 ≜ d i a g { C K − 1 } \begin{aligned} &\frac{1}{\Delta t^2}\left(\frac{\partial g}{\partial w}\right)^T \left(\frac{\partial g}{\partial w}\right) \\ &= \begin{bmatrix} F^T(0) & \cdots & F^T(K-1) \end{bmatrix} \begin{bmatrix} F(0)\\ \vdots \\ F(K-1) \end{bmatrix} \\\\ &= \sum_{k=0}^{K-1} F^T(k)F(k) \\\\ &=\begin{bmatrix} \sum_{k=0}^{K-1} f_k f_k^T &&& \\ & \sum_{k=0}^{K-1} f_k f_k^T && \\ && \ddots & \\ &&& \sum_{k=0}^{K-1} f_k f_k^T \end{bmatrix}_{N^2 \times N^2} \\\\ &\triangleq diag\{C_{K-1}\} \end{aligned} Δt21(wg)T(wg)=[FT(0)FT(K1)]F(0)F(K1)=k=0K1FT(k)F(k)=k=0K1fkfkTk=0K1fkfkTk=0K1fkfkTN2×N2diag{CK1}


γ = [ γ ( 1 ) γ ( 2 ) ⋮ γ ( K ) ] N K = ∂ g ∂ x Δ x \gamma = \begin{bmatrix} \gamma(1)\\ \gamma(2) \\ \vdots \\ \gamma(K) \end{bmatrix}_{NK} = \frac{\partial g}{\partial x} \Delta x γ=γ(1)γ(2)γ(K)NK=xgΔx
γ \gamma γ 表示由 Δ x \Delta x Δx 提供的误差信息,它的计算放在本文最后,先假设它已经求出来了


( ∂ g ∂ w ) T ( ∂ g ∂ x ) Δ x = Δ t [ F T ( 0 ) ⋯ F T ( K − 1 ) ] N 2 × N K [ γ ( 1 ) γ ( 2 ) ⋮ γ ( K ) ] N K = Δ t ∑ k = 1 K F T ( k − 1 ) γ ( k ) = Δ t ∑ k = 1 K [ f k − 1 f k − 1 ⋱ f k − 1 ] N 2 × N [ γ 1 ( k ) γ 2 ( k ) ⋮ γ N ( k ) ] N = Δ t [ ∑ k = 1 K f k − 1 γ 1 ( k ) ∑ k = 1 K f k − 1 γ 2 ( k ) ⋮ ∑ k = 1 K f k − 1 γ N ( k ) ] N 2 \begin{aligned} & \left(\frac{\partial g}{\partial w}\right)^T\left( \frac{\partial g}{\partial x}\right) \Delta x \\ &= \Delta t \begin{bmatrix} F^T(0) & \cdots & F^T(K-1) \end{bmatrix}_{N^2 \times NK} \begin{bmatrix} \gamma(1)\\ \gamma(2) \\ \vdots \\ \gamma(K) \end{bmatrix}_{NK} \\ &= \Delta t\sum_{k=1}^K F^T(k-1)\gamma(k) \\ &=\Delta t \sum_{k=1}^K \begin{bmatrix} f_{k-1} &&& \\ & f_{k-1}&& \\ && \ddots & \\ &&& f_{k-1} \end{bmatrix}_{N^2 \times N} \begin{bmatrix} \gamma_1(k)\\ \gamma_2(k) \\ \vdots \\ \gamma_N(k) \end{bmatrix}_{N}\\\\ &=\Delta t \begin{bmatrix} \sum_{k=1}^K f_{k-1} \gamma_1(k)\\ \sum_{k=1}^K f_{k-1} \gamma_2(k) \\ \vdots \\ \sum_{k=1}^K f_{k-1} \gamma_N(k) \end{bmatrix}_{N^2}\\\\ \end{aligned} (wg)T(xg)Δx=Δt[FT(0)FT(K1)]N2×NKγ(1)γ(2)γ(K)NK=Δtk=1KFT(k1)γ(k)=Δtk=1Kfk1fk1fk1N2×Nγ1(k)γ2(k)γN(k)N=Δtk=1Kfk1γ1(k)k=1Kfk1γ2(k)k=1Kfk1γN(k)N2

所以
Δ w = − [ ( ∂ g ∂ w ) T ( ∂ g ∂ w ) + ϵ I ] − 1 ( ∂ g ∂ w ) T ( ∂ g ∂ x ) Δ x = − 1 Δ t [ C K − 1 − 1 ∑ k = 1 K f k − 1 γ 1 ( k ) C K − 1 − 1 ∑ k = 1 K f k − 1 γ 2 ( k ) ⋮ C K − 1 − 1 ∑ k = 1 K f k − 1 γ N ( k ) ] N 2 Δ W = − 1 Δ t [ ∑ k = 1 K f k − 1 T C K − 1 − 1 γ 1 ( k ) ∑ k = 1 K f k − 1 T C K − 1 − 1 γ 2 ( k ) ⋮ ∑ k = 1 K f k − 1 T C K − 1 − 1 γ N ( k ) ] N × N = − 1 Δ t ∑ k = 1 K [ f k − 1 T γ 1 ( k ) f k − 1 T γ 2 ( k ) ⋮ f k − 1 T γ N ( k ) ] N × N C K − 1 − 1 \begin{aligned} \Delta w &= -\left[\left(\frac{\partial g}{\partial w}\right)^T \left(\frac{\partial g}{\partial w}\right) + \epsilon I\right]^{-1} \left(\frac{\partial g}{\partial w}\right)^T\left( \frac{\partial g}{\partial x}\right) \Delta x \\ &= - \frac{1}{\Delta t} \begin{bmatrix} C_{K-1}^{-1} \sum_{k=1}^K f_{k-1} \gamma_1(k)\\ C_{K-1}^{-1} \sum_{k=1}^K f_{k-1} \gamma_2(k) \\ \vdots \\ C_{K-1}^{-1} \sum_{k=1}^K f_{k-1} \gamma_N(k) \end{bmatrix}_{N^2}\\\\ \Delta W &= - \frac{1}{\Delta t} \begin{bmatrix} \sum_{k=1}^K f_{k-1}^TC_{K-1}^{-1} \gamma_1(k)\\ \sum_{k=1}^K f_{k-1}^TC_{K-1}^{-1} \gamma_2(k) \\ \vdots \\ \sum_{k=1}^K f_{k-1}^TC_{K-1}^{-1} \gamma_N(k) \end{bmatrix}_{N\times N} \\ &= - \frac{1}{\Delta t} \sum_{k=1}^K\begin{bmatrix} f_{k-1}^T \gamma_1(k)\\ f_{k-1}^T \gamma_2(k) \\ \vdots \\ f_{k-1}^T \gamma_N(k) \end{bmatrix}_{N\times N} C_{K-1}^{-1} \\ \end{aligned} ΔwΔW=[(wg)T(wg)+ϵI]1(wg)T(xg)Δx=Δt1CK11k=1Kfk1γ1(k)CK11k=1Kfk1γ2(k)CK11k=1Kfk1γN(k)N2=Δt1k=1Kfk1TCK11γ1(k)k=1Kfk1TCK11γ2(k)k=1Kfk1TCK11γN(k)N×N=Δt1k=1Kfk1Tγ1(k)fk1Tγ2(k)fk1TγN(k)N×NCK11其中
C K − 1 = ϵ I + ∑ r = 0 K − 1 f r f r T C_{K-1} = \epsilon I + \sum_{r=0}^{K-1} f_r f_r^T CK1=ϵI+r=0K1frfrT
注 意 : 上 述 Δ W 是 基 于 1 , 2 , … , K 整 个 时 间 段 的 更 新 , 不 妨 称 之 为 Δ W b a t c h \color{red}{注意:上述 \Delta W 是基于 1,2,\ldots, K 整个时间段的更新,不妨称之为 \Delta W_{batch}} ΔW1,2,,KΔWbatch

下面将更新公式拆解在线更新(online updating)的形式:
Δ W b a t c h ( K ) = Δ W ( 1 ) + ⋯ + Δ W ( K ) \Delta W^{batch}(K)= \Delta W(1) + \cdots + \Delta W(K) ΔWbatch(K)=ΔW(1)++ΔW(K)

等式右端对应每一时刻的更新量

在第 K K K 时刻的第 i i i 个神经元的输入权重的更新量:
Δ w i T ( K ) = − 1 Δ t ∑ k = 1 K f k − 1 T C K − 1 − 1 γ i ( k ) + 1 Δ t ∑ k = 1 K − 1 f k − 1 T C K − 2 − 1 γ i ( k ) = − 1 Δ t f K − 1 T C K − 1 − 1 γ i ( K ) − 1 Δ t ∑ k = 1 K − 1 f k − 1 T ( C K − 1 − 1 − C K − 2 − 1 ) γ i ( k ) = − 1 Δ t f K − 1 T C K − 1 − 1 γ i ( K ) − 1 Δ t ∑ k = 1 K − 1 f k − 1 T C K − 2 − 1 γ i ( k ) ( C K − 2 C K − 1 − 1 − I ) = − 1 Δ t f K − 1 T C K − 1 − 1 γ i ( K ) − Δ w i b a t c h ( K − 1 ) ( C K − 2 C K − 1 − 1 − I ) = − 1 Δ t f K − 1 T C K − 1 − 1 γ i ( K ) − ∑ k = 1 K − 1 Δ w i T ( k ) ( C K − 2 C K − 1 − 1 − I ) \begin{aligned} \Delta w^T_{i}(K) &= - \frac{1}{\Delta t} \sum_{k=1}^{K} f_{k-1}^TC_{K-1}^{-1} \gamma_i(k) + \frac{1}{\Delta t} \sum_{k=1}^{K-1} f_{k-1}^TC_{K-2}^{-1} \gamma_i(k)\\\\ &= - \frac{1}{\Delta t} f_{K-1}^TC_{K-1}^{-1} \gamma_i(K) - \frac{1}{\Delta t} \sum_{k=1}^{K-1} f_{k-1}^T (C_{K-1}^{-1} - C_{K-2}^{-1}) \gamma_i(k) \\\\ &=- \frac{1}{\Delta t} f_{K-1}^T C_{K-1}^{-1}\gamma_i(K) - \frac{1}{\Delta t} \sum_{k=1}^{K-1} f_{k-1}^T C_{K-2}^{-1}\gamma_i(k)(C_{K-2}C_{K-1}^{-1} - I) \\\\ &= - \frac{1}{\Delta t} f_{K-1}^TC_{K-1}^{-1} \gamma_i(K) - \Delta w_i^{batch}(K-1)(C_{K-2}C_{K-1}^{-1}- I) \\\\ &= - \frac{1}{\Delta t} f_{K-1}^TC_{K-1}^{-1} \gamma_i(K) - \sum_{k=1}^{K-1} \Delta w^T_i(k) (C_{K-2}C_{K-1}^{-1}- I) \end{aligned} ΔwiT(K)=Δt1k=1Kfk1TCK11γi(k)+Δt1k=1K1fk1TCK21γi(k)=Δt1fK1TCK11γi(K)Δt1k=1K1fk1T(CK11CK21)γi(k)=Δt1fK1TCK11γi(K)Δt1k=1K1fk1TCK21γi(k)(CK2CK11I)=Δt1fK1TCK11γi(K)Δwibatch(K1)(CK2CK11I)=Δt1fK1TCK11γi(K)k=1K1ΔwiT(k)(CK2CK11I)
可以看出,APRL 的更新规则由当前时刻的误差和 w 的累计更新(动量)组成

随着 K → ∞ K \to \infty K,易知 ∑ k = 1 K − 1 Δ w i T ( k ) → c o n s t , C K − 2 C K − 1 − 1 → I \sum_{k=1}^{K-1} \Delta w^T_i(k) \to const, C_{K-2}C_{K-1}^{-1} \to I k=1K1ΔwiT(k)const,CK2CK11I,所以第二项趋于零

BPDC 更新规则

BPDC 对 APRL 的在线算法做了简单粗暴的近似

该近似不试图累积完整的相关矩阵 C k C_k Ck,也舍弃了先前误差的累积,而且只计算瞬时相关 C ( k ) C(k) C(k)
Δ w i T ( k + 1 ) = − 1 Δ t f k T C ( k ) − 1 γ i ( k + 1 ) C ( k ) = ϵ I + f k f k T \begin{aligned} \Delta w^T_{i}(k+1) &= - \frac{1}{\Delta t} f_{k}^TC(k)^{-1} \gamma_i(k+1) \\\\ C(k) &= \epsilon I + f_k f_k^T \end{aligned} ΔwiT(k+1)C(k)=Δt1fkTC(k)1γi(k+1)=ϵI+fkfkT
利用矩阵求逆引理
C ( k ) − 1 = ( ϵ I + f k f k T ) − 1 = 1 ϵ I − 1 ϵ f f T ϵ + f T f \begin{aligned} C(k)^{-1} &= (\epsilon I + f_k f_k^T)^{-1} \\ &= \frac{1}{\epsilon}I - \frac{1}{\epsilon} \frac{ff^T}{\epsilon + f^Tf} \end{aligned} C(k)1=(ϵI+fkfkT)1=ϵ1Iϵ1ϵ+fTfffT
所以
Δ w i T ( k + 1 ) = − 1 Δ t f k T ( 1 ϵ I − 1 ϵ f f T ϵ + f T f ) γ i ( k + 1 ) = − 1 Δ t f T ϵ + f T f γ i ( k + 1 ) \begin{aligned} \Delta w^T_{i}(k+1) &= - \frac{1}{\Delta t} f_{k}^T\left( \frac{1}{\epsilon}I - \frac{1}{\epsilon} \frac{ff^T}{\epsilon + f^Tf}\right) \gamma_i(k+1) \\\\ &= - \frac{1}{\Delta t} \frac{f^T}{\epsilon + f^Tf} \gamma_i(k+1) \end{aligned} ΔwiT(k+1)=Δt1fkT(ϵ1Iϵ1ϵ+fTfffT)γi(k+1)=Δt1ϵ+fTffTγi(k+1)

计算 γ \gamma γ

γ = [ γ ( 1 ) γ ( 2 ) ⋮ γ ( K ) ] N K = ∂ g ∂ x Δ x = − η ∂ g ∂ x [ e ( 1 ) , … , e ( K ) ] T \gamma = \begin{bmatrix} \gamma(1)\\ \gamma(2) \\ \vdots \\ \gamma(K) \end{bmatrix}_{NK} = \frac{\partial g}{\partial x} \Delta x = -\eta \frac{\partial g}{\partial x} [e(1), \ldots, e(K)]^T γ=γ(1)γ(2)γ(K)NK=xgΔx=ηxg[e(1),,e(K)]T
关键在与计算 ∂ g ∂ x \frac{\partial g}{\partial x} xg
∂ g ∂ x = [ ∂ g 1 ∂ x ( 1 ) … ∂ g 1 ∂ x ( K ) ⋮ ⋱ ⋮ ∂ g K ∂ x ( 1 ) … ∂ g K ∂ x ( K ) ] = [ ∂ g 1 ∂ x ( 1 ) 0 0 … 0 ∂ g 2 ∂ x ( 1 ) ∂ g 2 ∂ x ( 2 ) 0 … 0 0 ∂ g 3 ∂ x ( 2 ) ∂ g 3 ∂ x ( 3 ) … 0 ⋮ ⋮ ⋮ ⋱ ⋮ 0 0 0 ∂ g K ∂ x ( K − 1 ) ∂ g K ∂ x ( K ) ] = [ − I 0 0 … 0 ( 1 − Δ t ) I + Δ t W D ( 1 ) − I 0 … 0 0 ( 1 − Δ t ) I + Δ t W D ( 2 ) − I … 0 ⋮ ⋮ ⋮ ⋱ ⋮ 0 0 0 ( 1 − Δ t ) I + Δ t W D ( K − 1 ) − I ] \begin{aligned} \frac{\partial g}{\partial x} &= \begin{bmatrix} \frac{\partial g_1}{\partial x(1)} & \ldots & \frac{ \partial g_1}{\partial x(K)}\\ \vdots & \ddots & \vdots\\ \frac{\partial g_K}{\partial x(1)} & \ldots & \frac{ \partial g_K}{\partial x(K)} \end{bmatrix}\\\\ &= \begin{bmatrix} \frac{\partial g_1}{\partial x(1)} & 0& 0 &\ldots & 0\\ \frac{\partial g_2}{\partial x(1)} & \frac{\partial g_2}{\partial x(2)}& 0 &\ldots & 0 \\ 0 & \frac{\partial g_3}{\partial x(2)} & \frac{\partial g_3}{\partial x(3)} & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \frac{\partial g_K}{\partial x(K-1)}& \frac{\partial g_K}{\partial x(K)} \end{bmatrix} \\\\ &= \begin{bmatrix} -I & 0& 0 &\ldots & 0\\ (1-\Delta t )I + \Delta t W D(1) & -I& 0 &\ldots & 0 \\ 0 & (1-\Delta t )I + \Delta t W D(2) & -I & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 &(1-\Delta t )I + \Delta t W D(K-1)& -I \end{bmatrix} \end{aligned} xg=x(1)g1x(1)gKx(K)g1x(K)gK=x(1)g1x(1)g2000x(2)g2x(2)g3000x(3)g30x(K1)gK000x(K)gK=I(1Δt)I+ΔtWD(1)000I(1Δt)I+ΔtWD(2)000I0(1Δt)I+ΔtWD(K1)000I
其中
D ( k ) = [ f ′ ( x 1 ( k ) ) ⋯ 0 ⋮ ⋱ ⋮ 0 … f ′ ( x N ( k ) ) ] N × N D(k) = \begin{bmatrix} f'(x_1(k)) & \cdots&0\\ \vdots & \ddots & \vdots\\ 0& \ldots & f'(x_N(k)) \end{bmatrix}_{N \times N} D(k)=f(x1(k))00f(xN(k))N×N
所以
γ = − η ∂ g ∂ x [ e ( 1 ) , … , e ( K ) ] T = − η [ − I 0 0 … 0 ( 1 − Δ t ) I + Δ t W D ( 1 ) − I 0 … 0 0 ( 1 − Δ t ) I + Δ t W D ( 2 ) − I … 0 ⋮ ⋮ ⋮ ⋱ ⋮ 0 0 0 ( 1 − Δ t ) I + Δ t W D ( K − 1 ) − I ] [ e ( 1 ) e ( 2 ) e ( 3 ) ⋮ e ( K ) ] = − η [ − e ( 1 ) [ ( 1 − Δ t ) I + Δ t W D ( 1 ) ] e ( 1 ) − e ( 2 ) [ ( 1 − Δ t ) I + Δ t W D ( 2 ) ] e ( 2 ) − e ( 3 ) ⋮ [ ( 1 − Δ t ) I + Δ t W D ( K − 1 ) ] e ( K − 1 ) − e ( K ) ] N K × 1 \begin{aligned} \gamma &= -\eta \frac{\partial g}{\partial x} [e(1), \ldots, e(K)]^T\\ &= -\eta\begin{bmatrix} -I & 0& 0 &\ldots & 0\\ (1-\Delta t )I + \Delta t WD(1) & -I& 0 &\ldots & 0 \\ 0 & (1-\Delta t )I + \Delta t W D(2) & -I & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 &(1-\Delta t )I + \Delta t W D(K-1)& -I \end{bmatrix} \begin{bmatrix} e(1) \\ e(2) \\ e(3) \\ \vdots \\ e(K) \end{bmatrix} \\\\ &= -\eta \begin{bmatrix} -e(1) \\ [(1-\Delta t )I + \Delta t W D(1)]e(1) - e(2) \\ [(1-\Delta t )I + \Delta t W D(2)]e(2) - e(3) \\ \vdots \\ [(1-\Delta t )I + \Delta t W D(K-1)]e(K-1) - e(K) \end{bmatrix}_{NK \times 1} \end{aligned} γ=ηxg[e(1),,e(K)]T=ηI(1Δt)I+ΔtWD(1)000I(1Δt)I+ΔtWD(2)000I0(1Δt)I+ΔtWD(K1)000Ie(1)e(2)e(3)e(K)=ηe(1)[(1Δt)I+ΔtWD(1)]e(1)e(2)[(1Δt)I+ΔtWD(2)]e(2)e(3)[(1Δt)I+ΔtWD(K1)]e(K1)e(K)NK×1
代入到 BPDC 更新规则:
Δ w i T ( k + 1 ) = − 1 Δ t f T ϵ + f T f γ i ( k + 1 ) = η Δ t f T ϵ + f T f { ( 1 − Δ t ) e i ( k ) + Δ t ∑ s ∈ O w i s f ′ ( x s ( k ) ) e s ( k ) − e i ( k + 1 ) } \begin{aligned} \Delta w^T_{i}(k+1) &= - \frac{1}{\Delta t} \frac{f^T}{\epsilon + f^Tf} \color{red}{ \gamma_i(k+1)} \\ &= \frac{\color{red}{\eta}}{\Delta t} \frac{f^T}{\epsilon + f^Tf} \color{red}\{ (1-\Delta t )e_i(k) + \Delta t \sum_{s\in O}w_{is} f'(x_s(k))e_s(k) - e_i(k+1) \} \end{aligned} ΔwiT(k+1)=Δt1ϵ+fTffTγi(k+1)=Δtηϵ+fTffT{(1Δt)ei(k)+ΔtsOwisf(xs(k))es(k)ei(k+1)}

参考文献

  • J.J. Steil, Backpropagation-decorrelation: online recurrent learning with O(N) complexity, in: Proceedings of the International Joint Conference on Neural Networks (IJCNN), vol. 1, 2004, pp. 843–848.
  • J.J. Steil, Online stability of backpropagation-decorrelation recurrent learning, Neurocomputing 69 (2006) 642–650.
  • J.J. Steil, Online reservoir adaptation by intrinsic plasticity for backpropagation-decorrelation and echo state learning, Neural Networks 20 (3) (2007) 353–364.

作者简介

在这里插入图片描述

服了这个德国佬,看了他三篇论文里的 γ \gamma γ 都推错了,最后还是在下面这个大姐的毕业论文里找到了正确的公式
在这里插入图片描述

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

颹蕭蕭

白嫖?

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值