西瓜书——神经网络笔记



神经网络

Lei_ZM
2019-10-07



1. 感知机

1.1. 定义

假设输入空间是 X ⊆ R n \mathcal{X}\subseteq R^{n} XRn,输出空间是 y = { 1 , 0 } \mathcal{y}=\{1,0\} y={10}。输入 x ∈ X \boldsymbol{x}\in\mathcal{X} xX七表示实例的特征向量,对应于输入空间的点;输出 y ∈ Y y\in\mathcal{Y} yY表示实例的类别。由输入空间到输出空间的如下函数:

f ( x ) = sgn ⁡ ( w T x + b ) f(\boldsymbol{x})=\operatorname{sgn}\left(\boldsymbol{w}^{T} \boldsymbol{x}+b\right) f(x)=sgn(wTx+b)

称为感知机参数,其中 w \boldsymbol{w} w b b b为感知机模型参数, sgn ⁡ \operatorname{sgn} sgn为阶跃函数,即:

sgn ⁡ ( z ) = { 1 , z ⩾ 0 0 , z < 0 \operatorname{sgn}(z)=\left\{ \begin{array}{ll} {1,} & {z \geqslant 0} \\ {0,} & {z<0}\end{array} \right. sgn(z)={1,0,z0z<0



1.2. 感知机的几何解释

线性方程 w T x + b = 0 \boldsymbol{w}^{T} \boldsymbol{x}+b=0 wTx+b=0对应于特征空间(输入空间) R n R^{n} Rn中的一个超平面 S S S,其中 w \boldsymbol{w} w是超平面的法向量, b b b是超平面的截距。这个超平面将特征空间划分为两个部分。位于两边的点(特征向量)分别被分为正、负两类。因此,超平面 S S S称为分离超平面,如图所示:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nIYVydXy-1571980109175)(_v_images/20191007230028901_1238115550.png =404x)]



1.3. 学习策略

假设训练数据集是线性可分的,感知机学习的目标是求得一个能够将训练集正实例点和负实例点完全正确分开的超平面。为了找出这样的超平面 S \mathrm{S} S,即确定感知机模型参数 w \boldsymbol{w} w b b b,需要确定一个学习策略,即定义损失函数并将损失函数极小化。损失函数的一个自然选择是误分类点的总数。但是,这样的损失函数不是参数 w \boldsymbol{w} w b b b的连续可导函数,不易优化,所以感知机采用的损失函数为误分类点到超平面的总距离。

输入空间 R n R^{n} Rn中点 x 0 \boldsymbol{x_{0}} x0到超平面 S \mathrm{S} S的距离公式为:

∣ w T x 0 + b ∣ ∥ w ∥ \frac{\left|\boldsymbol{w}^{T} \boldsymbol{x}_{0}+b\right|}{\|\boldsymbol{w}\|} wwTx0+b

其中, ∥ w ∥ \|\boldsymbol{w}\| w表示向量 w \boldsymbol{w} w L 2 L_{2} L2范数,也就是模长。

若将 b b b看成哑结点,即将其合进至 w \boldsymbol{w} w可得:

∣ w ^ T x ^ 0 ∣ ∥ w ^ ∥ \frac{\left|\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{0}\right|}{\|\hat{\boldsymbol{w}}\|} w^w^Tx^0

设误分类点集合为 M M M,那么所有误分类点到超平面 S \mathrm{S} S的总距离为:

∑ x ^ i ∈ M ∣ w ^ T x ^ i ∣ ∥ w ^ ∥ \sum_{\hat{x}_{i}\in M} \frac{\left|\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}\right|}{\|\hat{\boldsymbol{w}}\|} x^iMw^w^Tx^i

又因为,对于任意误分类点 x ^ i ∈ M \hat{x}_{i}\in M x^iM来说, 都有:

( y ^ i − y i ) w ^ T x ^ i > 0 \left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i} > 0 (y^iyi)w^Tx^i>0

其中, y ^ i \hat{y}_{i} y^i为当期感知机的输出。于是所有误分类点到超平面 S \mathrm{S} S的总距离为:

∑ x ^ i ∈ M ( y ^ i − y i ) w ^ T x ^ i ∥ w ^ ∥ \sum_{\hat{\boldsymbol{x}}_{i} \in M} \frac{\left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}}{\|\hat{\boldsymbol{w}}\|} x^iMw^(y^iyi)w^Tx^i

由于训练完成后无误分类点,即损失函数值为0,与分母 ∥ w ^ ∥ \|\hat{\boldsymbol{w}}\| w^无关,故可舍去,即不考虑 1 ∥ w ^ ∥ \frac{1}{\|\hat{\boldsymbol{w}}\|} w^1,此时得到感知机的函数为:

L ( w ^ ) = ∑ x ^ i ∈ M ( y ^ i − y i ) w ^ T x ^ i L\left(\hat{\boldsymbol{w}}\right)= \sum_{\hat{\boldsymbol{x}}_{i} \in M} \left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i} L(w^)=x^iM(y^iyi)w^Tx^i

显然,损失函数 L ( w ^ ) L\left(\hat{\boldsymbol{w}}\right) L(w^)是非负的。如果没有误分类点,损失函数值为0。而且误分类点越少,误分类点离超平面越近,损失函数值就越小,在误分类时是参数心的线性函数,在正确分类时是0。因此,给定训练数据集,损失函数 L ( w ^ ) L\left(\hat{\boldsymbol{w}}\right) L(w^) w ^ \hat{\boldsymbol{w}} w^的连续可导函数。



1.4. 算法

感知机学习算法是对以下最优化问题的算法,给定训练数据集:

T = { ( x ^ 1 , y 1 ) , ( x ^ 2 , y 2 ) , ⋯   , ( x ^ N , y N ) } T=\left\{\left(\hat{\boldsymbol{x}}_{1}, y_{1}\right),\left(\hat{\boldsymbol{x}}_{2}, y_{2}\right), \cdots,\left(\hat{\boldsymbol{x}}_{N}, y_{N}\right)\right\} T={(x^1,y1),(x^2,y2),,(x^N,yN)}

其中, x ^ i ∈ R n + 1 \hat{x}_{i}\in R^{n+1} x^iRn+1 y i ∈ { 0 , 1 } y_{i}\in \{0, 1\} yi{0,1},求参数 w ^ \hat{\boldsymbol{w}} w^使其为以下损失函数极小化问题的解。

L ( w ^ ) = ∑ x ^ i ∈ M ( y ^ i − y i ) w ^ T x ^ i L\left(\hat{\boldsymbol{w}}\right)= \sum_{\hat{\boldsymbol{x}}_{i} \in M} \left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i} L(w^)=x^iM(y^iyi)w^Tx^i

其中, M M M为误分类点的集合。

感知机学习算法是误分类驱动的,具体采用随机梯度下降法。首先,任意选取一个超平面 w ^ 0 T x ^ = 0 \hat{\boldsymbol{w}}^{T}_{0} \hat{\boldsymbol{x}}=0 w^0Tx^=0用梯度下降法不断地极小化损失函数 L ( w ^ ) L\left(\hat{\boldsymbol{w}}\right) L(w^),极小化过程中不是一次使 M M M中所有误分类点的梯度下降,而是一次随机选取一个误分类点使其梯度下降。已知损失函数的梯度为:

∇ L ( w ^ ) = ∂ L ( w ^ ) ∂ w ^ = ∂ ∂ w ^ [ ∑ x ^ i ∈ M ( y ^ i − y i ) w ^ T x ^ i ] = ∑ x ^ i ∈ M [ ( y ^ i − y i ) ∂ ∂ w ^ ( w ^ T x ^ i ) ] 矩阵微分公式 ∂ x T a ∂ x = a = ∑ x ^ i ∈ M ( y ^ i − y i ) x ^ i \begin{aligned} \nabla L(\hat{\boldsymbol{w}}) =\frac{\partial L(\hat{\boldsymbol{w}})}{\partial \hat{\boldsymbol{w}}} &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[\sum_{\hat{x}_{i} \in M}\left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}\right] \\ &=\sum_{\hat{x}_{i} \in M}\left[\left(\hat{y}_{i}-y_{i}\right) \frac{\partial}{\partial \hat{\boldsymbol{w}}}\left(\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i} \right)\right] \\ &\qquad \text{矩阵微分公式} \frac{\partial \boldsymbol{x}^{T} \boldsymbol{a}}{\partial \boldsymbol{x}}=\boldsymbol{a} \\ &=\sum_{\hat{\boldsymbol{x}}_{i} \in M}\left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{x}}_{i} \end{aligned} L(w^)=w^L(w^)=w^[x^iM(y^iyi)w^Tx^i]=x^iM[(y^iyi)w^(w^Tx^i)]矩阵微分公式xxTa=a=x^iM(y^iyi)x^i

那么随机选取一个误分类点 x ^ i \hat{x}_{i} x^i进行梯度下降,可得参数 w ^ \hat{\boldsymbol{w}} w^的更新公式:

w ^ ← w ^ + Δ w ^ Δ w ^ = − η ∇ L ( w ^ ) ← w ^ − η ∇ L ( w ^ ) 选取一个误分类点 ∇ L ( w ^ ) = ( y ^ i − y i ) x ^ i ← w ^ − η ( y ^ i − y i ) x ^ i = w ^ + η ( y i − y ^ i ) x ^ i \begin{aligned} \hat{\boldsymbol{w}} &\leftarrow \hat{\boldsymbol{w}}+\Delta\hat{\boldsymbol{w}} \\ &\qquad \Delta\hat{\boldsymbol{w}}=-\eta \nabla L(\hat{\boldsymbol{w}}) \\ &\leftarrow \hat{\boldsymbol{w}}-\eta \nabla L(\hat{\boldsymbol{w}}) \\ &\qquad \text{选取一个误分类点} \\ &\qquad \nabla L(\hat{\boldsymbol{w}}) = \left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{x}}_{i} \\ &\leftarrow \hat{\boldsymbol{w}}-\eta\left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{x}}_{i}=\hat{\boldsymbol{w}}+\eta\left(y_{i}-\hat{y}_{i}\right) \hat{\boldsymbol{x}}_{i} \end{aligned} w^w^+Δw^Δw^=ηL(w^)w^ηL(w^)选取一个误分类点L(w^)=(y^iyi)x^iw^η(y^iyi)x^i=w^+η(yiy^i)x^i

即有:

w ^ ← w ^ + Δ w ^ = w ^ + η ( y i − y ^ i ) \hat{\boldsymbol{w}}\leftarrow \hat{\boldsymbol{w}}+\Delta\hat{\boldsymbol{w}}=\hat{\boldsymbol{w}}+\eta\left(y_{i}-\hat{y}_{i}\right) w^w^+Δw^=w^+η(yiy^i)
⇒ Δ w ^ = η ( y i − y ^ i ) (西瓜书式5.2) \Rightarrow \Delta\hat{\boldsymbol{w}} = \eta\left(y_{i}-\hat{y}_{i}\right) \tag{西瓜书式5.2} Δw^=η(yiy^i)(西5.2)




2. 神经网络

2.1. 模型结构

单隐层前馈网络模型结构如下:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-248JDc5s-1571980109178)(_v_images/20191008120338422_605818040.png =723x)]

其中,

  • D = { ( x 1 , y 1 ) , ( x 1 , y 1 ) , ⋯   , ( x 1 , y 1 ) } D=\{(\boldsymbol{x}_{1}, \boldsymbol{y}_{1}), (\boldsymbol{x}_{1}, \boldsymbol{y}_{1}), \cdots, (\boldsymbol{x}_{1}, \boldsymbol{y}_{1})\} D={(x1,y1),(x1,y1),,(x1,y1)} x i ∈ R d \boldsymbol{x}_{i}\in \mathbb{R}^{d} xiRd y i ∈ R l \boldsymbol{y}_{i}\in \mathbb{R}^{l} yiRl:训练集

  • d d d:神经元输入个数,输入示例属性描述的个数

  • l l l:输出神经元个数,输出的实值向量维数

  • q q q:隐层神经元的个数

  • θ j \theta_{j} θj:输出层第 j j j个神经元的阈值

  • γ h \gamma_{h} γh:隐层第 h h h个神经元的阈值

  • v i h v_{ih} vih:输入层第 i i i个神经元与隐层第 h h h个神经元之间的连接权重

  • w h j w_{hj} whj:隐层第 h h h个神经元与输出层第 j j j个神经元之间的连接权重

  • α h = ∑ i = 1 d v i h x i \alpha_{h}=\sum_{i=1}^{d} v_{ih} x_{i} αh=i=1dvihxi:隐层第 h h h个神经元接收到的输入

  • β j = ∑ h = 1 q w h j b h \beta_{j}=\sum_{h=1}^{q} w_{hj} b_{h} βj=h=1qwhjbh:输出层第 j j j个神经元接收到的输入

  • b h b_{h} bh:隐层第 h h h个神经元的输出



2.2. 标准BP算法

给定一个训练样本 ( x k , y k ) (\boldsymbol{x}_{k}, \boldsymbol{y}_{k}) (xk,yk),假设神经网络模型的输出为 y ^ k = ( y ^ 1 , y ^ 2 , ⋯   , y ^ l ) \hat{\boldsymbol{y}}_{k}=(\hat{y}_{1}, \hat{\boldsymbol{y}}_{2}, \cdots, \hat{\boldsymbol{y}}_{l}) y^k=(y^1,y^2,,y^l),即:

y ^ j k = f ( β j − θ j ) (西瓜书式5.3) \hat{y}_{j}^{k}=f \left(\beta_{j}-\theta_{j}\right) \tag{西瓜书式5.3} y^jk=f(βjθj)(西5.3)

其中 f f f S i g m o i d \mathbf{Sigmoid} Sigmoid,则网络在一个训练样本 ( x k , y k ) (\boldsymbol{x}_{k}, \boldsymbol{y}_{k}) (xk,yk)上的均方误差为:

E k = 1 2 ∑ j = 1 l ( y ^ j k − y j k ) 2 (西瓜书式5.4) E_{k}=\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2} \tag{西瓜书式5.4} Ek=21j=1l(y^jkyjk)2(西5.4)

如果按照梯度下降法更新模型参数,那么各个参数的更新公式为:

w h j ← w h j + Δ w h j = w h j − η ∂ E k ∂ w h j θ j ← θ j + Δ θ j = θ j − η ∂ E k ∂ θ j v i h ← v i h + Δ v i h = v i h − η ∂ E k ∂ v i h γ h ← γ h + Δ γ h = γ h − η ∂ E k ∂ γ h \begin{aligned} w_{h j} \leftarrow w_{h j}+\Delta w_{h j} &=w_{h j}-\eta \frac{\partial E_{k}}{\partial w_{h j}} \\ \theta_{j} \leftarrow \theta_{j}+\Delta \theta_{j} &=\theta_{j}-\eta \frac{\partial E_{k}}{\partial \theta_{j}} \\ v_{i h} \leftarrow v_{i h}+\Delta v_{i h} &=v_{i h}-\eta \frac{\partial E_{k}}{\partial v_{i h}} \\ \gamma_{h} \leftarrow \gamma_{h}+\Delta \gamma_{h} &=\gamma_{h}-\eta \frac{\partial E_{k}}{\partial \gamma_{h}} \end{aligned} whjwhj+Δwhjθjθj+Δθjvihvih+Δvihγhγh+Δγh=whjηwhjEk=θjηθjEk=vihηvihEk=γhηγhEk



2.2.1. 参数 w h j w_{hj} whj的更新

已知 E k E_{k} Ek w h j w_{hj} whj的函数链式关系为:

E k = 1 2 ∑ j = 1 l ( y ^ j k − y j k ) 2   ↓   y ^ j k = f ( β j − θ j )   ↓   β j = ∑ h = 1 q w h j b h \begin{aligned} E_{k}=\frac{1}{2} \sum_{j=1}^{l}&\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2} \\ &\ \downarrow \\ &\begin{aligned} \ \hat{y}_{j}^{k}=f & \left(\beta_{j}-\theta_{j}\right) \\ &\ \downarrow \\ &\ \beta_{j}=\sum_{h=1}^{q} w_{h j} b_{h} \end{aligned} \end{aligned} Ek=21j=1l(y^jkyjk)2  y^jk=f(βjθj)  βj=h=1qwhjbh

所以:

∂ E k ∂ w h j = ∂ E k ∂ y ^ j k ⋅ ∂ y ^ j k ∂ β j ⋅ ∂ β j ∂ w h j \frac{\partial E_{k}}{\partial w_{h j}}=\frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial w_{h j}} whjEk=y^jkEkβjy^jkwhjβj

其中:

∂ E k ∂ y ^ j k = ∂ [ 1 2 ∑ j = 1 l ( y ^ j k − y j k ) 2 ] ∂ y ^ j k = 1 2 × 2 × ( y ^ j k − y j k ) × 1 = y ^ j k − y j k ∂ y ^ j k ∂ β j = ∂ [ f ( β j − θ j ) ] ∂ β j = f ′ ( β j − θ j ) × 1 其中  f ′ ( x ) = f ( x ) ( 1 − f ( x ) ) = f ( β j − θ j ) × [ 1 − f ( β j − θ j ) ] 西瓜书式5.3  y ^ j k = f ( β j − θ j ) = y ^ j k ( 1 − y ^ j k ) ∂ β j ∂ w h j = ∂ ( ∑ h = 1 q w h j b h ) ∂ w h j = b h \begin{aligned} &\begin{aligned} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} &=\frac{\partial\left[\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2}\right]}{\partial \hat{y}_{j}^{k}} \\ &=\frac{1}{2} \times 2 \times \left(\hat{y}_{j}^{k}-y_{j}^{k}\right) \times 1 \\ &=\hat{y}_{j}^{k}-y_{j}^{k} \end{aligned} \\ \\ &\begin{aligned} \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} &=\frac{\partial\left[f\left(\beta_{j}-\theta_{j}\right)\right]}{\partial \beta_{j}} \\ &=f^{\prime}\left(\beta_{j}-\theta_{j}\right) \times 1 \\ &\qquad \text{其中} \ f^{\prime}(x)=f(x)(1-f(x)) \\ &=f\left(\beta_{j}-\theta_{j}\right) \times\left[1-f\left(\beta_{j}-\theta_{j}\right)\right] \\ &\qquad \text{西瓜书式5.3}\ \hat{y}_{j}^{k}=f\left(\beta_{j}-\theta_{j}\right) \\ &=\hat{y}_{j}^{k} \left(1-\hat{y}_{j}^{k}\right) \end{aligned} \\ \\ &\begin{aligned} \frac{\partial \beta_{j}}{\partial w_{h j}} &=\frac{\partial\left(\sum_{h=1}^{q} w_{h j} b_{h}\right)}{\partial w_{h j}} \\ &=b_{h} \end{aligned} \end{aligned} y^jkEk=y^jk[21j=1l(y^jkyjk)2]=21×2×(y^jkyjk)×1=y^jkyjkβjy^jk=βj[f(βjθj)]=f(βjθj)×1其中 f(x)=f(x)(1f(x))=f(βjθj)×[1f(βjθj)]西瓜书式5.3 y^jk=f(βjθj)=y^jk(1y^jk)whjβj=whj(h=1qwhjbh)=bh

则令 g j g_j gj有:

g j = − ∂ E k ∂ y ^ j k ⋅ ∂ y ^ j k ∂ β j = − ( y ^ j k − y j k ) f ′ ( β j − θ j ) = y ^ j k ( 1 − y ^ j k ) ( y j k − y ^ j k ) (西瓜书式5.10) \begin{aligned} g_{j} &=-\frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \\ &=-\left(\hat{y}_{j}^{k}-y_{j}^{k}\right) f^{\prime}\left(\beta_{j}-\theta_{j}\right) \\ &=\hat{y}_{j}^{k}\left(1-\hat{y}_{j}^{k}\right)\left(y_{j}^{k}-\hat{y}_{j}^{k}\right) \end{aligned} \tag{西瓜书式5.10} gj=y^jkEkβjy^jk=(y^jkyjk)f(βjθj)=y^jk(1y^jk)(yjky^jk)(西5.10)

故有:

Δ w h j = − η ∂ E k ∂ w h j = − η ∂ E k ∂ y ^ j k ⋅ ∂ y ^ j k ∂ β j ⋅ ∂ β j ∂ w h j = η g j b h (西瓜书式5.11) \begin{aligned} \Delta w_{h j} &=-\eta \frac{\partial E_{k}}{\partial w_{h j}} \\ &=-\eta \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial w_{h j}} \\ &=\eta g_{j} b_{h} \end{aligned} \tag{西瓜书式5.11} Δwhj=ηwhjEk=ηy^jkEkβjy^jkwhjβj=ηgjbh(西5.11)



2.2.2. 参数 θ j \theta_{j} θj的参数更新

已知 E k E_{k} Ek θ j \theta_{j} θj的函数链式关系为:

E k = 1 2 ∑ j = 1 l ( y ^ j k − y j k ) 2   ↓   y ^ j k = f ( β j − θ j ) \begin{aligned} E_{k}=\frac{1}{2} \sum_{j=1}^{l}&\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2} \\ &\ \downarrow \\ &\ \hat{y}_{j}^{k}=f\left(\beta_{j}-\theta_{j}\right) \end{aligned} Ek=21j=1l(y^jkyjk)2  y^jk=f(βjθj)

所以:

∂ E k ∂ θ j = ∂ E k ∂ y ^ j k ⋅ ∂ y ^ j k ∂ θ j \frac{\partial E_{k}}{\partial \theta_{j}}=\frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \theta_{j}} θjEk=y^jkEkθjy^jk

其中:

∂ E k ∂ y ^ j k = ∂ [ 1 2 ∑ j = 1 l ( y ^ j k − y j k ) 2 ] ∂ y ^ j k = 1 2 × 2 × ( y ^ j k − y j k ) × 1 = y ^ j k − y j k ∂ y ^ j k ∂ θ j = ∂ [ f ( β j − θ j ) ] ∂ θ j = f ′ ( β j − θ j ) × ( − 1 ) 其中  f ′ ( x ) = f ( x ) ( 1 − f ( x ) ) = f ( β j − θ j ) × [ 1 − f ( β j − θ j ) ] × ( − 1 ) 西瓜书式5.3  y ^ j k = f ( β j − θ j ) = y ^ j k ( 1 − y ^ j k ) × ( − 1 ) \begin{aligned} &\begin{aligned} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} &=\frac{\partial\left[\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2}\right]}{\partial \hat{y}_{j}^{k}} \\ &=\frac{1}{2} \times 2 \times \left(\hat{y}_{j}^{k}-y_{j}^{k}\right) \times 1 \\ &=\hat{y}_{j}^{k}-y_{j}^{k} \end{aligned} \\ \\ &\begin{aligned} \frac{\partial \hat{y}_{j}^{k}}{\partial \theta_{j}} &=\frac{\partial\left[f\left(\beta_{j}-\theta_{j}\right)\right]}{\partial \theta_{j}} \\ &=f^{\prime}\left(\beta_{j}-\theta_{j}\right) \times (-1) \\ &\qquad \text{其中} \ f^{\prime}(x)=f(x)(1-f(x)) \\ &=f\left(\beta_{j}-\theta_{j}\right) \times\left[1-f\left(\beta_{j}-\theta_{j}\right)\right] \times (-1) \\ &\qquad \text{西瓜书式5.3}\ \hat{y}_{j}^{k}=f\left(\beta_{j}-\theta_{j}\right) \\ &=\hat{y}_{j}^{k} \left(1-\hat{y}_{j}^{k}\right) \times (-1) \end{aligned} \\ \end{aligned} y^jkEk=y^jk[21j=1l(y^jkyjk)2]=21×2×(y^jkyjk)×1=y^jkyjkθjy^jk=θj[f(βjθj)]=f(βjθj)×(1)其中 f(x)=f(x)(1f(x))=f(βjθj)×[1f(βjθj)]×(1)西瓜书式5.3 y^jk=f(βjθj)=y^jk(1y^jk)×(1)

故有:

Δ θ j = − η ∂ E k ∂ θ j = − η ∂ E k ∂ y ^ j k ⋅ ∂ y ^ j k ∂ θ j = − η ( y ^ j k − y j k ) ⋅ y ^ j k ( 1 − y ^ j k ) × ( − 1 ) = − η ( y j k − y ^ j k ) ⋅ y ^ j k ( 1 − y ^ j k ) 西瓜书式5.10  g j = y ^ j k ( 1 − y ^ j k ) ( y j k − y ^ j k ) = − η g j (西瓜书式5.12) \begin{aligned} \Delta \theta_{j} &=-\eta \frac{\partial E_{k}}{\partial \theta_{j}} \\ &=-\eta \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \theta_{j}} \\ &=-\eta \left(\hat{y}_{j}^{k}-y_{j}^{k}\right) \cdot \hat{y}_{j}^{k} \left(1-\hat{y}_{j}^{k}\right) \times (-1) \\ &=-\eta \left(y_{j}^{k}-\hat{y}_{j}^{k}\right) \cdot \hat{y}_{j}^{k} \left(1-\hat{y}_{j}^{k}\right) \\ &\qquad \text{西瓜书式5.10}\ g_{j}=\hat{y}_{j}^{k}\left(1-\hat{y}_{j}^{k}\right)\left(y_{j}^{k}-\hat{y}_{j}^{k}\right) \\ &=-\eta g_{j} \end{aligned} \tag{西瓜书式5.12} Δθj=ηθjEk=ηy^jkEkθjy^jk=η(y^jkyjk)y^jk(1y^jk)×(1)=η(yjky^jk)y^jk(1y^jk)西瓜书式5.10 gj=y^jk(1y^jk)(yjky^jk)=ηgj(西5.12)

2.2.3. 参数 v i h v_{ih} vih的更新

已知 E k E_{k} Ek v i h v_{ih} vih的函数链式关系为:

E k = 1 2 ∑ j = 1 l ( y ^ j k − y j k ) 2   ↓   y ^ j k = f ( β j − θ j )   ↓   β j = ∑ h = 1 q w h j b h ↓ b h = f ( α h − γ h )   ↓   α h = ∑ i = 1 d v i h x i \begin{aligned} E_{k}=\frac{1}{2} \sum_{j=1}^{l}&\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2} \\ &\ \downarrow \\ &\begin{aligned} \ \hat{y}_{j}^{k}=f & \left(\beta_{j}-\theta_{j}\right) \\ &\ \downarrow \\ &\begin{aligned} \ \beta_{j}=\sum_{h=1}^{q} w_{h j} &b_{h} \\ &\downarrow \\ &\begin{aligned} b_{h}=f & \left(\alpha_{h} - \gamma_{h}\right) \\ &\ \downarrow \\ &\ \alpha_{h}=\sum_{i=1}^{d} v_{ih} x_{i} \end{aligned} \end{aligned} \end{aligned} \end{aligned} Ek=21j=1l(y^jkyjk)2  y^jk=f(βjθj)  βj=h=1qwhjbhbh=f(αhγh)  αh=i=1dvihxi

所以:

∂ E k ∂ v i h = ∑ j = 1 l ∂ E k ∂ y ^ j k ⋅ ∂ y ^ j k ∂ β j ⋅ ∂ β j ∂ b h ⋅ ∂ b h ∂ α h ⋅ ∂ α h ∂ v i h \frac{\partial E_{k}}{\partial v_{i h}}=\sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot \frac{\partial b_{h}}{\partial \alpha_{h}} \cdot \frac{\partial \alpha_{h}}{\partial v_{i h}} vihEk=j=1ly^jkEkβjy^jkbhβjαhbhvihαh

这里 v i h v_{ih} vih存在于每一个 y ^ j \hat{y}_{j} y^j中,故共有 l l l个函数链关系。

其中:

∂ E k ∂ y ^ j k = ∂ [ 1 2 ∑ j = 1 l ( y ^ j k − y j k ) 2 ] ∂ y ^ j k = 1 2 × 2 × ( y ^ j k − y j k ) × 1 = y ^ j k − y j k ∂ y ^ j k ∂ β j = ∂ [ f ( β j − θ j ) ] ∂ β j = f ′ ( β j − θ j ) × 1 其中  f ′ ( x ) = f ( x ) ( 1 − f ( x ) ) = f ( β j − θ j ) × [ 1 − f ( β j − θ j ) ] 西瓜书式5.3  y ^ j k = f ( β j − θ j ) = y ^ j k ( 1 − y ^ j k ) ∂ β j ∂ b h = ∂ ( ∑ h = 1 q w h j b h ) ∂ b h = w h j ∂ b h ∂ α h = ∂ [ f ( α h − γ h ) ] ∂ α h = f ′ ( α h − γ h ) × 1 其中  f ′ ( x ) = f ( x ) ( 1 − f ( x ) ) = f ( α h − γ h ) × [ 1 − f ( α h − γ h ) ] = b h ( 1 − b h ) ∂ α h ∂ v i h = ∂ ( ∑ i = 1 d v i h x i ) ∂ v i h = x i \begin{aligned} &\begin{aligned} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} &=\frac{\partial\left[\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2}\right]}{\partial \hat{y}_{j}^{k}} \\ &=\frac{1}{2} \times 2 \times \left(\hat{y}_{j}^{k}-y_{j}^{k}\right) \times 1 \\ &=\hat{y}_{j}^{k}-y_{j}^{k} \end{aligned} \\ \\ &\begin{aligned} \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} &=\frac{\partial\left[f\left(\beta_{j}-\theta_{j}\right)\right]}{\partial \beta_{j}} \\ &=f^{\prime}\left(\beta_{j}-\theta_{j}\right) \times 1 \\ &\qquad \text{其中} \ f^{\prime}(x)=f(x)(1-f(x)) \\ &=f\left(\beta_{j}-\theta_{j}\right) \times\left[1-f\left(\beta_{j}-\theta_{j}\right)\right] \\ &\qquad \text{西瓜书式5.3}\ \hat{y}_{j}^{k}=f\left(\beta_{j}-\theta_{j}\right) \\ &=\hat{y}_{j}^{k} \left(1-\hat{y}_{j}^{k}\right) \end{aligned} \\ \\ &\begin{aligned} \frac{\partial \beta_{j}}{\partial b_{h}} &=\frac{\partial\left(\sum_{h=1}^{q} w_{h j} b_{h}\right)}{\partial b_{h}} \\ &=w_{h j} \end{aligned} \\ \\ &\begin{aligned} \frac{\partial b_{h}}{\partial \alpha_{h}} &=\frac{\partial\left[f\left(\alpha_{h}-\gamma_{h}\right)\right]}{\partial \alpha_{h}} \\ &=f^{\prime}\left(\alpha_{h}-\gamma_{h}\right) \times 1 \\ &\qquad \text{其中} \ f^{\prime}(x)=f(x)(1-f(x)) \\ &=f\left(\alpha_{h}-\gamma_{h}\right) \times\left[1-f\left(\alpha_{h}-\gamma_{h}\right)\right] \\ &=b_{h}\left(1-b_{h}\right) \end{aligned} \\ \\ &\begin{aligned} \frac{\partial \alpha_{h}}{\partial v_{i h}} &=\frac{\partial\left(\sum_{i=1}^{d} v_{i h} x_{i}\right)}{\partial v_{i h}} \\ &=x_{i} \end{aligned} \\ \end{aligned} y^jkEk=y^jk[21j=1l(y^jkyjk)2]=21×2×(y^jkyjk)×1=y^jkyjkβjy^jk=βj[f(βjθj)]=f(βjθj)×1其中 f(x)=f(x)(1f(x))=f(βjθj)×[1f(βjθj)]西瓜书式5.3 y^jk=f(βjθj)=y^jk(1y^jk)bhβj=bh(h=1qwhjbh)=whjαhbh=αh[f(αhγh)]=f(αhγh)×1其中 f(x)=f(x)(1f(x))=f(αhγh)×[1f(αhγh)]=bh(1bh)vihαh=vih(i=1dvihxi)=xi

则令 e h e_{h} eh有:

e h = − ∂ E k ∂ α h = − ∑ j = 1 l ∂ E k ∂ y ^ j k ⋅ ∂ y ^ j k ∂ β j ⋅ ∂ β j ∂ b h ⋅ ∂ b h ∂ α h = ∑ j = 1 l ( − ∂ E k ∂ y ^ j k ⋅ ∂ y ^ j k ∂ β j ) ⋅ ( ∂ β j ∂ b h ) ⋅ ( ∂ b h ∂ α h ) = ∑ j = 1 l g j ⋅ w h j ⋅ b h ( 1 − b h ) = b h ( 1 − b h ) ∑ j = 1 l w h j g j (西瓜书式5.15) \begin{aligned} e_{h} &=-\frac{\partial E_{k}}{\partial \alpha_{h}} \\ &=-\sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot \frac{\partial b_{h}}{\partial \alpha_{h}} \\ &=\sum_{j=1}^{l} \left(-\frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}}\right) \cdot \left(\frac{\partial \beta_{j}}{\partial b_{h}}\right) \cdot \left(\frac{\partial b_{h}}{\partial \alpha_{h}}\right) \\ &=\sum_{j=1}^{l} g_{j} \cdot w_{h j} \cdot b_{h}\left(1-b_{h}\right) \\ &=b_{h}\left(1-b_{h}\right) \sum_{j=1}^{l} w_{h j} g_{j} \end{aligned} \tag{西瓜书式5.15} eh=αhEk=j=1ly^jkEkβjy^jkbhβjαhbh=j=1l(y^jkEkβjy^jk)(bhβj)(αhbh)=j=1lgjwhjbh(1bh)=bh(1bh)j=1lwhjgj(西5.15)

所以:

Δ v i h = − η ∂ E k ∂ v i h = − η ∑ j = 1 l ∂ E k ∂ y ^ j k ⋅ ∂ y ^ j k ∂ β j ⋅ ∂ β j ∂ b h ⋅ ∂ b h ∂ α h ⋅ ∂ α h ∂ v i h = η ( − ∑ j = 1 l ∂ E k ∂ y ^ j k ⋅ ∂ y ^ j k ∂ β j ⋅ ∂ β j ∂ b h ⋅ ∂ b h ∂ α h ) ⋅ ( ∂ α h ∂ v i h ) = η e h x i (西瓜书式5.13) \begin{aligned} \Delta v_{i h} &=-\eta \frac{\partial E_{k}}{\partial v_{i h}} \\ &=-\eta \sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot \frac{\partial b_{h}}{\partial \alpha_{h}} \cdot \frac{\partial \alpha_{h}}{\partial v_{i h}} \\ &=\eta \left(-\sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot \frac{\partial b_{h}}{\partial \alpha_{h}}\right) \cdot \left(\frac{\partial \alpha_{h}}{\partial v_{i h}}\right) \\ &=\eta e_{h} x_{i} \end{aligned} \tag{西瓜书式5.13} Δvih=ηvihEk=ηj=1ly^jkEkβjy^jkbhβjαhbhvihαh=η(j=1ly^jkEkβjy^jkbhβjαhbh)(vihαh)=ηehxi(西5.13)

2.2.4. 参数 γ h \gamma_{h} γh的更新

已知 E k E_{k} Ek γ h \gamma_{h} γh的函数链式关系为:

E k = 1 2 ∑ j = 1 l ( y ^ j k − y j k ) 2   ↓   y ^ j k = f ( β j − θ j )   ↓   β j = ∑ h = 1 q w h j b h ↓ b h = f ( α h − γ h ) \begin{aligned} E_{k}=\frac{1}{2} \sum_{j=1}^{l}&\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2} \\ &\ \downarrow \\ &\begin{aligned} \ \hat{y}_{j}^{k}=f & \left(\beta_{j}-\theta_{j}\right) \\ &\ \downarrow \\ &\begin{aligned} \ \beta_{j}=\sum_{h=1}^{q} w_{h j} &b_{h} \\ &\downarrow \\ &b_{h}=f\left(\alpha_{h} - \gamma_{h}\right) \end{aligned} \end{aligned} \end{aligned} Ek=21j=1l(y^jkyjk)2  y^jk=f(βjθj)  βj=h=1qwhjbhbh=f(αhγh)

所以:

∂ E k γ h = ∑ j = 1 l ∂ E k ∂ y ^ j k ⋅ ∂ y ^ j k ∂ β j ⋅ ∂ β j ∂ b h ⋅ ∂ b h ∂ γ h \frac{\partial E_{k}}{\gamma_{h}}=\sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot \frac{\partial b_{h}}{\partial \gamma_{h}} γhEk=j=1ly^jkEkβjy^jkbhβjγhbh

这里 γ h \gamma_{h} γh存在于每一个 y ^ j \hat{y}_{j} y^j中,故共有 l l l个函数链关系。

其中:

∂ E k ∂ y ^ j k = ∂ [ 1 2 ∑ j = 1 l ( y ^ j k − y j k ) 2 ] ∂ y ^ j k = 1 2 × 2 × ( y ^ j k − y j k ) × 1 = y ^ j k − y j k ∂ y ^ j k ∂ β j = ∂ [ f ( β j − θ j ) ] ∂ β j = f ′ ( β j − θ j ) × 1 其中  f ′ ( x ) = f ( x ) ( 1 − f ( x ) ) = f ( β j − θ j ) × [ 1 − f ( β j − θ j ) ] 西瓜书式5.3  y ^ j k = f ( β j − θ j ) = y ^ j k ( 1 − y ^ j k ) ∂ β j ∂ b h = ∂ ( ∑ h = 1 q w h j b h ) ∂ b h = w h j ∂ b h ∂ γ h = ∂ [ f ( α h − γ h ) ] ∂ γ h 其中  f ′ ( x ) = f ( x ) ( 1 − f ( x ) ) = f ′ ( α h − γ h ) × ( − 1 ) = f ( α h − γ h ) × [ 1 − f ( α h − γ h ) ] × ( − 1 ) = b h ( 1 − b h ) × ( − 1 ) \begin{aligned} &\begin{aligned} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} &=\frac{\partial\left[\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2}\right]}{\partial \hat{y}_{j}^{k}} \\ &=\frac{1}{2} \times 2 \times \left(\hat{y}_{j}^{k}-y_{j}^{k}\right) \times 1 \\ &=\hat{y}_{j}^{k}-y_{j}^{k} \end{aligned} \\ \\ &\begin{aligned} \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} &=\frac{\partial\left[f\left(\beta_{j}-\theta_{j}\right)\right]}{\partial \beta_{j}} \\ &=f^{\prime}\left(\beta_{j}-\theta_{j}\right) \times 1 \\ &\qquad \text{其中} \ f^{\prime}(x)=f(x)(1-f(x)) \\ &=f\left(\beta_{j}-\theta_{j}\right) \times\left[1-f\left(\beta_{j}-\theta_{j}\right)\right] \\ &\qquad \text{西瓜书式5.3}\ \hat{y}_{j}^{k}=f\left(\beta_{j}-\theta_{j}\right) \\ &=\hat{y}_{j}^{k} \left(1-\hat{y}_{j}^{k}\right) \end{aligned} \\ \\ &\begin{aligned} \frac{\partial \beta_{j}}{\partial b_{h}} &=\frac{\partial\left(\sum_{h=1}^{q} w_{h j} b_{h}\right)}{\partial b_{h}} \\ &=w_{h j} \end{aligned} \\ \\ &\begin{aligned} \frac{\partial b_{h}}{\partial \gamma_{h}} &=\frac{\partial\left[f\left(\alpha_{h}-\gamma_{h}\right)\right]}{\partial \gamma_{h}} \\ &\qquad \text{其中} \ f^{\prime}(x)=f(x)(1-f(x)) \\ &=f^{\prime}\left(\alpha_{h}-\gamma_{h}\right) \times (-1) \\ &=f\left(\alpha_{h}-\gamma_{h}\right) \times\left[1-f\left(\alpha_{h}-\gamma_{h}\right)\right] \times (-1)\\ &=b_{h}\left(1-b_{h}\right) \times (-1) \end{aligned} \\ \end{aligned} y^jkEk=y^jk[21j=1l(y^jkyjk)2]=21×2×(y^jkyjk)×1=y^jkyjkβjy^jk=βj[f(βjθj)]=f(βjθj)×1其中 f(x)=f(x)(1f(x))=f(βjθj)×[1f(βjθj)]西瓜书式5.3 y^jk=f(βjθj)=y^jk(1y^jk)bhβj=bh(h=1qwhjbh)=whjγhbh=γh[f(αhγh)]其中 f(x)=f(x)(1f(x))=f(αhγh)×(1)=f(αhγh)×[1f(αhγh)]×(1)=bh(1bh)×(1)

故有:

Δ γ h = − η ∂ E k γ h = − η ∑ j = 1 l ∂ E k ∂ y ^ j k ⋅ ∂ y ^ j k ∂ β j ⋅ ∂ β j ∂ b h ⋅ ∂ b h ∂ γ h = η ∑ j = 1 l ( − ∂ E k ∂ y ^ j k ⋅ ∂ y ^ j k ∂ β j ) ⋅ ( ∂ β j ∂ b h ) ⋅ ( ∂ b h ∂ γ h ) = η ∑ j = 1 l g j ⋅ w h j ⋅ b h ( 1 − b h ) × ( − 1 ) = − η b h ( 1 − b h ) ∑ j = 1 l w h j g j 西瓜书式5.15  e h = b h ( 1 − b h ) ∑ j = 1 l w h j g j = − η e h (西瓜书式5.14) \begin{aligned} \Delta \gamma_{h} &=-\eta \frac{\partial E_{k}}{\gamma_{h}} \\ &=-\eta \sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot \frac{\partial b_{h}}{\partial \gamma_{h}} \\ &=\eta \sum_{j=1}^{l} \left(-\frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}}\cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}}\right) \cdot \left(\frac{\partial \beta_{j}}{\partial b_{h}}\right) \cdot \left(\frac{\partial b_{h}}{\partial \gamma_{h}}\right) \\ &=\eta \sum_{j=1}^{l} g_{j} \cdot w_{hj} \cdot b_{h}\left(1-b_{h}\right) \times (-1) \\ &=-\eta b_{h}\left(1-b_{h}\right) \sum_{j=1}^{l} w_{h j} g_{j} \\ &\qquad \text{西瓜书式5.15} \ e_{h}=b_{h}\left(1-b_{h}\right) \sum_{j=1}^{l} w_{h j} g_{j} \\ &=-\eta e_{h} \end{aligned} \tag{西瓜书式5.14} Δγh=ηγhEk=ηj=1ly^jkEkβjy^jkbhβjγhbh=ηj=1l(y^jkEkβjy^jk)(bhβj)(γhbh)=ηj=1lgjwhjbh(1bh)×(1)=ηbh(1bh)j=1lwhjgj西瓜书式5.15 eh=bh(1bh)j=1lwhjgj=ηeh(西5.14)

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值