神经网络与深度学习-多层前馈网络与误差反传算法
1. 多层感知机
1.1. XOR问题
线性不可分问题: 无法进行线性分类。Minsky 1969年提出XOR问题
1.2. 多层感知机
解决方法: 使用多层感知机
在输入和除数层间加一或多层隐单元,构成多层感知器(多层前馈神经网络)
加一层隐节点(单元)为三层网络,可解决异或(XOR)问题,见图:
由输入
u
=
(
u
1
,
u
2
)
\pmb{u}=(u_1,u_2)
u=(u1,u2)得到两个隐节点、一个输出层节点的输出:
y
1
1
=
f
[
w
11
1
u
1
+
w
12
1
u
2
−
θ
1
1
]
y
2
1
=
f
[
w
21
1
u
1
+
w
22
1
u
2
−
θ
2
1
]
y
=
f
[
w
1
2
y
1
1
+
w
2
2
y
2
1
−
θ
]
f
[
∙
]
=
{
1
,
∙
≥
0
0
,
∙
<
0
\begin{array}{l} y_{1}^{1}=f\left[w_{11}^{1} u_{1}+w_{12}^{1} u_{2}-\theta_{1}^{1}\right] \\ y_{2}^{1}=f\left[w_{21}^{1} u_{1}+w_{22}^{1} u_{2}-\theta_{2}^{1}\right] \\ y=f\left[w_{1}^{2} y_{1}^{1}+w_{2}^{2} y_{2}^{1}-\theta\right] \\ f[\bullet]=\left\{\begin{array}{ll} 1, & \bullet \geq 0 \\ 0, & \bullet<0 \end{array}\right. \\ \end{array}
y11=f[w111u1+w121u2−θ11]y21=f[w211u1+w221u2−θ21]y=f[w12y11+w22y21−θ]f[∙]={1,0,∙≥0∙<0
可得到:
y
1
1
=
{
1
,
w
11
1
u
1
+
w
12
1
u
2
≥
θ
1
0
,
w
11
1
u
1
+
w
12
1
u
2
<
θ
1
y
2
1
=
{
1
,
w
21
1
u
1
+
w
22
1
u
2
≥
θ
2
0
,
w
21
1
u
1
+
w
22
1
u
2
<
θ
2
y
=
{
1
,
w
1
2
y
1
1
+
w
2
2
y
2
1
≥
θ
0
,
w
1
2
y
1
1
+
w
2
2
y
2
1
<
θ
\begin{aligned} y_{1}^{1} & =\left\{\begin{array}{ll} 1, & w_{11}^{1} u_{1}+w_{12}^{1} u_{2} \geq \theta_{1} \\ 0, & w_{11}^{1} u_{1}+w_{12}^{1} u_{2}<\theta_{1} \end{array}\right. \\ y_{2}^{1} & =\left\{\begin{array}{ll} 1, & w_{21}^{1} u_{1}+w_{22}^{1} u_{2} \geq \theta_{2} \\ 0, & w_{21}^{1} u_{1}+w_{22}^{1} u_{2}<\theta_{2} \end{array}\right. \\ y & =\left\{\begin{array}{ll} 1, & w_{1}^{2} y_{1}^{1}+w_{2}^{2} y_{2}^{1} \geq \theta \\ 0, & w_{1}^{2} y_{1}^{1}+w_{2}^{2} y_{2}^{1}<\theta \end{array}\right. \end{aligned}
y11y21y={1,0,w111u1+w121u2≥θ1w111u1+w121u2<θ1={1,0,w211u1+w221u2≥θ2w211u1+w221u2<θ2={1,0,w12y11+w22y21≥θw12y11+w22y21<θ
设网络有如下一组权值和阈值,可得各节点的输出:
y
1
1
=
f
[
1
⋅
u
1
+
1
⋅
u
2
−
0.5
]
y
2
1
=
f
[
(
−
1
)
⋅
u
1
+
(
−
1
)
⋅
u
2
−
(
−
1.5
)
]
y
=
f
[
1
⋅
y
1
1
+
1
⋅
y
2
1
−
1.2
]
\begin{array}{l} y_{1}^{1}=f\left[1 \cdot u_{1}+1 \cdot u_{2}-0.5\right] \\ y_{2}^{1}=f\left[(-1) \cdot u_{1}+(-1) \cdot u_{2}-(-1.5)\right] \\ y=f\left[1 \cdot y_{1}^{1}+1 \cdot y_{2}^{1}-1.2\right] \end{array}
y11=f[1⋅u1+1⋅u2−0.5]y21=f[(−1)⋅u1+(−1)⋅u2−(−1.5)]y=f[1⋅y11+1⋅y21−1.2]
三层感知器可识别任一凸多边形或无界的凸区域
更多层感知器网络,可识别更为复杂的图形
多层感知器网络,有如下定理:
定理1: 若隐层节点(单元)可任意设置,用三层阈值节点的网络,可以实现任意的二值逻辑函数
定理2: 若隐层节点(单元)可任意设置,用三层S型非线性特性节点的网络,可以一致逼近紧集上的连续函数或按 范数逼近紧集上的平方可积函数
2. 多层前馈网络及BP算法简述
2.1. 多层前馈网络
多层前馈网络的反向传播(BP)学习算法,简称BP算法,是有导师的学习,它是梯度下降法在多层前馈网中的应用
网络结构: 见图,u、y是网络的输入、输出向量,神经元用节点表示,网络由输入层、隐层和输出层节点组成,隐层可一层,也可多层(图中是单隐层),前层至后层节点通过权联接。由于用BP学习算法,所以常称BP神经网络
2.2. BP算法简述
- 已知网络的输入/输出样本,即导师信号
- BP学习算法由正向传播和反向传播组成:
- 正向传播是输入信号从输入层经隐层,传向输出层,若输出层得到了期望的输出,则学习算法结束;否则,转至反向传播
- 反向传播是将误差(样本输出与网络输出
之差)按原联接通路反向计算,由梯度下
降法调整各层节点的权值和阈值,使误差
减小
3. BP算法详解
3.1. BP算法基本思想
记法(尽量与吴恩达“深度学习”一致):
(粗整体为向量或矩阵,一般斜体为变量)
层: 用上标 [ l ] \left[l\right] [l]表示,共 L L L层。其中输入为第0层,输出为 L L L层
网络输出: y ^ = a [ L ] = a \pmb{\hat{y}=a^{[L]}=a} y^=a[L]=a
网络输入: y ^ = a [ 0 ] = x \pmb{\hat{y}=a^{[0]}=x} y^=a[0]=x
网络中第l层输出: a [ l ] = f ( z [ l ] ) \pmb{a^{[l]}=f(z^{[l]})} a[l]=f(z[l]),选取作用函数为Sigmoid函数,则可即为: a [ l ] = σ ( z [ l ] ) \pmb{a^{[l]}=\sigma(z^{[l]})} a[l]=σ(z[l])
权值连接: w i j [ l ] w_{ij}^{[l]} wij[l],表示连接第 l l l层第 i i i个节点和第 l − 1 l-1 l−1层第 j j j个节点的权值
设算法的输入输出样本(导师信号)为:
{
x
(
1
)
,
y
(
1
)
}
,
{
x
(
2
)
,
y
(
2
)
}
,
⋯
{
x
(
N
)
,
y
(
N
)
}
\{x^{(1)},y^{(1)}\},\{x^{(2)},y^{(2)}\},\cdots\{x^{(N)},y^{(N)}\}
{x(1),y(1)},{x(2),y(2)},⋯{x(N),y(N)}即共
N
N
N个样本。或记为:
{
x
(
i
)
,
y
(
i
)
}
,
i
=
1
,
⋯
N
\{x^{(i)},y^{(i)}\},i=1,\cdots N
{x(i),y(i)},i=1,⋯N
网络训练的目的,是使对每一个输入样本,调整网络参数,使输出均方误差最小化。这是一个最优化问题
选取: J ( x ( i ) ; w ) = 1 2 ( y ( i ) − y ^ ( i ) ( x ; w ) ) 2 = 1 2 ( y ( i ) − a ( i ) ( x ; w ) ) 2 J(x^{(i)};w)=\frac{1}{2}(y^{(i)}-\hat{y}^{(i)}(x;w))^2=\frac{1}{2}(y^{(i)}-a^{(i)}(x;w))^2 J(x(i);w)=21(y(i)−y^(i)(x;w))2=21(y(i)−a(i)(x;w))2
考虑迭代算法,设初始权值为
w
0
w_0
w0,
k
k
k时刻权值为
w
k
w_k
wk, 则使用泰勒级数展开,有:
J
(
w
k
+
1
)
=
J
(
w
k
)
+
[
d
J
d
θ
]
T
Δ
w
k
+
⋯
J(w_{k+1})=J(w_k)+\left[\frac{dJ}{d \theta}\right]^T\varDelta w_k+\cdots
J(wk+1)=J(wk)+[dθdJ]TΔwk+⋯
问题: 如何选择 Δ w k \varDelta w_k Δwk,使 J J J最小
最直接的方法就是选择 Δ w k = − α d J d w , 0 < α ≤ 1 \varDelta w_k=-\alpha\frac{dJ}{dw},0<\alpha\le1 Δwk=−αdwdJ,0<α≤1
这样每一步都能保证 J ( w k + 1 ) ≤ J ( w k ) J(w_{k+1})\le J(w_k) J(wk+1)≤J(wk),从而使 J J J最终可收敛到最小。
这就是梯度下降算法,也是BP学习算法的基本思想
- 设置初始权系数 w 0 \pmb{w_0} w0为较小的随机非零值
- 给定输入/输出样本对,计算网络输出, 完成前向传播
- 计算目标函数 4 4 4。如 J < ε J<\varepsilon J<ε, 训练成功,退出;否则转入4
- 反向传播计算 由输出层,按梯度下降法将误差反向传播,逐层调整权值
3.2. BP算法推导
3.2.1. 前向传播
考虑三层神经网络。对于当前样本
隐含层输出: 对于第
l
l
l层第
i
i
i个神经元,
w
i
j
l
x
j
w^l_{ij}x_j
wijlxj
a
i
[
l
]
=
f
(
z
i
[
l
]
)
=
f
(
w
i
[
l
]
⋅
a
[
l
−
1
]
)
=
f
(
∑
j
=
0
n
w
i
j
[
l
]
⋅
a
j
[
l
−
1
]
)
a_i^{[l]}=f\left(z_i^{[l]}\right)=f\left(\mathbf{w}_i^{[l]} \cdot \mathbf{a}^{[l-1]}\right)=f\left(\sum_{j=0}^n w_{i j}^{[l]} \cdot a_j^{[l-1]}\right)
ai[l]=f(zi[l])=f(wi[l]⋅a[l−1])=f(j=0∑nwij[l]⋅aj[l−1])
f
f
f可选取为Log Sigmoid函数
f
=
1
1
+
e
−
x
f=\frac{1}{1+e^{-x}}
f=1+e−x1
假设仅有一层隐层,则输出:
$$$$
计算误差:
e
=
y
−
a
e=y-a
e=y−a
第i个输出: e i = y i − a i [ 2 ] e_i=y_i-a_i^{[2]} ei=yi−ai[2]
我们要计算: Δ w k = − α d J d w \varDelta w_k=-\alpha\frac{dJ}{dw} Δwk=−αdwdJ,
因此需要求: d J d w ∣ w = w k \frac{dJ}{dw}|_{w=w_k} dwdJ∣w=wk
3.2.2. 误差反传
3.2.2.1. 输出层
首先考虑输出层权值
w
[
2
]
w^{[2]}
w[2]。根据链式求导法则:
∂
J
∂
w
i
j
[
2
]
=
[
∂
J
∂
e
]
T
∂
e
∂
w
i
j
[
2
]
,
∂
J
∂
e
=
e
\frac{\partial J}{\partial w_{ij}^{[2]}}=\left[\frac{\partial J}{\partial e}\right]^T\frac{\partial e}{\partial w_{ij}^{[2]}},\frac{\partial J }{\partial e}=e
∂wij[2]∂J=[∂e∂J]T∂wij[2]∂e,∂e∂J=e
注意到
w
i
j
[
2
]
w_{ij}^{[2]}
wij[2]仅和
y
i
y_i
yi有关,(看神经网络图),因此:
∂
e
∂
w
i
j
[
2
]
=
[
∂
e
1
∂
w
i
j
[
2
]
,
⋯
,
∂
e
i
∂
w
i
j
[
2
]
,
⋯
∂
e
m
∂
w
i
j
[
2
]
]
T
=
[
0
,
⋯
,
∂
e
i
∂
w
i
j
[
2
]
,
⋯
0
]
T
\frac{\partial \mathbf{e}}{\partial w_{i j}^{[2]}}=\left[\frac{\partial e_1}{\partial w_{i j}^{[2]}}, \cdots, \frac{\partial e_i}{\partial w_{i j}^{[2]}}, \cdots \frac{\partial e_m}{\partial w_{i j}^{[2]}}\right]^{\mathrm{T}}=\left[0, \cdots, \frac{\partial e_i}{\partial w_{i j}^{[2]}}, \cdots 0\right]^{\mathrm{T}}
∂wij[2]∂e=[∂wij[2]∂e1,⋯,∂wij[2]∂ei,⋯∂wij[2]∂em]T=[0,⋯,∂wij[2]∂ei,⋯0]T
进一步根据Log Sigmoid函数性质有:
∂
J
∂
w
i
j
[
2
]
=
−
e
i
∂
a
i
∂
w
i
j
[
2
]
=
−
e
i
a
i
(
1
−
a
i
)
a
j
[
1
]
Δ
w
i
j
[
2
]
(
k
)
=
−
α
∂
J
∂
w
i
j
[
2
]
=
α
⋅
a
i
(
1
−
a
i
)
e
i
⋅
a
j
[
1
]
\begin{aligned} \frac{\partial J}{\partial w_{i j}^{[2]}} & =-e_i \frac{\partial a_i}{\partial w_{i j}^{[2]}}=-e_i a_i\left(1-a_i\right) a_j^{[1]} \\ \Delta w_{i j}^{[2]}(k) & =-\alpha \frac{\partial J}{\partial w_{i j}^{[2]}}=\alpha \cdot a_i\left(1-a_i\right) e_i \cdot a_j^{[1]} \end{aligned}
∂wij[2]∂JΔwij[2](k)=−ei∂wij[2]∂ai=−eiai(1−ai)aj[1]=−α∂wij[2]∂J=α⋅ai(1−ai)ei⋅aj[1]令
δ
i
[
2
]
=
a
i
(
1
−
a
i
)
e
i
\delta_i^{[2]}=a_i(1-a_i)e_i
δi[2]=ai(1−ai)ei,和Hebb规则类比则有:
Δ
w
i
j
[
2
]
(
k
)
=
α
δ
i
[
2
]
⋅
a
j
[
1
]
\varDelta w_{ij}^{[2]}(k)=\alpha \delta_i^{[2]}\cdot a_j^{[1]}
Δwij[2](k)=αδi[2]⋅aj[1]
3.2.2.2. 隐含层
注意到
w
i
j
[
1
]
w^{[1]}_{ij}
wij[1] 仅和
a
i
[
1
]
a_i^{[1]}
ai[1]有关,因此
∂
J
∂
w
i
j
[
1
]
=
[
[
∂
J
∂
e
]
T
∂
e
∂
a
i
[
1
]
]
∂
a
i
[
1
]
∂
w
i
j
[
1
]
∂
e
∂
a
i
[
1
]
=
−
∂
y
∂
a
i
[
1
]
=
[
∂
y
1
∂
a
i
[
1
]
,
⋯
∂
y
m
∂
a
i
[
1
]
]
T
\begin{gathered} \frac{\partial J}{\partial w_{i j}^{[1]}}=\left[\left[\frac{\partial J}{\partial \mathbf{e}}\right]^{\mathrm{T}} \frac{\partial \mathbf{e}}{\partial a_i^{[1]}}\right] \frac{\partial a_i^{[1]}}{\partial w_{i j}^{[1]}} \\ \frac{\partial \mathbf{e}}{\partial a_i^{[1]}}=-\frac{\partial \mathbf{y}}{\partial a_i^{[1]}}=\left[\frac{\partial y_1}{\partial a_i^{[1]}}, \cdots \frac{\partial y_m}{\partial a_i^{[1]}}\right]^{\mathrm{T}} \end{gathered}
∂wij[1]∂J=[[∂e∂J]T∂ai[1]∂e]∂wij[1]∂ai[1]∂ai[1]∂e=−∂ai[1]∂y=[∂ai[1]∂y1,⋯∂ai[1]∂ym]T
以
y
m
y_m
ym为例说明求法。由
y
m
y_m
ym表达式(见前向传播),有:
∂
y
m
∂
a
i
[
1
]
=
f
′
(
∑
j
=
1
n
w
m
j
[
2
]
a
j
[
1
]
)
∂
(
∑
j
=
1
n
w
m
j
[
2
]
a
j
[
1
]
)
∂
a
i
[
1
]
=
f
′
(
z
m
[
2
]
)
w
m
i
[
2
]
\frac{\partial y_m}{\partial a_i^{[1]}}=f^{\prime}\left(\sum_{j=1}^n w_{m j}^{[2]} a_j^{[1]}\right) \frac{\partial\left(\sum_{j=1}^n w_{m j}^{[2]} a_j^{[1]}\right)}{\partial a_i^{[1]}}=f^{\prime}\left(z_m^{[2]}\right) w_{m i}^{[2]}
∂ai[1]∂ym=f′(j=1∑nwmj[2]aj[1])∂ai[1]∂(∑j=1nwmj[2]aj[1])=f′(zm[2])wmi[2]
根据Sigmoid函数性质,同时利用
a
m
=
f
(
z
m
[
2
]
)
a_m=f(z_m^{[2]})
am=f(zm[2]),有:
f
′
(
z
m
[
2
]
)
=
a
m
(
1
−
a
m
)
[
∂
j
∂
e
]
T
∂
e
∂
a
i
[
1
]
=
∑
j
=
1
m
∂
J
∂
e
j
⋅
∂
e
j
∂
y
j
⋅
∂
y
j
∂
a
i
[
1
]
=
−
∑
j
=
1
m
a
j
(
1
−
a
j
)
w
j
i
[
2
]
e
j
\begin{gathered} f^{\prime}\left(z_m^{[2]}\right)=a_m\left(1-a_m\right) \\ {\left[\frac{\partial j}{\partial \mathbf{e}}\right]^{\mathrm{T}} \frac{\partial \mathbf{e}}{\partial a_i^{[1]}}=\sum_{j=1}^m \frac{\partial J}{\partial e_j} \cdot \frac{\partial e_j}{\partial y_j} \cdot \frac{\partial y_j}{\partial a_i^{[1]}}=-\sum_{j=1}^m a_j\left(1-a_j\right) w_{j i}^{[2]} e_j} \end{gathered}
f′(zm[2])=am(1−am)[∂e∂j]T∂ai[1]∂e=j=1∑m∂ej∂J⋅∂yj∂ej⋅∂ai[1]∂yj=−j=1∑maj(1−aj)wji[2]ej
即误差进行反向传播:
∂
a
i
[
1
]
∂
w
i
j
[
1
]
=
a
i
[
1
]
(
1
−
a
i
[
1
]
)
x
j
\frac{\partial a_i^{[1]}}{\partial w_{ij}^{[1]}}=a_i^{[1]}(1-a_i^{[1]})x_j
∂wij[1]∂ai[1]=ai[1](1−ai[1])xj
综合上述结果,有:
Δ
w
i
j
[
1
]
(
k
)
=
α
[
∑
j
=
1
m
w
j
i
[
2
]
a
j
(
1
−
a
j
)
e
j
]
a
i
[
1
]
(
1
−
a
i
[
1
]
)
x
j
\Delta w_{i j}^{[1]}(k)=\alpha\left[\sum_{j=1}^m w_{j i}^{[2]} a_j\left(1-a_j\right) e_j\right] a_i^{[1]}\left(1-a_i^{[1]}\right) x_j
Δwij[1](k)=α[j=1∑mwji[2]aj(1−aj)ej]ai[1](1−ai[1])xj
令:
δ
i
[
1
]
=
[
∑
j
=
1
m
w
j
i
[
2
]
δ
j
[
2
]
]
(
a
i
[
1
]
)
\delta_i^{[1]}=\left[\sum_{j=1}^m w_{j i}^{[2]} \delta_j^{[2]}\right]\left(a_i^{[1]}\right)
δi[1]=[j=1∑mwji[2]δj[2]](ai[1])
则和Hebb规则类比:
Δ
w
i
j
[
1
]
(
k
)
=
α
δ
i
[
1
]
⋅
x
j
\Delta w_{i j}^{[1]}(k)=\alpha \delta_i^{[1]} \cdot x_j
Δwij[1](k)=αδi[1]⋅xj
3.2.2.3. 总结
如果当前是输出层: δ i [ L ] = a i ( 1 − a i ) e i \delta_i^{[L]}=a_i(1-a_i)e_i δi[L]=ai(1−ai)ei隐含层(按从后向前顺序更新): δ i [ L ] = [ ∑ j = 1 m w j i [ l + 1 ] δ j [ l + 1 ] ] ( a i [ l ] ) \delta_i^{[L]}=\left[\sum\limits_{j=1}^mw_{ji}^{[l+1]}\delta_j^{[l+1]}\right](a_i^{[l]}) δi[L]=[j=1∑mwji[l+1]δj[l+1]](ai[l])然后更新: Δ w i j [ l ] ( k ) = α ⋅ δ i [ l ] ⋅ a j [ l − 1 ] , a j [ 0 ] = x j \varDelta w_{ij}^{[l]}(k)=\alpha\cdot\delta_i^{[l]}\cdot a_j^{[l-1]},a_j^{[0]}=x_j Δwij[l](k)=α⋅δi[l]⋅aj[l−1],aj[0]=xj
4. 算法扩展
4.1. 神经网络分类
考虑输出层为先行节点(单输出):
J
(
w
)
=
1
2
N
∑
i
=
1
N
(
a
(
i
)
−
y
(
i
)
)
2
J(w)=\frac{1}{2N}\sum\limits^N_{i=1}(a^{(i)}-y^{(i)})^2
J(w)=2N1i=1∑N(a(i)−y(i))2
容易计算:
∂
∂
w
j
[
2
]
J
(
w
)
=
1
N
∑
i
=
1
N
[
a
(
i
)
−
y
(
i
)
]
a
j
[
1
]
\frac{\partial}{\partial w_j^{[2]}}J(w)=\frac{1}{N}\sum\limits^N_{i=1}\left[a^{(i)}-y^{(i)}\right]a_j^{[1]}
∂wj[2]∂J(w)=N1i=1∑N[a(i)−y(i)]aj[1]
考虑二分类问题(单输出):
J
(
w
)
=
1
N
∑
i
=
1
N
L
(
a
(
i
)
,
y
(
i
)
)
=
−
1
N
∑
i
=
1
N
[
y
(
i
)
log
a
(
i
)
+
(
1
−
y
(
i
)
)
log
(
1
−
a
(
i
)
)
]
J(\mathbf{w})=\frac{1}{N} \sum_{i=1}^{N} L\left(a^{(i)}, y^{(i)}\right)=-\frac{1}{N} \sum_{i=1}^{N}\left[y^{(i)} \log a^{(i)}+\left(1-y^{(i)}\right) \log \left(1-a^{(i)}\right)\right]
J(w)=N1i=1∑NL(a(i),y(i))=−N1i=1∑N[y(i)loga(i)+(1−y(i))log(1−a(i))]
注意到:
y
(
i
)
log
a
(
i
)
+
(
1
−
y
(
i
)
)
log
(
1
−
a
(
i
)
)
=
y
(
i
)
log
(
1
1
+
e
−
(
w
[
2
]
)
T
a
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
1
1
+
e
−
(
w
[
2
]
)
T
a
(
i
)
)
\begin{aligned} & y^{(i)} \log a^{(i)}+\left(1-y^{(i)}\right) \log \left(1-a^{(i)}\right) \\ = & y^{(i)} \log \left(\frac{1}{1+e^{-\left(\mathbf{w}^{[2]}\right)^{\mathrm{T}} \mathbf{a}^{(i)}}}\right)+\left(1-y^{(i)}\right) \log \left(1-\frac{1}{1+e^{-\left(\mathbf{w}^{[2]}\right)^{\mathrm{T}} \mathbf{a}^{(i)}}}\right) \end{aligned}
=y(i)loga(i)+(1−y(i))log(1−a(i))y(i)log(1+e−(w[2])Ta(i)1)+(1−y(i))log(1−1+e−(w[2])Ta(i)1)
进一步可计算:
∂
∂
w
j
[
2
]
J
(
w
)
=
1
N
∑
i
=
1
N
[
a
(
i
)
−
y
(
i
)
]
a
j
[
1
]
\frac{\partial}{\partial w_j^{[2]}}J(w)=\frac{1}{N}\sum\limits^N_{i=1}\left[a^{(i)}-y^{(i)}\right]a_j^{[1]}
∂wj[2]∂J(w)=N1i=1∑N[a(i)−y(i)]aj[1]
4.2. 权值正则化
加入正则项: J ( w ) = 1 N ∑ i = 1 N ( a ( i ) , y ( i ) ) + λ 2 P ∣ ∣ w ∣ ∣ 2 J(w)=\frac{1}{N}\sum\limits^N_{i=1}(a^{(i)},y^{(i)})+\frac{\lambda}{2P}||w||^2 J(w)=N1i=1∑N(a(i),y(i))+2Pλ∣∣w∣∣2可计算: ∂ ∂ w j [ 2 ] J ( w ) = 1 N ∑ i = 1 N [ a ( i ) − y ( i ) ] x j ( i ) + λ P w j [ 2 ] \frac{\partial}{\partial w_j^{[2]}}J(w)=\frac{1}{N}\sum\limits^N_{i=1}\left[a^{(i)}-y^{(i)}\right]x_j^{(i)}+\frac{\lambda}{P}w_j^{[2]} ∂wj[2]∂J(w)=N1i=1∑N[a(i)−y(i)]xj(i)+Pλwj[2]
5. 算法评述
优点:
- 学习完全自主
- 可逼近任意非线性函数
缺点:
- 算法非全局收敛
- 收敛速度慢
- 学习速率 α \alpha α选择
- 神经网络如何设计(几层?节点数?)
5. 总结
目标定位: 常用的定位方法有边界框定位和特征点定位;同时,定位问题与分类问题的区别无非在神经网络的输出环节多输出了几个数字用于传达更多的信息。
目标检测: 常用的方法有滑动窗口算法和YOLO算法。YOLO算法改善了滑动窗口边界框定位不准的问题,同时运算效率高,甚至可用于实时检测。