《神经网络与深度学习》习题解答(至第七章)

部分题目的个人解答,参考了github上的习题解答分享与知乎解答。题目是自己做的,部分解答可能出错,有问题的解题部分欢迎指正。原文挂在自己的自建博客上。

第二章

2-1

直观上,对特定的分类问题,平方差的损失有上限(所有标签都错,损失值是一个有效值),但交叉熵则可以用整个非负域来反映优化程度的程度。

从本质上看,平方差的意义和交叉熵的意义不一样。概率理解上,平方损失函数意味着模型的输出是以预测值为均值的高斯分布,损失函数是在这个预测分布下真实值的似然度,softmax损失意味着真实标签的似然度。

分类问题中的标签,是没有连续的概念的。1-hot作为标签的一种表达方式,每个标签之间的距离也是没有实际意义的,所以预测值和标签两个向量之间的平方差这个值不能反应分类这个问题的优化程度。

大部分分类问题并非服从高斯分布

根据吴恩达机器学习视频: J ( θ ) = 1 m ∑ i = 1 m 1 2 ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta)=\frac{1}{m}\sum^m_{i=1}\frac{1}{2}(h_{\theta}(x^{(i)})-y^{(i)})^2 J(θ)=m1i=1m21(hθ(x(i))y(i))2,h表示的是你的预测结果,y表示对应的标签,J就可以理解为用二范数的方式将预测和标签的差距表示出来,模型学习的过程就是优化权重参数,使得J达到近似最小值,理论上这个损失函数是很有效果的,但是在实践中却又些问题,它这个h是激活函数激活后的结果,激活函数通常是非线性函数,例如sigmoid之类的,这就使得这个J的曲线变得很复杂,并不是凸函数,不利于优化,很容易陷入到局部最优解的情况。吴恩达说当激活函数是sigmoid的时候,J的曲线就如下图所示,可以看到这个曲线是很难求出全局最小值的,稍不留神就是局部最小值。

img

交叉熵的公式为: C o s t ( h θ ( x ) , y ) = − y ⋅ l o g ( h θ ( x ) ) + ( y − 1 ) ⋅ l o g ( 1 − h θ ( x ) ) Cost(h_{\theta}(x),y)=-y\cdot log(h_{\theta}(x))+(y-1)\cdot log(1-h_{\theta}(x)) Cost(hθ(x),y)=ylog(hθ(x))+(y1)log(1hθ(x))

使用交叉熵的时候就变成:

img

img

2-2

最优参数

[ r ‾ ( n ) ] 2 = r ( n ) [\overline{r}^{(n)}]^2 = r^{(n)} [r(n)]2=r(n),则:
R ( w ) = 1 2 ∑ n = 1 N [ r ‾ ( n ) ] 2 ( y ( n ) − w ⊤ x ( n ) ) 2 = 1 2 ∑ n = 1 N ( ( ‾ r ) ( n ) ( y ( n ) − w ⊤ x ( n ) ) ) 2 = 1 2 ∣ ∣ r ‾ ⊤ ( y − X ‾ w ) ∣ ∣ 2 \begin{aligned} R(w) &=\frac{1}{2}\sum^N_{n=1}[\overline{r}^{(n)}]^2(y^{(n)}-w^\top x^{(n)})^2 \\ & =\frac{1}{2}\sum^N_{n=1}(\overline(r)^{(n)}(y^{(n)}-w^\top x^{(n)}))^2 \\ & =\frac{1}{2}||\overline{r}^\top(y-\overline{X}w)||^2 \end{aligned} R(w)=21n=1N[r(n)]2(y(n)wx(n))2=21n=1N((r)(n)(y(n)wx(n)))2=21r(yXw)2
损失函数对参数 w w w求导:
∂ R ( w ) ∂ w = 1 2 ∣ ∣ r ‾ ⊤ ( y − X ‾ w ) ∣ ∣ 2 ∂ w = − X r ‾ r ‾ ⊤ ( y − X ⊤ w ) = 0 \begin{aligned} \frac{\partial R(w)}{\partial w}&= \frac{\frac{1}{2}||\overline{r}^\top(y-\overline{X}w)||^2}{\partial w} \\ &= -X\overline{r}\overline{r}^\top (y-X^\top w) \\ &= 0 \end{aligned} wR(w)=w21r(yXw)2=Xrr(yXw)=0
于是有: w ∗ = ( X r ‾ r ‾ ⊤ X ⊤ ) − 1 X r ‾ r ‾ ⊤ y w^* =(X\overline{r}\overline{r}^\top X^\top)^{-1}X\overline{r}\overline{r}^\top y w=(XrrX)1Xrry

参数 r ( n ) r^{(n)} r(n)

这个参数是为了对不同的数据进行加权,相当于不同数据对结果的影响程度会不同,如果某个数据比较重要,希望对其高度重视,那么就可以设置相对较大的权重,反之则设置小一点。

2-3

已知定理: A 、 B A、B AB分别为 n × m , m × s n \times m,m\times s n×m,m×s的矩阵,则 r a n k ( A B ) ≤ m i n { r a n k ( A ) , r a n k ( B ) } rank(AB)\leq min\{rank(A),rank(B)\} rank(AB)min{rank(A),rank(B)}

X ∈ R ( d + 1 ) × N , X T ∈ R N × ( d + 1 ) X\in \mathbb{R}^{(d+1)\times N},X^T \in \mathbb{R}^{N\times (d+1)} XR(d+1)×NXTRN×(d+1)

r a n k ( X ) = r a n k ( X ⊤ ) = m i n ( ( d + 1 ) , N ) , N < d + 1 rank(X)=rank(X^\top)=min((d+1),N),N<d+1 rank(X)=rank(X)=min((d+1),N),N<d+1,因此 r a n k ( X ) = N rank(X)=N rank(X)=N

r a n k ( X X ⊤ ) ≤ N , N = N rank(XX^\top)\leq{N,N}=N rank(XX)N,N=N

2-4

R ( w ) = 1 2 ∣ ∣ y − X ⊤ w ∣ ∣ 2 + 1 2 λ ∣ ∣ w ∣ ∣ 2 R(w)=\frac{1}{2}||y-X^\top w||^2+\frac{1}{2}\lambda||w||^2 R(w)=21yXw2+21λw2 w ∗ = ( X X ⊤ + λ I ) − 1 X y w^* = (XX^\top+\lambda I)^{-1}Xy w=(XX+λI)1Xy

可得:
∂ R ( w ) ∂ w = 1 2 ∂ ∣ ∣ y − X ⊤ w ∣ ∣ 2 + λ ∣ ∣ w ∣ ∣ 2 ∂ w = − X ( y − X ⊤ w ) + λ w = 0 \begin{aligned} \frac{\partial R(w)}{\partial w} &=\frac{1}{2}\frac{\partial ||y-X^\top w||^2+\lambda||w||^2}{\partial w} \\ &=-X(y-X^\top w)+\lambda w \\ &= 0 \end{aligned} wR(w)=21wyXw2+λw2=X(yXw)+λw=0
因此有:
− X Y + X X ⊤ w + λ w = 0 ( X X ⊤ + λ I ) w = X Y w ∗ = ( X X ⊤ + λ I ) − 1 X y -XY+XX^\top w+\lambda w = 0 \\ (XX^\top+\lambda I)w =XY \\ w^* = (XX^\top+\lambda I)^{-1}Xy XY+XXw+λw=0(XX+λI)w=XYw=(XX+λI)1Xy

2-5

根据题意,有: log ⁡ p ( y ∣ X ; w , σ ) = ∑ n = 1 N log ⁡ N ( y ( n ) w ⊤ x ( n ) , σ 2 ) \log p(y|X;w,\sigma) =\sum^N_{n=1}\log N(y^{(n)}w^\top x^{(n)},\sigma^2) logp(yX;w,σ)=n=1NlogN(y(n)wx(n),σ2)

∂ log ⁡ p ( y ∣ X ; w , σ ) ∂ w = 0 \frac{\partial \log p(y|X;w,\sigma)}{\partial w} = 0 wlogp(yX;w,σ)=0,因此有:
∂ ( ∑ n = 1 N − ( y ( n ) − w ⊤ x ( n ) ) 2 2 β ) ∂ w = 0 ∂ 1 2 ∣ ∣ y − X ⊤ w ∣ ∣ 2 ∂ w = 0 − X ( y − X ⊤ w ) = 0 \frac{\partial(\sum^N_{n=1}-\frac{(y^{(n)}-w^\top x^{(n)})^2}{2\beta})}{\partial w}=0 \\ \frac{\partial \frac{1}{2}||y-X^\top w||^2}{\partial w} =0 \\ -X(y-X^\top w) = 0 w(n=1N2β(y(n)wx(n))2)=0w21yXw2=0X(yXw)=0
因此有: w M L = ( X X ⊤ ) − 1 X y w^{ML}=(XX^\top)^{-1}Xy wML=(XX)1Xy

2-6

样本均值

参数 μ \mu μ在样本上的似然函数为: p ( x ∣ μ , σ 2 ) = ∑ n = 1 N ( x ( n ) ; μ , σ 2 ) p(x|\mu,\sigma^2)=\sum^N_{n=1}(x^{(n)};\mu,\sigma^2) p(xμ,σ2)=n=1N(x(n);μ,σ2)

对数似然函数为: log ⁡ p ( x ; μ , σ 2 ) = ∑ n = 1 N log ⁡ p ( x ( n ) ; μ , σ 2 ) = ∑ n = 1 N ( log ⁡ 1 2 π σ − ( x ( n ) − μ ) 2 2 σ 2 ) \log p(x;\mu,\sigma^2)=\sum^N_{n=1}\log p(x^{(n)};\mu,\sigma^2)=\sum^N_{n=1}(\log \frac{1}{\sqrt{2\pi}\sigma}-\frac{(x^{(n)}-\mu)^2}{2\sigma^2}) logp(x;μ,σ2)=n=1Nlogp(x(n);μ,σ2)=n=1N(log2π σ12σ2(x(n)μ)2)

我们的目标是找到参数 μ \mu μ的一个估计使得似然函数最大,等价于对数自然函数最大。

∂ log ⁡ p ( x ; μ , σ 2 ) ∂ μ = 1 σ 2 ∑ n = 1 N ( x ( n ) − μ ) = 0 \frac{\partial \log p(x;\mu,\sigma^2)}{\partial \mu}=\frac{1}{\sigma^2}\sum^N_{n=1}(x^{(n)}-\mu)=0 μlogp(x;μ,σ2)=σ21n=1N(x(n)μ)=0,得到: μ M L = 1 N ∑ n = 1 N x ( n ) \mu^{ML}=\frac{1}{N}\sum^N_{n=1}x^{(n)} μML=N1n=1Nx(n),即样本均值

MAP证明

参数 μ \mu μ的后验分布: p ( μ ∣ x ; μ 0 , σ 0 2 ) ∝ p ( x ∣ μ ; σ 2 ) p ( μ ; μ 0 , σ 0 2 ) p(\mu|x;\mu_0,\sigma_0^2)\propto p(x|\mu;\sigma^2)p(\mu;\mu_0,\sigma_0^2) p(μx;μ0,σ02)p(xμ;σ2)p(μ;μ0,σ02)

令似然函数 p ( x ∣ μ ; σ 2 ) p(x|\mu;\sigma^2) p(xμ;σ2)为高斯密度函数,后验分布的对数为:
log ⁡ p ( μ ∣ x ; μ 0 , σ 0 2 ) ∝ log ⁡ p ( x ∣ μ ; σ 2 ) + log ⁡ p ( μ ; μ 0 , σ 2 ) ∝ − 1 σ 2 ∑ n = 1 N ( x ( n ) − μ ) 2 − 1 σ 2 ( μ − μ 0 ) 2 \begin{aligned} \log p(\mu|x;\mu_0,\sigma_0^2)&\propto\log p(x|\mu;\sigma^2)+\log p(\mu;\mu_0,\sigma^2) \\ &\propto -\frac{1}{\sigma^2}\sum^N_{n=1}(x^{(n)}-\mu)^2-\frac{1}{\sigma^2}(\mu-\mu_0)^2 \end{aligned} logp(μx;μ0,σ02)logp(xμ;σ2)+logp(μ;μ0,σ2)σ21n=1N(x(n)μ)2σ21(μμ0)2
∂ log ⁡ p ( μ ∣ x ; μ 0 , σ 0 2 ) ∂ μ = 0 \frac{\partial \log p(\mu|x;\mu_0,\sigma_0^2)}{\partial \mu}=0 μlogp(μx;μ0,σ02)=0,得到: μ M A P = ( 1 σ 2 ∑ n = 1 N x ( n ) + μ 0 σ 0 2 ) / ( N σ 2 + 1 σ 0 2 ) \mu^{MAP}=(\frac{1}{\sigma^2}\sum^N_{n=1}x^{(n)}+\frac{\mu_0}{\sigma_0^2})/(\frac{N}{\sigma^2}+\frac{1}{\sigma_0^2}) μMAP=(σ21n=1Nx(n)+σ02μ0)/(σ2N+σ021)

证明完毕

2-7

σ → ∞ \sigma\rightarrow \infty σ μ M A P = 1 N ∑ n = 1 N x ( n ) \mu^{MAP}=\frac{1}{N}\sum^N_{n=1}x^{(n)} μMAP=N1n=1Nx(n)

2-8

因为 f f f可测量,故 σ \sigma σ可测量,又 f f f有界,有: E y [ f 2 ( x ) ∣ x ] = f 2 ( x ) ,   E y [ y f ( x ) ∣ x ] = f ( x ) ⋅ E y ( y ∣ x ) \mathbb{E}_y[f^2(x)|x]=f^2(x),\ \mathbb{E}_y[yf(x)|x]=f(x)\cdot \mathbb{E}_y(y|x) Ey[f2(x)x]=f2(x), Ey[yf(x)x]=f(x)Ey(yx)

R ( f ) = E x [ E y [ ( y − f ( x ) ) 2 ∣ x ] ] = E x [ E y [ y 2 ∣ x ] + E y [ f 2 ( x ) ∣ x ] − 2 E y [ y f ( x ) ∣ x ] ] R(f)=\mathbb{E}_x[\mathbb{E}_y[(y-f(x))^2|x]]=\mathbb{E}_x[\mathbb{E}_y[y^2|x]+\mathbb{E}_y[f^2(x)|x]-2\mathbb{E}_y[yf(x)|x]] R(f)=Ex[Ey[(yf(x))2x]]=Ex[Ey[y2x]+Ey[f2(x)x]2Ey[yf(x)x]]

R ( f ) = E x [ E y [ y 2 ∣ x ] + f 2 ( x ) − 2 f ( x ) E y [ y ∣ x ] ] R(f)=\mathbb{E}_x[\mathbb{E}_y[y^2|x]+f^2(x)-2f(x)\mathbb{E}_y[y|x]] R(f)=Ex[Ey[y2x]+f2(x)2f(x)Ey[yx]]

由Jensen不等式: E y [ y 2 ∣ x ] ≥ E y [ y ∣ x ] 2 \mathbb{E}_y[y^2|x]\geq \mathbb{E}_y[y|x]^2 Ey[y2x]Ey[yx]2

故: R ( f ) ≥ E x [ E y [ f ( x ) − E y [ y ∣ x ] ] ] 2 R(f)\geq \mathbb{E}_x[\mathbb{E}_y[f(x)-\mathbb{E}_y[y|x]]]^2 R(f)Ex[Ey[f(x)Ey[yx]]]2

故: f ∗ ( x ) = E y ∼ p r ( y ∣ x ) [ y ] f^*(x)=\mathbb{E}_{y\sim p_r(y|x)}[y] f(x)=Eypr(yx)[y]

2-9

高偏差原因:数据特征过少;模型复杂度太低;正则化系数 λ \lambda λ太大;

高方差原因:数据样例过少;模型复杂度过高;正则化系数 λ \lambda λ太小;没有使用交叉验证

2-10

对于单个样本 E D E_D ED f ∗ ( x ) f^*(x) f(x)是常数,因此: E D [ f ∗ ( x ) ] = f ∗ ( x ) E_D[f^*(x)]=f^*(x) ED[f(x)]=f(x)

E D [ ( f D ( x ) − f ∗ ( x ) ) 2 ] = E D [ ( f D ( x ) − E D [ f D ( x ) ] + E D [ f D ( x ) ] − f ∗ ( x ) ) 2 ] = E D [ ( f D ( x ) − E D [ f D ( x ) ] ) 2 + ( E D [ f D ( x ) ] − f ∗ ( x ) ) 2 + 2 ( f D ( x ) − E D [ f D ( x ) ] ) ( E D [ f D ( x ) ] − f ∗ ( x ) ) ] = E D [ ( f D ( x ) − E D [ f D ( x ) ] ) 2 ] + E D [ E D [ f D ( x ) ] 2 + ( f ∗ ( x ) ) 2 − 2 E D [ f D ( x ) ] f ∗ ( x ) ] + 2 E D [ ( f D ( x ) − E D [ f D ( x ) ] ) ( E D [ f D ( x ) ] − f ∗ ( x ) ) ] = E D [ ( f D ( x ) − E D [ f D ( x ) ] ) 2 ] + E D [ f D ( x ) ] 2 + ( f ∗ ( x ) ) 2 − 2 E D [ f D ( x ) + 2 E D [ ( f D ( x ) − E D [ f D ( x ) ] ) ( E D [ f D ( x ) ] − f ∗ ( x ) ) ] = E D [ ( f D ( x ) − E D [ f D ( x ) ] ) 2 ] + ( E D [ f D ( x ) ] − f ∗ ( x ) ) 2 + 2 E D [ ( f D ( x ) − E D [ f D ( x ) ] ) ( E D [ f D ( x ) ] − f ∗ ( x ) ) ] = E D [ ( f D ( x ) − E D [ f D ( x ) ] ) 2 ] + ( E D [ f D ( x ) ] − f ∗ ( x ) ) 2 + 2 ( E D [ f D ( x ) ] ) 2 − 2 E D [ f D ( x ) f ∗ ( x ) ] − 2 ( E D [ f D ( x ) ] ) 2 + 2 E D [ f D ( x ) f ∗ ( x ) ] = E D [ ( f D ( x ) − E D [ f D ( x ) ] ) 2 ] + ( E D [ f D ( x ) ] − f ∗ ( x ) ) 2 \begin{aligned} E_D[(f_D(x)-f^*(x))^2] &= E_D[(f_D(x)-E_D[f_D(x)]+E_D[f_D(x)]-f*(x))^2] \\ &=E_D[(f_D(x)-E_D[f_D(x)])^2+(E_D[f_D(x)]-f^*(x))^2+2(f_D(x)-E_D[f_D(x)])(E_D[f_D(x)]-f^*(x))] \\ &=E_D[(f_D(x)-E_D[f_D(x)])^2]+E_D[E_D[f_D(x)]^2+(f^*(x))^2-2E_D[f_D(x)]f^*(x)]+2E_D[(f_D(x)-E_D[f_D(x)])(E_D[f_D(x)]-f^*(x))] \\ &=E_D[(f_D(x)-E_D[f_D(x)])^2]+E_D[f_D(x)]^2+(f^*(x))^2-2E_D[f_D(x)+2E_D[(f_D(x)-E_D[f_D(x)])(E_D[f_D(x)]-f^*(x))] \\ &=E_D[(f_D(x)-E_D[f_D(x)])^2]+(E_D[f_D(x)]-f^*(x))^2+2E_D[(f_D(x)-E_D[f_D(x)])(E_D[f_D(x)]-f^*(x))] \\ &=E_D[(f_D(x)-E_D[f_D(x)])^2]+(E_D[f_D(x)]-f^*(x))^2+2(E_D[f_D(x)])^2-2E_D[f_D(x)f^*(x)]-2(E_D[f_D(x)])^2+2E_D[f_D(x)f^*(x)] \\ &=E_D[(f_D(x)-E_D[f_D(x)])^2]+(E_D[f_D(x)]-f^*(x))^2 \end{aligned} ED[(fD(x)f(x))2]=ED[(fD(x)ED[fD(x)]+ED[fD(x)]f(x))2]=ED[(fD(x)ED[fD(x)])2+(ED[fD(x)]f(x))2+2(fD(x)ED[fD(x)])(ED[fD(x)]f(x))]=ED[(fD(x)ED[fD(x)])2]+ED[ED[fD(x)]2+(f(x))22ED[fD(x)]f(x)]+2ED[(fD(x)ED[fD(x)])(ED[fD(x)]f(x))]=ED[(fD(x)ED[fD(x)])2]+ED[fD(x)]2+(f(x))22ED[fD(x)+2ED[(fD(x)ED[fD(x)])(ED[fD(x)]f(x))]=ED[(fD(x)ED[fD(x)])2]+(ED[fD(x)]f(x))2+2ED[(fD(x)ED[fD(x)])(ED[fD(x)]f(x))]=ED[(fD(x)ED[fD(x)])2]+(ED[fD(x)]f(x))2+2(ED[fD(x)])22ED[fD(x)f(x)]2(ED[fD(x)])2+2ED[fD(x)f(x)]=ED[(fD(x)ED[fD(x)])2]+(ED[fD(x)]f(x))2

2-11

image-20201206203637388

2-12

用笔算一下就OK,9

第三章

3-1

设任意 α \alpha α为超平面上的向量,取两点 α 1 , α 2 ∈ a \alpha_1,\alpha_2 \in a α1,α2a,则满足:
{ ω ⊤ α 1 + b = 0 ω ⊤ α 2 + b = 0 \begin{cases} \omega^\top\alpha_1+b=0 \\\\ \omega^\top\alpha_2+b=0 \\\\ \end{cases} ωα1+b=0ωα2+b=0
两式相减,得到: ω T ( α 1 − α 2 ) = 0 \omega^T(\alpha_1-\alpha_2)=0 ωT(α1α2)=0,由 α 1 − α 2 \alpha_1-\alpha_2 α1α2平行于 a a a,故 ω ⊥ α \omega \perp \alpha ωα,即 ω \omega ω垂直于决策边界。

3-2

x x x投影到平面 f ( x , ω ) = ω ⊤ x + b = 0 f(x,\omega)=\omega^\top x+b=0 f(x,ω)=ωx+b=0的点为 x ′ x' x,则:可知 x − x ′ x-x' xx垂直于 f ( x , ω ) f(x,\omega) f(x,ω),由3-1有 x − x ′ x-x' xx平行于 ω \omega ω

于是有: δ = ∣ ∣ x − x ′ ∣ ∣ = k ∣ ∣ ω ∣ ∣ \delta=||x-x'||=k||\omega|| δ=xx=kω,又:
{ f ( x , ω ) = ω ⊤ x + b ω ⊤ x 2 + b = 0 \begin{cases} f(x,\omega)=\omega^\top x+b \\\\ \omega^\top x_2+b=0 \end{cases} f(x,ω)=ωx+bωx2+b=0
故有: w T ( x − x ′ ) = f ( x , ω ) w^T(x-x')=f(x,\omega) wT(xx)=f(x,ω),带入 x − x ′ = k ω x-x'=k\omega xx=kω有: k ∣ ∣ ω ∣ ∣ 2 = f ( x , ω ) k||\omega||^2=f(x,\omega) kω2=f(x,ω)

故: δ = ∣ f ( x , ω ) ∣ ∣ ∣ ω ∣ ∣ \delta=\frac{|f(x,\omega)|}{||\omega||} δ=ωf(x,ω)

3-3

由多线性可分定义

可知: ω c x 1 > ω c ~ x 1 \omega_cx_1>\omega_{\tilde{c}}x_1 ωcx1>ωc~x1 ω c x 2 > ω c ~ x 2 \omega_cx_2>\omega_{\tilde{c}}x_2 ωcx2>ωc~x2,又 ρ ∈ [ 0 , 1 ] \rho \in[0,1] ρ[0,1],故: ρ > 0 , 1 − ρ > 0 \rho>0,1-\rho>0 ρ>0,1ρ>0

线性组合即有: ρ ω c x 1 + ( 1 − ρ ) ω c x 2 > ρ ω a ~ x 1 + ( 1 − ρ ) ω c ~ x 2 \rho\omega_cx_1+(1-\rho)\omega_cx_2>\rho\omega_{\tilde{a}}x_1+(1-\rho)\omega_{\tilde{c}}x_2 ρωcx1+(1ρ)ωcx2>ρωa~x1+(1ρ)ωc~x2

3-4

对于每个类别 c c c,他们的分类函数为 f c ( x ; ω c ) = ω c T x + b c , c ∈ { 1 , ⋯   , C } f_c(x;\omega_c)=\omega^T_cx+b_c,c\in \{1,\cdots,C\} fc(x;ωc)=ωcTx+bcc{1,,C}

因为每个类都与除它本身以外的类线性可分,所以: ω c ⊤ x ( n ) > ω c ~ ⊤ x ( n ) \omega_c^\top x^ {(n)}>\omega_{\tilde{c}}^\top x^{(n)} ωcx(n)>ωc~x(n)

因此有: ∑ n = 1 N ( ω c ⊤ x ( n ) − ω c ~ ⊤ x ( n ) ) > 0 \sum^N_{n=1}(\omega^\top_cx^{(n)}-\omega^\top_{\tilde{c}}x^{(n)})>0 n=1N(ωcx(n)ωc~x(n))>0,即: X T ω c − X T ω c ~ > 0 X^T\omega_c-X^T\omega_{\tilde{c}}>0 XTωcXTωc~>0,故整个数据集线性可分。

3-5

从理论上来说,平方损失函数也可以用于分类问题,但不适合。首先,最小化平方损失函数本质上等同于在误差服从高斯分布的假设下的极大似然估计,然而大部分分类问题的误差并不服从高斯分布。而且在实际应用中,交叉熵在和Softmax激活函数的配合下,能够使得损失值越大导数越大,损失值越小导数越小,这就能加快学习速率。然而若使用平方损失函数,则损失越大导数反而越小,学习速率很慢。

Logistic回归的平方损失函数是非凸的: L = 1 2 ∑ t ( y ^ − y ) 2 L=\frac{1}{2}\sum_t(\hat{y}-y)^2 L=21t(y^y)2 y ^ = σ ( ω T x ) \hat{y}=\sigma(\omega^Tx) y^=σ(ωTx)

∂ L ∂ ω = ∑ t ( y ^ − y i ) ∂ y ^ ∂ ω = ∑ i ( y ^ − y i ) y ^ ( 1 − y ^ ) x i = ∑ i ( − y ^ 3 + ( y i + 1 ) y ^ 2 − y i y ^ ) x i \frac{\partial L}{\partial \omega} =\sum_t(\hat{y}-y_i)\frac{\partial \hat{y}}{\partial \omega}=\sum_i(\hat{y}-y_i)\hat{y}(1-\hat{y})x_i=\sum_i(-\hat{y}^3+(y_i+1)\hat{y}^2-y_i\hat{y})x_i ωL=t(y^yi)ωy^=i(y^yi)y^(1y^)xi=i(y^3+(yi+1)y^2yiy^)xi

进一步地: ∂ 2 L ∂ ω 2 = ∑ ( − 3 y ^ 2 + 2 ( y i + 1 ) y ^ − y i ) ∂ y ^ ∂ ω x i = ∑ [ − 3 y ^ 2 + 2 ( y i + 1 ) y ^ − y i ] y ^ ( 1 − y ^ ) x i 2 \frac{\partial^2 L}{\partial \omega^2}=\sum(-3\hat{y}^2+2(y_i+1)\hat{y}-y_i)\frac{\partial \hat{y}}{\partial \omega}x_i=\sum[-3\hat{y}^2+2(y_i+1)\hat{y}-y_i]\hat{y}(1-\hat{y})x_i^2 ω22L=(3y^2+2(yi+1)y^yi)ωy^xi=[3y^2+2(yi+1)y^yi]y^(1y^)xi2

y ^ ∈ [ 0 , 1 ] , y ∈ 0 , 1 \hat{y}\in[0,1],y\in{0,1} y^[0,1]y0,1,二阶导数不一定大于零。

3-6

不加入正则化项限制权重向量的大小,可能造成权重向量过大,产生上溢。

3-6

3-7

原式中: ω ‾ = 1 T ∑ k = 1 K c k ω k \overline{\omega}=\frac{1}{T}\sum^K_{k=1}c_k\omega_k ω=T1k=1Kckωk,又 c k = t k + 1 − t k c_k=t_{k+1}-t_k ck=tk+1tk

即: ω ‾ = 1 T ∑ k = 1 K ( t k + 1 − t k ) ω k \overline{\omega}=\frac{1}{T}\sum^K_{k=1}(t_{k+1}-t_k)\omega_k ω=T1k=1K(tk+1tk)ωk,即只需要证明算法和该式等价。

根据算法第8、9行有: ω k = y ( n ) x ( n ) \omega_k=y^{(n)}x^{(n)} ωk=y(n)x(n),故原算法中: ω ‾ = ω T − 1 T u = ∑ k = 1 K ω k − 1 T u \overline{\omega}=\omega_T-\frac{1}{T}u=\sum^K_{k=1}\omega_k-\frac{1}{T}u ω=ωTT1u=k=1KωkT1u

又有: u = ∑ k = 1 K t k ω k u=\sum^K_{k=1}t_{k}\omega_k u=k=1Ktkωk

故算法3.2能得到: ω ‾ = ∑ k = 1 K ( 1 − 1 T t k ω k ) \overline{\omega}=\sum^K_{k=1}(1-\frac{1}{T}t_k\omega_k) ω=k=1K(1T1tkωk),由算法第12行可知: t k + 1 = T t_{k+1}=T tk+1=T并带入可得到:

ω ‾ = 1 T ∑ k = 1 K ( T − t k ) ω k = 1 T ∑ k = 1 K ( t k + 1 − t k ) ω k \overline{\omega}=\frac{1}{T}\sum^K_{k=1}(T-t_k)\omega_k=\frac{1}{T}\sum^K_{k=1}(t_{k+1}-t_k)\omega_k ω=T1k=1K(Ttk)ωk=T1k=1K(tk+1tk)ωk,证毕。

3-8

ω k = ω k − 1 + ϕ ( x ( k ) , y ( k ) ) − ϕ ( x ( k ) , z ) \omega_k=\omega_{k-1}+\phi(x^{(k)},y^{(k)})-\phi(x^{(k)},z) ωk=ωk1+ϕ(x(k),y(k))ϕ(x(k),z)

因此可知: ∣ ∣ ω K ∣ ∣ 2 = ∣ ∣ ω K − 1 + ϕ ( x ( k ) , y ( k ) ) − ϕ ( x ( k ) , z ) ∣ ∣ 2 ||\omega_K||^2=||\omega_{K-1}+\phi(x^{(k)},y^{(k)})-\phi(x^{(k)},z)||^2 ωK2=ωK1+ϕ(x(k),y(k))ϕ(x(k),z)2

∣ ∣ ω K ∣ ∣ 2 = ∣ ∣ ω K − 1 ∣ ∣ 2 + ∣ ∣ ϕ ( x ( k ) , y ( k ) ) − ϕ ( x ( k ) , z ) ∣ ∣ 2 + 2 ω K − 1 ⋅ ( ϕ ( x ( n ) , y ( n ) ) − ϕ ( x ( n ) , z ) ) ||\omega_K||^2=||\omega_{K-1}||^2+||\phi(x^{(k)},y^{(k)})-\phi(x^{(k)},z)||^2+2\omega_{K-1}\cdot(\phi(x^{(n)},y^{(n)})-\phi(x^{(n)},z)) ωK2=ωK12+ϕ(x(k),y(k))ϕ(x(k),z)2+2ωK1(ϕ(x(n),y(n))ϕ(x(n),z))

因为 z z z ω K − 1 \omega_{K-1} ωK1的最倾向的候选项,因此 2 ω K − 1 ⋅ ( ϕ ( x ( n ) , y ( n ) ) − ϕ ( x ( n ) , z ) ) 2\omega_{K-1}\cdot(\phi(x^{(n)},y^{(n)})-\phi(x^{(n)},z)) 2ωK1(ϕ(x(n),y(n))ϕ(x(n),z))将小于0。

故: ∣ ∣ ω K ∣ ∣ 2 ≤ ∣ ∣ ω K − 1 ∣ ∣ 2 + R 2 ||\omega_K||^2\leq||\omega_{K-1}||^2+R^2 ωK2ωK12+R2

迭代到最后有: ∣ ∣ ω K ∣ ∣ 2 ≤ K R 2 ||\omega_K||^2\leq KR^2 ωK2KR2,找到了上界。

再寻找下界: ∣ ∣ ω K ∣ ∣ 2 = ∣ ∣ ω ∗ ∣ ∣ 2 ⋅ ∣ ∣ ω K ∣ ∣ 2 ≥ ∣ ∣ ω ∗ ⊤ ω K ∣ ∣ 2 ||\omega_K||^2=||\omega^*||^2\cdot||\omega_K||^2\geq||\omega^{*\top}\omega_K||^2 ωK2=ω2ωK2ωωK2

带入 ω K \omega_K ωK有: ∣ ∣ ω K ∣ ∣ 2 ≥ ∣ ∣ ω ∗ ⊤ ∑ k = 1 K ( ϕ ( x ( n ) , y ( n ) ) − ϕ ( x ( n ) . z ) ) ∣ ∣ ||\omega_K||^2\geq ||\omega^{*\top}\sum^K_{k=1}(\phi(x^{(n)},y^{(n)})-\phi(x^{(n)}.z))|| ωK2ωk=1K(ϕ(x(n),y(n))ϕ(x(n).z))

即: ∣ ∣ ω K ∣ ∣ 2 ≥ [ ∑ k = 1 K ⟨ ω ∗ , ( ϕ ( x ( n ) , y ( n ) ) − ϕ ( x ( n ) . z ) ) ⟩ ] 2 ||\omega_K||^2\geq [\sum^K_{k=1}\langle\omega^*,(\phi(x^{(n)},y^{(n)})-\phi(x^{(n)}.z))\rangle]^2 ωK2[k=1Kω,(ϕ(x(n),y(n))ϕ(x(n).z))]2

根据广义线性可分有: ⟨ ω ∗ , ϕ ( x ( k ) , y ( k ) ) ⟩ − ⟨ ω ∗ , ϕ ( x ( k ) , z ) ⟩ ≥ γ \langle\omega^*,\phi(x^{(k)},y^{(k)})\rangle-\langle\omega^*,\phi(x^{(k)},z)\rangle\geq\gamma ωϕ(x(k),y(k))ω,ϕ(x(k),z)γ

因此: ∣ ∣ ω K ∣ ∣ 2 ≥ K 2 γ 2 ||\omega_K||^2\geq K^2\gamma^2 ωK2K2γ2

因此结合起来就得到了: K 2 γ 2 ≤ K R 2 K^2\gamma^2\leq KR^2 K2γ2KR2,即 K ≤ R 2 γ 2 K\leq\frac{R^2}{\gamma^2} Kγ2R2,证毕

3-9

存在性证明:

因为数据集线性可分,因此该最优化问题存在可行解,又根据线性可分的定义可知目标函数一定有下界,所以最优化问题一定有解,记作: ( ω ∗ , b ∗ ) (\omega^*,b^*) (ω,b)

因为 y ∈ { 1 , − 1 } y\in \{1,-1\} y{1,1},因此 ( ω ∗ , b ∗ ) ≠ ( 0 , b ∗ ) (\omega^*,b^*)\not=(0,b^*) (ω,b)=(0,b),即 ω ∗ ≠ O \omega^*\not=\mathbb{O} ω=O,故分离的超平面一定存在。

唯一性证明(反证法):

假定存在两个最优的超平面分别为 ω 1 ∗ x + b = 0 \omega_1^*x+b=0 ω1x+b=0 ω 2 ∗ x + b = 0 \omega_2^*x+b=0 ω2x+b=0

因为为最优,故有: ∣ ∣ ω 1 ∗ ∣ ∣ = ∣ ∣ ω 2 ∗ ∣ ∣ = C ||\omega_1^*||=||\omega_2^*||=C ω1=ω2=C,其中C为一个常数。

于是令: ω = ω 1 ∗ + ω 2 ∗ 2 \omega=\frac{\omega_1^*+\omega_2^*}{2} ω=2ω1+ω2 b = b 1 ∗ + b 2 ∗ 2 b=\frac{b_1^*+b_2^*}{2} b=2b1+b2,可知该解也为可行解。

于是有: C ≤ ∣ ∣ ω ∣ ∣ C\leq||\omega|| Cω,又根据范数的三角不等式性质: ∥ ∣ ω ∣ ∣ ≤ ∣ ∣ ω 1 ∗ ∣ ∣ 2 + ∣ ∣ ω 2 ∗ ∣ ∣ 2 = C \||\omega||\leq\frac{||\omega_1^*||}{2}+\frac{||\omega_2^*||}{2}=C ω2ω1+2ω2=C

因此: 2 ∣ ∣ ω ∣ ∣ = ∣ ∣ ω 1 ∗ ∣ ∣ + ∣ ∣ ω 2 ∗ ∣ ∣ 2||\omega||=||\omega_1^*||+||\omega^*_2|| 2ω=ω1+ω2

又根据不等式取等号的条件可以得到: ω 1 ∗ = λ ω 2 ∗ \omega_1^*=\lambda\omega_2^* ω1=λω2

代入可知: λ = 1 \lambda=1 λ=1 (-1的解舍去,会导致 ω = 0 \omega=0 ω=0

因此不存在两个超平面最优,故该超平面惟一。

证毕

3-10

ϕ ( x ) = [ 1 , 2 x 1 , 2 x 2 , 2 x 1 x 2 , x 1 2 , x 2 2 ] ⊤ \phi(x)=[1,\sqrt{2}x_1,\sqrt{2}x_2,\sqrt{2}x_1x_2,x_1^2,x_2^2]^{\top} ϕ(x)=[1,2 x1,2 x2,2 x1x2,x12,x22] ϕ ( z ) = [ 1 , 2 z 1 , 2 z 2 , 2 z 1 z 2 , z 1 2 , z 2 2 ] ⊤ \phi(z)=[1,\sqrt{2}z_1,\sqrt{2}z_2,\sqrt{2}z_1z_2,z_1^2,z_2^2]^{\top} ϕ(z)=[1,2 z1,2 z2,2 z1z2,z12,z22]

故: ϕ ( x ) ⊤ ϕ ( z ) = 1 + 2 x 1 z 1 + 2 x 2 z 2 + 2 x 1 x 2 z 1 z 2 + x 1 2 z 1 2 + x 2 2 z 2 2 = ( 1 + ( x 1   x 2 ) ( z 1   z 2 ) ⊤ ) 2 \phi(x)^\top\phi(z)=1+2x_1z_1+2x_2z_2+2x_1x_2z_1z_2+x_1^2z_1^2+x_2^2z_2^2=(1+(x_1 \ x_2)(z_1 \ z_2)^\top)^2 ϕ(x)ϕ(z)=1+2x1z1+2x2z2+2x1x2z1z2+x12z12+x22z22=(1+x1 x2)(z1 z2))2

即: ϕ ( x ) ⊤ ϕ ( z ) = ( 1 + x ⊤ z ) 2 = k ( x , z ) \phi(x)^\top\phi(z)=(1+x^\top z)^2=k(x,z) ϕ(x)ϕ(z)=(1+xz)2=k(x,z),证毕

3-11

原问题:
m i n 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ n = 1 N ξ n s . t . 1 − y n ( w ⊤ x n + b ) − ξ n ≤ 0 , ∀ n ∈ { 1 , ⋯   , N } ξ n ≥ 0 , ∀ n ∈ { 1 , ⋯   , N } \begin{array}{c} min\frac{1}{2}||w||^2+C\sum^N_{n=1}\xi_n \\\\ s.t. 1-y_n(w^\top x_n+b)-\xi_n\leq 0,\forall n\in\{1,\cdots,N\} \\\\ \xi_n\geq0,\forall n\in\{1,\cdots,N\} \end{array} min21w2+Cn=1Nξns.t.1yn(wxn+b)ξn0,n{1,,N}ξn0,n{1,,N}
使用拉格朗日乘子法,可得:
L ( w , b , ξ , λ , μ ) = 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 N ξ i + ∑ i = 1 N λ i ( 1 − y i ( w ⊤ x i + b ) − ξ i ) − ∑ i = 1 N μ i ξ i L(w,b,\xi,\lambda,\mu)=\frac{1}{2}||w||^2+C\sum^N_{i=1}\xi_i+\sum^N_{i=1}\lambda_i(1-y_i(w^\top x_i+b)-\xi_i)-\sum^N_{i=1}\mu_i\xi_i L(w,b,ξ,λ,μ)=21w2+Ci=1Nξi+i=1Nλi(1yi(wxi+b)ξi)i=1Nμiξi
将其转化为最小最大问题:
min ⁡ w , b , ξ   max ⁡ λ , μ   L ( w , b , ξ , λ , μ ) s . t . λ i ≥ 0 , ∀ n ∈ { 1 , ⋯   , N } \begin{array}{c} \min\limits_{w,b,\xi} \ \max\limits_{\lambda,\mu} \ L(w,b,\xi,\lambda,\mu) \\\\ s. t. \lambda_i\geq0,\forall n\in\{1,\cdots,N\} \end{array} w,b,ξmin λ,μmax L(w,b,ξ,λ,μ)s.t.λi0,n{1,,N}
转化为对偶问题:
max ⁡ λ , μ   min ⁡ w , b , ξ   L ( w , b , ξ , λ , μ ) s . t . λ i ≥ 0 , ∀ n ∈ { 1 , ⋯   , N } \begin{array}{c} \max\limits_{\lambda,\mu} \ \min\limits_{w,b,\xi} \ L(w,b,\xi,\lambda,\mu) \\\\ s. t. \lambda_i\geq0,\forall n\in\{1,\cdots,N\} \end{array} λ,μmax w,b,ξmin L(w,b,ξ,λ,μ)s.t.λi0,n{1,,N}
求解 min ⁡ w , b , ξ L ( w , b , ξ , λ , μ ) \min\limits_{w,b,\xi}L(w,b,\xi,\lambda,\mu) w,b,ξminL(w,b,ξ,λ,μ)如下:

∂ L ∂ b = 0 \frac{\partial L}{\partial b}=0 bL=0,得到 ∑ i = 1 N λ i y i = 0 \sum^N\limits_{i=1}\lambda_iy_i=0 i=1Nλiyi=0,带入 L L L中,有:

L ( w , b , ξ , λ , μ ) = 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 N ξ i + ∑ i = 1 N λ i − ∑ i = 1 N λ i y i w ⊤ x i − ∑ i = 1 N λ i ξ i − ∑ i = 1 N μ i ξ i L(w,b,\xi,\lambda,\mu)=\frac{1}{2}||w||^2+C\sum^N\limits_{i=1}\xi_i+\sum^N\limits_{i=1}\lambda_i-\sum^N\limits_{i=1}\lambda_iy_iw^\top x_i-\sum^N\limits_{i=1}\lambda_i\xi_i-\sum^N\limits_{i=1}\mu_i\xi_i L(w,b,ξ,λ,μ)=21w2+Ci=1Nξi+i=1Nλii=1Nλiyiwxii=1Nλiξii=1Nμiξi

∂ L ∂ w = 0 \frac{\partial L}{\partial w}=0 wL=0,可得: w − ∑ i = 1 N λ i y i x i = 0 w-\sum^N\limits_{i=1}\lambda_iy_ix_i=0 wi=1Nλiyixi=0,因此: w = ∑ i = 1 N λ i y i x i w=\sum^N\limits_{i=1}\lambda_iy_ix_i w=i=1Nλiyixi

带入 L L L得到:
L ( w , b , ξ , λ , μ ) = 1 2 ∑ i = 1 N ∑ i = 1 N λ i λ j y i y j x i ⊤ x j + C ∑ i = 1 N ξ i + ∑ i = 1 N λ i − ∑ i = 1 N ∑ j = 1 N λ i λ j y i y j x i ⊤ x j − ∑ i = 1 N λ i ξ i − ∑ i = 1 N μ i ξ i = − 1 2 ∑ i = 1 N ∑ i = 1 N λ i λ j y i y j x i ⊤ x j + ∑ i = 1 N ( C − λ i − μ i ) ξ i + ∑ i = 1 N λ i \begin{aligned} L(w,b,\xi,\lambda,\mu) &=\frac{1}{2}\sum^N\limits_{i=1}\sum^N\limits_{i=1}\lambda_i\lambda_jy_iy_jx^\top_ix_j+C\sum^N\limits_{i=1}\xi_i+\sum^N\limits_{i=1}\lambda_i-\sum^N\limits_{i=1}\sum^N\limits_{j=1}\lambda_i\lambda_jy_iy_jx_i^\top x_j-\sum^N_{i=1}\lambda_i\xi_i-\sum^N_{i=1}\mu_i\xi_i \\\\ &=-\frac{1}{2}\sum^N\limits_{i=1}\sum^N\limits_{i=1}\lambda_i\lambda_jy_iy_jx^\top_ix_j+\sum^N_{i=1}(C-\lambda_i-\mu_i)\xi_i+\sum^N_{i=1}\lambda_i \end{aligned} L(w,b,ξ,λ,μ)=21i=1Ni=1Nλiλjyiyjxixj+Ci=1Nξi+i=1Nλii=1Nj=1Nλiλjyiyjxixji=1Nλiξii=1Nμiξi=21i=1Ni=1Nλiλjyiyjxixj+i=1N(Cλiμi)ξi+i=1Nλi
∂ L ∂ ξ i = 0 \frac{\partial L}{\partial \xi_i}=0 ξiL=0,可得 C − λ i − μ i = 0 C-\lambda_i-\mu_i=0 Cλiμi=0

带入 L L L再次有: L ( w , b , ξ , λ , μ ) = − 1 2 ∑ i = 1 N ∑ j = 1 N λ i λ j y i y j x i ⊤ x j + ∑ i = 1 N λ i L(w,b,\xi,\lambda,\mu)=-\frac{1}{2}\sum^N\limits_{i=1}\sum^N\limits_{j=1}\lambda_i\lambda_jy_iy_jx^\top_ix_j+\sum^N\limits_{i=1}\lambda_i L(w,b,ξ,λ,μ)=21i=1Nj=1Nλiλjyiyjxixj+i=1Nλi

因此对偶问题可以为:
max ⁡ λ − 1 2 ∑ i = 1 N ∑ j = 1 N λ i λ j y i y j x i ⊤ x j + ∑ i = 1 N λ i s . t . ∑ i = 1 N λ i y i = 0 , ∀ i ∈ { 1 , ⋯   , N } C − λ i − μ i = 0 , ∀ i ∈ { 1 , ⋯   , N } λ i ≥ 0 , ∀ i ∈ { 1 , ⋯   , N } μ i ≥ 0 , ∀ i ∈ { 1 , ⋯   , N } \begin{array}{c} \max\limits_{\lambda}-\frac{1}{2}\sum^N\limits_{i=1}\sum^N\limits_{j=1}\lambda_i\lambda_jy_iy_jx^\top_ix_j+\sum^N\limits_{i=1}\lambda_i \\ s. t. \sum^N\limits_{i=1}\lambda_iy_i=0,\forall i\in\{1,\cdots,N\} \\ C-\lambda_i-\mu_i=0,\forall i\in\{1,\cdots,N\} \\ \lambda_i\geq 0,\forall i \in\{1,\cdots,N\} \\ \mu_i\geq 0,\forall i \in\{1,\cdots,N\} \end{array} λmax21i=1Nj=1Nλiλjyiyjxixj+i=1Nλis.t.i=1Nλiyi=0,i{1,,N}Cλiμi=0,i{1,,N}λi0,i{1,,N}μi0,i{1,,N}
化简得到:
max ⁡ λ − 1 2 ∑ i = 1 N ∑ j = 1 N λ i λ j y i y j x i ⊤ x j + ∑ i = 1 N λ i s . t . ∑ i = 1 N λ i y i = 0 , ∀ i ∈ { 1 , ⋯   , N } 0 ≤ λ i ≤ C , ∀ i ∈ { 1 , ⋯   , N } \begin{array}{c} \max\limits_{\lambda}-\frac{1}{2}\sum^N\limits_{i=1}\sum^N\limits_{j=1}\lambda_i\lambda_jy_iy_jx^\top_ix_j+\sum^N\limits_{i=1}\lambda_i \\ s. t. \sum^N\limits_{i=1}\lambda_iy_i=0,\forall i\in\{1,\cdots,N\} \\ 0\leq \lambda_i \leq C,\forall i\in\{1,\cdots,N\} \end{array} λmax21i=1Nj=1Nλiλjyiyjxixj+i=1Nλis.t.i=1Nλiyi=0,i{1,,N}0λiC,i{1,,N}
因此其KKT条件如下:
{ ∇ w L = w − ∑ i = 1 N λ i y i x i = 0 ∇ b L = − ∑ i = 1 N λ i y i x i = 0 ∇ ξ L = C − λ − μ = 0 λ i ( 1 − y n ( w ⊤ x n + b ) − ξ i ) = 0 1 − y n ( w ⊤ x n + b ) − ξ n ≤ 0 ξ i ≥ 0 λ i ≥ 0 μ i ≥ 0 \begin{cases} \nabla_w L=w-\sum^N\limits_{i=1}\lambda_iy_ix_i=0 \\ \nabla_b L=-\sum^N\limits_{i=1}\lambda_iy_ix_i=0 \\ \nabla_{\xi}L=C-\lambda-\mu=0 \\ \lambda_i(1-y_n(w^\top x_n+b)-\xi_i)=0 \\ 1-y_n(w^\top x_n+b)-\xi_n\leq 0 \\ \xi_i\geq 0 \\ \lambda_i\geq 0 \\ \mu_i \geq 0 \end{cases} wL=wi=1Nλiyixi=0bL=i=1Nλiyixi=0ξL=Cλμ=0λi(1yn(wxn+b)ξi)=01yn(wxn+b)ξn0ξi0λi0μi0

第四章

4-1

零均值化的输入,使得神经元在0附近,sigmoid函数在零点处的导数最大,所有收敛速度最快。非零中心化的输入将导致 ω \omega ω的梯度全大于0或全小于0,使权重更新发生抖动,影响梯度下降的速度。形象一点而言,就是零中心化的输入就如同走较为直的路,而非零时七拐八拐才到终点。

4-2

题目要求有两个隐藏神经元和一个输出神经元,那么网络应该有 W ( 1 ) W^{(1)} W(1) w ( 2 ) w^{(2)} w(2)两个权重,取:
W ( 1 ) = [ 1 1 1 1 ] , b ( 1 ) = [ 0 1 ] w ( 2 ) = [ 1 − 2 ] , b ( 2 ) = 0 W^{(1)}=\left[\begin{array}{c}1 & 1\\ 1& 1\end{array}\right],b^{(1)}=\left[\begin{array}{c}0 \\ 1\end{array}\right] \\ w^{(2)}=\left[\begin{array}{c}1 \\ -2\end{array}\right],b^{(2)}=0 W(1)=[1111],b(1)=[01]w(2)=[12],b(2)=0
带入得到:
X = [ 0 0 1 1 0 1 0 1 ] X=\left[\begin{array}{c}0 & 0 & 1 & 1 \\ 0 & 1 & 0 & 1\end{array}\right] X=[00011011]
神经元的输入与输出:

x 1 x_1 x1 x 2 x_2 x2 y y y
000
101
011
110

实验代码(需要tensorflow2.3):

import numpy as np
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
#input and output
X=np.array([[0,0],[0,1],[1,0],[1,1]])
Y=np.array([[0],[1],[1],[0]])
x=tf.placeholder(dtype=tf.float32,shape=[None,2])
y=tf.placeholder(dtype=tf.float32,shape=[None,1])
#weight
w1=tf.Variable(tf.random_normal([2,2]))
w2=tf.Variable(tf.random_normal([2,1]))
#bias
b1=tf.Variable([0.1,0.1])
b2=tf.Variable(0.1)
#relu activation function
h=tf.nn.relu(tf.matmul(x,w1)+b1)
output=tf.matmul(h,w2)+b2
#loss and Adam optimizer
loss=tf.reduce_mean(tf.square(output-y))
train=tf.train.AdamOptimizer(0.05).minimize(loss)

with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    
    for i in range(2000):
        session.run(train,feed_dict={x:X,y:Y})
        loss_=session.run(loss,feed_dict={x:X,y:Y})
        if i%50 == 0:
            print("step:%d,loss:%.3f"%(i,loss_))
            
    print("X:%r"%X)
    print("Pred:%r"%session.run(output,feed_dict={x:X}))

4-3

二分类的例子

二分类交叉熵损失函数为: L ( y , y ^ ) = − ( y log ⁡ y ^ + ( 1 − y ) log ⁡ ( 1 − y ^ ) ) L(y,\hat{y})=-(y\log\hat{y}+(1-y)\log(1-\hat{y})) L(y,y^)=(ylogy^+(1y)log(1y^))

不同取值损失函数如表所示:

y y y y ^ \hat{y} y^ L ( y , y ^ ) L(y,\hat{y}) L(y,y^)
000
10 + ∞ +\infty +
01 − ∞ -\infty
110

如果我们要损失函数尽可能小的时候, y y y为1的时候 y ^ \hat{y} y^尽可能要大, y y y为0的时候 y ^ \hat{y} y^尽可能要小。而在后一种情况下需要 ω \omega ω尽可能小,因此如果更新过多,会导致样本的所有输出全部为负数,因而梯度会为0,造成权重无法更新,因而成死亡结点。

解决方法

使用Leaky ReLU、PReLU、ELU或者SoftPlus函数当作激活函数。

ReLU死亡问题数学推导

向前传播公式: { z = ω ⋅ x a = R e L U ( z ) \begin{cases}z=\omega\cdot x\\ a=ReLU(z)\end{cases} {z=ωxa=ReLU(z)

损失函数为 L L L,反向传播公式为: { ∂ L ∂ z = ∂ L ∂ a ⋅ ∂ a ∂ z ∂ L ∂ W = ∂ L ∂ z ⋅ x ⊤ ∂ L ∂ x = ω ⊤ ⋅ ∂ L ∂ z \begin{cases}\frac{\partial L}{\partial z}=\frac{\partial L}{\partial a}\cdot\frac{\partial a}{\partial z}\\\\ \frac{\partial L}{\partial W}=\frac{\partial L}{\partial z}\cdot x^\top\\\\ \frac{\partial L}{\partial x}=\omega^\top \cdot \frac{\partial L}{\partial z}\end{cases} zL=aLzaWL=zLxxL=ωzL

对固定的学习率 l r lr lr,梯度 ∂ L ∂ W \frac{\partial L}{\partial W} WL越大,权重 W W W更新也就越快,梯度更新公式: W = W + l r ⋅ ∂ L ∂ W W=W+lr\cdot \frac{\partial L}{\partial W} W=W+lrWL

如果梯度太大,而学习率又不小心设置得太大,就会导致权重一下子更新过多。

而我们又想使得损失函数最小,就有可能出现这种情况:对于任意训练样本 x i x_i xi ,网络的输出都是小于0的,用数学描述为: z i = W ⋅ x i < 0 , ∀ x i ∈ D t r a i n z_i=W\cdot x_i<0,\forall x_i\in D_{train} zi=Wxi<0,xiDtrain

此时: a i = m a x { z i , 0 } = 0 a_i=max\{z_i,0\}=0 ai=max{zi,0}=0,即与该神经元相连的参数都得不到更新,神经元进入“休眠”状态。

4-4

S w i s h 求 导 : d x σ ( β x ) d x = σ ( β x ) + x σ ( β x ) ( 1 − σ ( β x ) ) β = 1 + ( β x + 1 ) ⋅ e x p ( − β x ) ( 1 + exp ⁡ ( − β x ) ) 2 G E L U 求 导 : d G E L U ( x ) d x = P ( X ≤ x ) + x d P ( X ≤ x ) d x = Φ ( x ) + x ⋅ exp ⁡ ( − x 2 2 ) 2 π Swish求导:\frac{\mathbb{d}x\sigma(\beta x)}{\mathbb{d}x}=\sigma(\beta x)+x\sigma(\beta x)(1-\sigma(\beta x))\beta=\frac{1+(\beta x+1)\cdot exp(-\beta x)}{(1+\exp(-\beta x))^2} \\\\ GELU求导:\frac{\mathbb{d}GELU(x)}{\mathbb{d}x}=P(X\leq x)+\frac{x\mathbb{d}P(X\leq x)}{\mathbb{d}x}=\Phi(x)+\frac{x\cdot \exp(\frac{-x^2}{2})}{\sqrt{2\pi}} Swishdxdxσ(βx)=σ(βx)+xσ(βx)(1σ(βx))β=(1+exp(βx))21+(βx+1)exp(βx)GELUdxdGELU(x)=P(Xx)+dxxdP(Xx)=Φ(x)+2π xexp(2x2)

4-5

输入层到隐藏层:输入的样本数量为 m ( 0 ) ⋅ N L m^{(0)}\cdot\frac{N}{L} m(0)LN,偏置数量为 1 1 1

L L L层隐藏层之间:神经元需要的参数为 ( L − 1 ) ⋅ ( N L ) 2 (L-1)\cdot(\frac{N}{L})^2 (L1)(LN)2,偏置数量为 N − 1 N-1 N1

隐藏层到输出层:输出需要的参数为 N L \frac{N}{L} LN,偏置数量为 1 1 1

一共为: N + 1 + ( L − 1 ) ( N L ) 2 + M 0 N L + N L N+1+(L-1)(\frac{N}{L})^2+\frac{M_0N}{L}+\frac{N}{L} N+1+(L1)(LN)2+LM0N+LN

4-6

通用近似定理对于具有线性输出层和至少一个使用 ReLU 激活函数的隐藏层组成的前馈神经网络,也都是适用的。

时间有限要赶毕设了,深入推导请看:https://arxiv.org/pdf/1505.03654.pdf

4-7

加入正则化项的目的是为了防止过拟合, 防止它对于输入的微小变化过于敏感, 但偏置对任意的输入都产生同样的效应, 加入他们对于防止过拟合没有什么帮助。

以线性回归为例:最终我们可以看到 ω = ( X X ⊤ + λ I ) X y \omega=(XX^\top+\lambda I)Xy ω=(XX+λI)Xy,偏置并没有影响到 ω \omega ω,而拟合好坏是由 ω \omega ω相关的,故不需要对偏置进行正则化。

4-8

因为 将 ω \omega ω b b b初始化为0,则初始输入: z = ω ⊤ x + b = 0 z=\omega^\top x+b=0 z=ωx+b=0,在第一遍前向计算的过程中所有的隐藏层神经元的激活值都相同。在反向传播时,所有权重更新也都相同,这样会导致隐藏层神经元没有区分性。这种现象称为对称权重现
象。

4-9

可以缓解,但不一定能得到理想的效果,增大学习率可能使最优值被跨越,也可能造成梯度爆炸。

第五章

5-1

1)证明:
原 式 = x ( t + 1 ) − x ( t ) − ( x ( t ) − x ( t − 1 ) ) = lim ⁡ Δ t → 1 x ( t + Δ t ) − x ( t ) Δ t − x ( t ) − x ( t − Δ t ) Δ t = x ′ ( t ) − x ′ ( t − 1 ) = lim ⁡ Δ t → 1 x ′ ( t ) − x ′ ( t − Δ t ) Δ t 即 有 : x ( t + 1 ) − x ( t ) − ( x ( t ) − x ( t − 1 ) ) = x ′ ′ ( t ) \begin{aligned} 原式&=x(t+1)-x(t)-(x(t)-x(t-1)) \\ &=\lim_{\Delta t\rightarrow 1}\frac{x(t+\Delta t)-x(t)}{\Delta t}-\frac{x(t)-x(t-\Delta t)}{\Delta t} \\ &=x'(t)-x'(t-1)=\lim_{\Delta t\rightarrow 1}\frac{x'(t)-x'(t-\Delta t)}{\Delta t} \end{aligned} \\即有:x(t+1)-x(t)-(x(t)-x(t-1))=x''(t) =x(t+1)x(t)(x(t)x(t1))=Δt1limΔtx(t+Δt)x(t)Δtx(t)x(tΔt)=x(t)x(t1)=Δt1limΔtx(t)x(tΔt)x(t+1)x(t)(x(t)x(t1))=x(t)
2) 由原式自然地可以知道:
∂ 2 f ∂ x 2 = f ( x + 1 , y ) + f ( x − 1 , y ) − 2 f ( x , y ) ∂ 2 f ∂ y 2 = f ( x , y + 1 ) + f ( x , y − 1 ) − 2 f ( x , y ) \frac{\partial^2 f}{\partial x^2}=f(x+1,y)+f(x-1,y)-2f(x,y) \\\\ \frac{\partial^2 f}{\partial y^2}=f(x,y+1)+f(x,y-1)-2f(x,y) \\\\ x22f=f(x+1,y)+f(x1,y)2f(x,y)y22f=f(x,y+1)+f(x,y1)2f(x,y)
于是有二维的二阶微分:
∂ 2 f ∂ x 2 + ∂ 2 f ∂ y 2 = f ( x + 1 , y ) + f ( x , y + 1 ) − 4 f ( x , y ) + f ( x − 1 , y ) + f ( x , y − 1 ) \frac{\partial^2 f}{\partial x^2}+\frac{\partial^2 f}{\partial y^2}=f(x+1,y)+f(x,y+1)-4f(x,y)+f(x-1,y)+f(x,y-1) x22f+y22f=f(x+1,y)+f(x,y+1)4f(x,y)+f(x1,y)+f(x,y1)
因此可以有卷积核:
[ 0 1 0 1 − 4 1 0 1 0 ] \left[\begin{array}{c} 0 & 1 & 0 \\\\ 1 & -4 & 1 \\\\ 0 & 1 & 0 \end{array}\right] 010141010

5-2

根据卷积,有: y i j = ∑ u = 1 U ∑ v = 1 V w u v ⋅ x i + u − 1 y_{ij}=\sum^U_{u=1}\sum^V_{v=1}w_{uv}\cdot x_{i+u-1} yij=u=1Uv=1Vwuvxi+u1

变换坐标: k = i − u + 1 , d = j − v + 1 k=i-u+1,d=j-v+1 k=iu+1,d=jv+1,于是有: u = k − i + 1 , v = d − j + 1 u=k-i+1,v=d-j+1 u=ki+1,v=dj+1

因此原式变为:
y i j = ∑ k = i − u + 1 U + i − 1 ∑ d = j − v + 1 V + j − 1 x k d ⋅ w k − i + 1 , d − j + 1 y_{ij}=\sum_{k=i-u+1}^{U+i-1}\sum_{d=j-v+1}^{V+j-1}x_{kd}\cdot w_{k-i+1,d-j+1} yij=k=iu+1U+i1d=jv+1V+j1xkdwki+1,dj+1

根据宽卷积的性质,宽卷积仅是进行了X的零填充,可以发现该等式仍然成立,故宽卷积也符合可交换性。

5-3

  1. 降低特征映射的深度,例如v1版本的Inception模块 1 × 1 1\times 1 1×1的卷积核作用
  2. 针对特征映射之间的冗余信息,能实现跨映射空间的特征抽取的作用。
  3. 减小时间复杂度,见5-4

5-4

1)时间: O ( M 2 ⋅ K 2 ⋅ C i n ⋅ C o u t ) = 5.898 × 1 0 9 O(M^2\cdot K^2\cdot C_{in}\cdot C_{out})=5.898\times 10^9 O(M2K2CinCout)=5.898×109,其中 M = 100 , K = 3 , C i n = 256 , C o u t = 256 M=100,K=3,C_{in}=256,C_{out}=256 M=100,K=3,Cin=256,Cout=256

2)时间: O ( M 2 ⋅ K 2 ⋅ C i n ⋅ C o u t ) + O ( M 2 ⋅ K ′ 2 ⋅ C i n ′ ⋅ C o u t ′ ) = 1.6384 × 1 0 9 O(M^2\cdot K^2\cdot C_{in}\cdot C_{out})+O(M^2\cdot K'^2\cdot C_{in}'\cdot C_{out}')=1.6384\times 10^9 O(M2K2CinCout)+O(M2K2CinCout)=1.6384×109,其中 M = 100 , K = K ′ = 3 , C i n = 256 , C o u t = 64 , C i n ′ = 64 , C o u t ′ = 256 M=100,K=K'=3,C_{in}=256,C_{out}=64,C_{in}'=64,C_{out}'=256 M=100,K=K=3,Cin=256,Cout=64,Cin=64,Cout=256

5-5

X X X其根据列优先展开为九维向量:

X = [ a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 ] = [ α 1 ⋅ α 2 ⋅ α 3 ⋅ ] 展 开 之 后 有 : X ′ = [ α 1 ⋅ α 2 ⋅ α 3 ⋅ ] X=\left[ \begin{array}{c} a_{11} & a_{12} & a_{13} \\\\ a_{21} & a_{22} & a_{23} \\\\ a_{31} & a_{32} & a_{33} \end{array}\right] =\left[\begin{array}{c}\alpha_{1\cdot} & \alpha_{2\cdot} &\alpha_{3\cdot}\end{array}\right] \\展开之后有:X'=\left[\begin{array}{c}\alpha_{1\cdot} \\ \alpha_{2\cdot} \\ \alpha_{3\cdot}\end{array}\right] X=a11a21a31a12a22a32a13a23a33=[α1α2α3]X=α1α2α3

因此我们的仿射变换的 W W W为:

W = [ w 11 w 12 w 21 w 22 ] W ′ = [ w 11 w 21 0 w 12 w 22 0 0 0 0 0 0 0 w 11 w 21 0 w 12 w 22 0 0 w 11 w 21 0 w 12 w 22 0 0 0 0 0 0 0 w 11 w 21 0 w 12 w 22 ] W=\left[ \begin{array}{c} w_{11} & w_{12} \\ w_{21} & w_{22} \end{array}\right]\\ W'=\left[ \begin{array}{c} w_{11} & w_{21} & 0 & w_{12} & w_{22} & 0 & 0 & 0 & 0 \\\\ 0 & 0 & 0 & w_{11} & w_{21} & 0 & w_{12} & w_{22} & 0 \\\\ 0 & w_{11} & w_{21} & 0 & w_{12} & w_{22} & 0 & 0 & 0 \\\\ 0 & 0 & 0 & 0 & w_{11} & w_{21} & 0 & w_{12} & w_{22} \\\\ \end{array}\right] W=[w11w21w12w22]W=w11000w210w11000w210w12w1100w22w21w12w1100w22w210w12000w220w12000w22
因此放射变换为: A f f i n e ( W , X ) = W ′ X ′ Affine(W,X)=W'X' Affine(W,X)=WX

5-6

∀ x i , x j , i ≠ j 有 x i ≠ x j 时 : ∂ M a x ∂ X = { 1 M a x ≠ 0 0 M a x = 0 \forall x_i,x_j,i\not=j有x_i\not=x_j时:\frac{\partial Max}{\partial X}=\begin{cases}1 & Max\not=0\\ 0 & Max=0\end{cases} xi,xj,i=jxi=xjXMax={10Max=0Max=0

a r g m a x argmax argmax不可导。

5-7

在第 l l l层时:净输入为 z ( l + 1 ) = W ( l + 1 ) z ( l ) z^{(l+1)}=W^{(l+1)}z^{(l)} z(l+1)=W(l+1)z(l), 反向传播的误差项为 δ ( l ) = ( W ( l + 1 ) ) ⊤ δ ( l + 1 ) \delta^{(l)}=(W^{(l+1)})^\top \delta^{(l+1)} δ(l)=(W(l+1))δ(l+1)

当在第 l + 1 l+1 l+1层时:净输入为 z ( l ) = ( W ( l + 1 ) ) ⊤ z ( l + 1 ) z^{(l)}=(W^{(l+1)})^\top z^{(l+1)} z(l)=(W(l+1))z(l+1),而反向传播的误差项为 δ ( l + 1 ) = W ( l + 1 ) δ ( l ) \delta^{(l+1)}=W^{(l+1)}\delta^{(l)} δ(l+1)=W(l+1)δ(l)

因此,忽略激活函数下前向计算和反向传播是一种转置关系。

5-8

( P + 1 ) 2 = K + ( K − 1 ) × ( D − 1 ) (P+1)^2=K+(K-1)\times (D-1) (P+1)2=K+(K1)×(D1)

第六章

6-1

共同点: 都共享参数权重,

相异点: 延时神经网络当前层神经元的活性值仅依赖于前一层神经元的最近K个时刻的活性值,而循环神经网络当前时刻的活性值依赖之前所有时刻的活性值;循环神经网络和延时神经网络都是在时间维度上共享权重,卷积神经网络则是在空间上共享权重。

6-2

t t t时刻损失函数为: L t = L ( y t , g ( h t ) ) L_t=L(y_t,g(h_t)) Lt=L(yt,g(ht)),故总的损失函数为: L = ∑ t = 1 T L t L=\sum^T_{t=1}L_t L=t=1TLt

关于参数 W W W的梯度为: ∂ L ∂ W = ∑ t = 1 T ∂ L t ∂ W \frac{\partial L}{\partial W}=\sum^T_{t=1}\frac{\partial L_t}{\partial W} WL=t=1TWLt,关于参数 b b b的梯度为: ∂ L ∂ b = ∑ t = 1 T ∂ L t ∂ b \frac{\partial L}{\partial b}=\sum^T_{t=1}\frac{\partial L_t}{\partial b} bL=t=1TbLt

对于 W W W矩阵的元素 w i j w_{ij} wij b b b矩阵的元素 b i j b_{ij} bij,引入每个 k k k时刻的净输入: z k = U h k − 1 + W x k + b z_k=Uh_{k-1}+Wx_k+b zk=Uhk1+Wxk+b,则:
∂ L ∂ w i j = ∂ + z k ∂ w i j ⋅ ∂ L ∂ z k = ∑ k = 1 t ∂ + z k ∂ w i j ∂ L t ∂ z k ∂ L ∂ b i j = ∂ + z k ∂ b i j ⋅ ∂ L ∂ z k = ∑ k = 1 t ∂ + z k ∂ b i j ∂ L t ∂ z k \frac{\partial L}{\partial w_{ij}}=\frac{\partial^+ z_k}{\partial w_{ij}}\cdot\frac{\partial L}{\partial z_{k}}=\sum^t_{k=1}\frac{\partial^+ z_k}{\partial w_{ij}} \frac{\partial L_t}{\partial z_{k}} \\ \frac{\partial L}{\partial b_{ij}}=\frac{\partial^+ z_k}{\partial b_{ij}}\cdot\frac{\partial L}{\partial z_{k}}=\sum^t_{k=1}\frac{\partial^+ z_k}{\partial b_{ij}} \frac{\partial L_t}{\partial z_{k}} wijL=wij+zkzkL=k=1twij+zkzkLtbijL=bij+zkzkL=k=1tbij+zkzkLt
∂ + z k ∂ w i j = [ 0 , ⋯   , [ x k ] j , ⋯   , 0 ] ≜ I i ( [ x k ] j ) , ∂ + z k ∂ b i j = [ 0 , ⋯   , 1 , ⋯   , 0 ] ≜ I i ( [ 1 ] j ) \frac{\partial^+z_k}{\partial w_ij}= [0,\cdots,[x_k]_j,\cdots,0] \triangleq\mathbb{I}_i([x_k]_j),\frac{\partial^+z_k}{\partial b_ij}=[0,\cdots,1,\cdots,0]\triangleq\mathbb{I}_i([1]_j) wij+zk=[0,,[xk]j,,0]Ii([xk]j)bij+zk=[0,,1,,0]Ii([1]j)

根据公式6.36有: ∂ L t ∂ z k = d i a g ( f ′ ( z k ) ) U ⊤ δ t , k + 1 \frac{\partial L_t}{\partial z_k}=\mathbb{diag}(f'(z_k))U^\top\delta_{t,k+1} zkLt=diag(f(zk))Uδt,k+1,因此连立公式合并:

∂ L ∂ w i j = ∑ k = 1 t [ δ i , t , k ] [ x k , j ] → ∂ L ∂ W = ∑ k = 1 t δ t , k x k ⊤ ∂ L ∂ b i j = ∑ k = 1 t [ δ t , k , i ] [ 1 j ] → ∂ L ∂ W = ∑ k = 1 t δ t , k \frac{\partial L}{\partial w_{ij}}=\sum_{k=1}^t[\delta_{i,t,k}][x_{k,j}] \rightarrow\frac{\partial L}{\partial W}=\sum_{k=1}^t\delta_{t,k}x^\top_k \\\\ \frac{\partial L}{\partial b_{ij}}=\sum_{k=1}^t[\delta_{t,k,i}][1_j] \rightarrow \frac{\partial L}{\partial W}=\sum_{k=1}^t\delta_{t,k} wijL=k=1t[δi,t,k][xk,j]WL=k=1tδt,kxkbijL=k=1t[δt,k,i][1j]WL=k=1tδt,k

因此有:
∂ L ∂ W = ∑ t = 1 T ∑ k = 1 t δ t , k x k ⊤ ∂ L ∂ b = ∑ t = 1 T ∑ k = 1 t δ t , k \frac{\partial L}{\partial W}=\sum^T_{t=1}\sum^t_{k=1}\delta_{t,k}x^\top_k \\ \frac{\partial L}{\partial b}=\sum^T_{t=1}\sum^t_{k=1}\delta_{t,k} WL=t=1Tk=1tδt,kxkbL=t=1Tk=1tδt,k

6-3

原因

公式: h t = h t − 1 + g ( x t , h t − 1 ; θ ) h_t=h_{t-1}+g(x_t,h_{t-1};\theta) ht=ht1+g(xt,ht1;θ),此时可能存在 ∣ ∣ d i a g ( f ′ ( z t ) ) U ⊤ ∣ ∣ > 1 ||\mathbb{diag}(f'(z_t))U^\top||> 1 diag(f(zt))U>1,则在计算梯度时有:

z k = U h k − 1 + W x k + b z_k=Uh_{k-1}+Wx_k+b zk=Uhk1+Wxk+b g ( ⋅ ) g(\cdot) g()在第 k k k时刻的输入,则在 δ t , k = ∂ L t ∂ z k \delta_{t,k}=\frac{\partial L_t}{\partial z_k} δt,k=zkLt时,有:

δ t , k = d i a g ( f ′ ( z k ) ) U ⊤ δ t , k + 1 = δ t , t ∏ i = k t d i a g ( f ′ ( z k ) ) U ⊤ \delta_{t,k}=\mathbb{diag}(f'(z_k))U^\top \delta_{t,k+1}=\delta_{t,t}\prod^{t}_{i=k}\mathbb{diag}(f'(z_k))U^\top δt,k=diag(f(zk))Uδt,k+1=δt,ti=ktdiag(f(zk))U

因此: t − k → + ∞ t-k\rightarrow +\infty tk+时, δ t , k → + ∞ \delta_{t,k}\rightarrow +\infty δt,k+,故梯度此刻可能爆炸了。

解决方法
  1. 使用基于门控的循环神经网络;
  2. 使用更小的 U U U f ( ⋅ ) f(\cdot) f缓解,尽量让 ∣ ∣ d i a g ( f ′ ( z t ) ) U ⊤ ∣ ∣ ≈ 1 ||\mathbb{diag(f'(z_t))}U^\top||\approx 1 diag(f(zt))U1

6-4

t t t时刻的输入向量: x t x_t xt

权重参数: W i , W c , W f , W o , U i , U c , U f , U o W_i,W_c,W_f,W_o,U_i,U_c,U_f,U_o WiWcWfWoUiUcUfUo

bias: b i , b c , b f , b o b_i,b_c,b_f,b_o bibcbfbo

σ \sigma σ:sigmoid函数

V V V ∂ o t ∂ h t \frac{\partial o_t}{\partial h_t} htot

前向传播
  1. Forget Gate: f t = σ ( W f x t + U f h t − 1 + b f ) f_t = \sigma(W_fx_t+U_fh_{t-1}+b_f) ft=σ(Wfxt+Ufht1+bf)
  2. Input Gate: i t = σ ( W i x t + U i h t − 1 + b i ) i_t = \sigma(W_ix_t+U_ih_{t-1}+b_i) it=σ(Wixt+Uiht1+bi) c t ~ = t a n h ( W c x t + U c h t − 1 + b c ) \tilde{c_t} = tanh(W_cx_t+U_ch_{t-1}+b_c) ct~=tanh(Wcxt+Ucht1+bc)
  3. Cell state: c t = f t ⊙ c t − 1 + i t ⊙ c t ~ c_t = f_t\odot c_{t-1}+i_t\odot\tilde{c_t} ct=ftct1+itct~
  4. Output Gate: o t = σ ( W o x t + U o h t − 1 + b o ) o_t = \sigma(W_ox_t+U_oh_{t-1}+b_o) ot=σ(Woxt+Uoht1+bo) h t = o t ⊙ t a n h ( c t ) h_t = o_t\odot tanh(c_t) ht=ottanh(ct)
反向传播

隐藏状态 h t h_t ht c t c_t ct的梯度为: δ h t = ∂ L ∂ h t \delta h_t=\frac{\partial L}{\partial h_t} δht=htL δ c t = ∂ L ∂ c t \delta c_t =\frac{\partial L}{\partial c_t} δct=ctL

损失函数: L ( t ) = l ( t ) + L ( t + 1 ) = l ( t ) + ∑ i = t + 1 τ − t l ( i ) L(t)=l(t)+L(t+1) = l(t)+\sum\limits^{\tau-t}_{i=t+1}l(i) L(t)=l(t)+L(t+1)=l(t)+i=t+1τtl(i)

最后时刻 τ \tau τ时:

δ h τ = ( ∂ o τ ∂ h τ ) ⊤ ∂ L τ ∂ o τ = V ⊤ ( y t ^ − y t ) \delta h_\tau=(\frac{\partial o_\tau}{\partial h_\tau})^\top\frac{\partial L_\tau}{\partial o_\tau}=V^\top(\hat{y_t}-y_t) δhτ=(hτoτ)oτLτ=V(yt^yt)

δ c τ = ( ∂ h τ ∂ c τ ) ⊤ ∂ L ( τ ) ∂ h τ = o τ ⊙ ( 1 − t a n h 2 ( c τ ) ) ⊙ δ h τ \delta c_\tau = (\frac{\partial h_\tau}{\partial c_\tau})^\top\frac{\partial L(\tau)}{\partial h_\tau}=o_\tau \odot (1-tanh^2(c_\tau))\odot\delta h_\tau δcτ=(cτhτ)hτL(τ)=oτ(1tanh2(cτ))δhτ

从第 t + 1 t+1 t+1时刻反向传播到第 t t t时刻:

δ h t = ∂ L ∂ h t = ∂ l ( t ) ∂ h t + ( ∂ h t + 1 ∂ h t ) ⊤ ∂ L ( t + 1 ) ∂ h t + 1 = V ⊤ ( y t ^ − y t ) + ( ∂ h t + 1 h t ) ⊤ δ h t + 1 \delta h_t = \frac{\partial L}{\partial h_t}=\frac{\partial l(t)}{\partial h_t}+(\frac{\partial h_{t+1}}{\partial h_t})^\top\frac{\partial L(t+1)}{\partial h_{t+1}}=V^\top(\hat{y_t}-y_t)+(\frac{\partial h_{t+1}}{h_t})^\top \delta h_{t+1} δht=htL=htl(t)+(htht+1)ht+1L(t+1)=V(yt^yt)+(htht+1)δht+1

又有: △ c t = o t + 1 ⊙ [ 1 − t a n h 2 ( c t + 1 ) ] \triangle c_t = o_{t+1}\odot [1-tanh^2(c_{t+1})] ct=ot+1[1tanh2(ct+1)]

∂ h t + 1 ∂ h t = d i a g [ o t + 1 ⊙ ( 1 − o t + 1 ) ⊙ t a n h ( c t + 1 ) ] W o + d i a g [ △ c ⊙ f t + 1 ⊙ ( 1 − f t + 1 ) ⊙ c t ] W f + d i a g { △ c ⊙ i t + 1 ⊙ [ 1 − ( c t + 1 ) 2 ] } W c + d i a g [ △ c ⊙ c t + 1 ⊙ i t + 1 ⊙ ( 1 − i t + 1 ) ] W i \frac{\partial h_{t+1}}{\partial h_t}=diag[o_{t+1}\odot (1-o_{t+1})\odot tanh(c_{t+1})]W_o+diag[\triangle c\odot f_{t+1}\odot (1-f_{t+1})\odot c_t]W_f \\\\ +diag\{\triangle c\odot i_{t+1}\odot[1-(c_{t+1})^2]\}W_c+diag[\triangle c\odot c_{t+1}\odot i_{t+1}\odot (1-i_{t+1})]W_i htht+1=diag[ot+1(1ot+1)tanh(ct+1)]Wo+diag[cft+1(1ft+1)ct]Wf+diag{cit+1[1(ct+1)2]}Wc+diag[cct+1it+1(1it+1)]Wi

δ c t = ( ∂ c t + 1 ∂ c t ) ⊤ ∂ L ∂ c t + 1 + ( ∂ h t ∂ c t ) ⊤ ∂ L ∂ h t = ( ∂ c t + 1 ∂ c t ) ⊤ δ c t + 1 + δ h t ⊙ o t ⊙ ( 1 − t a n h 2 ( c t ) ) \delta c_t = (\frac{\partial c_{t+1}}{\partial c_t})^\top\frac{\partial L}{\partial c_{t+1}}+(\frac{\partial h_t}{\partial c_t})^\top\frac{\partial L}{\partial h_t}=(\frac{\partial c_{t+1}}{\partial c_t})^\top\delta c_{t+1}+\delta h_t \odot o_t \odot (1-tanh^2(c_t)) δct=(ctct+1)ct+1L+(ctht)htL=(ctct+1)δct+1+δhtot(1tanh2(ct))

因此有: δ c t = δ c t + 1 ⊙ f t + 1 + δ h t ⊙ o t ⊙ ( 1 − t a n h 2 ( c t ) ) \delta c_t=\delta c_{t+1}\odot f_{t+1}+\delta h_t\odot o_t \odot (1-tanh^2(c_t)) δct=δct+1ft+1+δhtot(1tanh2(ct))

于是: ∂ L ∂ W f = ∑ t = 1 τ [ δ c t ⊙ c t − 1 ⊙ f t ⊙ ( 1 − f t ) ] h t − 1 ⊤ \frac{\partial L}{\partial W_f}=\sum\limits ^\tau_{t=1}[\delta c_t \odot c_{t-1}\odot f_t \odot (1-f_t)]h_{t-1}^\top WfL=t=1τ[δctct1ft(1ft)]ht1

6-5

GRU将三门变两门:更新门 z t z_t zt,重置门 r t r_t rt,单元状态 c c c与输出 o o o合并为一个状态 h h h

前向传播

r t = σ ( W r ⋅ [ h t − 1 , x t ] ) r_t=\sigma(W_r\cdot [h_{t-1},x_t]) rt=σ(Wr[ht1,xt])

z t = σ ( W z ⋅ [ h t − 1 , x t ] ) z_t=\sigma(W_z\cdot [h_{t-1},x_t]) zt=σ(Wz[ht1,xt])

h t ^ = t a n h ( W h ^ ⋅ [ r t ⊙ h t − 1 , x t ] ) \hat{h_t}=tanh(W_{\hat{h}}\cdot [r_t\odot h_{t-1}, x_t]) ht^=tanh(Wh^[rtht1,xt])

h t = ( 1 − z t ) ⊙ h t − 1 + z t ⊙ h t ^ h_t = (1-z_t)\odot h_{t-1} + z_t \odot \hat{h_t} ht=(1zt)ht1+ztht^

y t = σ ( W o ⋅ h t ) y_t = \sigma(W_o\cdot h_t) yt=σ(Woht)

反向传播

知乎有个解答很详细,这里就不写了(太长了orz)

6-6

GRU、LSTM、双向LSTM、递归神经网络、图神经网络

6-7

根据书上的图进行对比:

递归神经网络]

image-20201222202207655

当递归神经网络退化到图6.11的结构的时候,每个 h i h_i hi节点都只会收到 x i x_i xi的输入,而输出则在最后,因此跟序列到类别模式下的循环神经网络是一样的。

第七章

7-1

在小批量梯度下降中:
g t ( θ ) = 1 K ∑ ( x , y ) ∈ S t ∂ L ( y , f ( x ; θ ) ) ∂ θ θ t   = θ t − 1 − α g t g_t(\theta) = \frac{1}{K}\sum_{(x,y)\in S_t}\frac{\partial L(y,f(x;\theta))}{\partial \theta} \\ \theta_t\ = \theta_{t-1}-\alpha g_t gt(θ)=K1(x,y)StθL(y,f(x;θ))θt =θt1αgt
g t = 1 K δ g_t = \frac{1}{K}\delta gt=K1δ ,则: θ t = θ t − 1 − α K δ \theta_t = \theta_{t-1}-\frac{\alpha}{K}\delta θt=θt1Kαδ

因此我们要使得参数最优,则 α K \frac{\alpha}{K} Kα为最优的时候的常数,故学习率要和批量大小成正比。

7-2

在Adam算法中:
M t = β 1 M t − 1 + ( 1 − β 1 ) g t G t = β 2 G t − 1 + ( 1 − β 2 ) ⊙ g t M_t = \beta_1 M_{t-1}+(1-\beta_1)g_t \\ G_t = \beta_2 G_{t-1} + (1-\beta_2)\odot g_t Mt=β1Mt1+(1β1)gtGt=β2Gt1+(1β2)gt
因此当 β 1 → 1 \beta_1\rightarrow 1 β11 β 2 → 1 \beta_2 \rightarrow 1 β21的时候:
lim ⁡ β 1 → 1 M t = M t − 1 lim ⁡ β 2 → 1 G t = G t − 1 \lim_{\beta_1\rightarrow 1} M_t = M_{t-1} \\ \lim_{\beta_2\rightarrow 1} G_t = G_{t-1} β11limMt=Mt1β21limGt=Gt1
因此可以发现此时梯度消失,因此需要进行偏差修正。

7-3

  1. 小批量梯度下降: G t = 1 − ϵ G_t = 1-\epsilon Gt=1ϵ M t = g t ( θ ) = 1 K ∑ ( x , y ) ∈ δ t ∂ L ( y , f ( x ; θ ) ) ∂ θ M_t = g_t(\theta)=\frac{1}{K}\sum_{(x,y)\in \delta_t}\frac{\partial L(y,f(x;\theta))}{\partial \theta} Mt=gt(θ)=K1(x,y)δtθL(y,f(x;θ))
  2. AdaGrad: G t = ∑ τ = 1 t g τ ⊙ g τ G_t=\sum_{\tau =1}\limits^{t} g_{\tau}\odot g_{\tau} Gt=τ=1tgτgτ M t = g t ( θ ) M_t = g_t(\theta) Mt=gt(θ)
  3. RMSprop: G t = ( 1 − β ) ∑ τ = 1 t β t − τ g τ ⊙ g t G_t = (1-\beta)\sum_{\tau=1}\limits^t\beta^{t-\tau}g_{\tau}\odot g_t Gt=(1β)τ=1tβtτgτgt M t = g t ( θ ) M_t = g_t(\theta) Mt=gt(θ)
  4. AdaDelta: G t = ∑ τ = 1 t g τ ⊙ g τ G_t=\sum_{\tau =1}\limits^{t}g_{\tau}\odot g_{\tau} Gt=τ=1tgτgτ M t = g t ( θ ) M_t = g_t(\theta) Mt=gt(θ)
  5. 动量法: G t = 1 − ϵ G_t = 1 - \epsilon Gt=1ϵ M t = ∑ τ = 1 t ρ t − τ g τ M_t = \sum\limits^t_{\tau=1}\rho^{t-\tau}g_{\tau} Mt=τ=1tρtτgτ
  6. Nesterov: G t = 1 − ϵ G_t = 1 - \epsilon Gt=1ϵ M t = ∑ τ = 1 t ρ t − τ g τ ^ ( θ t − 1 + ρ △ θ t − 1 ) M_t = \sum_{\tau=1}\limits^t \rho^{t-\tau}\hat{g_{\tau}}(\theta_{t-1}+\rho \triangle\theta_{t-1}) Mt=τ=1tρtτgτ^(θt1+ρθt1) g ^ \hat{g} g^ θ t − 1 + ρ △ θ t − 1 \theta_{t-1}+\rho \triangle\theta_{t-1} θt1+ρθt1偏导
  7. Adam: G t = β 2 G t − 1 + ( 1 − β 2 ) g t ⊙ g t G_t =\beta_2G_{t-1}+(1-\beta_2)g_t\odot g_t Gt=β2Gt1+(1β2)gtgt M t = β 1 M t − 1 + ( 1 − β 1 ) g t Mt = \beta_1M_{t-1}+(1-\beta_1)g_t Mt=β1Mt1+(1β1)gt

7-4

**Note: ** 做法有问题哈,等后续改过来

根据7.41可知: v a r ( a ( l ) ) = M l − 1 ⋅ v a r ( w i ( l ) ) v a r ( a i ( l − 1 ) ) var(a^{(l)})=M_{l-1} \cdot var(w_i^{(l)})var(a_i^{(l-1)}) var(a(l))=Ml1var(wi(l))var(ai(l1))

而反向传播中: a ( l − 1 ) = ∑ i = 1 M l w i ( l ) a ( l ) a^{(l-1)} = \sum\limits^{M_{l}}_{i=1}w_i^{(l)}a^{(l)} a(l1)=i=1Mlwi(l)a(l)

因此需要做到计算强度不变,则需要满足: v a r ( a ( l ) ) = v a r ( a ( l − 1 ) ) var(a^{(l)})=var(a^{(l-1)}) var(a(l))=var(a(l1))

又因为: a ( l ) a^{(l)} a(l) w ( l ) w^{(l)} w(l)都相互独立,且 E ( a ( l − 1 ) ) = ∑ i = 1 M l E ( w i ( l ) ) E ( a i ( l ) ) = 0 E(a^{(l-1)})=\sum\limits^{M_l}_{i=1}E(w_i^{(l)})E(a_i^{(l)})=0 E(a(l1))=i=1MlE(wi(l))E(ai(l))=0

因此: v a r ( a ( l − 1 ) ) = v a r ( ∑ i = 1 M l w i ( l ) a i ( l ) ) var(a^{(l-1)})=var(\sum\limits^{M_{l}}_{i=1}w_i^{(l)}a_i^{(l)}) var(a(l1))=var(i=1Mlwi(l)ai(l))

即: v a r ( a ( l − 1 ) ) = M l v a r ( w i ( l ) ) v a r ( a i ( l ) ) var(a^{(l-1)})=M_l var(w_i^{(l)})var(a_i^{(l)}) var(a(l1))=Mlvar(wi(l))var(ai(l))

因此需要满足: v a r ( w i ( l ) ) = 1 M l var(w_i^{(l)})=\frac{1}{M_l} var(wi(l))=Ml1

同样地,7-44推导如下:

考虑向前传播和向后传播都不会放大或缩小,则根据7.417-4有:

v a r ( a ( l − 1 ) ) + v a r ( a ( l ) ) = M l − 1 v a r ( w i ( l ) ) v a r ( a i ( l − 1 ) ) + M l v a r ( w i ( l ) ) v a r ( a i ( l ) ) var(a^{(l-1)})+var(a^{(l)})=M_{l-1}var(w_i^{(l)})var(a_i^{(l-1)})+M_lvar(w_i^{(l)})var(a_i^{(l)}) var(a(l1))+var(a(l))=Ml1var(wi(l))var(ai(l1))+Mlvar(wi(l))var(ai(l))

又有: v a r ( a ( l − 1 ) ) = v a r ( a ( l ) ) var(a^{(l-1)})=var(a^{(l)}) var(a(l1))=var(a(l)),因此: ( M l − 1 + M l ) v a r ( w i ( l ) ) = 2 (M_{l-1}+M_{l})var(w_i^{(l)})=2 (Ml1+Ml)var(wi(l))=2

因此: v a r ( w i ( l ) ) = 2 M l + M l − 1 var(w_i^{(l)})=\frac{2}{M_l+M_{l-1}} var(wi(l))=Ml+Ml12

7-5

当使用ReLu函数的时候,前向传播有:KaTeX parse error: Expected group after '^' at position 13: a^{(l)}=\sum^̲\limits{M_{l-1}…

根据方差公式,有: v a r ( a ( l ) ) = M l v a r ( w i ( l ) ) E [ ( x i ( l ) ) 2 ] var(a^{(l)})=M_lvar(w_i^{(l)})E[(x_i^{(l)})^2] var(a(l))=Mlvar(wi(l))E[(xi(l))2]

求积分,有: E [ ( x i ( l ) ) 2 ] = 1 2 V a r ( a ( l − 1 ) ) E[(x_i^{(l)})^2]=\frac{1}{2}Var(a^{(l-1)}) E[(xi(l))2]=21Var(a(l1))

因此有: v a r ( a ( l ) ) = 1 2 M l − 1 v a r ( w i ( l ) ) v a r ( a ( l − 1 ) ) var(a^{(l)})=\frac{1}{2}M_{l-1}var(w_i^{(l)})var(a^{(l-1)}) var(a(l))=21Ml1var(wi(l))var(a(l1)),即: v a r ( w i ( l ) ) = 2 M l − 1 var(w_i^{(l)})=\frac{2}{M_{l-1}} var(wi(l))=Ml12

7-6

a ( l ) = f ( z ( l ) ) = f ( W a ( l − 1 ) + b ) a^{(l)}=f(z^{(l)})=f(Wa^{(l-1)}+b) a(l)=f(z(l))=f(Wa(l1)+b) ,第 l l l层的神经元净输入为 z ( 1 , l ) , z ( 2 , l ) , ⋯   , z ( k , l ) z^{(1,l)},z^{(2,l)},\cdots,z^{(k,l)} z(1,l),z(2,l),,z(k,l),则批量归一化为:
{ μ B = 1 K ∑ k = 1 K z ( k , l ) = 1 K ∑ k = 1 K w ( k , l ) a ( k , l − 1 ) + b = δ B + b σ B 2 = 1 K ∑ K = 1 K ( z ( k , l ) − μ B ) ⊙ ( z ( k , l ) − μ B ) = 1 K ∑ K = 1 K ( w ( k , l ) a ( k , l − 1 ) − δ B ) ⊙ ( w ( k , l ) a ( k , l − 1 ) − δ B ) z ^ ( l ) = z ( l ) − μ B σ B 2 − ϵ ⊙ γ + β = w ( l ) a ( l − 1 ) − δ B σ B 2 − ϵ ⊙ γ + β \begin{cases} \mu_B = \frac{1}{K}\sum_{k=1}\limits^Kz^{(k,l)} = \frac{1}{K}\sum_{k=1}\limits^Kw^{(k,l)}a^{(k,l-1)}+b = \delta_B+b \\ \sigma_B^2=\frac{1}{K}\sum_{K=1}\limits^K (z^{(k,l)}-\mu_B)\odot (z^{(k,l)}-\mu_B) = \frac{1}{K}\sum_{K=1}\limits^K (w^{(k,l)}a^{(k,l-1)}-\delta_B)\odot (w^{(k,l)}a^{(k,l-1)}-\delta_B) \\ \hat{z}^{(l)}=\frac{z^{(l)}-\mu_B}{\sqrt{\sigma_B^2}-\epsilon}\odot \gamma +\beta = \frac{w^{(l)}a^{(l-1)-\delta_B}}{\sqrt{\sigma_B^2-\epsilon}}\odot \gamma +\beta \end{cases} μB=K1k=1Kz(k,l)=K1k=1Kw(k,l)a(k,l1)+b=δB+bσB2=K1K=1K(z(k,l)μB)(z(k,l)μB)=K1K=1K(w(k,l)a(k,l1)δB)(w(k,l)a(k,l1)δB)z^(l)=σB2 ϵz(l)μBγ+β=σB2ϵ w(l)a(l1)δBγ+β
f ( B N ( W a ( l − 1 ) + b ) ) f(BN(Wa^{(l-1)}+b)) f(BN(Wa(l1)+b)) f ( W B N ( a ( l − 1 ) ) + b ) f(WBN(a^{(l-1)})+b) f(WBN(a(l1))+b)都是将BN层放在激活函数前,而结构差异上看一个针对上一层的输出,一个针对的是该层净输入。
{ μ A = 1 K ∑ k = 1 K a ( k , l − 1 ) σ A 2 = 1 K ∑ K = 1 K ( a ( k , l − 1 ) − μ A ) ⊙ ( a ( k , l − 1 ) − μ A ) z ^ ( l ) = a ( l − 1 ) − μ A σ A 2 − ϵ ⊙ γ + β \begin{cases} \mu_A = \frac{1}{K}\sum_{k=1}\limits^Ka^{(k,l-1)} \\ \sigma_A^2=\frac{1}{K}\sum_{K=1}\limits^K (a^{(k,l-1)}-\mu_A)\odot (a^{(k,l-1)}-\mu_A) \\ \hat{z}^{(l)}=\frac{a^{(l-1)-\mu_A}}{\sqrt{\sigma_A^2-\epsilon}}\odot \gamma +\beta \end{cases} μA=K1k=1Ka(k,l1)σA2=K1K=1K(a(k,l1)μA)(a(k,l1)μA)z^(l)=σA2ϵ a(l1)μAγ+β

B N ( W a ( l − 1 ) + b ) = z ^ ( l ) W B N ( a ( l − 1 ) ) + b = a ^ ( l − 1 ) = w ( l ) a ( l − 1 ) − w ( l ) μ A σ A 2 − ϵ ⊙ γ + β + b \\ BN(Wa^{(l-1)+b})=\hat{z}^{(l)} \\ WBN(a^{(l-1)})+b = \hat{a}^{(l-1)}=\frac{w^{(l)}a^{(l-1)}-w^{(l)}\mu_A}{\sqrt{\sigma_A^2-\epsilon}}\odot \gamma + \beta + b BN(Wa(l1)+b)=z^(l)WBN(a(l1))+b=a^(l1)=σA2ϵ w(l)a(l1)w(l)μAγ+β+b
由此可见,我们还能发现针对输出时带入了上层的偏置值,会影响BN层的归一化结果。此外,前者比后者更加能充分利用参数的内容,实现更平滑的效果。

7-7

BN层中的缩放与平移,可以很好地保证模型的容纳能力,原始分布能够有效还原。同时,能有效适应不同的激活函数,通过 γ \gamma γ β \beta β来自动调整输入分布,防止死亡ReLu问题。

7-8

批量归一化是针对中间层的每个神经元进行归一化操作,而且净输入的分布是静态的。而循环神经网络神经元净输入的分布是动态的,无法使用批量归一化操作。

7-9

并不等价,论文中有说明。

image-20210326200520196

7-10

对循环神经网络直接应用丢弃法的时候,会导致每个时刻的隐状态丢失,损害网络在时间维度的记忆能力。

7-11

l ( y ~ , f ( x ; θ ) ) = − y ~ log ⁡ ( f ( x ; θ ) ) l(\tilde{y},f(x;\theta))=-\tilde{y}\log(f(x;\theta)) l(y~,f(x;θ))=y~log(f(x;θ))

  • 33
    点赞
  • 240
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值