神经网络中的正则化

最新推荐文章于 2024-05-01 08:48:29 发布

lankuohsing

最新推荐文章于 2024-05-01 08:48:29 发布

阅读量628

点赞数

分类专栏：学习笔记理论学习文章标签：神经网络正则化过拟合

本文链接：https://blog.csdn.net/THUChina/article/details/80710205

版权

学习笔记同时被 2 个专栏收录

53 篇文章 0 订阅

订阅专栏

理论学习

46 篇文章 1 订阅

订阅专栏

Adding regularization will often help To prevent overfitting problem (high variance problem ).

1. Logistic regression

回忆一下训练时的优化目标函数

min w, b J (w, b), w \in R n x, b \in R (1-1)

$\min \limits_{w,b}J\left(w,b\right), \ \ \ \ w\in\mathbb{R}^{n_x},b\in\mathbb{R} \tag{1-1}$
其中

J (w, b) = 1 m \sum i = 1 m L (y^(i), y (i)) (1-2)

$J\left(w,b\right)=\frac{1}{m}\sum_{i=1}^{m}L\left(\hat y^{(i)},y^{(i)}\right)\\ \tag{1-2}$

L2 regularization L 2 r e g u l a r i z a t i o n $L_2 \ \ regularization$ (most commonly used)：

J (w, b) = 1 m \sum i = 1 m L (y^(i), y (i)) + λ 2 m ∥ w ∥ 22 (1-3)

$J\left(w,b\right)=\frac{1}{m}\sum_{i=1}^{m}L\left(\hat y^{(i)},y^{(i)}\right)+\frac{\lambda}{2m}\left\lVert w \right\rVert_2^2\\ \tag{1-3}$
其中

∥ w ∥ 22 = \sum j = 1 n x w 2 j = w T w (1-4)

$\left\lVert w \right\rVert_2^2=\sum_{j=1}^{n_x}w_j^2=w^Tw\tag{1-4}$
Why do we regularize just the parameter w? Because w Is usually a high dimensional parameter vector while b is A scalar. Almost all The parameters are in w rather than b.

L1 regularization L 1 r e g u l a r i z a t i o n $L_1 \ \ regularization$

J (w, b) = 1 m \sum i = 1 m L (y^(i), y (i)) + λ m | w | 1 (1-5)

$J\left(w,b\right)=\frac{1}{m}\sum_{i=1}^{m}L\left(\hat y^{(i)},y^{(i)}\right)+\frac{\lambda}{m}\left\lvert w \right\rvert_1\tag{1-5}$
其中

| w | 1 = \sum j n x | w j | (1-6)

$\left\lvert w \right\rvert_1=\sum_j^{n_x}\left\lvert w_j \right\rvert \tag{1-6}$
w will end up being sparse. In other words the w vector will have a lot of zeros in it. This can help with compressing the model a little.

2. Neural network “Frobenius norm”

J (w [1], b [1], \dots, w [L], b [L]) = 1 m \sum i = 1 m L (y^(i), y (i)) + λ 2 m \sum l = 1 L ∥ w ∥ 22 (2-1)

$J\left(w^{[1]},b^{[1]},\cdots,w^{[L]},b^{[L]}\right)=\frac{1}{m}\sum_{i=1}^{m}L\left(\hat y^{(i)},y^{(i)}\right)+\frac{\lambda}{2m}\sum_{l=1}^{L}{\left\lVert w \right\rVert_2^2 }\tag{2-1}$
其中

∥ ∥ w [l] ∥ ∥ 2 F = \sum i n [l - 1] \sum j n [l] (w i j) 2 (2-2)

$\left\lVert w^{[l]} \right\rVert_F^2=\sum_i^{n^{[l-1]}}\sum_j^{n^{[l]}}\left(w_{ij}\right)^2 \tag{2-2}$

L2 L 2 $L_2$ regulation is also called Weight decay:

d w [l] w l : = (f r o m b a c k p r o p) + λ m w [l] = w [l] - α d w [l] = (1 - α λ m) w [l] - α (f r o m b a c k p r o p) (2-3)

$\begin{align*} dw^{[l]}&=\left(from\ backprop\right)+\frac{\lambda}{m}w^{[l]}\\ w^{l}:&=w^{[l]}-\alpha dw^{[l]}\\ &=\left(1-\frac{\alpha\lambda}{m}\right)w^{[l]}-\alpha(from\ backprop)\\ \tag{2-3} \end{align*}$
能够防止权重

w w $w$ 过大，从而避免过拟合

3. inverted dropout

对于不同的训练样本都可以随机消除一部分结点
反向随机失活（前向和后向都需要dropout）：

\begin{matrix} (3-1) & \begin{aligned} d^{3} & = n p . r a n d o m . r a n d (a_{3} . s h a p e [0], a_{3} . s h a p e [1]) < k e e p . p r o b \\ a^{3} & = n p . m u l t i p l y (a_{3}, d_{3}) # a 3 * d 3, e l e m e n t w i s e m u l t i p l i c a t i o n \\ a^{3} / & = k e e p . p r o b # i n o r d e r t o n o t r e d u c e t h e e x p e c t e d v a l u e o f a^{3} i n v e r t e d d r o p o u t \\ z^{[4]} & = w^{[4]} a^{[3]} + b^{[4]} \\ z^{[4]} / & = k e e p . p r o b \end{aligned} \end{matrix}

$\begin{align*} d^3&=np.random.rand(a_3.shape[0],a_3.shape[1]) < keep.prob\\ a^3&=np.multiply(a_3,d_3)\ \ \ \#a3*d3, element\ wise\ multiplication\\ a^3/&=keep.prob\ \ \ \#in\ order\ to\ not\ reduce\ the\ expected\ value\ of\ a^3\ \ inverted\ dropout\\ z^{[4]}&=w^{[4]}a^{[3]}+b^{[4]}\\ z^{[4]}/&=keep.prob\\ \tag{3-1} \end{align*}$

this inverted dropout technique by dividing by the keep.prob, it ensures that the expected value of a3 remains the same. This makes test time easier because you have less of a scaling problem.
测试时不需要使用drop out