深度学习小记 - 正则化/优化器/线性回归/逻辑斯蒂回归
文章目录
正则化
泛化误差
- 一种解释学习算法泛化性能的一种重要工具;它由偏差,方差,噪声三部分构成;
- 偏差度量了学习算法的期望预测与真实结果的差距;方差刻画了同一组数据在受不同情况扰动的情况下性能的变化;噪声刻画了学习问题本身的难度
- 偏差,方差,噪声三者的关系:训练集的错误率较小,而验证集/测试集的错误率较大的情况称为过拟合;训练集和测试集的错误率都较大,且两者相近的情况称为欠拟合;
L1与L2正则化
-
解决模型训练过程中过拟合的问题常用的方法就是正则化,所谓的正则化以增大训练误差为代价的方式减少测试误差;在本篇博客我们重点学习参数范数惩罚的正则化方法
-
J ( θ , b , X , y ) = L ( θ , b , X , y ) + λ ⋅ Ω ( θ ) J(\theta,b,X,y)=L(\theta,b,X,y)+\lambda \cdot \Omega(\theta) J(θ,b,X,y)=L(θ,b,X,y)+λ⋅Ω(θ),其中 Ω ( θ ) \Omega(\theta) Ω(θ) 表示模型范数惩罚项
L 1 L1 L1 范数 L 2 L2 L2 范数 损失函数 J ( θ , b , X , y ) = L ( θ , b , X , y ) + ⋅ λ ⋅ ∣ ∣ θ ∣ ∣ 1 J(\theta,b,X,y)=L(\theta,b,X,y)+\cdot \lambda \cdot\ ||\theta||_1 J(θ,b,X,y)=L(θ,b,X,y)+⋅λ⋅ ∣∣θ∣∣1 J ( θ , b , X , y ) = L ( θ , b , X , y ) + 1 2 ⋅ λ ⋅ ∣ ∣ θ ∣ ∣ 2 J(\theta,b,X,y)=L(\theta,b,X,y)+\frac{1}{2}\cdot \lambda \cdot\ ||\theta||_2 J(θ,b,X,y)=L(θ,b,X,y)+21⋅λ⋅ ∣∣θ∣∣2 使用说明 由于 L 1 L1 L1 正则化最后得到的模型参数存在大量 0 0 0,使得模型变得更加稀疏 较常使用 -
使用 L 1 L1 L1 与 L 2 L2 L2 正则化是解决模型训练过拟合的问题方法之一,主要表现在添加正则化项 λ \lambda λ 后,当 λ \lambda λ 增大时,导致模型参数 θ \theta θ 减小,或出现众多参数趋近于 0 0 0,即模型变得稀疏,降低了模型的复杂度;或导致输出结果减小,输出结果减小在经过激活函数之时,根据激活函数的性质,输出结果近似线性;以及这一区间梯度较大有效地防止梯度消失
线性回归与逻辑斯蒂回归
线性回归
-
预测模型: f ( X ) = θ ⋅ X + b f(X)=\theta \cdot X + b f(X)=θ⋅X+b
-
损失函数: L ( θ , b , X , y ) = 1 2 ⋅ ( f ( X ) − y ) 2 L(\theta,b,X,y)=\frac{1}{2}\cdot (f(X)-y)^2 L(θ,b,X,y)=21⋅(f(X)−y)2
-
线性回归推导过程我们假设 θ \theta θ 与 b b b 为模型参数, X X X 为输入数据的特征, y y y 为输入数据的目标值(标签), η \eta η 为学习率
∂ L ( θ , b , X , y ) ∂ θ = ∂ ∂ θ ( 1 2 ( f ( X ) − y ) 2 ) = ∂ ∂ θ ( 1 2 ( θ ⋅ X + b − y ) 2 ) = ( θ ⋅ X + b − y ) ⋅ X \frac{\partial L(\theta,b,X,y)}{\partial \theta}=\frac{\partial}{\partial \theta} (\frac{1}{2}(f(X)-y)^2)=\frac{\partial}{\partial \theta} (\frac{1}{2}(\theta \cdot X+b-y)^2)=(\theta \cdot X + b -y)\cdot X ∂θ∂L(θ,b,X,y)=∂θ∂(21(f(X)−y)2)=∂θ∂(21(θ⋅X+b−y)2)=(θ⋅X+b−y)⋅X
∂ L ( θ , b , X , y ) ∂ b = ∂ ∂ b ( 1 2 ( f ( X ) − y ) 2 ) = ∂ ∂ b ( 1 2 ( θ ⋅ X + b − y ) 2 ) = θ ⋅ X + b − y \frac{\partial L(\theta,b,X,y)}{\partial b}=\frac{\partial}{\partial b} (\frac{1}{2}(f(X)-y)^2)=\frac{\partial}{\partial b} (\frac{1}{2}(\theta \cdot X+b-y)^2)=\theta \cdot X + b -y ∂b∂L(θ,b,X,y)=∂b∂(21(f(X)−y)2)=∂b∂(21(θ⋅X+b−y)2)=θ⋅X+b−y
θ t + 1 ← θ t − η ⋅ ∂ L ( θ t , b t , X , y ) ∂ θ t = θ t − η ⋅ ( θ t ⋅ X + b t − y ) ⋅ X \theta_{t+1} \leftarrow \theta_{t}-\eta \cdot \frac{\partial L(\theta_t,b_t,X,y)}{\partial \theta_t}=\theta_t-\eta \cdot (\theta_t \cdot X + b_t -y)\cdot X θt+1←θt−η⋅∂θt∂L(θt,bt,X,y)=θt−η⋅(θt⋅X+bt−y)⋅Xb t + 1 ← b t − η ⋅ ∂ L ( θ t , b t , X , y ) ∂ b t = b t − η ⋅ ( θ t ⋅ X + b t − y ) b_{t+1}\leftarrow b_t-\eta \cdot \frac{\partial L(\theta_t,b_t,X,y)}{\partial b_t}=b_t-\eta\cdot(\theta_t \cdot X+b_t-y) bt+1←bt−η⋅∂bt∂L(θt,bt,X,y)=bt−η⋅(θt⋅X+bt−y)
逻辑斯蒂回归
-
预测模型: f ( X ) = 1 1 + e θ ⋅ X + b f(X)=\frac{1}{1+e^{\theta \cdot X + b}} f(X)=1+eθ⋅X+b1
-
损失函数: L ( θ , b , X , y ) = − [ y ⋅ l o g f ( X ) + ( 1 − y ) ⋅ l o g ( 1 − f ( X ) ) ] L(\theta,b,X,y)= -[y\cdot logf(X)+(1-y)\cdot log(1-f(X))] L(θ,b,X,y)=−[y⋅logf(X)+(1−y)⋅log(1−f(X))]
-
其本质上是一种非线性回归模型
-
线性回归推导过程我们假设 θ \theta θ 与 b b b 为模型参数, X X X 为输入数据的特征, y y y 为输入数据的目标值(标签), η \eta η 为学习率
∂ L ( θ , b , X , y ) ∂ θ = − ∂ ∂ θ ( y ⋅ l o g f ( X ) + ( 1 − y ) ⋅ l o g ( 1 − f ( X ) ) ) = − [ y ⋅ 1 f ( X ) ⋅ f ( X ) ⋅ ( 1 − f ( X ) ) ⋅ ( − X ) + ( 1 − y ) ⋅ 1 1 − f ( X ) ⋅ ( 1 − f ( X ) ) f ( X ) ⋅ X ] = − [ − X ⋅ y ( 1 − f ( X ) ) + ( 1 − y ) ⋅ f ( X ) ⋅ X ] = X ⋅ ( y − f ( X ) ) \frac{\partial L(\theta,b,X,y)}{\partial \theta}=-\frac{\partial}{\partial \theta}(y\cdot logf(X)+(1-y)\cdot log(1-f(X)))= -[y\cdot\frac{1}{f(X)}\cdot f(X)\cdot(1-f(X)) \cdot (-X)+(1-y)\cdot \frac{1}{1-f(X)}\cdot(1-f(X))f(X)\cdot X]=-[-X\cdot y(1-f(X))+(1-y)\cdot f(X)\cdot X]=X\cdot (y-f(X)) ∂θ∂L(θ,b,X,y)=−∂θ∂(y⋅logf(X)+(1−y)⋅log(1−f(X)))=−[y⋅f(X)1⋅f(X)⋅(1−f(X))⋅(−X)+(1−y)⋅1−f(X)1⋅(1−f(X))f(X)⋅X]=−[−X⋅y(1−f(X))+(1−y)⋅f(X)⋅X]=X⋅(y−f(X))
∂ L ( θ , b , X , y ) ∂ b = − ∂ ∂ b ( y ⋅ l o g f ( X ) + ( 1 − y ) ⋅ l o g ( 1 − f ( X ) ) ) = − [ y ⋅ 1 f ( X ) ⋅ f ( X ) ⋅ ( 1 − f ( X ) ) ⋅ ( − 1 ) + ( 1 − y ) ⋅ 1 1 − f ( X ) ⋅ ( 1 − f ( X ) ) f ( X ) ] = − [ − X ⋅ y ( 1 − f ( X ) ) + ( 1 − y ) ⋅ f ( X ) ⋅ X ] = y − f ( X ) \frac{\partial L(\theta,b,X,y)}{\partial b}=-\frac{\partial}{\partial b}(y\cdot logf(X)+(1-y)\cdot log(1-f(X)))= -[y\cdot\frac{1}{f(X)}\cdot f(X)\cdot(1-f(X)) \cdot (-1)+(1-y)\cdot \frac{1}{1-f(X)}\cdot(1-f(X))f(X)]=-[-X\cdot y(1-f(X))+(1-y)\cdot f(X)\cdot X]=y-f(X) ∂b∂L(θ,b,X,y)=−∂b∂(y⋅logf(X)+(1−y)⋅log(1−f(X)))=−[y⋅f(X)1⋅f(X)⋅(1−f(X))⋅(−1)+(1−y)⋅1−f(X)1⋅(1−f(X))f(X)]=−[−X⋅y(1−f(X))+(1−y)⋅f(X)⋅X]=y−f(X)
θ t + 1 ← θ t − η ⋅ ∂ L ( θ t , b t , X , y ) ∂ θ t = θ t − η ⋅ X ⋅ ( y − f ( X ) ) \theta_{t+1} \leftarrow \theta_{t}-\eta \cdot \frac{\partial L(\theta_t,b_t,X,y)}{\partial \theta_t}=\theta_t-\eta \cdot X \cdot (y-f(X)) θt+1←θt−η⋅∂θt∂L(θt,bt,X,y)=θt−η⋅X⋅(y−f(X))b t + 1 ← b t − η ⋅ ∂ L ( θ t , b t , X , y ) ∂ b t = b t − η ⋅ ( y − f ( X ) ) b_{t+1}\leftarrow b_t-\eta \cdot \frac{\partial L(\theta_t,b_t,X,y)}{\partial b_t}=b_t-\eta\cdot (y-f(X)) bt+1←bt−η⋅∂bt∂L(θt,bt,X,y)=bt−η⋅(y−f(X))
优化器
随机梯度下降算法(SGD)
在神经网络训练中最常见的优化器,也是大家刚入门优化器中,学习的第一个优化算法
- θ t + 1 ← θ t − η ⋅ ∂ L ( θ t , b t , X , y ) ∂ θ t \theta_{t+1} \leftarrow \theta_t-\eta \cdot \frac{\partial L(\theta_t,b_t,X,y)}{\partial \theta_t} θt+1←θt−η⋅∂θt∂L(θt,bt,X,y)
- b t + 1 ← b t − η ⋅ ∂ L ( θ t , b t , X , y ) ∂ b t b_{t+1} \leftarrow b_t -\eta \cdot \frac{\partial L(\theta_t,b_t,X,y)}{\partial b_t} bt+1←bt−η⋅∂bt∂L(θt,bt,X,y)
基于动量的随机梯度下降算法(SGDM)
在随机梯度下降的过程中,观察下降的轨迹我们发现,方向抖动是非常剧烈的,在统计学中为了估计序列数据,我们通常采用指数加权平均的方法,让变化的轨迹(曲线)更加地平缓;
-
将指数加权平均的方法引入到随机梯度下降算法,通过加权累加过去的梯度来减少抵达最小路径上的波动加速收敛 (当梯度方向一致时,动量梯度下降能加速学习);
-
假设 m ∂ θ m_{\partial \theta} m∂θ 表示损失函数分别对 θ \theta θ 与 的偏导的加权累加, β \beta β 表示加权累加的权重;
-
m ∂ θ t + 1 ← β ⋅ m ∂ θ t + ( 1 − β ) ⋅ ∂ L ∂ θ t m_{\partial \theta_{t+1}}\leftarrow \beta \cdot m_{\partial \theta_{t}}+(1-\beta)\cdot \frac{\partial L}{\partial \theta_{t}} m∂θt+1←β⋅m∂θt+(1−β)⋅∂θt∂L
-
θ t + 1 ← θ t − η ⋅ m ∂ θ t + 1 \theta_{t+1} \leftarrow \theta_{t}-\eta \cdot m_{\partial \theta_{t+1}} θt+1←θt−η⋅m∂θt+1
逐参数适应学习率
Adagrad
- 使用一个随机梯度的平方累加和的倒数来动态调整学习率
- 该方法计算量大,在后期随着累加越多,优化速度慢
- s t + 1 ← s t + ( ∂ L θ t ) 2 s_{t+1}\leftarrow s_{t}+(\frac{\partial L}{\theta_t})^2 st+1←st+(θt∂L)2
- θ t + 1 ← θ t − η s t + 1 ⋅ ∂ L ∂ θ t \theta_{t+1}\leftarrow \theta_t- \frac{\eta}{\sqrt{s_{t+1}}}\cdot \frac{\partial L}{\partial \theta_t} θt+1←θt−st+1η⋅∂θt∂L
RMSProp
- 将指数加权平均的方法引入到Adagrad算法中;
- RMSprop算法有助于减少抵达最小路径上的摆动,并允许使用一个更大的学习率,从而加快学习速度
- s t + 1 ← β ⋅ s t + ( 1 − β ) ⋅ ( ∂ L θ t ) 2 s_{t+1}\leftarrow \beta \cdot s_{t}+(1-\beta)\cdot(\frac{\partial L}{\theta_t})^2 st+1←β⋅st+(1−β)⋅(θt∂L)2
- θ t + 1 ← θ t − η s t + 1 ⋅ ∂ L ∂ θ t \theta_{t+1}\leftarrow \theta_t- \frac{\eta}{\sqrt{s_{t+1}}}\cdot \frac{\partial L}{\partial \theta_t} θt+1←θt−st+1η⋅∂θt∂L
Adam
- 将SGDM和RMSProp算法进行中和
- s t + 1 ← β 1 ⋅ s t + ( 1 − β 1 ) ⋅ ∂ L θ t r t + 1 ← β 2 ⋅ r t + ( 1 − β 2 ) ⋅ ( ∂ L θ t ) 2 s_{t+1}\leftarrow \beta_1 \cdot s_{t}+(1-\beta_1)\cdot \frac{\partial L}{\theta_t}\quad r_{t+1}\leftarrow \beta_2 \cdot r_{t}+(1-\beta_2)\cdot(\frac{\partial L}{\theta_t})^2 st+1←β1⋅st+(1−β1)⋅θt∂Lrt+1←β2⋅rt+(1−β2)⋅(θt∂L)2
- s t + 1 ^ = s t + 1 1 − β 1 r t + 1 ^ = r t + 1 1 − β 2 \widehat{s_{t+1}}=\frac{s_{t+1}}{1-\beta_1}\quad \widehat{r_{t+1}}=\frac{r_{t+1}}{1-\beta_2} st+1 =1−β1st+1rt+1 =1−β2rt+1
- θ t + 1 ← θ t − η ⋅ s t + 1 ^ r t + 1 ^ ⋅ ∂ L ∂ θ t \theta_{t+1}\leftarrow \theta_t- \eta \cdot \frac{\widehat{s_{t+1}}}{\sqrt{\widehat{r_{t+1}}}}\cdot \frac{\partial L}{\partial \theta_t} θt+1←θt−η⋅rt+1 st+1 ⋅∂θt∂L