文章目录
本文主要参考内容 An overview of gradient descent optimization algorithms以及https://ruder.io/optimizing-gradient-descent/index.html, 对文中优化算法用python代码实现,算法细节参照这些文档,本文更多侧重于代码。
梯度下降法是优化神经网络的最常用方法,在许多的深度学习框架中(比如:tensorflow,pytorch等)都实现了多种优化算法。但是,这些算法通常用作黑盒优化器,因为很难对它们的优缺点进行实用的解释。本文首先看一下梯度下降法的变种,然后说明模型在训练时的困难所在。接着介绍最常用的优化算法以及他们解决模型训练难点上起的作用。
梯度下降法的目地是最小化目标函数 J ( θ ) J(\theta) J(θ),通过对 θ \theta θ沿着目标函数 J ( θ ) J(\theta) J(θ)的梯度 ∇ θ J ( θ ) \nabla_{\theta}J(\theta) ∇θJ(θ)反方向来更新。学习率 η \eta η决定更新步长大小。有关梯度下降法的更多说明参见
1、梯度下降法变种
有三种梯度下降法的变种,只是在计算目标函数梯度所用数据量不同的区别,数据量的大小可以可以影响梯度计算的准确度和参数更新的速度,数据越多,越准确,但是计算量会变大,会变慢,参有数更新也会变慢。
1.1 批量梯度下降法 batch gradient descent
是最初的梯度下降法,用训练集全部数据集来求梯度: θ = θ − η ∗ ∇ θ J ( θ ) \theta = \theta - \eta * \nabla_{\theta}J(\theta) θ=θ−η∗∇θJ(θ) 由于用的数据太多,更新一次参数速度太慢,所以模型无法完成实时快速训练,但可以使在凸曲面上找到全局最优点,非凸曲面上的局部最优点。代码类似是:
for i in range(n_epochs):
params_grad = compute_grads(objective_function,data,params)
params = params - lr*params_grad
1.2 随机梯度下降法 stochastic gradient descent
用单个样本 ( x i , y i ) (x_i,y_i) (xi,yi)来求梯度并更新参数: θ = θ − η ∗ ∇ θ J ( θ ; x ( i ) ; y ( i ) ) \theta = \theta - \eta *\nabla_{\theta}J(\theta;x^{(i)};y^{(i)}) θ=θ−η∗∇θJ(θ;x(i);y(i))用单个样本来更新,速度快,但是波动大,收敛到最小值后,容易再次退出。估码类似:
for i in range(nb_epochs):
random.shuffle(data) #每个epoch都需要对数据进行打散
for example in data:
params_grad = compute_grads(objective_function, example, params)
params = params - learning_rate * params_grad
1.3 小批量梯度下降法 mini-batch gradient descent
介于批量梯度下降法和随机梯度下降法之间,用训练数据中的部分数据来求梯度 θ = θ − η ∗ ∇ θ J ( θ ; x ( i : i + n ) ; y ( i : i + n ) ) \theta = \theta - \eta *\nabla_{\theta}J(\theta;x^{(i:i+n)};y^{(i:i+n)}) θ=θ−η∗∇θJ(θ;x(i:i+n);y(i:i+n)) 好处理即快又稳定。代码类似:
for i in range(nb_epochs):
random.shuffle(data)
for batch in get_batches(data, batch_size=50):
params_grad = compute_grads(objective_function, batch, params)
params = params - learning_rate * params_grad
2 、难点
- 很难选择合适的学习率,太小学习太慢,浪费时间; 太大无法收敛到最优处。
- 学习率周期,使学习率按照提前设定或损失域值来改变学习率,但是这样无法适应训练集的特性。
- 同一个学习率对所有参数进行更新也是不合适的。
- 陷入非凸函数的局部最小值或拐点。
3、常用梯度优化算法
3.1 Momentum
加入动量,记录之前梯度方向的一部分: v t = γ ∗ v t − 1 + η ∗ ∇ θ J ( θ ) v_t = \gamma*v_{t-1}+\eta * \nabla_{\theta}J(\theta) vt=γ∗vt−1+η∗∇θJ(θ) θ = θ − v t \theta = \theta - v_t θ=θ−vt γ \gamma γ通常设置为0.9或类似。以下用python实现该算法:
class StochasticGradientDescent():
def __init__(self, learning_rate=0.01, momentum=0):
self.learning_rate = learning_rate
self.momentum = momentum
self.w_updt = None
def update(self, w, grad_wrt_w):
# If not initialized
if self.w_updt is None:
self.w_updt = np.zeros(np.shape(w))
# Use momentum if set
self.w_updt = self.momentum * self.w_updt + (1 - self.momentum) * grad_wrt_w
# Move against the gradient to minimize loss
return w - self.learning_rate * self.w_updt
3.2 Nesterov 加速梯度下降法
对梯度加了一个较正,能够对识差函数做一个更新适应,公式是: v t = γ ∗ v t − 1 + η ∗ ∇ θ J ( θ − γ ∗ v t − 1 ) v_t = \gamma*v_{t-1}+\eta * \nabla_{\theta}J(\theta-\gamma*v_{t-1}) vt=γ∗vt−1+η∗∇θJ(θ−γ∗vt−1) θ = θ − v t \theta = \theta - v_t θ=θ−vt
class NesterovAcceleratedGradient():
def __init__(self,learning_rate=0.001,momentum=0.4):
self.learning_rate=learning_rate
self.momentum=momentum
self.w_updt=np.array([])
def update(self,w,grad_func):
# Calculate the gradient of the loss a bit further down the slope from w
approx_future_grad = np.clip(grad_func(w - self.momentum * self.w_updt), -1, 1)
# Initialize on first update
if not self.w_updt.any():
self.w_updt = np.zeros(np.shape(w))
self.w_updt = self.momentum * self.w_updt + self.learning_rate * approx_future_grad
# Move against the gradient to minimize loss
return w - self.w_updt
3.3 Adagard
Adagrad算法的一个主要优点是无需手动调整学习率。会对每一个参数做单独的调整。让学习率适应参数,对于出现次数较少的特征,我们对其采用更大的学习率,对于出现次数较多的特征,我们对其采用较小的学习率。因此,Adagrad非常适合处理稀疏数据。Adagrad的一个主要缺点是它在分母中累加梯度的平方:由于没增加一个正项,在整个训练过程中,累加的和会持续增长。这会导致学习率变小以至于最终变得无限小,在学习率无限小时,Adagrad算法将无法取得额外的信息。接下来的算法旨在解决这个不足。
对单个变量: g t , i = ∇ θ J ( θ i ) g_{t,i} = \nabla_{\theta}J(\theta_i) gt,i=∇θJ(θi) θ t + 1 , i = θ t , i − η ∗ g t , i \theta_{t+1,i} =\theta_{t,i}-\eta *g_{t,i} θt+1,i=θt,i−η∗gt,i
修正学习率: g t , i = ∇ θ J ( θ i ) g_{t,i} = \nabla_{\theta}J(\theta_i) gt,i=∇θJ(θi) θ t + 1 , i = θ t , i − η G t , i i ∗ g t , i \theta_{t+1,i} =\theta_{t,i}-{\eta\over{\sqrt{G_{t,ii}}}} *g_{t,i} θt+1,i=θt,i−Gt,iiη∗gt,i
变为向量: θ t + 1 = θ t − η G t + ϵ ∙ g t \theta_{t+1} =\theta_{t}-{\eta\over{\sqrt{G_{t}+\epsilon}}}{\bullet} g_{t} θt+1=θt−Gt+ϵη∙gt
class Adagrad():
def __init__(self, learning_rate=0.01):
self.learning_rate = learning_rate
self.G = None # Sum of squares of the gradients
self.eps = 1e-8
def update(self, w, grad_wrt_w):
# If not initialized
if self.G is None:
self.G = np.zeros(np.shape(w))
# Add the square of the gradient of the loss function at w
self.G += np.power(grad_wrt_w, 2)
# Adaptive gradient with higher learning rate for sparse data
return w - self.learning_rate * grad_wrt_w / np.sqrt(self.G + self.eps)
3.4 Adadelta
Adadelta是Adagrad的一种扩展算法,以处理Adagrad学习速率单调递减的问题。不是计算所有的梯度平方,Adadelta将计算计算历史梯度的窗口大小限制为一个固定值w。在Adadelta中,无需存储先前的w个平方梯度,而是将梯度的平方递归地表示成所有历史梯度平方的均值。在t时刻的均值 E [ g 2 ] t E[g^2]_t E[g2]t只取决于先前的均值和当前的梯度(分量γ类似于动量项):
E [ g 2 ] t + 1 = γ ∗ E [ g 2 ] t + ( 1 − γ ) ∗ g t 2 E[g^2]_{t+1}=\gamma * E[g^2]_{t} + (1-\gamma)*{g_t^2} E[g2]t+1=γ∗E[g2]t+(1−γ)∗gt2
整个更新: Δ θ t = − η ∙ g t \Delta{\theta_t}=-\eta \bullet g_{t} Δθt=−η∙gt θ t + 1 = θ t + Δ θ t \theta_{t+1}=\theta_t +\Delta{\theta_t} θt+1=θt+Δθt 先前adagrad更新算法: Δ θ t = − η G t + ϵ ∙ g t \Delta{\theta_{t}} =-{\eta\over{\sqrt{G_{t}+\epsilon}}}{\bullet} g_{t} Δθt=−Gt+ϵη∙gt 换掉 G t G_t Gt Δ θ t = − η E [ g t 2 ] + ϵ ∙ g t \Delta{\theta_{t}} =-{\eta\over{\sqrt{E[g_t^2]+\epsilon}}}{\bullet} g_{t} Δθt=−E[gt2]+ϵη∙gt 分母是一个均方根RMS: Δ θ t = − η R M S [ g ] t ∙ g t \Delta{\theta_{t}} =-{\eta\over {RMS[g]_t}}{\bullet} g_{t} Δθt=−RMS[g]tη∙gt
class Adadelta():
def __init__(self, rho=0.95, eps=1e-6):
self.E_w_updt = None # Running average of squared parameter updates
self.E_grad = None # Running average of the squared gradient of w
self.w_updt = None # Parameter update
self.eps = eps
self.rho = rho
def update(self, w, grad_wrt_w):
# If not initialized
if self.w_updt is None:
self.w_updt = np.zeros(np.shape(w))
self.E_w_updt = np.zeros(np.shape(w))
self.E_grad = np.zeros(np.shape(grad_wrt_w))
# Update average of gradients at w
self.E_grad = self.rho * self.E_grad + (1 - self.rho) * np.power(grad_wrt_w, 2)
RMS_delta_w = np.sqrt(self.E_w_updt + self.eps)
RMS_grad = np.sqrt(self.E_grad + self.eps)
# Adaptive learning rate
adaptive_lr = RMS_delta_w / RMS_grad
# Calculate the update
self.w_updt = adaptive_lr * grad_wrt_w
# Update the running average of w updates
self.E_w_updt = self.rho * self.E_w_updt + (1 - self.rho) * np.power(self.w_updt, 2)
return w - self.w_updt
3.5 RMSprop
RMSprob是adadelta的特例
class RMSprop():
def __init__(self, learning_rate=0.01, rho=0.9):
self.learning_rate = learning_rate
self.Eg = None # Running average of the square gradients at w
self.eps = 1e-8
self.rho = rho
def update(self, w, grad_wrt_w):
# If not initialized
if self.Eg is None:
self.Eg = np.zeros(np.shape(grad_wrt_w))
self.Eg = self.rho * self.Eg + (1 - self.rho) * np.power(grad_wrt_w, 2)
# Divide the learning rate for a weight by a running average of the magnitudes of recent
# gradients for that weight
return w - self.learning_rate * grad_wrt_w / np.sqrt(self.Eg + self.eps)
3.6 Adam
自适应动量估计算法,adaptive moment estimation
m t = β 1 m t − 1 + ( 1 − β ) g t m_t = \beta_1 m_{t-1}+(1-\beta)g_t mt=β1mt−1+(1−β)gt v t = β 2 v t − 1 + ( 1 − β 2 ) g t 2 v_t = \beta_2 v_{t-1}+(1-\beta_2)g_t^2 vt=β2vt−1+(1−β2)gt2 m ^ t = m t 1 + β 1 t \hat m_t ={ m_t \over {1+\beta_1^t}} m^t=1+β1tmt v ^ t = v t 1 + β 2 t \hat v_t ={ v_t \over {1+\beta_2^t}} v^t=1+β2tvt θ t + 1 = θ t − η v ^ t + ϵ m ^ t \theta_{t+1}=\theta_t-{\eta\over{\sqrt{\hat v_t}+\epsilon}}\hat m_t θt+1=θt−v^t+ϵηm^t
class Adam():
def __init__(self, learning_rate=0.001, b1=0.9, b2=0.999):
self.learning_rate = learning_rate
self.eps = 1e-8
self.m = None
self.v = None
# Decay rates
self.b1 = b1
self.b2 = b2
def update(self, w, grad_wrt_w):
# If not initialized
if self.m is None:
self.m = np.zeros(np.shape(grad_wrt_w))
self.v = np.zeros(np.shape(grad_wrt_w))
self.m = self.b1 * self.m + (1 - self.b1) * grad_wrt_w
self.v = self.b2 * self.v + (1 - self.b2) * np.power(grad_wrt_w, 2)
m_hat = self.m / (1 - self.b1)
v_hat = self.v / (1 - self.b2)
self.w_updt = self.learning_rate * m_hat / (np.sqrt(v_hat) + self.eps)
return w - self.w_updt
3.7 AdaMax
k相比adam,只是改变一个公式, 中梯度平方用二范数,可以推广到任何范数: u t = m a x ( β 2 v t − 1 , ∣ g t ∣ ) u_t = max(\beta_2 v_{t-1},|g_t|) ut=max(β2vt−1,∣gt∣) θ t = θ t − 1 − η u t m t ^ \theta_t = \theta_{t-1}-{\frac \eta u_t}{\hat{m_t}} θt=θt−1−uηtmt^
class AdaMax():
def __init__(self, learning_rate=0.002, b1=0.9, b2=0.999):
self.learning_rate = learning_rate
self.m = None
self.v = None
# Decay rates
self.b1 = b1
self.b2 = b2
def update(self, w, grad_wrt_w):
# If not initialized
if self.m is None:
self.m = np.zeros(np.shape(grad_wrt_w))
self.v = np.zeros(np.shape(grad_wrt_w))
self.m = self.b1 * self.m + (1 - self.b1) * grad_wrt_w
self.u = max(self.b2 * self.v , np.abs(grad_wrt_w))
m_hat = self.m / (1 - self.b1)
self.w_updt = self.learning_rate * m_hat / self.u
return w - self.w_updt
3.8 Nadam
adam可以看成是momentum与RMSprop的结合,momentum是梯度的一阶,PMSprop是梯度的二阶,Nadam是结合adam与NAG算法
Momentum 算法: g t = ∇ θ t J ( θ t ) g_t = \nabla_{\theta_t} J(\theta_t) gt=∇θtJ(θt) m t = γ m t − 1 + η g t m_t = \gamma m_{t-1}+\eta g_t mt=γmt−1+ηgt θ t + 1 = θ t − m t \theta_{t+1}=\theta_t-m_t θt+1=θt−mt
参第三个公式展开: θ t + 1 = θ t − ( γ m t − 1 + η g t ) \theta_{t+1}=\theta_t-(\gamma m_{t-1}+\eta g_t) θt+1=θt−(γmt−1+ηgt) 可以看出更新方向
NAG算法: g t = ∇ θ t J ( θ t − γ m t − 1 ) g_t = \nabla_{\theta_t} J(\theta_t-{\gamma m_{t-1}}) gt=∇θtJ(θt−γmt−1) m t = γ m t − 1 + η g t m_t = \gamma m_{t-1}+\eta g_t mt=γmt−1+ηgt θ t + 1 = θ t − m t \theta_{t+1}=\theta_t-m_t θt+1=θt−mt
adam算法: m t = β 1 m t − 1 + ( 1 − β ) g t m_t = \beta_1 m_{t-1}+(1-\beta)g_t mt=β1mt−1+(1−β)gt m ^ t = m t 1 + β 1 t \hat m_t ={ m_t \over {1+\beta_1^t}} m^t=1+β1tmt θ t + 1 = θ t − η v ^ t + ϵ m ^ t \theta_{t+1}=\theta_t-{\eta\over{\sqrt{\hat v_t}+\epsilon}}\hat m_t θt+1=θt−v^t+ϵηm^t
把第三个公式展开: θ t + 1 = θ t − η v ^ t + ϵ ( β 1 m t − 1 1 − β 1 t + ( 1 − β 1 m t − 1 ) g t 1 − β 1 t ) \theta_{t+1}=\theta_t-{\eta\over{\sqrt{\hat v_t}+\epsilon}} ({{\beta_1 m_{t-1}\over{1-\beta_1^t}}+{({1-\beta_1 m_{t-1})g_t}\over{1-\beta_1^t}}}) θt+1=θt−v^t+ϵη(1−β1tβ1mt−1+1−β1t(1−β1mt−1)gt)
括号中第一项可以看成 β 1 m ^ t − 1 \beta_1 {\hat m}_{t-1} β1m^t−1
最终的更新方法就是: θ t + 1 = θ t − η v ^ t + ϵ ( β 1 m ^ t − 1 + ( 1 − β 1 m t − 1 ) g t 1 − β 1 t ) \theta_{t+1}=\theta_t-{\eta\over{\sqrt{\hat v_t}+\epsilon}} ({{\beta_1 {\hat m}_{t-1}}+{({1-\beta_1 m_{t-1})g_t}\over{1-\beta_1^t}}}) θt+1=θt−v^t+ϵη(β1m^t−1+1−β1t(1−β1mt−1)gt)
4 优化算法的选择
自适应的算法用adam,或SGD momentum算法加学习率退火算法。