深度学习基础知识——optimizer总结

最新推荐文章于 2024-06-08 09:29:07 发布

m0_49089298

最新推荐文章于 2024-06-08 09:29:07 发布

阅读量2.2k

点赞数

分类专栏： python知识储备文章标签：深度学习

原文链接：https://mp.weixin.qq.com/s?__biz=MzU1NzE0MDk1MA==&mid=2247484022&idx=1&sn=88ccf1e682496786896207dee033141e&chksm=fc3b1ccfcb4c95d904ced1f7a3a3d029e745e4b24c8a61854b3df76b4e5727cc6ef3cdd65446#rd

版权

python知识储备专栏收录该内容

6 篇文章 0 订阅

订阅专栏

1.训练方法：
（1）批/梯度下降法GD（gradient descent）
梯度下降法是求解无约束最优化问题的一种常用方法，比较适用于控制变量较多，受控系统比较复杂，无法建立准确数学模型的最优化控制过程。是一种迭代算法，每一步需要求解目标函数的梯度向量。
在机器学习中，GD主要用于降低模型输出和真实输出之间的损失/误差，迭代模型结构。随机梯度下降是随机取样替代完整的样本，主要作用是提高迭代速度，避免陷入庞大计算量的泥沼。对于整个样本做GD又称为批梯度下降（BGD，batch gradient descent）。

（2）随机梯度下降法 SGD（stochastic gradient descent）
随机选取一个点做梯度下降，而不是遍历所有样本后进行参数迭代。因为梯度下降法的代价函数计算需要遍历所有样本，而且是每次迭代都要遍历，直至达到局部最优解，在样本量庞大时就显得收敛速度比较慢了，计算量非常庞大。
随机梯度下降仅以当前样本点进行最小值求解，通常无法达到真正局部最优解，但可以比较接近。属于大样本兼顾计算成本的折中方案。

    **为了理解打个比方**

群山之中我想要找到最低的山谷，我按照重力作用向下走（梯度下降），先随机选择向西方向，然后再往南方。。。按照收敛的性质，我一定能够达到一个谷底，可能是局部最低点，也可能是全局的最低点，但是，这样肯定比精确梯度的速度快很多。在这里插入图片描述

批梯度下降算法的步骤可以归纳为以下几步：
1.先确定向下一步的步伐大小，我们称为Learning rate ；
2.任意给定一个初始值：θ向量，一般为0向量；
3.确定一个向下的方向，并向下走预先规定的步伐，并更新θ向量；
4.当下降的高度小于某个定义的值，则停止下降；

随机梯度下降算法：
每次迭代只是考虑让该样本点的J(θ)趋向最小，而不管其他的样本点，这样算法会很快，但是收敛的过程会比较曲折，整体效果上，大多数时候它只能接近局部最优解，而无法真正达到局部最优解。所以适合用于较大训练集的case。
在这里插入图片描述
随机梯度下降算法的python的实现：

#Training data set
#each element in x represents (x0,x1,x2)
x = [(1,0.,3) , (1,1.,3) ,(1,2.,3), (1,3.,2) , (1,4.,4)]
#y[i] is the output of y = theta0 * x[0] + theta1 * x[1] +theta2 * x[2]
y = [95.364,97.217205,75.195834,60.105519,49.342380]

 
epsilon = 0.0001
#learning rate
alpha = 0.01
diff = [0,0]
error1 = 0
error0 =0
m = len(x)

 
#init the parameters to zero
theta0 = 0
theta1 = 0
theta2 = 0
 
while True:
   #calculate the parameter
    for i in range(m):
     
         diff[0] = y[i]-( theta0 + theta1 * x[i][1] + theta2 * x[i][2] )
         
         theta0 = theta0 + alpha * diff[0]* x[i][0]
         theta1 = theta1 + alpha * diff[0]* x[i][1]
         theta2 = theta2 + alpha * diff[0]* x[i][2]
     
    #calculate the cost function
         error1=0
    for lp in range(len(x)):
         error1 += ( y[i]-( theta0 + theta1 * x[i][1] + theta2 * x[i][2] ) )**2/2
     
    if abs(error1-error0) < epsilon:
         break
    else:
         error0 = error1
     
    print (' theta0 : %f, theta1 : %f, theta2 : %f, error1 : %f ' )
 
print( 'Done: theta0 : %f, theta1 : %f, theta2 : %f')

批梯度下降算法：

#Training data set
#each element in x represents (x0,x1,x2)
x = [(1,0.,3) , (1,1.,3) ,(1,2.,3), (1,3.,2) , (1,4.,4)]
#y[i] is the output of y = theta0 * x[0] + theta1 * x[1] +theta2 * x[2]
y = [95.364,97.217205,75.195834,60.105519,49.342380]
 
epsilon = 0.000001
#learning rate
alpha = 0.001
diff = [0,0]
error1 = 0
error0 =0
m = len(x)
 
#init the parameters to zero
theta0 = 0
theta1 = 0
theta2 = 0
sum0 = 0
sum1 = 0
sum2 = 0
while True:
   
     #calculate the parameters
     for i in range(m):
         #begin batch gradient descent
         diff[0] = y[i]-( theta0 + theta1 * x[i][1] + theta2 * x[i][2] )
         sum0 = sum0 + alpha * diff[0]* x[i][0]
         sum1 = sum1 + alpha * diff[0]* x[i][1]
         sum2 = sum2 + alpha * diff[0]* x[i][2]
         #end  batch gradient descent
     theta0 = sum0;
     theta1 = sum1;
     theta2 = sum2;
     #calculate the cost function
     error1 = 0
     for lp in range(len(x)):
         error1 += ( y[i]-( theta0 + theta1 * x[i][1] + theta2 * x[i][2] ) )**2/2
     
     if abs(error1-error0) < epsilon:
         break
     else:
         error0 = error1
     
     print (' theta0 : %f, theta1 : %f, theta2 : %f, error1 : %f')
 
print ('Done: theta0 : %f, theta1 : %f, theta2 : %f')

两者之间的区别：

随机梯度下降算法在迭代的时候，每迭代一个新的样本，就会更新一次所有的theta参数。

for i in range(m):36     
     diff[0] = y[i]-( theta0 + theta1 * x[i][1] + theta2 * x[i][2] )
         
         theta0 = theta0 + alpha * diff[0]* x[i][0]
         theta1 = theta1 + alpha * diff[0]* x[i][1]
         theta2 = theta2 + alpha * diff[0]* x[i][2]

批梯度下降算法在迭代的时候，是完成所有样本的迭代后才会去更新一次theta参数

#calculate the parameters
36     for i in range(m):
37         #begin batch gradient descent38         diff[0] = y[i]-( theta0 + theta1 * x[i][1] + theta2 * x[i][2] )39         
sum0 = sum0 + alpha * diff[0]* x[i][0]40         sum1 = sum1 + alpha * diff[0]* x[i][1]41         sum2 = sum2 + alpha * diff[0]* x[i][2]42         #end  batch gradient descent

43     theta0 = sum0;
44     theta1 = sum1;
45     theta2 = sum2;

因此当样本数量很大时候，批梯度得做完所有样本的计算才能更新一次theta，从而花费的时间远大于随机梯度下降。但是随机梯度下降过早的结束了迭代，使得它获取的值只是接近局部最优解，而并非像批梯度下降算法那么是局部最优解。

要点提醒

（1）a 即learning rate，决定的下降步伐，如果太小，则找到函数最小值的速度就很慢，如果太大，则可能会出现overshoot the minimum的现象；
（2）当存在多个局部最优时，初始点不同，获得的最小值也不同，因此梯度下降求得的只是局部最小值；
（3）越接近最小值时，下降速度越慢；
（4）计算批梯度下降算法时候，计算每一个θ值都需要遍历计算所有样本，当数据量的时候这是比较费时的计算。
参考：
https://mp.weixin.qq.com/s?__biz=MzU1NzE0MDk1MA==&mid=2247484022&idx=1&sn=88ccf1e682496786896207dee033141e&chksm=fc3b1ccfcb4c95d904ced1f7a3a3d029e745e4b24c8a61854b3df76b4e5727cc6ef3cdd65446#rd

常见的optizer：
1.BGD(Batch Gradient Descent)，SGD(Stochastic Gradient Descent)，MBGD(Mini-Batch Gradient Descent)，
2Momentum & Nesterov Momentum
3AdaGrad & Adadelta
4RMSProp
5Adam
6first-order optimization & second- order optimization & L-BFGS
具体内容分析见链接：
https://zhuanlan.zhihu.com/p/40415008

下面找了几个GIF图直面感受下算法收敛速度及表现：

在这里插入图片描述