梯度下降法(gradient descent)是一种最优化算法,通常也称为最速下降法,在机器学习中属于最基础的优化算法。谁发明的没查到,有知道的同学可以告诉我。
在机器学习中要用到梯度下降法,至少也是多个变量以上的问题。但为了理解方便,以最简单的一元函数来做说明。梯度,即导数,对一元函数来说,某点处的导数就是该点处的切线斜率。看一个示例,凸函数在区间[-12,12]之间的图像如下图所示:
我们对函数f求导函数,得到导函数,则A(-10,70)处的导数为-12,B(10,30)处的导数为8,设C点为在区间[-12,12]之间的极小值点,显然C处的导数应为0。通过导函数可以求得C点的x值为2,代入得到极小值点C为(2,-2)。以上是代数求解法,那为什么还要用梯度下降算法,直接令导数为0求解不是更方便吗。现实的情况是:并不是所有的函数都可以根据导数求出取得0值的点,或者可以求出导数在每个点的值, 但是直接解方程解不出来,这个以后再说,这篇文章重点在理解该方法的使用。
那用梯度下降法是怎样实现求极小值的呢?其处理方法如下,该公式表达的含义就是沿着梯度下降的方向寻找下一点,不断迭代直到找到极小值点:
是当前点的值,d为当前点的梯度(导数),p为搜寻的步长,即为寻找的下一点的值,当d趋近于0时,即认为为求得的极小值。其中步长p的意义是梯度下降的速度,围绕它有相当多的扩展方法,以后再说。
还是看上面一元函数的实例。梯度下降法求解步骤如下:
(1)设定步长p,趋近于0的阈值ε;
(2)随机生成一个初始值,赋值给;
(3)计算处的导数d;
(4)如果不成立,则计算,将赋值给,即=,转到(3);
(5)如果成立,成功退出。
代码如下:
def f(x): # 原函数
return x ** 2 / 2 - 2 * x
def g(x): # 导函数
return x - 2
def gd(x_start, p, e, n): # x_start:初始位置值, p:步长, e:阈值, n:迭代次数
x = x_start
for i in range(n):
grad = g(x) # x点处的梯度(导数)
print('times = {0}, grad = {1}, x = {2}, y = {3}'.format(i, grad, x, f(x)))
if abs(grad) < e:
break # 满足收敛条件,退出迭代
x -= grad * p
return x
gd(-10, 0.5, 0.1, 30)的输出结果:
times = 0, grad = -12, x = -10, y = 70.0
times = 1, grad = -6.0, x = -4.0, y = 16.0
times = 2, grad = -3.0, x = -1.0, y = 2.5
times = 3, grad = -1.5, x = 0.5, y = -0.875
times = 4, grad = -0.75, x = 1.25, y = -1.71875
times = 5, grad = -0.375, x = 1.625, y = -1.9296875
times = 6, grad = -0.1875, x = 1.8125, y = -1.982421875
times = 7, grad = -0.09375, x = 1.90625, y = -1.99560546875
迭代7次即求得了满足要求的结果。下图为这7次迭代中x从x0到x7的寻优情况。
梯度下降法有几个值得注意的问题。
(1)步长p的选择问题
步长p的选择是个核心问题,如果步长选的小,必然可以收敛,但迭代次数就会多,相应的计算耗时也多。上面的例子中步长p=0.5,迭代次数为7次,而当步长为0.1时,迭代次数为46次。但当步长p选的过大时,如p=3,看看函数gd(-10, 3, 0.1, 30)执行的情况。
times = 0, grad = -12, x = -10, y = 70.0
times = 1, grad = 24, x = 26, y = 286.0
times = 2, grad = -48, x = -46, y = 1150.0
times = 3, grad = 96, x = 98, y = 4606.0
times = 4, grad = -192, x = -190, y = 18430.0
times = 5, grad = 384, x = 386, y = 73726.0
times = 6, grad = -768, x = -766, y = 294910.0
times = 7, grad = 1536, x = 1538, y = 1179646.0
times = 8, grad = -3072, x = -3070, y = 4718590.0
times = 9, grad = 6144, x = 6146, y = 18874366.0
times = 10, grad = -12288, x = -12286, y = 75497470.0
times = 11, grad = 24576, x = 24578, y = 301989886.0
times = 12, grad = -49152, x = -49150, y = 1207959550.0
times = 13, grad = 98304, x = 98306, y = 4831838206.0
times = 14, grad = -196608, x = -196606, y = 19327352830.0
times = 15, grad = 393216, x = 393218, y = 77309411326.0
times = 16, grad = -786432, x = -786430, y = 309237645310.0
times = 17, grad = 1572864, x = 1572866, y = 1236950581246.0
times = 18, grad = -3145728, x = -3145726, y = 4947802324990.0
times = 19, grad = 6291456, x = 6291458, y = 19791209299966.0
times = 20, grad = -12582912, x = -12582910, y = 79164837199870.0
times = 21, grad = 25165824, x = 25165826, y = 316659348799486.0
times = 22, grad = -50331648, x = -50331646, y = 1266637395197950.0
times = 23, grad = 100663296, x = 100663298, y = 5066549580791806.0
times = 24, grad = -201326592, x = -201326590, y = 2.0266198323167228e+16
times = 25, grad = 402653184, x = 402653186, y = 8.106479329266893e+16
times = 26, grad = -805306368, x = -805306366, y = 3.242591731706757e+17
times = 27, grad = 1610612736, x = 1610612738, y = 1.2970366926827028e+18
times = 28, grad = -3221225472, x = -3221225470, y = 5.188146770730811e+18
times = 29, grad = 6442450944, x = 6442450946, y = 2.0752587082923246e+19
梯度grad的绝对值越来越大,在不断的发散。这是因为p选的大,p*d的值也很大,也即离比较远,可能会造成处的梯度比处梯度还大,那么根据,与之间的距离会不断扩大,相应的梯度也会越来越大,趋近于无穷。
再来看个步长p=2的问题,看看gd(-10, 2, 0.1, 30)执行的情况。
times = 0, grad = -12, x = -10, y = 70.0
times = 1, grad = 12, x = 14, y = 70.0
times = 2, grad = -12, x = -10, y = 70.0
times = 3, grad = 12, x = 14, y = 70.0
times = 4, grad = -12, x = -10, y = 70.0
times = 5, grad = 12, x = 14, y = 70.0
times = 6, grad = -12, x = -10, y = 70.0
times = 7, grad = 12, x = 14, y = 70.0
times = 8, grad = -12, x = -10, y = 70.0
times = 9, grad = 12, x = 14, y = 70.0
times = 10, grad = -12, x = -10, y = 70.0
times = 11, grad = 12, x = 14, y = 70.0
times = 12, grad = -12, x = -10, y = 70.0
times = 13, grad = 12, x = 14, y = 70.0
times = 14, grad = -12, x = -10, y = 70.0
times = 15, grad = 12, x = 14, y = 70.0
times = 16, grad = -12, x = -10, y = 70.0
times = 17, grad = 12, x = 14, y = 70.0
times = 18, grad = -12, x = -10, y = 70.0
times = 19, grad = 12, x = 14, y = 70.0
times = 20, grad = -12, x = -10, y = 70.0
times = 21, grad = 12, x = 14, y = 70.0
times = 22, grad = -12, x = -10, y = 70.0
times = 23, grad = 12, x = 14, y = 70.0
times = 24, grad = -12, x = -10, y = 70.0
times = 25, grad = 12, x = 14, y = 70.0
times = 26, grad = -12, x = -10, y = 70.0
times = 27, grad = 12, x = 14, y = 70.0
times = 28, grad = -12, x = -10, y = 70.0
times = 29, grad = 12, x = 14, y = 70.0
每步求解的情况都一样,原地踏步。看图像:
测试了一下,不管初始值在哪,都一样。再看看gd(-5, 2, 0.1, 30)的情况。
显然,存在一个特殊值s,当步长p<s时,能够收敛;当p>s时,发散;当p=s时,原地踏步。那么这个特殊值s到底是多少呢?如果函数可导,且函数的梯度满足李普希兹连续(常数为L),若以小于s=1/L的步长迭代,则能保证每次迭代的函数值都不增,则保证最终会收敛到梯度为0的点。
(2)接近极值点收敛速度变慢的问题
其实这个算不上问题,所有的优化算法都存在这个现象。举个不太恰当的例子,就像考试,从10分考到20分很容易,从89分考到99分就非常难,同样的学习时间越接近满分进步空间越小,所谓的百尺杆头吧。从梯度下降的处理方法看,,当越接近极小值时,显然d越来越小,趋近于0,相应的p*d也越来越小,那么同样一次迭代,和的差值会越来越小,感觉就是收敛速度变慢了,这很好理解。
(3)陷入局部极小值的问题
看如下的函数:,求其在[-2,2]区间的极小值。看函数图像:
显然该函数在[-2,2]区间有4个极小值点,位置约为(1.5,-4.5)处为最小值点。看以下4个初始值的计算情况。
A.初始值:-1,步长:0.01;
B.初始值:-0.2,步长:0.01;
C.初始值:0.8,步长:0.01;
D.初始值:1.5,步长:0.01。
图中A0,B0,C0,D0分别为4个初始值位置,A,B,C,D分别为4种情况下求出的对应极小值。可以看出,初始值不同,求出的极小值点也可能不同,这就是梯度下降法陷入了局部最优,考虑全局收敛性,最好是多设几个初始值迭代收敛。陷入局部最优也是通常的优化算法的共同问题,不同算法的解决陷入局部最优的方法有很多,但都不可能完全避免。