梯度下降算法

        梯度下降法gradient descent)是一种最优化算法,通常也称为最速下降法,在机器学习中属于最基础的优化算法。谁发明的没查到,有知道的同学可以告诉我。

       在机器学习中要用到梯度下降法,至少也是多个变量以上的问题。但为了理解方便,以最简单的一元函数来做说明。梯度,即导数,对一元函数来说,某点处的导数就是该点处的切线斜率。看一个示例,凸函数在区间[-12,12]之间的图像如下图所示:

 

       我们对函数f求导函数,得到导函数,则A(-10,70)处的导数为-12,B(10,30)处的导数为8,设C点为在区间[-12,12]之间的极小值点,显然C处的导数应为0。通过导函数可以求得C点的x值为2,代入得到极小值点C为(2,-2)。以上是代数求解法,那为什么还要用梯度下降算法,直接令导数为0求解不是更方便吗。现实的情况是:并不是所有的函数都可以根据导数求出取得0值的点,或者可以求出导数在每个点的值, 但是直接解方程解不出来,这个以后再说,这篇文章重点在理解该方法的使用。

       那用梯度下降法是怎样实现求极小值的呢?其处理方法如下,该公式表达的含义就是沿着梯度下降的方向寻找下一点,不断迭代直到找到极小值点

       是当前点的值,d为当前点的梯度(导数),p为搜寻的步长,即为寻找的下一点的值,当d趋近于0时,即认为为求得的极小值。其中步长p的意义是梯度下降的速度,围绕它有相当多的扩展方法,以后再说。

       还是看上面一元函数的实例。梯度下降法求解步骤如下:

(1)设定步长p,趋近于0的阈值ε;

(2)随机生成一个初始值,赋值给

(3)计算处的导数d;

(4)如果不成立,则计算,将赋值给,即=,转到(3);

(5)如果成立,成功退出。

代码如下:

def f(x):  # 原函数
    return x ** 2 / 2 - 2 * x

def g(x):  # 导函数
    return x - 2

def gd(x_start, p, e, n):  # x_start:初始位置值, p:步长, e:阈值, n:迭代次数
    x = x_start
    for i in range(n):
        grad = g(x)  # x点处的梯度(导数)
        print('times = {0}, grad = {1}, x = {2}, y = {3}'.format(i, grad, x, f(x)))
        if abs(grad) < e:
            break  # 满足收敛条件,退出迭代
        x -= grad * p
    return x

gd(-10, 0.5, 0.1, 30)的输出结果:

times = 0, grad = -12, x = -10, y = 70.0

times = 1, grad = -6.0, x = -4.0, y = 16.0

times = 2, grad = -3.0, x = -1.0, y = 2.5

times = 3, grad = -1.5, x = 0.5, y = -0.875

times = 4, grad = -0.75, x = 1.25, y = -1.71875

times = 5, grad = -0.375, x = 1.625, y = -1.9296875

times = 6, grad = -0.1875, x = 1.8125, y = -1.982421875

times = 7, grad = -0.09375, x = 1.90625, y = -1.99560546875

       迭代7次即求得了满足要求的结果。下图为这7次迭代中x从x0到x7的寻优情况。

梯度下降法有几个值得注意的问题。

(1)步长p的选择问题

       步长p的选择是个核心问题,如果步长选的小,必然可以收敛,但迭代次数就会多,相应的计算耗时也多。上面的例子中步长p=0.5,迭代次数为7次,而当步长为0.1时,迭代次数为46次。但当步长p选的过大时,如p=3,看看函数gd(-10, 3, 0.1, 30)执行的情况。

times = 0, grad = -12, x = -10, y = 70.0

times = 1, grad = 24, x = 26, y = 286.0

times = 2, grad = -48, x = -46, y = 1150.0

times = 3, grad = 96, x = 98, y = 4606.0

times = 4, grad = -192, x = -190, y = 18430.0

times = 5, grad = 384, x = 386, y = 73726.0

times = 6, grad = -768, x = -766, y = 294910.0

times = 7, grad = 1536, x = 1538, y = 1179646.0

times = 8, grad = -3072, x = -3070, y = 4718590.0

times = 9, grad = 6144, x = 6146, y = 18874366.0

times = 10, grad = -12288, x = -12286, y = 75497470.0

times = 11, grad = 24576, x = 24578, y = 301989886.0

times = 12, grad = -49152, x = -49150, y = 1207959550.0

times = 13, grad = 98304, x = 98306, y = 4831838206.0

times = 14, grad = -196608, x = -196606, y = 19327352830.0

times = 15, grad = 393216, x = 393218, y = 77309411326.0

times = 16, grad = -786432, x = -786430, y = 309237645310.0

times = 17, grad = 1572864, x = 1572866, y = 1236950581246.0

times = 18, grad = -3145728, x = -3145726, y = 4947802324990.0

times = 19, grad = 6291456, x = 6291458, y = 19791209299966.0

times = 20, grad = -12582912, x = -12582910, y = 79164837199870.0

times = 21, grad = 25165824, x = 25165826, y = 316659348799486.0

times = 22, grad = -50331648, x = -50331646, y = 1266637395197950.0

times = 23, grad = 100663296, x = 100663298, y = 5066549580791806.0

times = 24, grad = -201326592, x = -201326590, y = 2.0266198323167228e+16

times = 25, grad = 402653184, x = 402653186, y = 8.106479329266893e+16

times = 26, grad = -805306368, x = -805306366, y = 3.242591731706757e+17

times = 27, grad = 1610612736, x = 1610612738, y = 1.2970366926827028e+18

times = 28, grad = -3221225472, x = -3221225470, y = 5.188146770730811e+18

times = 29, grad = 6442450944, x = 6442450946, y = 2.0752587082923246e+19

       梯度grad的绝对值越来越大,在不断的发散。这是因为p选的大,p*d的值也很大,也即比较远,可能会造成处的梯度比处梯度还大,那么根据之间的距离会不断扩大,相应的梯度也会越来越大,趋近于无穷。

       再来看个步长p=2的问题,看看gd(-10, 2, 0.1, 30)执行的情况。

times = 0, grad = -12, x = -10, y = 70.0

times = 1, grad = 12, x = 14, y = 70.0

times = 2, grad = -12, x = -10, y = 70.0

times = 3, grad = 12, x = 14, y = 70.0

times = 4, grad = -12, x = -10, y = 70.0

times = 5, grad = 12, x = 14, y = 70.0

times = 6, grad = -12, x = -10, y = 70.0

times = 7, grad = 12, x = 14, y = 70.0

times = 8, grad = -12, x = -10, y = 70.0

times = 9, grad = 12, x = 14, y = 70.0

times = 10, grad = -12, x = -10, y = 70.0

times = 11, grad = 12, x = 14, y = 70.0

times = 12, grad = -12, x = -10, y = 70.0

times = 13, grad = 12, x = 14, y = 70.0

times = 14, grad = -12, x = -10, y = 70.0

times = 15, grad = 12, x = 14, y = 70.0

times = 16, grad = -12, x = -10, y = 70.0

times = 17, grad = 12, x = 14, y = 70.0

times = 18, grad = -12, x = -10, y = 70.0

times = 19, grad = 12, x = 14, y = 70.0

times = 20, grad = -12, x = -10, y = 70.0

times = 21, grad = 12, x = 14, y = 70.0

times = 22, grad = -12, x = -10, y = 70.0

times = 23, grad = 12, x = 14, y = 70.0

times = 24, grad = -12, x = -10, y = 70.0

times = 25, grad = 12, x = 14, y = 70.0

times = 26, grad = -12, x = -10, y = 70.0

times = 27, grad = 12, x = 14, y = 70.0

times = 28, grad = -12, x = -10, y = 70.0

times = 29, grad = 12, x = 14, y = 70.0

       每步求解的情况都一样,原地踏步。看图像:

       测试了一下,不管初始值在哪,都一样。再看看gd(-5, 2, 0.1, 30)的情况。

       显然,存在一个特殊值s,当步长p<s时,能够收敛;当p>s时,发散;当p=s时,原地踏步。那么这个特殊值s到底是多少呢?如果函数可导,且函数的梯度满足李普希兹连续(常数为L),若以小于s=1/L的步长迭代,则能保证每次迭代的函数值都不增,则保证最终会收敛到梯度为0的点。

(2)接近极值点收敛速度变慢的问题

       其实这个算不上问题,所有的优化算法都存在这个现象。举个不太恰当的例子,就像考试,从10分考到20分很容易,从89分考到99分就非常难,同样的学习时间越接近满分进步空间越小,所谓的百尺杆头吧。从梯度下降的处理方法看,,当越接近极小值时,显然d越来越小,趋近于0,相应的p*d也越来越小,那么同样一次迭代,的差值会越来越小,感觉就是收敛速度变慢了,这很好理解。

(3)陷入局部极小值的问题

       看如下的函数:,求其在[-2,2]区间的极小值。看函数图像:

 

       显然该函数在[-2,2]区间有4个极小值点,位置约为(1.5,-4.5)处为最小值点。看以下4个初始值的计算情况。

A.初始值:-1,步长:0.01;

B.初始值:-0.2,步长:0.01;

C.初始值:0.8,步长:0.01;

D.初始值:1.5,步长:0.01。

 

       图中A0,B0,C0,D0分别为4个初始值位置,A,B,C,D分别为4种情况下求出的对应极小值。可以看出,初始值不同,求出的极小值点也可能不同,这就是梯度下降法陷入了局部最优,考虑全局收敛性,最好是多设几个初始值迭代收敛。陷入局部最优也是通常的优化算法的共同问题,不同算法的解决陷入局部最优的方法有很多,但都不可能完全避免。




评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值