【PyTorch】4.3 优化器Optimizer(二)


任务简介:

学习最常用的优化器:torch.optim.SGD

详细说明:

深入了解学习率和momentum在梯度下降法中的作用,分析LR和Momentum这两个参数对优化过程的影响,最后学习optim.SGD以及pytorch的十种优化器简介。

一、learning rate学习率

1. 梯度下降遇到的问题

梯度下降希望寻找函数的最小值,但可以看到y值不但没有减小,还越来越大了。
在这里插入图片描述

2. 代码演示梯度下降的过程

绘制函数图像:

import torch
import numpy as np
import matplotlib.pyplot as plt
torch.manual_seed(1)


def func(x_t):
    """
    y = (2x)^2 = 4*x^2      dy/dx = 8x
    """
    return torch.pow(2*x_t, 2)


# init
x = torch.tensor([2.], requires_grad=True)

# ------------------------------ plot data ------------------------------
# flag = 0
flag = 1
if flag:

    x_t = torch.linspace(-3, 3, 100)
    y = func(x_t)
    plt.plot(x_t.numpy(), y.numpy(), label="y = 4*x^2")
    plt.grid()
    plt.xlabel("x")
    plt.ylabel("y")
    plt.legend()
    plt.show()

输出:
在这里插入图片描述
测试梯度下降过程:

# ------------------------------ gradient descent ------------------------------
# flag = 0
flag = 1
if flag:
    iter_rec, loss_rec, x_rec = list(), list(), list()

    lr = 1    # /1. /.5 /.2 /.1 /.125
    max_iteration = 4   # /1. 4     /.5 4   /.2 20 200

    for i in range(max_iteration):

        y = func(x)
        y.backward()

        print("Iter:{}, X:{:8}, X.grad:{:8}, loss:{:10}".format(
            i, x.detach().numpy()[0], x.grad.detach().numpy()[0], y.item()))

        x_rec.append(x.item())

        x.data.sub_(lr * x.grad)    # x -= x.grad  数学表达式意义:  x = x - x.grad    # 0.5 0.2 0.1 0.125
        x.grad.zero_()

        iter_rec.append(i)
        loss_rec.append(y)

    plt.subplot(121).plot(iter_rec, loss_rec, '-ro')
    plt.xlabel("Iteration")
    plt.ylabel("Loss value")

    x_t = torch.linspace(-3, 3, 100)
    y = func(x_t)
    plt.subplot(122).plot(x_t.numpy(), y.numpy(), label="y = 4*x^2")
    plt.grid()
    y_rec = [func(torch.tensor(i)).item() for i in x_rec]
    plt.subplot(122).plot(x_rec, y_rec, '-ro')
    plt.legend()
    plt.show()

输出:

Iter:0, X:     2.0, X.grad:    16.0, loss:      16.0
Iter:1, X:   -14.0, X.grad:  -112.0, loss:     784.0
Iter:2, X:    98.0, X.grad:   784.0, loss:   38416.0
Iter:3, X:  -686.0, X.grad: -5488.0, loss: 1882384.0

发生梯度爆炸
在这里插入图片描述
左边为loss曲线图,右边为梯度下降图像

为什么会出现梯度爆炸?

答: W i W_i Wi减去的 g ( w i ) g(w_i) g(wi)尺度太大,引起参数 W i + 1 W_{i+1} Wi+1越来越大,导致函数值不能减小。

3. 学习率

在这里插入图片描述
设置学习率:

 lr = 0.2

迭代次数:

max_iteration = 20

输出:

Iter:0, X:     2.0, X.grad:    16.0, loss:      16.0
Iter:1, X:-1.2000000476837158, X.grad:-9.600000381469727, loss:5.760000228881836
Iter:2, X:0.7200000286102295, X.grad:5.760000228881836, loss:2.0736000537872314
Iter:3, X:-0.4320000410079956, X.grad:-3.456000328063965, loss:0.7464961409568787
Iter:4, X:0.2592000365257263, X.grad:2.0736002922058105, loss:0.26873862743377686
Iter:5, X:-0.1555200219154358, X.grad:-1.2441601753234863, loss:0.09674590826034546
Iter:6, X:0.09331201016902924, X.grad:0.7464960813522339, loss:0.03482852503657341
Iter:7, X:-0.05598720908164978, X.grad:-0.44789767265319824, loss:0.012538270093500614
Iter:8, X:0.03359232842922211, X.grad:0.26873862743377686, loss:0.004513778258115053
Iter:9, X:-0.020155396312475204, X.grad:-0.16124317049980164, loss:0.0016249599866569042
Iter:10, X:0.012093238532543182, X.grad:0.09674590826034546, loss:0.0005849856534041464
Iter:11, X:-0.007255943492054939, X.grad:-0.058047547936439514, loss:0.000210594866075553
Iter:12, X:0.0043535660952329636, X.grad:0.03482852876186371, loss:7.581415411550552e-05
Iter:13, X:-0.0026121395640075207, X.grad:-0.020897116512060165, loss:2.729309198912233e-05
Iter:14, X:0.001567283645272255, X.grad:0.01253826916217804, loss:9.825512279348914e-06
Iter:15, X:-0.0009403701405972242, X.grad:-0.007522961124777794, loss:3.537184056767728e-06
Iter:16, X:0.0005642221076413989, X.grad:0.004513776861131191, loss:1.2733863741232199e-06
Iter:17, X:-0.00033853325294330716, X.grad:-0.0027082660235464573, loss:4.584190662626497e-07
Iter:18, X:0.00020311994012445211, X.grad:0.001624959520995617, loss:1.6503084054875217e-07
Iter:19, X:-0.00012187196989543736, X.grad:-0.0009749757591634989, loss:5.941110714502429e-08

在这里插入图片描述
设置学习率:

lr = 0.01

输出:

Iter:0, X:     2.0, X.grad:    16.0, loss:      16.0
Iter:1, X:1.840000033378601, X.grad:14.720000267028809, loss:13.542400360107422
Iter:2, X:1.6928000450134277, X.grad:13.542400360107422, loss:11.462287902832031
Iter:3, X:1.5573760271072388, X.grad:12.45900821685791, loss:9.701680183410645
Iter:4, X:1.432785987854004, X.grad:11.462287902832031, loss:8.211503028869629
Iter:5, X:1.3181631565093994, X.grad:10.545305252075195, loss:6.950216293334961
Iter:6, X:1.2127101421356201, X.grad:9.701681137084961, loss:5.882663726806641
Iter:7, X:1.1156933307647705, X.grad:8.925546646118164, loss:4.979086399078369
Iter:8, X:1.0264378786087036, X.grad:8.211503028869629, loss:4.214298725128174
Iter:9, X:0.9443228244781494, X.grad:7.554582595825195, loss:3.5669822692871094
Iter:10, X:0.8687769770622253, X.grad:6.950215816497803, loss:3.0190937519073486
Iter:11, X:0.7992748022079468, X.grad:6.394198417663574, loss:2.555360794067383
Iter:12, X:0.7353328466415405, X.grad:5.882662773132324, loss:2.1628575325012207
Iter:13, X:0.6765062212944031, X.grad:5.412049770355225, loss:1.8306427001953125
Iter:14, X:0.6223857402801514, X.grad:4.979085922241211, loss:1.549456000328064
Iter:15, X:0.5725948810577393, X.grad:4.580759048461914, loss:1.3114595413208008
Iter:16, X:0.526787281036377, X.grad:4.214298248291016, loss:1.110019326210022
Iter:17, X:0.4846442937850952, X.grad:3.8771543502807617, loss:0.9395203590393066
Iter:18, X:0.4458727538585663, X.grad:3.5669820308685303, loss:0.795210063457489
Iter:19, X:0.41020292043685913, X.grad:3.281623363494873, loss:0.673065721988678

在这里插入图片描述
测试多个学习率:

# ------------------------------ multi learning rate ------------------------------

# flag = 0
flag = 1
if flag:
    iteration = 100
    num_lr = 10
    lr_min, lr_max = 0.01, 0.5  # .5 .3 .2

    lr_list = np.linspace(lr_min, lr_max, num=num_lr).tolist()
    loss_rec = [[] for l in range(len(lr_list))]
    iter_rec = list()

    for i, lr in enumerate(lr_list):
        x = torch.tensor([2.], requires_grad=True)
        for iter in range(iteration):

            y = func(x)
            y.backward()
            x.data.sub_(lr * x.grad)  # x.data -= x.grad
            x.grad.zero_()

            loss_rec[i].append(y.item())

    for i, loss_r in enumerate(loss_rec):
        plt.plot(range(len(loss_r)), loss_r, label="LR: {}".format(lr_list[i]))
    plt.legend()
    plt.xlabel('Iterations')
    plt.ylabel('Loss value')
    plt.show()

输出:
在这里插入图片描述
发现loss值有激增的情况。

减少最大学习率:

lr_min, lr_max = 0.01, 0.2

在这里插入图片描述

二、momentum动量

在这里插入图片描述
在这里插入图片描述
测试代码:

import torch
import numpy as np
import torch.optim as optim
import matplotlib.pyplot as plt
torch.manual_seed(1)


def exp_w_func(beta, time_list):
    return [(1 - beta) * np.power(beta, exp) for exp in time_list]


beta = 0.9
num_point = 100
time_list = np.arange(num_point).tolist()

# ------------------------------ exponential weight ------------------------------
# flag = 0
flag = 1
if flag:
    weights = exp_w_func(beta, time_list)

    plt.plot(time_list, weights, '-ro', label="Beta: {}\ny = B^t * (1-B)".format(beta))
    plt.xlabel("time")
    plt.ylabel("weight")
    plt.legend()
    plt.title("exponentially weighted average")
    plt.show()

    print(np.sum(weights))

输出:

0.9999734386011124

在这里插入图片描述
测试不同权重下的变化曲线:

# ------------------------------ multi weights ------------------------------
# flag = 0
flag = 1
if flag:
    beta_list = [0.98, 0.95, 0.9, 0.8]
    w_list = [exp_w_func(beta, time_list) for beta in beta_list]
    for i, w in enumerate(w_list):
        plt.plot(time_list, w, label="Beta: {}".format(beta_list[i]))
        plt.xlabel("time")
        plt.ylabel("weight")
    plt.legend()
    plt.show()

输出:

β \beta β 这个超参数是用来控制记忆周期,值越大记的越远。
在这里插入图片描述
在这里插入图片描述
Momentum相关知识见《深度学习入门》P168。

比较SGD和Momentum:

# ------------------------------ SGD momentum ------------------------------
# flag = 0
flag = 1
if flag:
    def func(x):
        return torch.pow(2*x, 2)    # y = (2x)^2 = 4*x^2        dy/dx = 8x

    iteration = 100
    m = 0.     # .9 .63

    lr_list = [0.01, 0.03]

    momentum_list = list()
    loss_rec = [[] for l in range(len(lr_list))]
    iter_rec = list()

    for i, lr in enumerate(lr_list):
        x = torch.tensor([2.], requires_grad=True)

        momentum = 0. if lr == 0.03 else m
        momentum_list.append(momentum)

        optimizer = optim.SGD([x], lr=lr, momentum=momentum)

        for iter in range(iteration):

            y = func(x)
            y.backward()

            optimizer.step()
            optimizer.zero_grad()

            loss_rec[i].append(y.item())

    for i, loss_r in enumerate(loss_rec):
        plt.plot(range(len(loss_r)), loss_r, label="LR: {} M:{}".format(lr_list[i], momentum_list[i]))
    plt.legend()
    plt.xlabel('Iterations')
    plt.ylabel('Loss value')
    plt.show()

输出:
在这里插入图片描述
改变动量:

m = 0.9

输出:
在这里插入图片描述
Momentum比SGD更快的达到了loss的较小值,但是由于权重过大,会受到前一时刻步长的影响,造成振荡。

继续修改合适的值:

m = 0.63

输出:
在这里插入图片描述

三、torch.optim.SGD

在这里插入图片描述

四、PyTorch的十种优化器

在这里插入图片描述
参考论文:
在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值