梯度下降算法的数学原理及python实现

一只程序猿林

于 2023-01-14 14:25:09 发布

阅读量249

点赞数 2

文章标签：算法 python pytorch 人工智能

本文链接：https://blog.csdn.net/qq_53383206/article/details/128683077

版权

0 引言

人工智能的多个领域都涉及到问题的求解，实际就是一个搜索的过程。搜索技术分为全局搜索和局部搜索，梯度下降算法是一种基于一阶微分的局部搜索算法。由于梯度下降算法的简单易用，梯度下降算法(GD)已然成为了深度学习领域的核心算法之一。

1 梯度下降算法(GD)的数学推导

梯度下降算法基于泰勒展开式推导而来。
设 $f (x)$ 在实数域上的各点均能够泰勒展开,则 $f (x)$ 在 $x_{t}$ 处的泰勒展开式为：
$\begin{aligned} f(x)=\sum\limits_{k=0}^n\frac{f^{(n)}(x_{t})}{n!}{(x-x_{t})^k}\tag{1} \end{aligned}$
忽略 $n > 2$ 的子式，则有：
$\begin{aligned} f(x)\approx f(x_{t})+f^{'}(x_{t})(x-x_{t})\tag{2} \end{aligned}$
由连续到离散，令 $x=x_{t+1}$ ,代入式 $(2)$ ，近似相等变为相等，得到：
$\begin{aligned} f(x_{t+1})=f(x_{t})+f^{'}(x_{t})(x_{t+1}-x_{t})\tag{3} \end{aligned}$
对式 $(3)$ 进行向量化,得到：
$\begin{aligned} f(X_{t+1})=f(X_{t})+ (X_{t+1}-X_{t})\nabla f(X_{t})\tag{4} \end{aligned}$
梯度下降要求点沿着目标函数梯度下降的方向运动，所以 $f(X_{t+1})<f(X_{t})$ 。那么，则有: $(X_{t+1}-X_{t})\nabla f(X_{t})<0$ ，所以向量 $X_{t+1}-X_{t})$ 和向量 $\nabla f (X_{t})$ 夹角为钝角。自然的为了使 $f(X_{t+1})-f(X_{t})$ 尽可能的小，向量 $X_{t+1}-X_{t})$ 和向量 $\nabla f (X_{t})$ 夹角应为平角 $180^{。}$ 即两向量为相反向量。
由相反向量定义可知：
$\begin{aligned} X_{t+1}-X_{t}=-\gamma\nabla f(X_{t})\tag{5} \end{aligned}$
其中 $\gamma$ 是一个正数，机器学习中称为学习率，用来表示每次梯度下降的剧烈程度。利用式 $(5)$ 即可完成参数 $X$ 的更新，这种方法就称为梯度下降算法。

2 梯度下降算法拟合函数

问题背景：使用梯度下降算法，用三阶多项式 $f(x)=ax^{3}+bx^{2}+cx+d$ 拟合函数 $g (x) = s in (x)$ 。在 $f (x)$ 上等间隔的取 $N = 2000$ 个点作为样本点。构造损失函数如下：
$\begin{aligned} L(a,b,c,d)=\frac{1}{N}\sum\limits_{n}[f(x_{n})-g(x_{n})]^2\tag{6} \end{aligned}$
显然拟合问题转换为求解:
$\begin{aligned} Min\ L(a,b,c,d)\tag{7} \end{aligned}$
根据梯度下降算法得到 $a, b, c, d$ 的向量化更新公式如下:
$\begin{aligned} \left( \begin{matrix} a \\b \\c \\ d \end{matrix} \right)_{t+1}=\left( \begin{matrix} a \\b \\c \\ d \end{matrix} \right)_{t}-\gamma \left( \begin{matrix} \frac{\partial L}{\partial a} \\ \\\frac{\partial L}{\partial b} \\ \\\frac{\partial L}{\partial c} \\ \\\frac{\partial L}{\partial d} \end{matrix} \right)\tag{8} \end{aligned}$

2.1 使用numpy实现梯度下降

def numpy_GD():
    import numpy as np
    import math
    import matplotlib.pyplot as plt
    import time
    # Create random input and output data
    x = np.linspace(-math.pi, math.pi, 2000)
    y = np.sin(x)

    plt.plot(x, y)
    plt.grid(True)

    # Randomly initialize weights
    a = np.random.randn()
    b = np.random.randn()
    c = np.random.randn()
    d = np.random.randn()

    learning_rate = 1e-6
    t0 = time.time()
    for t in range(2000):
        # Forward pass: compute predicted y
        # y = a + b x + c x^2 + d x^3
        y_pred = a + b * x + c * x ** 2 + d * x ** 3

        # Compute and print loss
        loss = np.square(y_pred - y).sum()
        if t % 100 == 99:
            print(t, loss)

        # Backprop to compute gradients of a, b, c, d with respect to loss
        grad_y_pred = 2.0 * (y_pred - y)
        grad_a = grad_y_pred.sum()
        grad_b = (grad_y_pred * x).sum()
        grad_c = (grad_y_pred * x ** 2).sum()
        grad_d = (grad_y_pred * x ** 3).sum()

        # Update weights
        a -= learning_rate * grad_a
        b -= learning_rate * grad_b
        c -= learning_rate * grad_c
        d -= learning_rate * grad_d
    t1 = time.time()
    print("花费时间%.2fs" % (t1 - t0))
    print(f'Result: y = {a} + {b} x + {c} x^2 + {d} x^3')
    plt.plot(x, y, label='real')
    plt.plot(x, y_pred, label='pred')
    plt.legend(loc='best')
    plt.grid(True)
    plt.show()

在cpu上运行，耗时：0.33s。
结果：
在这里插入图片描述

2.2 使用pytorch实现梯度下降

def torch_GD():
    import torch
    import math
    import matplotlib.pyplot as plt
    import time
    
    dtype = torch.float
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    print(device)
    x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
    y = torch.sin(x)

    a = torch.randn((), device=device, dtype=dtype, requires_grad=True)
    b = torch.randn((), device=device, dtype=dtype, requires_grad=True)
    c = torch.randn((), device=device, dtype=dtype, requires_grad=True)
    d = torch.randn((), device=device, dtype=dtype, requires_grad=True)

    learning_rate = 1e-6
    t0 = time.time()
    for t in range(2000):

        y_pred = a + b * x + c * x ** 2 + d * x ** 3

        loss = (y_pred - y).pow(2).sum()
        if t % 100 == 99:
            print(t, loss.item())

        loss.backward()

        with torch.no_grad():
            a -= learning_rate * a.grad
            b -= learning_rate * b.grad
            c -= learning_rate * c.grad
            d -= learning_rate * d.grad

            a.grad = None
            b.grad = None
            c.grad = None
            d.grad = None
    t1 = time.time()
    print("花费时间%.2fs" % (t1 - t0))
    print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')

    plt.plot(x.cpu().data.numpy(), y.cpu().data.numpy(), label='real')
    plt.plot(x.cpu().data.numpy(), y_pred.cpu().data.numpy(), label='pred')
    plt.grid(True)
    plt.legend(loc='best')
    plt.show()