梯度下降法详解-CSDN博客

本文链接：https://blog.csdn.net/devinwood/article/details/77464117

线性回归中参数解析式的求解涉及到矩阵的求逆，当特征矩阵数据量过大，求逆是一个很耗时的过程，我们可以使用先对特征矩阵做奇异值分解（SVD）后再求广义逆 $^①$ ，但更多情况下，是使用梯度下降法绕过求逆的过程。

根据梯度反方向 $-\frac{\partial J(\theta)}{\partial \theta}$ 是函数值下降最快的方向，随机初始化参数 $\theta$ ，让 $\theta$ 沿负梯度方向迭代，更新后的 $\theta$ 使损失函数 $J(\theta)$ 更小：

θ = θ - α \partial J ( θ ) \partial θ

$\theta=\theta-\alpha\frac{\partial J(\theta)}{\partial \theta}$
其中

α $\alpha$ 是学习率，步长，表示每次向着

J(θ) $J(\theta)$ 最陡峭的方向迈步的大小。这是梯度下降的基本思想。

以平方损失为例，参数 $\theta_j$ 的梯度方向 $^②$ 为：

\partial J ( θ ) \partial θ j = (h θ (x) - y) x j

$\frac{\partial J(\theta)}{\partial \theta_j}=(h_\theta(x)-y)x_j$

批量梯度下降法（BGD）

批量梯度下降法（Batch Gradient Descent，简称BGD）是梯度下降法最原始的形式，具体思路是在更新每一参数时使用所有的样本来进行更新。BGD一定可以得到一个局部极值，在线性回归中可以得到全局最优解。

样本预测值为 $h_\theta(x^i)$ 的样本 $(x^i,y^i)$ 的参数 $\theta_j$ 更新：

θ j = θ j + α m \sum i = 1 m (y i - h θ (x i)) x i j

$\theta_j = \theta_j+\frac{\alpha}{m} \sum_{i=1}^{m}(y^i-h_\theta(x^i))x_j^i$
代码实现

③ $^③$ ：

for j in range(n):
        for i in range(m):
            diff += (y[i]- np.dot(theta,x[i])) * x[i][j]
        theta[j] = theta[j] + alpha *1/m* diff

这里写图片描述

随机梯度下降法（SGD）

批量梯度下降法更新一个参数的时候需要用到所有样本，所以训练速度会随着样本数量增加而变得非常缓慢，随机梯度（Stochastic Gradient Decent ，简称SGD）的解决办法就是通过单个样本的损失函数对参数求偏导来更新梯度。

参数更新：

θ j = θ j + α (y i - h θ (x i)) x i j

$\theta_j = \theta_j+\alpha(y^i-h_\theta(x^i))x_j^i$
代码实现：

i = random.randint(0, m - 1)
diff = y[i] - np.dot(theta, x[i])
for j in range(n):
    theta[j] = theta[j] + alpha * diff * x[i][j]

这里写图片描述

比较图发现，SGD震荡下降，在极值点附近震荡，因为不需要用到所有样本进行更新参数，所以下降更快，但可能迭代次数会更多。
由于SGD是震荡下降，所以有可能在震荡下降过程中可能跳出次局部极值，找到全局极小值。SGD因为不需要用到所有的样本，所以更适合于做在线学习。

小批量梯度下降法（MBGD）

SGD的缺点也很明显，噪音较BGD要多，虽然整体往最优值方向移动，但每次迭代并非都向整体最优方向。

为了保证算法的训练速度，同时也保证参数训练的准确率，于是就有了小批量梯度下降法（Mini-batch Gradient Decent ,简称MBGD），抽取部分样本进行参数更新，视训练集规模人为给定抽取的样本个数，代码可参考BGD与SGD。

附录

①：若 $A$ 为非奇异矩阵，则线性方程组 $Ax=b$ 的解为

x = (A T A) - 1 A T b

$x = (A^TA)^{-1}A^Tb$
从方程的直观意义上可定义广义逆矩阵：

A + = (A T A) - 1 A T

$A^+= (A^TA)^{-1}A^T$
②：梯度方向
这里写图片描述

③：

# coding=utf-8
# !/usr/bin/python

import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import warnings
import random

np.set_printoptions(suppress=True)
np.set_printoptions(precision=1)
warnings.filterwarnings("ignore")

# 构造数据 样本x 目标值y
N = 100
xNew = []
x = np.linspace(0, 6, N) + np.random.randn(N)
x = np.sort(x)
print(x)
z = [1 for i in range(len(x))]
y = 4 * x - 3 + np.random.randn(N)  # 通过一次函数加入随机扰动构造值
x1 = x
for i in range(len(x)):
    tempMat = [z[i], round(x[i], 2)]
    xNew.append(tempMat)
x = xNew

epsilon = 0.01
alpha = 0.05  # 学习率
error1 = 0  # 当前权重损失值
error0 = 0  # 上一次权重损失值
time = 0  # 迭代次数
errors = []  # 损失值数组
times = []

n = len(x[0])  # 变量个数 考虑截距项
m = len(x)  # 样本数
theta = [0 for i in range(n)]


while True:
    diff = 0
    # 随机梯度下降 begin
    # i = random.randint(0, m - 1)
    # diff = y[i] - np.dot(theta, x[i])
    # for j in range(n):
    #     theta[j] = theta[j] + alpha * diff * x[i][j]
        # 随机梯度下降 end
        #批量梯度下降 begin
    for j in range(n):
        for i in range(m):
            diff += (y[i]- np.dot(theta,x[i])) * x[i][j]
        theta[j] = theta[j] + alpha *1/m* diff
    #批量梯度下降 end
    error1 = 0
    for i in range(m):
        error1 += (y[i] - (np.dot(theta, x[i]))) ** 2 / 2  # 计算损失值

    errors.append(error1)
    times.append(time)
    time += 1

    if abs(error1 - error0) < epsilon:
        break
    else:
        error0 = error1

    if time == 100:  # 迭代到达100次停止
        break

# 画图
mpl.rcParams['font.sans-serif'] = [u'simHei']
mpl.rcParams['axes.unicode_minus'] = False
plt.figure(figsize=(12, 9))
plt.subplot(1, 2, 1)
plt.plot(times, errors, 'g-', linewidth=2)
plt.title(u'损失曲线', fontsize=18)
plt.xlabel(u'次数', fontsize=15)
plt.ylabel(u'损失值', fontsize=15)
plt.legend(loc='upper right')
plt.grid()
plt.subplot(1, 2, 2)
plt.plot(x1, y, 'bo', linewidth=2, label=u'原始数据')
x = np.arange(-2, 8)
y = theta[0] + theta[1] * x
plt.plot(x, y, 'r-', linewidth=2, label=u'回归模型')
plt.xlabel(u'X', fontsize=15)
plt.ylabel(u'Y', fontsize=15)
plt.title(u'回归拟合', fontsize=18)
plt.legend(loc='upper left')
plt.grid()
plt.show()
print('Done: theta0 : %f, theta1 : %f' % (theta[0], theta[1]))