《Pytorch深度学习实践》P3梯度下降法笔记+代码+图像：梯度下降、随机梯度下降、小批量随机梯度下降

大臣不想在月亮上上热搜

已于 2024-10-11 22:01:13 修改

阅读量160

点赞数 9

文章标签：深度学习 pytorch 笔记 python 机器学习

于 2024-10-11 21:54:38 首次发布

本文链接：https://blog.csdn.net/weixin_46046293/article/details/142862629

版权

梯度下降（Batch Gradient Descent）

随机梯度下降（Stochastic Gradien Descent，SGD）

小批量随机梯度下降（Mini-batch Gradient Descent）

梯度下降（Batch Gradient Descent）

介绍：使用所有的训练样本计算梯度，并且在每次迭代中更新权重。

原理：假设有一个损失函数 $J(\omega )$ ，它依赖于参数 $\omega$ 。通过最小化损失函数来找到最优参数 $\omega$ ，即：

$w^* = \arg\min \limits_w J(w)$

损失函数 $J(\omega )$ 的梯度 $\nabla J(w)$ 表示在某个点 $\omega$ 处损失函数的变化率。梯度是一个向量，指向损失函数上升最快的方向。梯度的计算公式为：

$\nabla J(w) = \left[ \frac{\partial J}{\partial w_1}, \frac{\partial J}{\partial w_2}, \ldots, \frac{\partial J}{\partial w_n} \right]$

其中 $\omega_1,\omega_2,...,\omega_n$ 是参数向量的各个分量。

梯度下降的核心思想是沿着梯度的反方向更新参数，更新规则如下：

$w := w - \alpha \nabla J(w)$

其中， $\alpha$ 是学习率，控制每次更新的步长。 $\omega$ 是当前的参数值。

迭代过程：

初始化参数 $\omega$ （随机或为零）。
计算当前参数下的损失函数 $J(\omega )$ 。
计算梯度 $\nabla J(w)$ 。
根据更新规则更新参数 $\omega$ 。
重复步骤 2-4，直到损失函数收敛到一个可接受的值或者达到预设的迭代次数。

优点：

收敛稳定，因为使用全数据集计算梯度。
适用于小规模数据集。

缺点：

计算成本高，特别是数据集较大时。
更新频率低，可能导致收敛速度慢。

x_data = [1.0, 2.0, 3.0]
y_data = [2.0, 4.0, 6.0]

w = 1.0

def forward(x):
    return x*w

def cost(xs,ys):
    cost = 0
    for x, y in zip(xs, ys):
        y_pred = forward(x)
        cost += (y_pred -y) **2
    return cost / len(xs)

def gradient(xs, ys):
    grad = 0
    for x, y in zip(xs, ys):
        grad += 2 * x * (x * w - y)
    return grad / len(xs)

print('Predict (before training)', 4, forward(4))
cost_all = []
for epoch in range(100):
    cost_val = cost(x_data, y_data)
    cost_all.append(cost_val)
    grad_val = gradient(x_data, y_data)
    w -= 0.01 * grad_val
    # w = round(w,2)
    print(f'Epoch:{epoch}, w={round(w,2)}, loss={cost_val}')
print('Predict (after training)', 4, forward)

import matplotlib.pyplot as plt
import numpy as np
epoch = np.arange(1,101,1)
plt.plot(epoch, cost_all)
plt.ylabel('Loss')
plt.xlabel('epoch')
plt.show()

随机梯度下降（Stochastic Gradien Descent，SGD）

介绍：每次只使用一个训练样本计算梯度，并立即更新权重。

优点：

更新频率高，能够更快地开始收敛。
更好的泛化能力，减少过拟合的风险。
对大数据集友好，能够逐步处理数据。

缺点：

收敛过程可能不稳定，损失函数曲线会有较大的波动。
最终可能会在最优解附近振荡，而不是收敛到最优点。

x_data = [1.0, 2.0, 3.0]
y_data = [2.0, 4.0, 6.0]

w = 1.0

def forward(x):
    return x*w

def loss(x,y):
    y_pred = forward(x)
    return (y_pred - y) ** 2

def gradient(x, y):
    return 2 * x * (x * w - y)

print('Predict (before training)', 4, forward(4))
cost_all = []
for epoch in range(100):
    l = 0
    for x, y in zip(x_data, y_data):
        grad = gradient(x, y)
        w = w - 0.01 * grad
        print(f'x = {x}, y = {y}, grad = {grad}')
        l += loss(x,y)
    print(f'Epoch:{epoch}, w={round(w,2)}, loss={l}')
    cost_all.append(l/len(x_data))

print('Predict (after training)', 4, forward)

import matplotlib.pyplot as plt
import numpy as np
epoch = np.arange(1,101,1)
plt.plot(epoch, cost_all)
plt.ylabel('Loss')
plt.xlabel('epoch')
plt.show()

小批量随机梯度下降（Mini-batch Gradient Descent）

介绍：每次只使用一小部分（批量）训练样本计算梯度，并更新权重

优点：

结合了批量和随机梯度下降的优点，提供了较好的收敛性和稳定性。
更新频率比批量梯度下降高，收敛速度快。
批量大小可以调节以获得最佳性能（例如，计算效率与收敛速度）。

缺点：

选择批量大小可能需要经验，可能会影响模型性能。
仍然可能存在较小的波动，但通常比 SGD 更稳定。

import numpy as np
import matplotlib.pyplot as plt

x_data = np.arange(1,11,1)
y_data = np.arange(2,22,2)

w = 1.0
batch_size = 4  

def forward(x):
    return x * w

def cost(xs, ys):
    total_cost = 0
    for x, y in zip(xs, ys):
        y_pred = forward(x)
        total_cost += (y_pred - y) ** 2
    return total_cost / len(xs)

def gradient(xs, ys):
    grad = 0
    for x, y in zip(xs, ys):
        grad += 2 * x * (x * w - y)
    return grad / len(xs)

print('Predict (before training)', 4, forward(4))

cost_all = []
for epoch in range(100):
    # 随机打乱数据
    indices = np.random.permutation(len(x_data))
    
    l = 0
    for i in range(0, len(x_data), batch_size):
        
        batch_indices = indices[i:i + batch_size]
        batch_x = x_data[batch_indices]
        batch_y = y_data[batch_indices]
        cost_val = cost(batch_x, batch_y)
        l += cost_val
        grad_val = gradient(batch_x, batch_y)
        w -= 0.002 * grad_val

    average_loss = l / (len(x_data) / batch_size)
    cost_all.append(average_loss)
    print(f'Epoch:{epoch}, w={round(w, 2)}, loss={average_loss}')

print('Predict (after training)', 4, forward(4))

# 绘制损失曲线
epoch = np.arange(1, 101, 1)
plt.plot(epoch, cost_all)
plt.ylabel('Loss')
plt.xlabel('epoch')
plt.show()