pytorch 自动微分以及自定义 torch.autograd.Function 教程

最新推荐文章于 2024-12-06 10:15:41 发布

kebijuelun

最新推荐文章于 2024-12-06 10:15:41 发布

阅读量1k

点赞数 10

文章标签： pytorch 人工智能 python 深度学习

本文链接：https://blog.csdn.net/kebijuelun/article/details/141031069

版权

简介

本文介绍 PyTorch 自动微分以及自定义 torch.autograd.Function 的方法。通过 torch.autograd 和自定义 autograd.Function，可以灵活地实现复杂的前向与反向传播逻辑。

1. torch 自动微分与梯度计算介绍

让我们看看自动微分（autograd）如何收集梯度。我们创建了两个张量 a 和 b，并设置 requires_grad=True，这表明对它们进行的每一个操作都应该被自动微分跟踪。

import torch

a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

接着，我们从 a 和 b 创建另一个张量 Q。

$Q = 3a^3 - b^2$

Q = 3*a**3 - b**2

假设 a 和 b 是神经网络的参数，而 Q 是误差。在神经网络的训练中，我们希望得到误差关于参数的梯度，即：

$\frac{\partial Q}{\partial a} = 9a^2$
$\frac{\partial Q}{\partial b} = -2b$

当我们在 Q 上调用 .backward() 时，自动微分将计算这些梯度，并将它们存储在相应张量的 .grad 属性中。

因为 Q 是一个向量，我们需要在 Q.backward() 中显式地传递一个梯度参数。gradient 是与 Q 形状相同的张量，表示 Q 关于自身的梯度，即：

$\frac{dQ}{dQ} = 1$

同样地，我们也可以将 Q 聚合为标量，然后隐式调用 backward()，比如 Q.sum().backward()。

external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)

梯度现在已经存储在 a.grad 和 b.grad 中。

# 检查收集的梯度是否正确
print(9*a**2 == a.grad)
print(-2*b == b.grad)

输出：

tensor([True, True])
tensor([True, True])

整体验证代码

import torch

# 创建两个张量 a 和 b，并启用 requires_grad=True 以便自动求导可以追踪它们的操作
a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

# 定义张量 Q，它是通过 a 和 b 的操作生成的。公式为：Q = 3a^3 - b^2
Q = 3*a**3 - b**2

# 定义一个外部梯度 external_grad，用于模拟在反向传播时传递的梯度值
external_grad = torch.tensor([1., 1.])

# 通过调用 backward 方法计算 Q 对 a 和 b 的梯度，并将外部梯度传递给 backward
Q.backward(gradient=external_grad)

# 检查收集到的梯度是否正确
# 对 a 的梯度应该为 9a^2
print(9*a**2 == a.grad)
# 对 b 的梯度应该为 -2b
print(-2*b == b.grad)

2. 如何通过扩展 torch.autograd.Function 自定义反向传播函数？

torch.autograd.Function 是什么？

torch.autograd.Function 是创建自定义 autograd.Function 的基类。
- 要创建自定义的 autograd.Function，需要继承此类并实现 forward() 和 backward() 静态方法。然后，在前向传播中使用自定义操作时，调用类方法 apply，不要直接调用 forward()。
- 为了确保正确性和最佳性能，请确保在 ctx 上调用正确的方法，并使用 torch.autograd.gradcheck() 验证 backward 函数。

使用流程

按照以下步骤操作：

继承 Function 并实现 forward()、（可选）setup_context() 和 backward() 方法。
在 ctx 参数上调用适当的方法。
声明你的函数是否支持双重反向传播。
使用 gradcheck 验证你的梯度是否正确。

第一步：继承 `Function` 后，你需要定义 3 个方法：

forward() 是执行前向操作的代码。它可以接受任意数量的参数，其中一些可以是可选的（如果你指定了默认值）。此方法可以接受各种 Python 对象。对于历史追踪（即 requires_grad=True）的张量参数，在调用前会将它们转换为不追踪历史的张量，并且它们的使用将被记录在计算图中。注意，这种逻辑不会遍历列表/字典/其他数据结构，只会考虑作为直接参数传递的张量。你可以返回一个单一的张量输出，也可以返回一个包含多个输出的元组。此外，请参阅 Function 的文档，了解只能在 forward() 中调用的有用方法的描述。
setup_context()（可选）。可以编写一个接受 ctx 对象的“组合” forward()，或者（在 PyTorch 2.0 中）编写一个不接受 ctx 的单独 forward() 和一个 setup_context() 方法，在此方法中修改 ctx。forward() 应该包含计算内容，而 setup_context() 应仅负责 ctx 修改（不进行任何计算）。通常，单独的 forward() 和 setup_context() 更接近 PyTorch 原生操作的工作方式，因此与各种 PyTorch 子系统的组合性更强。有关更多详细信息，请参阅组合或单独的 forward() 和 setup_context()。
backward()（或 vjp()）定义梯度公式。它将接收与输出数量相同的张量参数，每个参数代表相对于该输出的梯度。切记绝不要在原地 (In-place) 修改这些参数。**它应返回与输入数量相同的张量，每个张量包含相对于对应输入的梯度。**如果你的输入不需要梯度（needs_input_grad 是一个布尔值元组，指示每个输入是否需要梯度计算）或是非张量对象，你可以返回 None。此外，如果 forward() 有可选参数，你可以返回比输入更多的梯度，只要它们全为 None。

第二步：你有责任正确使用 `ctx` 中的函数，以确保新的 `Function` 能够与 `autograd` 引擎正确协作。

save_for_backward() 必须用于保存要在反向传播中使用的任何张量。非张量对象应直接存储在 ctx 上。如果为反向传播保存了既不是输入也不是输出的张量，则你的 Function 可能不支持双重反向传播（参见第三步）。
mark_dirty() 必须用于标记由 forward 函数就地修改的任何输入。
mark_non_differentiable() 必须用于告知引擎输出不可微分。默认情况下，所有可微分类型的输出张量将被设置为需要梯度。不可微分类型（即整数类型）的张量永远不会被标记为需要梯度。
set_materialize_grads() 可用于告诉 autograd 引擎在输出不依赖于输入的情况下优化梯度计算，不对传递给 backward 函数的梯度张量进行具体化。也就是说，如果设置为 False，则 Python 中的 None 对象或 C++ 中的“未定义张量”（x.defined() 为 False 的张量 x）不会在调用 backward 前被转换为填充零的张量，因此你的代码需要将这些对象视为填充零的张量。此设置的默认值为 True。

第三步：如果你的 `Function` 不支持双重反向传播，应该通过在 `backward` 上添加 `once_differentiable()` 装饰器来明确声明这一点。使用此装饰器后，尝试对你的函数执行双重反向传播时将产生错误。有关双重反向传播的更多信息，请参阅双重反向传播教程。

第四步：建议使用 `torch.autograd.gradcheck()` 检查你的 `backward` 函数是否正确计算了 `forward` 的梯度，方法是使用你的 `backward` 函数计算雅可比矩阵，并将其与使用有限差分法数值计算的雅可比矩阵进行逐元素比较。

示例

下面的代码展示了一个带有附加注释的 Linear 函数：

# 继承自 Function
class LinearFunction(Function):

    # 注意 forward、setup_context 和 backward 是 @staticmethod
    @staticmethod
    def forward(input, weight, bias):
        output = input.mm(weight.t())
        if bias is not None:
            output += bias.unsqueeze(0).expand_as(output)
        return output

    @staticmethod
    # inputs 是传递给 forward 的所有输入的元组。
    # output 是 forward() 的输出。
    def setup_context(ctx, inputs, output):
        input, weight, bias = inputs
        ctx.save_for_backward(input, weight, bias)

    # 此函数只有一个输出，因此它只接收一个梯度
    @staticmethod
    def backward(ctx, grad_output):
        # 在 backward 的顶部解包 saved_tensors 并将所有输入的梯度初始化为 None 是非常方便的。
        # 由于额外的尾随 None 会被忽略，即使函数有可选输入，return 语句也很简单。
        input, weight, bias = ctx.saved_tensors
        grad_input = grad_weight = grad_bias = None

        # 这些 needs_input_grad 检查是可选的，它们只用于提高效率。
        # 如果你希望让代码更简单，可以跳过它们。
        # 返回不需要梯度的输入的梯度不是错误。
        if ctx.needs_input_grad[0]:
            grad_input = grad_output.mm(weight)
        if ctx.needs_input_grad[1]:
            grad_weight = grad_output.t().mm(input)
        if bias is not None and ctx.needs_input_grad[2]:
            grad_bias = grad_output.sum(0)

        return grad_input, grad_weight, grad_bias

现在，为了更方便地使用这些自定义操作，我们建议要么为它们创建别名，要么将它们包装在一个函数中。将其包装在函数中可以让我们支持默认参数和关键字参数：

# 选项 1：别名
linear = LinearFunction.apply

# 选项 2：将其包装在函数中，以支持默认参数和关键字参数。
def linear(input, weight, bias=None):
    return LinearFunction.apply(input, weight, bias)

这里，我们提供了一个额外的例子，其中的函数由非 Tensor 参数进行参数化：

class MulConstant(Function):
    @staticmethod
    def forward(tensor, constant):
        return tensor * constant

    @staticmethod
    def setup_context(ctx, inputs, output):
        # ctx 是一个上下文对象，可以用来存储反向传播计算所需的信息
        tensor, constant = inputs
        ctx.constant = constant

    @staticmethod
    def backward(ctx, grad_output):
        # 我们返回的梯度数量与输入参数的数量相同。
        # 对于前向传播的非 Tensor 参数，其梯度必须为 None。
        return grad_output * ctx.constant, None

在这里，我们通过调用 set_materialize_grads(False) 优化了上述示例：

class MulConstant(Function):
    @staticmethod
    def forward(tensor, constant):
        return tensor * constant

    @staticmethod
    def setup_context(ctx, inputs, output):
        tensor, constant = inputs
        ctx.set_materialize_grads(False)
        ctx.constant = constant

    @staticmethod
    def backward(ctx, grad_output):
        # 在这里我们必须处理 None 的 grad_output Tensor。在这种情况下，
        # 我们可以跳过不必要的计算，直接返回 None。
        if grad_output is None:
            return None, None

        # 我们返回的梯度数量与输入参数的数量相同。
        # 对于前向传播的非 Tensor 参数，其梯度必须为 None。
        return grad_output * ctx.constant, None

如果你需要在 forward() 中计算的任何“中间” Tensors 被保存，它们要么必须作为输出返回，要么将 forward 和 setup_context() 结合起来（参见合并或分离的 forward() 和 setup_context()）。注意，这意味着如果你希望梯度通过这些中间值传播，你需要为它们定义梯度公式（另请参阅双重反向传播教程）：

class MyCube(torch.autograd.Function):
    @staticmethod
    def forward(x):
        # 我们希望为反向传播保存 dx。为了做到这一点，它必须作为输出返回。
        dx = 3 * x ** 2
        result = x ** 3
        return result, dx

    @staticmethod
    def setup_context(ctx, inputs, output):
        x, = inputs
        result, dx = output
        ctx.save_for_backward(x, dx)

    @staticmethod
    def backward(ctx, grad_output, grad_dx):
        x, dx = ctx.saved_tensors
        # 为了使 autograd.Function 支持高阶梯度，我们必须添加 `dx` 的梯度贡献，
        # 即 grad_dx * 6 * x。
        result = grad_output * dx + grad_dx * 6 * x
        return result

将 MyCube 包装在一个函数中，以便更清楚地知道输出是什么：

def my_cube(x):
    result, dx = MyCube.apply(x)
    return result

你可能想检查一下你实现的 backward 方法是否真的计算了你函数的导数。可以通过使用小的有限差分与数值近似值进行比较来实现这一点：

from torch.autograd import gradcheck

# gradcheck 接受一个包含张量的元组作为输入，检查你的梯度在这些张量上评估的结果
# 是否足够接近数值近似，并且在所有条件都符合时返回 True。
input = (torch.randn(20,20,dtype=torch.double,requires_grad=True), torch.randn(30,20,dtype=torch.double,requires_grad=True))
test = gradcheck(linear, input, eps=1e-6, atol=1e-4)
print(test)

合并 `forward()` 和 `setup_context()`

定义 Function 有两种主要方式：

定义一个将 forward() 计算逻辑与 setup_context() 结合的 forward()。
（从 PyTorch 2.0 起）定义一个单独的 forward() 和 setup_context()。

我们推荐第二种方式（分离 forward() 和 setup_context()），因为它更接近 PyTorch 原生操作的实现方式，并且与 torch.func 转换兼容。然而，我们计划继续支持两种方式；合并 forward() 和 setup_context() 会带来更多的灵活性，因为你可以在不将中间结果作为输出返回的情况下保存它们。

请参阅上一节了解如何定义带有分离的 forward() 和 setup_context() 的 Function。

以下是一个如何将 forward() 和 setup_context() 合并的示例：

class LinearFunction(Function):
    @staticmethod
    # ctx 是 forward 的第一个参数
    def forward(ctx, input, weight, bias=None):
        # 前向传播可以使用 ctx。
        ctx.save_for_backward(input, weight, bias)
        output = input.mm(weight.t())
        if bias is not None:
            output += bias.unsqueeze(0).expand_as(output)
        return output

    @staticmethod
    def backward(ctx, grad_output):
        input, weight, bias = ctx.saved_tensors
        grad_input = grad_weight = grad_bias = None

        if ctx.needs_input_grad[0]:
            grad_input = grad_output.mm(weight)
        if ctx.needs_input_grad[1]:
            grad_weight = grad_output.t().mm(input)
        if bias is not None and ctx.needs_input_grad[2]:
            grad_bias = grad_output.sum(0)

        return grad_input, grad_weight, grad_bias

3. 实例：自定义三阶多项式

这是一个三阶多项式，训练目标是通过最小化平方欧几里得距离来预测从 $-\pi$ 到 $\pi$ 的 $\sin(x)$ 。与其将多项式写为 $y = a + bx + cx^2 + dx^3$ ，不如写成 $y = a + bP_3(c + dx)$ ，其中 $P_3(x) = \frac{1}{2}(5x^3 - 3x)$ 是三阶勒让德多项式。
- 此实现使用 PyTorch 张量来计算前向传递，并使用 PyTorch 的自动求导来计算梯度。
- 在此实现中，我们实现了一个自定义自动求导函数来计算 $P_3'(x)$ 。根据数学公式， $P_3'(x) = \frac{3}{2}(5x^2 - 1)$ 。

import torch
import math

class LegendrePolynomial3(torch.autograd.Function):
    """
    我们可以通过子类化 torch.autograd.Function 并实现前向和后向传递的函数，
    来实现自定义的自动求导功能，这些操作都作用于张量上。
    """

    @staticmethod
    def forward(ctx, input):
        """
        在前向传递中，我们接收一个包含输入的张量，并返回一个包含输出的张量。
        ctx 是一个上下文对象，可用于存储后向计算的信息。你可以使用 ctx.save_for_backward 方法缓存任意对象，
        以便在后向传递中使用。
        """
        ctx.save_for_backward(input)
        return 0.5 * (5 * input ** 3 - 3 * input)

    @staticmethod
    def backward(ctx, grad_output):
        """
        在后向传递中，我们接收一个包含相对于输出的损失梯度的张量，
        并且我们需要计算相对于输入的损失梯度。
        """
        input, = ctx.saved_tensors
        return grad_output * 1.5 * (5 * input ** 2 - 1)

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0")  # 取消注释以在 GPU 上运行

# 创建张量以保存输入和输出。
# 默认情况下，requires_grad=False，表示我们不需要在后向传递中计算这些张量的梯度。
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# 为权重创建随机张量。对于此示例，我们需要 4 个权重：y = a + b * P3(c + d * x)，这些权重需要初始化得不太远
# 离正确的结果，以确保收敛。设置 requires_grad=True 表示我们希望在后向传递中计算这些张量的梯度。
a = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
b = torch.full((), -1.0, device=device, dtype=dtype, requires_grad=True)
c = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
d = torch.full((), 0.3, device=device, dtype=dtype, requires_grad=True)

learning_rate = 5e-6
for t in range(2000):
    # 要应用我们的函数，我们使用 Function.apply 方法。我们将其别名为 'P3'。
    P3 = LegendrePolynomial3.apply

    # 前向传递：使用操作计算预测的 y；我们使用自定义自动求导操作计算 P3。
    y_pred = a + b * P3(c + d * x)

    # 计算并打印损失
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # 使用自动求导计算后向传递。
    loss.backward()

    # 使用梯度下降法更新权重
    with torch.no_grad():
        a -= learning_rate * a.grad
        b -= learning_rate * b.grad
        c -= learning_rate * c.grad
        d -= learning_rate * d.grad

        # 更新权重后手动将梯度清零
        a.grad = None
        b.grad = None
        c.grad = None
        d.grad = None

print(f'结果: y = {a.item()} + {b.item()} * P3({c.item()} + {d.item()} x)')

运行以上代码可以看到 loss 在不断降低，说明梯度计算与反传正常：

99 209.95834350585938
199 144.66018676757812
299 100.70249938964844
399 71.03519439697266
499 50.978511810302734
599 37.403133392333984
699 28.206867218017578
799 21.97318458557129
899 17.7457275390625
999 14.877889633178711
1099 12.93176555633545
1199 11.610918045043945
1299 10.71425724029541
1399 10.10548210144043
1499 9.692105293273926
1599 9.411375999450684
1699 9.220745086669922
1799 9.091285705566406
1899 9.003361701965332
1999 8.943641662597656
Result: y = -6.71270206087371e-10 + -2.208526849746704 * P3(-3.392665037793563e-10 + 0.2554861009120941 x)

参考文档

https://pytorch.org/docs/stable/autograd.html
https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
https://pytorch.org/docs/stable/notes/extending.html#extending-autograd
https://pytorch.org/tutorials/beginner/examples_autograd/two_layer_net_custom_function.html