Pytorch - Autograd

Auto-grad

Internally, autograd represents this graph as a graph of Function objects, which can be apply() to compute the result of evaluating the graph. The .grad_fn attribute of each torch.Tensor is the entry point into this graph.

Concurrency Training on CPU

The gradient can be non-deterministic here!

# Define a train function to be used in different threads
def train_fn():
    x = torch.ones(5, 5, requires_grad=True)
    # forward
    y = (x + 3) * (x + 4) * 0.5
    # backward
    y.sum().backward()
    # potential optimizer update


# User write their own threading code to drive the train_fn
threads = []
for _ in range(10):
    p = threading.Thread(target=train_fn, args=())
    p.start()
    threads.append(p)

for p in threads:
    p.join()

optimizer 的 zero_grad 只需要运行一次即可

Custom Layers and Backward

自定义操作,详情参考官方 API torch.autograd.Functiontorch.autograd.function._ContextMethodMixin (ctx)

Example 1 : Binarized Input and Use our own difined backward.

class BinarizedF(Function):
    @staticmethod
    def forward(ctx, input_):  # 注意这里 ctx  的使用
        ctx.save_for_backward(input_) # 将前向传播的数据保存下来,供反向传播时使用
        a = torch.ones_like(input_)
        b = -torch.ones_like(input_)
        output = torch.where(input_ >= 0, a, b)
        return output

    @staticmethod
    def backward(ctx, output_grad):
        input_, = ctx.saved_tensors
        input_abs = torch.abs(input_)
        ones = torch.ones_like(input_)
        zeros = torch.zeros_like(input_)
        input_grad = torch.where(input_abs <= 1, ones, zeros)
        return input_grad

class Binarized(torch.nn.Module):
    def __init__(self):
        super(Binarized, self).__init__()

    def forward(self, input_):
        return BinarizedF.apply(input_)

Example 2 : Custom Linear Layer

# Inherit from Function
class LinearFunction(Function):

    # Note that both forward and backward are @staticmethods
    @staticmethod
    # bias is an optional argument
    def forward(ctx, input, weight, bias=None):
        ctx.save_for_backward(input, weight, bias)
        output = input.mm(weight.t())
        if bias is not None:
            output += bias.unsqueeze(0).expand_as(output)
        return output

    # This function has only a single output, so it gets only one gradient
    @staticmethod
    def backward(ctx, grad_output):
        # This is a pattern that is very convenient - at the top of backward
        # unpack saved_tensors and initialize all gradients w.r.t. inputs to
        # None. Thanks to the fact that additional trailing Nones are
        # ignored, the return statement is simple even when the function has
        # optional inputs.
        input, weight, bias = ctx.saved_tensors
        grad_input = grad_weight = grad_bias = None

        # These needs_input_grad checks are optional and there only to
        # improve efficiency. If you want to make your code simpler, you can
        # skip them. Returning gradients for inputs that don't require it is
        # not an error.
        if ctx.needs_input_grad[0]:
            grad_input = grad_output.mm(weight)
        if ctx.needs_input_grad[1]:
            grad_weight = grad_output.t().mm(input)
        if bias is not None and ctx.needs_input_grad[2]:
            grad_bias = grad_output.sum(0)

        # 这里应该就对应到相关 Tensor 的 .grad 属性中
        return grad_input, grad_weight, grad_bias

class Linear(nn.Module):
    def __init__(self, input_features, output_features, bias=True):
        super(Linear, self).__init__()
        self.input_features = input_features
        self.output_features = output_features

        # nn.Parameter is a special kind of Tensor, that will get
        # automatically registered as Module's parameter once it's assigned
        # as an attribute. Parameters and buffers need to be registered, or
        # they won't appear in .parameters() (doesn't apply to buffers), and
        # won't be converted when e.g. .cuda() is called. You can use
        # .register_buffer() to register buffers.
        # nn.Parameters require gradients by default.
        self.weight = nn.Parameter(torch.Tensor(output_features, input_features))
        if bias:
            self.bias = nn.Parameter(torch.Tensor(output_features))
        else:
            # You should always register all possible parameters, but the
            # optional ones can be None if you want.
            self.register_parameter('bias', None)

        # Not a very smart way to initialize weights
        self.weight.data.uniform_(-0.1, 0.1)
        if self.bias is not None:
            self.bias.data.uniform_(-0.1, 0.1)

    def forward(self, input):
        # See the autograd section for explanation of what happens here.
        return LinearFunction.apply(input, self.weight, self.bias)

自定义处理层,实现backward 处理。 从auto_grad 中继承 Function 实现 forward & backward。再继承 nn.Module,实现 forward,就与一般的神经网络层一致了。

在各层中监控输入输出(hook 机制)

def printnorm(self, input, output):
    # input is a tuple of packed tensor
    # output is a Tensor, whose data is the interest
    print('input size : ', input[0].size()) # input[0] is the input tensor

net.conv2.register_forward_hook(printnorm) # when call forward, the function will be called

def printgrad(self, grad_input, grad_output):
    # input and output are tuples
    print('grad_input size : ', grad_input[0].size())
    print('grad_output size : ', grad_output[0].norm())

net.conv2.register_backward_hook(printgrad) 

Transfer Learning & Fine-tuning

训练时,有时会只训练一部分参数(分阶段训练),关于 requires_grad官方文档

If there’s a single input to an operation that requires gradient, its output will also require gradient. Conversely, only if all inputs don’t require gradient, the output also won’t require it.

设置 requires_grad 仅是减少计算。是否更新参数是在 Optimizer 中。

Pay attention to how to freeze paramters inside a network. Only add the parameters to be updated into the optimizer.

model = torchvision.models.resnet18(pretrained=True)
for param in model.parameters():
	param.requires_grad = False
model.fc = nn.Linear(512,100)
# optimize only the classifier
optimizer = optim.SGD(model.fc.parameters(), lr=0.02)

Batch Accumulation

Pytorch backward 之后,各个 Tensor 的梯度存在 .grad 中。当反复构建计算图并且反向传播之后,各个梯度是累加的。所以可以直接经过几个 batch 反向传播后,再用 Optimizer 优化。

def zero_grad(self):
    r"""Clears the gradients of all optimized :class:`torch.Tensor` s."""
    for group in self.param_groups:
        for p in group['params']:
            if p.grad is not None:
                p.grad.detach_()
                p.grad.zero_()

梯度操作

Gradient Penalty

在 StyleGAN 中提出的,见论文。是将参数的梯度也作为损失项加入总 Loss 中。

γ 2 E P D ( x ) [ ∣ ∣ ▽ D ϕ ( x ) ∣ ∣ 2 ] \frac{\gamma}{2}E_{PD(x)}[||\triangledown D_{\phi}(x)||^2] 2γEPD(x)[Dϕ(x)2]

在 Pytorch 中可使用 torch.autograd.grad 计算梯度。也就是 ▽ D ϕ ( x ) \triangledown D_{\phi}(x) Dϕ(x)。eg.

if r1_gamma != 0.0:
    real_loss = d_result_real.sum()
    real_grads = torch.autograd.grad(real_loss, reals, create_graph=True, retain_graph=True)[0]
    r1_penalty = torch.sum(real_grads.pow(2.0), dim=[1, 2, 3])
    loss = loss + r1_penalty * (r1_gamma * 0.5)
return loss.mean()

real_loss D ϕ ( x ) D_{\phi}(x) Dϕ(x),也就是 back propogation 时的损失。 x x x 是输入,也就是图片。

Gradient Clip

梯度剪裁(Gradient Clipping),为了防止梯度过大导致模型训练不收敛,或者难以收敛。采用 torch.nn.utils.clip_grad_norm_API注意这是 In-place 操作

torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

第一个传入的可以是 Tensor 的可迭代器(列表),或者单一的 Tensor

Notes

尽量减少对矩阵的直接赋值(其实赋值的话根本没办法算梯度),最好是生成一个常数矩阵,与原矩阵做加减乘除运算,或者将不同的矩阵赋值到此矩阵中达到目的;pytorch 反向传播支持赋值操作。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Transformer发轫于NLP(自然语言处理),并跨界应用到CV(计算机视觉)领域。 Swin Transformer是基于Transformer的计算机视觉骨干网,在图像分类、目标检测、实例分割、语义分割等多项下游CV应用中取得了SOTA的性能。该项工作也获得了ICCV 2021顶会最佳论文奖。 本课程将手把手地教大家使用labelImg标注和使用Swin Transformer训练自己的数据集。  本课程将介绍Transformer及在CV领域的应用、Swin Transformer的原理。 课程以多目标检测(足球和梅西同时检测)为例进行Swin Transformer实战演示。 课程在Windows和Ubuntu系统上分别做项目演示。包括:安装软件环境、安装Pytorch、安装Swin-Transformer-Object-Detection、标注自己的数据集、准备自己的数据集(自动划分训练集和验证集)、数据集格式转换(Python脚本完成)、修改配置文件、训练自己的数据集、测试训练出的网络模型、性能统计、日志分析。  相关课程: 《Transformer原理与代码精讲(PyTorch)》https://edu.csdn.net/course/detail/36697《Transformer原理与代码精讲(TensorFlow)》https://edu.csdn.net/course/detail/36699《ViT(Vision Transformer)原理与代码精讲》https://edu.csdn.net/course/detail/36719《DETR原理与代码精讲》https://edu.csdn.net/course/detail/36768《Swin Transformer实战目标检测:训练自己的数据集》https://edu.csdn.net/course/detail/36585《Swin Transformer实战实例分割:训练自己的数据集》https://edu.csdn.net/course/detail/36586《Swin Transformer原理与代码精讲》 https://download.csdn.net/course/detail/37045

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值