文章目录
Auto-grad
Internally, autograd represents this graph as a graph of Function
objects, which can be apply()
to compute the result of evaluating the graph. The .grad_fn
attribute of each torch.Tensor
is the entry point into this graph.
Concurrency Training on CPU
The gradient can be non-deterministic here!
# Define a train function to be used in different threads
def train_fn():
x = torch.ones(5, 5, requires_grad=True)
# forward
y = (x + 3) * (x + 4) * 0.5
# backward
y.sum().backward()
# potential optimizer update
# User write their own threading code to drive the train_fn
threads = []
for _ in range(10):
p = threading.Thread(target=train_fn, args=())
p.start()
threads.append(p)
for p in threads:
p.join()
optimizer 的 zero_grad 只需要运行一次即可
Custom Layers and Backward
自定义操作,详情参考官方 API torch.autograd.Function
与 torch.autograd.function._ContextMethodMixin
(ctx)
Example 1 : Binarized Input and Use our own difined backward.
class BinarizedF(Function):
@staticmethod
def forward(ctx, input_): # 注意这里 ctx 的使用
ctx.save_for_backward(input_) # 将前向传播的数据保存下来,供反向传播时使用
a = torch.ones_like(input_)
b = -torch.ones_like(input_)
output = torch.where(input_ >= 0, a, b)
return output
@staticmethod
def backward(ctx, output_grad):
input_, = ctx.saved_tensors
input_abs = torch.abs(input_)
ones = torch.ones_like(input_)
zeros = torch.zeros_like(input_)
input_grad = torch.where(input_abs <= 1, ones, zeros)
return input_grad
class Binarized(torch.nn.Module):
def __init__(self):
super(Binarized, self).__init__()
def forward(self, input_):
return BinarizedF.apply(input_)
Example 2 : Custom Linear Layer
# Inherit from Function
class LinearFunction(Function):
# Note that both forward and backward are @staticmethods
@staticmethod
# bias is an optional argument
def forward(ctx, input, weight, bias=None):
ctx.save_for_backward(input, weight, bias)
output = input.mm(weight.t())
if bias is not None:
output += bias.unsqueeze(0).expand_as(output)
return output
# This function has only a single output, so it gets only one gradient
@staticmethod
def backward(ctx, grad_output):
# This is a pattern that is very convenient - at the top of backward
# unpack saved_tensors and initialize all gradients w.r.t. inputs to
# None. Thanks to the fact that additional trailing Nones are
# ignored, the return statement is simple even when the function has
# optional inputs.
input, weight, bias = ctx.saved_tensors
grad_input = grad_weight = grad_bias = None
# These needs_input_grad checks are optional and there only to
# improve efficiency. If you want to make your code simpler, you can
# skip them. Returning gradients for inputs that don't require it is
# not an error.
if ctx.needs_input_grad[0]:
grad_input = grad_output.mm(weight)
if ctx.needs_input_grad[1]:
grad_weight = grad_output.t().mm(input)
if bias is not None and ctx.needs_input_grad[2]:
grad_bias = grad_output.sum(0)
# 这里应该就对应到相关 Tensor 的 .grad 属性中
return grad_input, grad_weight, grad_bias
class Linear(nn.Module):
def __init__(self, input_features, output_features, bias=True):
super(Linear, self).__init__()
self.input_features = input_features
self.output_features = output_features
# nn.Parameter is a special kind of Tensor, that will get
# automatically registered as Module's parameter once it's assigned
# as an attribute. Parameters and buffers need to be registered, or
# they won't appear in .parameters() (doesn't apply to buffers), and
# won't be converted when e.g. .cuda() is called. You can use
# .register_buffer() to register buffers.
# nn.Parameters require gradients by default.
self.weight = nn.Parameter(torch.Tensor(output_features, input_features))
if bias:
self.bias = nn.Parameter(torch.Tensor(output_features))
else:
# You should always register all possible parameters, but the
# optional ones can be None if you want.
self.register_parameter('bias', None)
# Not a very smart way to initialize weights
self.weight.data.uniform_(-0.1, 0.1)
if self.bias is not None:
self.bias.data.uniform_(-0.1, 0.1)
def forward(self, input):
# See the autograd section for explanation of what happens here.
return LinearFunction.apply(input, self.weight, self.bias)
自定义处理层,实现backward 处理。 从auto_grad 中继承 Function 实现 forward & backward。再继承 nn.Module,实现 forward,就与一般的神经网络层一致了。
在各层中监控输入输出(hook 机制)
def printnorm(self, input, output):
# input is a tuple of packed tensor
# output is a Tensor, whose data is the interest
print('input size : ', input[0].size()) # input[0] is the input tensor
net.conv2.register_forward_hook(printnorm) # when call forward, the function will be called
def printgrad(self, grad_input, grad_output):
# input and output are tuples
print('grad_input size : ', grad_input[0].size())
print('grad_output size : ', grad_output[0].norm())
net.conv2.register_backward_hook(printgrad)
Transfer Learning & Fine-tuning
训练时,有时会只训练一部分参数(分阶段训练),关于 requires_grad
官方文档
If there’s a single input to an operation that requires gradient, its output will also require gradient. Conversely, only if all inputs don’t require gradient, the output also won’t require it.
设置
requires_grad
仅是减少计算。是否更新参数是在 Optimizer 中。
Pay attention to how to freeze paramters inside a network. Only add the parameters to be updated into the optimizer.
model = torchvision.models.resnet18(pretrained=True)
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Linear(512,100)
# optimize only the classifier
optimizer = optim.SGD(model.fc.parameters(), lr=0.02)
Batch Accumulation
Pytorch backward 之后,各个 Tensor 的梯度存在 .grad
中。当反复构建计算图并且反向传播之后,各个梯度是累加的。所以可以直接经过几个 batch 反向传播后,再用 Optimizer 优化。
def zero_grad(self):
r"""Clears the gradients of all optimized :class:`torch.Tensor` s."""
for group in self.param_groups:
for p in group['params']:
if p.grad is not None:
p.grad.detach_()
p.grad.zero_()
梯度操作
Gradient Penalty
在 StyleGAN 中提出的,见论文。是将参数的梯度也作为损失项加入总 Loss 中。
γ 2 E P D ( x ) [ ∣ ∣ ▽ D ϕ ( x ) ∣ ∣ 2 ] \frac{\gamma}{2}E_{PD(x)}[||\triangledown D_{\phi}(x)||^2] 2γEPD(x)[∣∣▽Dϕ(x)∣∣2]
在 Pytorch 中可使用 torch.autograd.grad
计算梯度。也就是
▽
D
ϕ
(
x
)
\triangledown D_{\phi}(x)
▽Dϕ(x)。eg.
if r1_gamma != 0.0:
real_loss = d_result_real.sum()
real_grads = torch.autograd.grad(real_loss, reals, create_graph=True, retain_graph=True)[0]
r1_penalty = torch.sum(real_grads.pow(2.0), dim=[1, 2, 3])
loss = loss + r1_penalty * (r1_gamma * 0.5)
return loss.mean()
real_loss
是 D ϕ ( x ) D_{\phi}(x) Dϕ(x),也就是 back propogation 时的损失。 x x x 是输入,也就是图片。
Gradient Clip
梯度剪裁(Gradient Clipping),为了防止梯度过大导致模型训练不收敛,或者难以收敛。采用 torch.nn.utils.clip_grad_norm_
,API。注意这是 In-place 操作。
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
第一个传入的可以是 Tensor 的可迭代器(列表),或者单一的 Tensor
Notes
尽量减少对矩阵的直接赋值(其实赋值的话根本没办法算梯度),最好是生成一个常数矩阵,与原矩阵做加减乘除运算,或者将不同的矩阵赋值到此矩阵中达到目的;pytorch 反向传播支持赋值操作。