重学 PyTorch 第四天:Module 和 Optimizer

Module 和 Optimizer

首先说明:在神经网络构建方面,PyTorch 也有面向对象编程和函数式编程两种形式,分别在 torch.nntorch.nn.functional 模块下面。面向对象式编程中,状态(模型参数)保存在对象中;函数式编程中,函数没有状态,状态由用户维护;对于无状态的层(比如激活、归一化和池化)两者差异甚微。应用较广泛的是面向对象式编程,本文也重点介绍前者。

Module 是什么

  • 有状态的神经网络计算模块
  • 与 PyTorch 的 autograd 系统紧密结合,与 Parameter、Optimizer 共同组成易于训练学习的神经网络模型
  • 方便存储、转移。torch 内置序列化与反序列化方法、在 cpugputpu 等设备之间无缝切换、进行模型剪枝、量化、JIT 等

关键点就两个:状态计算(这怎么听起来这么像动态规划)

torch.nn.Moduletorch 中所有神经网络层的基类。

Module 的组成:

  • submodule:子模块
  • Parameter:参数,可跟踪梯度,可被更新
  • Buffer:Buffer 是状态的一部分,但是不是参数,就比如说 BatchNormrunning_mean 不是参数,但是是模型的状态。

公有属性

  • training:标识是在 train 模式还是在 eval 模式,可通过 Module.train()Module.eval() 切换(或者可以直接赋值)

公有方法

注意,Module 的方法如果不加说明都是 in-place 操作

cast 类

  • bfloat16()
  • half()
  • float()
  • double()
  • type(dst_type)
  • to(dtype)

设备间搬运类

  • cpu()
  • cuda(device=None)
  • to(device)

前向计算

  • forward
  • zero_grad

增加子模块

  • add_module(module, name):挂一个命名为 name 的子模块上去
  • register_module(module, name):同上
  • register_buffer(name, tensor, persistent=True):挂 buffer

挂 hook

PyTorch 1.11 一共有五个,常用来打印个向量的类型,或者提示进度之类,或者丢进 TensorBoard 之类的

  • register_backward_hook(hook)
  • register_forward_hook(hook)
  • register_forward_pre_hook(hook)
  • register_full_backward_hook(hook)
  • register_load_state_dict_post_hook(hook)

迭代(只读)

  • modules():返回一个递归迭代过所有 module 的迭代器。相同 module 只会被迭代过一次。
  • children():Returns an iterator over immediate children modules. 与 modules() 的区别是本函数只迭代直接子节点,不会递归。
  • parameters(recurse=True):Returns an iterator over module parameters. This is typically passed to an optimizer. 若 recurse=True 会递归,否则只迭代直接子节点。
  • named_children():上述函数的命名版本。
  • named_modules()
  • named_parameters()

迭代并修改

  • apply(fn):对 self 自己和所有子模块(.children())递归应用 fn,常用于初始化权值

序列化

  • load_state_dict(state_dict, strict=True):将一个 state_dict 载入自己的参数
  • state_dict:返回自己的 state_dict,可以使用 torch.save 保存之

数据格式

  • to(memory_format)memory_formatTensor 一节有写。

杂项

  • extra_repr
  • train
  • eval
  • get_submodule
  • get_submodule
  • set_extra_state
  • shared_memory_:Moves the underlying storage to CUDA shared memory. 只对 CUDA 端生效。移动后的 Tensor 不能进行 resize
  • to_empty:Moves the parameters and buffers to the specified device without copying storage.

Module 的使用

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Module 的常用类

对于神经网络中的常用模块,PyTorch 已经提供了高性能的实现。

参数
ParameterA kind of Tensor that is to be considered a module parameter.
UninitializedParameterA parameter that is not initialized.
UninitializedBufferA buffer that is not initialized.
容器
ModuleBase class for all neural network modules.
SequentialA sequential container.
ModuleListHolds submodules in a list.
ModuleDictHolds submodules in a dictionary.
ParameterListHolds parameters in a list.
ParameterDictHolds parameters in a dictionary.

通用 hook

register_module_forward_pre_hookRegisters a forward pre-hook common to all modules.
register_module_forward_hookRegisters a global forward hook for all the modules
register_module_backward_hookRegisters a backward hook common to all the modules.
register_module_full_backward_hookRegisters a backward hook common to all the modules.
卷积层
nn.Conv1dApplies a 1D convolution over an input signal composed of several input planes.
nn.Conv2dApplies a 2D convolution over an input signal composed of several input planes.
nn.Conv3dApplies a 3D convolution over an input signal composed of several input planes.
nn.ConvTranspose1dApplies a 1D transposed convolution operator over an input image composed of several input planes.
nn.ConvTranspose2dApplies a 2D transposed convolution operator over an input image composed of several input planes.
nn.ConvTranspose3dApplies a 3D transposed convolution operator over an input image composed of several input planes.
nn.LazyConv1dA torch.nn.Conv1d module with lazy initialization of the in_channels argument of the Conv1d that is inferred from the input.size(1).
nn.LazyConv2dA torch.nn.Conv2d module with lazy initialization of the in_channels argument of the Conv2d that is inferred from the input.size(1).
nn.LazyConv3dA torch.nn.Conv3d module with lazy initialization of the in_channels argument of the Conv3d that is inferred from the input.size(1).
nn.LazyConvTranspose1dA torch.nn.ConvTranspose1d module with lazy initialization of the in_channels argument of the ConvTranspose1d that is inferred from the input.size(1).
nn.LazyConvTranspose2dA torch.nn.ConvTranspose2d module with lazy initialization of the in_channels argument of the ConvTranspose2d that is inferred from the input.size(1).
nn.LazyConvTranspose3dA torch.nn.ConvTranspose3d module with lazy initialization of the in_channels argument of the ConvTranspose3d that is inferred from the input.size(1).
nn.UnfoldExtracts sliding local blocks from a batched input tensor.
nn.FoldCombines an array of sliding local blocks into a large containing tensor.
池化层
nn.MaxPool1dApplies a 1D max pooling over an input signal composed of several input planes.
nn.MaxPool2dApplies a 2D max pooling over an input signal composed of several input planes.
nn.MaxPool3dApplies a 3D max pooling over an input signal composed of several input planes.
nn.MaxUnpool1dComputes a partial inverse of MaxPool1d.
nn.MaxUnpool2dComputes a partial inverse of MaxPool2d.
nn.MaxUnpool3dComputes a partial inverse of MaxPool3d.
nn.AvgPool1dApplies a 1D average pooling over an input signal composed of several input planes.
nn.AvgPool2dApplies a 2D average pooling over an input signal composed of several input planes.
nn.AvgPool3dApplies a 3D average pooling over an input signal composed of several input planes.
nn.FractionalMaxPool2dApplies a 2D fractional max pooling over an input signal composed of several input planes.
nn.FractionalMaxPool3dApplies a 3D fractional max pooling over an input signal composed of several input planes.
nn.LPPool1dApplies a 1D power-average pooling over an input signal composed of several input planes.
nn.LPPool2dApplies a 2D power-average pooling over an input signal composed of several input planes.
nn.AdaptiveMaxPool1dApplies a 1D adaptive max pooling over an input signal composed of several input planes.
nn.AdaptiveMaxPool2dApplies a 2D adaptive max pooling over an input signal composed of several input planes.
nn.AdaptiveMaxPool3dApplies a 3D adaptive max pooling over an input signal composed of several input planes.
nn.AdaptiveAvgPool1dApplies a 1D adaptive average pooling over an input signal composed of several input planes.
nn.AdaptiveAvgPool2dApplies a 2D adaptive average pooling over an input signal composed of several input planes.
nn.AdaptiveAvgPool3dApplies a 3D adaptive average pooling over an input signal composed of several input planes.
填充层
nn.ReflectionPad1dPads the input tensor using the reflection of the input boundary.
nn.ReflectionPad2dPads the input tensor using the reflection of the input boundary.
nn.ReflectionPad3dPads the input tensor using the reflection of the input boundary.
nn.ReplicationPad1dPads the input tensor using replication of the input boundary.
nn.ReplicationPad2dPads the input tensor using replication of the input boundary.
nn.ReplicationPad3dPads the input tensor using replication of the input boundary.
nn.ZeroPad2dPads the input tensor boundaries with zero.
nn.ConstantPad1dPads the input tensor boundaries with a constant value.
nn.ConstantPad2dPads the input tensor boundaries with a constant value.
nn.ConstantPad3dPads the input tensor boundaries with a constant value.
非线性函数
nn.ELUApplies the Exponential Linear Unit (ELU) function, element-wise, as described in the paper: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs).
nn.HardshrinkApplies the Hard Shrinkage (Hardshrink) function element-wise.
nn.HardsigmoidApplies the Hardsigmoid function element-wise.
nn.HardtanhApplies the HardTanh function element-wise.
nn.HardswishApplies the hardswish function, element-wise, as described in the paper:
nn.LeakyReLUApplies the element-wise function:
nn.LogSigmoidApplies the element-wise function:
nn.MultiheadAttentionAllows the model to jointly attend to information from different representation subspaces as described in the paper: Attention Is All You Need.
nn.PReLUApplies the element-wise function:
nn.ReLUApplies the rectified linear unit function element-wise:
nn.ReLU6Applies the element-wise function:
nn.RReLUApplies the randomized leaky rectified liner unit function, element-wise, as described in the paper:
nn.SELUApplied element-wise, as:
nn.CELUApplies the element-wise function:
nn.GELUApplies the Gaussian Error Linear Units function:
nn.SigmoidApplies the element-wise function:
nn.SiLUApplies the Sigmoid Linear Unit (SiLU) function, element-wise.
nn.MishApplies the Mish function, element-wise.
nn.SoftplusApplies the Softplus function \text{Softplus}(x) = \frac{1}{\beta} _ \log(1 + \exp(\beta _ x))Softplus(x)=β1∗log(1+exp(βx)) element-wise.
nn.SoftshrinkApplies the soft shrinkage function elementwise:
nn.SoftsignApplies the element-wise function:
nn.TanhApplies the Hyperbolic Tangent (Tanh) function element-wise.
nn.TanhshrinkApplies the element-wise function:
nn.ThresholdThresholds each element of the input Tensor.
nn.GLUApplies the gated linear unit function {GLU}(a, b)= a \otimes \sigma(b)G**LU(a,b)=aσ(b) where aa is the first half of the input matrices and bb is the second half.
其它非线性层
nn.SoftminApplies the Softmin function to an n-dimensional input Tensor rescaling them so that the elements of the n-dimensional output Tensor lie in the range [0, 1] and sum to 1.
nn.SoftmaxApplies the Softmax function to an n-dimensional input Tensor rescaling them so that the elements of the n-dimensional output Tensor lie in the range [0,1] and sum to 1.
nn.Softmax2dApplies SoftMax over features to each spatial location.
nn.LogSoftmaxApplies the \log(\text{Softmax}(x))log(Softmax(x)) function to an n-dimensional input Tensor.
nn.AdaptiveLogSoftmaxWithLossEfficient softmax approximation as described in Efficient softmax approximation for GPUs by Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, and Hervé Jégou.
归一化层
nn.BatchNorm1dApplies Batch Normalization over a 2D or 3D input as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift .
nn.BatchNorm2dApplies Batch Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift .
nn.BatchNorm3dApplies Batch Normalization over a 5D input (a mini-batch of 3D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift .
nn.LazyBatchNorm1dA torch.nn.BatchNorm1d module with lazy initialization of the num_features argument of the BatchNorm1d that is inferred from the input.size(1).
nn.LazyBatchNorm2dA torch.nn.BatchNorm2d module with lazy initialization of the num_features argument of the BatchNorm2d that is inferred from the input.size(1).
nn.LazyBatchNorm3dA torch.nn.BatchNorm3d module with lazy initialization of the num_features argument of the BatchNorm3d that is inferred from the input.size(1).
nn.GroupNormApplies Group Normalization over a mini-batch of inputs as described in the paper Group Normalization
nn.SyncBatchNormApplies Batch Normalization over a N-Dimensional input (a mini-batch of [N-2]D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift .
nn.InstanceNorm1dApplies Instance Normalization over a 2D (unbatched) or 3D (batched) input as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization.
nn.InstanceNorm2dApplies Instance Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization.
nn.InstanceNorm3dApplies Instance Normalization over a 5D input (a mini-batch of 3D inputs with additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization.
nn.LazyInstanceNorm1dA torch.nn.InstanceNorm1d module with lazy initialization of the num_features argument of the InstanceNorm1d that is inferred from the input.size(1).
nn.LazyInstanceNorm2dA torch.nn.InstanceNorm2d module with lazy initialization of the num_features argument of the InstanceNorm2d that is inferred from the input.size(1).
nn.LazyInstanceNorm3dA torch.nn.InstanceNorm3d module with lazy initialization of the num_features argument of the InstanceNorm3d that is inferred from the input.size(1).
nn.LayerNormApplies Layer Normalization over a mini-batch of inputs as described in the paper Layer Normalization
nn.LocalResponseNormApplies local response normalization over an input signal composed of several input planes, where channels occupy the second dimension.
RNN 层
nn.RNNBase
nn.RNNApplies a multi-layer Elman RNN with \tanhtanh or \text{ReLU}ReLU non-linearity to an input sequence.
nn.LSTMApplies a multi-layer long short-term memory (LSTM) RNN to an input sequence.
nn.GRUApplies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.
nn.RNNCellAn Elman RNN cell with tanh or ReLU non-linearity.
nn.LSTMCellA long short-term memory (LSTM) cell.
nn.GRUCellA gated recurrent unit (GRU) cell
Transformer 层
nn.TransformerA transformer model.
nn.TransformerEncoderTransformerEncoder is a stack of N encoder layers
nn.TransformerDecoderTransformerDecoder is a stack of N decoder layers
nn.TransformerEncoderLayerTransformerEncoderLayer is made up of self-attn and feedforward network.
nn.TransformerDecoderLayerTransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.
线性层
nn.IdentityA placeholder identity operator that is argument-insensitive.
nn.LinearApplies a linear transformation to the incoming data: y = xA^T + by=xAT+b
nn.BilinearApplies a bilinear transformation to the incoming data: y = x_1^T A x_2 + by=x1TAx2+b
nn.LazyLinearA torch.nn.Linear module where in_features is inferred.
Dropout 层
nn.DropoutDuring training, randomly zeroes some of the elements of the input tensor with probability p using samples from a Bernoulli distribution.
nn.Dropout1dRandomly zero out entire channels (a channel is a 1D feature map, e.g., the jj-th channel of the ii-th sample in the batched input is a 1D tensor \text{input}[i, j]input[i,j]).
nn.Dropout2dRandomly zero out entire channels (a channel is a 2D feature map, e.g., the jj-th channel of the ii-th sample in the batched input is a 2D tensor \text{input}[i, j]input[i,j]).
nn.Dropout3dRandomly zero out entire channels (a channel is a 3D feature map, e.g., the jj-th channel of the ii-th sample in the batched input is a 3D tensor \text{input}[i, j]input[i,j]).
nn.AlphaDropoutApplies Alpha Dropout over the input.
nn.FeatureAlphaDropoutRandomly masks out entire channels (a channel is a feature map, e.g.
稀疏层
nn.EmbeddingA simple lookup table that stores embeddings of a fixed dictionary and size.
nn.EmbeddingBagComputes sums or means of ‘bags’ of embeddings, without instantiating the intermediate embeddings.
距离函数
nn.CosineSimilarityReturns cosine similarity between x_1x1 and x_2x2, computed along dim.
nn.PairwiseDistanceComputes the pairwise distance between vectors v_1v1, v_2v2 using the p-norm:
损失函数
nn.L1LossCreates a criterion that measures the mean absolute error (MAE) between each element in the input xx and target yy.
nn.MSELossCreates a criterion that measures the mean squared error (squared L2 norm) between each element in the input xx and target yy.
nn.CrossEntropyLossThis criterion computes the cross entropy loss between input and target.
nn.CTCLossThe Connectionist Temporal Classification loss.
nn.NLLLossThe negative log likelihood loss.
nn.PoissonNLLLossNegative log likelihood loss with Poisson distribution of target.
nn.GaussianNLLLossGaussian negative log likelihood loss.
nn.KLDivLossThe Kullback-Leibler divergence loss.
nn.BCELossCreates a criterion that measures the Binary Cross Entropy between the target and the input probabilities:
nn.BCEWithLogitsLossThis loss combines a Sigmoid layer and the BCELoss in one single class.
nn.MarginRankingLossCreates a criterion that measures the loss given inputs x1x1, x2x2, two 1D mini-batch or 0D Tensors, and a label 1D mini-batch or 0D Tensor yy (containing 1 or -1).
nn.HingeEmbeddingLossMeasures the loss given an input tensor xx and a labels tensor yy (containing 1 or -1).
nn.MultiLabelMarginLossCreates a criterion that optimizes a multi-class multi-classification hinge loss (margin-based loss) between input xx (a 2D mini-batch Tensor) and output yy (which is a 2D Tensor of target class indices).
nn.HuberLossCreates a criterion that uses a squared term if the absolute element-wise error falls below delta and a delta-scaled L1 term otherwise.
nn.SmoothL1LossCreates a criterion that uses a squared term if the absolute element-wise error falls below beta and an L1 term otherwise.
nn.SoftMarginLossCreates a criterion that optimizes a two-class classification logistic loss between input tensor xx and target tensor yy (containing 1 or -1).
nn.MultiLabelSoftMarginLossCreates a criterion that optimizes a multi-label one-versus-all loss based on max-entropy, between input xx and target yy of size (N, C)(N,C).
nn.CosineEmbeddingLossCreates a criterion that measures the loss given input tensors x_1x1, x_2x2 and a Tensor label yy with values 1 or -1.
nn.MultiMarginLossCreates a criterion that optimizes a multi-class classification hinge loss (margin-based loss) between input xx (a 2D mini-batch Tensor) and output yy (which is a 1D tensor of target class indices, 0 \leq y \leq \text{x.size}(1)-10≤y≤x.size(1)−1):
nn.TripletMarginLossCreates a criterion that measures the triplet loss given an input tensors x1x1, x2x2, x3x3 and a margin with a value greater than 00.
nn.TripletMarginWithDistanceLossCreates a criterion that measures the triplet loss given input tensors aa, pp, and nn (representing anchor, positive, and negative examples, respectively), and a nonnegative, real-valued function (“distance function”) used to compute the relationship between the anchor and positive example (“positive distance”) and the anchor and negative example (“negative distance”).
视觉相关
nn.PixelShuffleRearranges elements in a tensor of shape (, C \times r^2, H, W)(∗,C×r2,H,W) to a tensor of shape (, C, H \times r, W \times r)(∗,C,H×r,W×r), where r is an upscale factor.
nn.PixelUnshuffleReverses the PixelShuffle operation by rearranging elements in a tensor of shape (, C, H \times r, W \times r)(∗,C,H×r*,W×r) to a tensor of shape (, C \times r^2, H, W)(∗,C×r*2,H,W), where r is a downscale factor.
nn.UpsampleUpsamples a given multi-channel 1D (temporal), 2D (spatial) or 3D (volumetric) data.
nn.UpsamplingNearest2dApplies a 2D nearest neighbor upsampling to an input signal composed of several input channels.
nn.UpsamplingBilinear2dApplies a 2D bilinear upsampling to an input signal composed of several input channels.
随机打乱
nn.ChannelShuffleDivide the channels in a tensor of shape (, C , H, W)(∗,C,H,W) into g groups and rearrange them as (, C \frac g, g, H, W)(∗,C,g**g,H,W), while keeping the original tensor shape.
DataParallel
nn.DataParallelImplements data parallelism at the module level.
nn.parallel.DistributedDataParallelImplements distributed data parallelism that is based on torch.distributed package at the module level.
RNN 相关层
nn.utils.rnn.PackedSequenceHolds the data and list of batch_sizes of a packed sequence.
nn.utils.rnn.pack_padded_sequencePacks a Tensor containing padded sequences of variable length.
nn.utils.rnn.pad_packed_sequencePads a packed batch of variable length sequences.
nn.utils.rnn.pad_sequencePad a list of variable length Tensors with padding_value
nn.utils.rnn.pack_sequencePacks a list of variable length Tensors
Flatten 层
nn.FlattenFlattens a contiguous range of dims into a tensor.
nn.UnflattenUnflattens a tensor dim expanding it to a desired shape.

另外还有一些模型裁剪、剪枝的方法在 torch.utils 中,模型量化方法等,不多介绍。

nn.init

每个 Module 有其内置的初始化逻辑,但我们也可以采用一些高级的初始化方式。

  • torch.nn.init.calculate_gain(nonlinearity, param=None):Return the recommended gain value for the given nonlinearity function.
函数结果
uniform_(tensor, a=0.0, b=1.0) U ( a , b ) \mathcal{U}(a, b) U(a,b)
normal_(tensor, mean=0.0, std=1.0) N ( m e a n , s t d 2 ) \mathcal{N}(\mathrm{mean}, \mathrm{std}^2) N(mean,std2)
constant_(tensor, val) v a l \mathrm{val} val
ones_(tensor) 1 1 1
zeros_(tensor) 0 0 0
eye_(tensor) I I I(仅对二维矩阵)
dirac_(tensor, group=1)Fills the {3, 4, 5}-dimensional input Tensor with the Dirac delta function. Preserves the identity of the inputs in Convolutional layers, where as many input channels are preserved as possible. In case of groups>1, each group of channels preserves identity
xavier_uniform_(tensor, gain=1.0) U ( − a , a ) \mathcal{U}(-a, a) U(a,a), where a = gain ⁡ × 6  fan_in  +  fan_out  a=\operatorname{gain} \times \sqrt{\frac{6}{\text { fan\_in }+\text { fan\_out }}} a=gain× fan_in + fan_out 6
xavier_normal_(tensor, gain=1.0) N ( 0 , s t d 2 ) \mathcal{N}(0, \mathrm{std}^2) N(0,std2), where std ⁡ = gain ⁡ × 2  fan_in  +  fan_out  \operatorname{std}=\operatorname{gain} \times \sqrt{\frac{2}{\text { fan\_in }+\text { fan\_out }}} std=gain× fan_in + fan_out 2
kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu') U ( − b o u n d , − b o u n d ) \mathcal{U}(-\mathrm{bound}, -\mathrm{bound}) U(bound,bound), where b o u n d = gain ⁡ × 3  fan_mode  \mathrm{bound}=\operatorname{gain} \times \sqrt{\frac{3}{\text { fan\_mode }}} bound=gain× fan_mode 3
kaiming_normal_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu') N ( 0 , s t d 2 ) \mathcal{N}(0, \mathrm{std}^2) N(0,std2), where std ⁡ =  gain   fan_mode  \operatorname{std}=\frac{\text { gain }}{\sqrt{\text { fan\_mode }}} std= fan_mode   gain 
trunc_normal_(tensor, mean=0.0, std=1.0, a=- 2.0, b=2.0)The values are effectively drawn from the normal distribution N ( mean , std 2 ) \mathcal{N}(\text{mean}, \text{std}^2) N(mean,std2) with values outside [ a , b ] [a, b] [a,b] redrawn until they are within the bounds.
orthogonal_(tensor, gain=1)Fills the input Tensor with a (semi) orthogonal matrix.
sparse_(tensor, sparsity, std=0.01)填入一个稀疏度为 sparsity 的稀疏矩阵,非零元素服从 N ( 0 , s t d 2 ) N(0, \mathrm{std}^2) N(0,std2)

Optimizer

Optimizer 规定了如何进行梯度更新。多数 Optimizer 也有状态,保存检查点时记得一并保存。

属性

  • params:该优化器所优化的 torch.Tensor 或者 dict
  • defaults – (dict): a dict containing default values of optimization options (used when a parameter group doesn’t specify them).

方法

方法描述
Optimizer.add_param_groupAdd a param group to the Optimizer s param_groups.
Optimizer.load_state_dictLoads the optimizer state.
Optimizer.state_dictReturns the state of the optimizer as a dict.
Optimizer.step一步更新
Optimizer.zero_grad清空更新的 torch.Tensor 的梯度

实现

算法描述
AdadeltaImplements Adadelta algorithm.
AdagradImplements Adagrad algorithm.
AdamImplements Adam algorithm.
AdamWImplements AdamW algorithm.
SparseAdamImplements lazy version of Adam algorithm suitable for sparse tensors.
AdamaxImplements Adamax algorithm (a variant of Adam based on infinity norm).
ASGDImplements Averaged Stochastic Gradient Descent.
LBFGSImplements L-BFGS algorithm, heavily inspired by minFunc.
NAdamImplements NAdam algorithm.
RAdamImplements RAdam algorithm.
RMSpropImplements RMSprop algorithm.
RpropImplements the resilient backpropagation algorithm.
SGDImplements stochastic gradient descent (optionally with momentum).

对于几个关键优化器讲解如下1

Gradient Descent
# Vanilla Gradient Descent
while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)
  weights += - step_size * weights_grad # perform parameter update

This simple loop is at the core of all Neural Network libraries. There are other ways of performing the optimization (e.g. LBFGS), but Gradient Descent is currently by far the most common and established way of optimizing Neural Network loss functions.

在实际使用中,不能将全部数据统一加载计算梯度时,可以采用 Minibatch Gradient Descent

# Vanilla Minibatch Gradient Descent

while True:
  data_batch = sample_training_data(data, 256) # sample 256 examples
  weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
  weights += - step_size * weights_grad # perform parameter update

SGD 通常被认为是随机选取 Mini-batch 进行的梯度下降。

之后我们假定 x 为当前参数,dxx 在这个 Mini-batch 的前向计算中获得的梯度。

SGD

SGD 有一个超参数:learning_rate,一个固定的常量

# Vanilla update
x += - learning_rate * dx
Momentum

Momentum 算法基于物理上的“动量”概念,dx 的积累体现在 v 上,通过v 更新 x

v 是模型的内部参数,初始化为 0 0 0。新增超参数 mu

# Momentum update
v = mu * v - learning_rate * dx # integrate velocity
x += v # integrate position

一个理解是,“既然这个方向是好的,那么往下走很可能更好”。这样做的好处是可以“冲出”局部最优。

Nesterov Momentum

Nesterov 理论背景比 Momentum 更强,在实际中能取得更好的效果。算法的核心是x + mu*v 处计算梯度 dx_ahead,用 dx_ahead 更新 v,再用 v 更新 x

x_ahead = x + mu * v
# evaluate dx_ahead (the gradient at x_ahead instead of at x)
v = mu * v - learning_rate * dx_ahead
x += v

当然另一种实现是这样的,去掉了 x_ahead 增加了 v_prev,二者等价。

v_prev = v # back this up
v = mu * v - learning_rate * dx # velocity update stays the same
x += -mu * v_prev + (1 + mu) * v # position update changes form
Adagrad

前三者的学习率是全局唯一的,而 Adagrad 的学习率随模型参数变化。

超参数只有一个 learning_rate

# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)

cache 变量逐元素记录了梯度的平方,并对学习率进行归一化。但一大问题是学习率是不变的,而 cache 的单调不下降造成更新速度单调不上升,模型过早停止训练,容易欠拟合。

RMSprop

rprop 只使用梯度的符号信息,而 RMSprop 是A mini-batch version of rprop

cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)

采用移动平均法cache 进行更新,cache 不再是单调不上升的。

Adam
m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += - learning_rate * m / (np.sqrt(v) + eps)

如果只看最后一句,那么和 Adagrad、RMSprop 是相同的,不同的是 mv。所以 Adam 可以看做带动量的 RMSprop

完整的 Adam 算法还有一个启动过程

# t is your iteration counter going from 1 to infinity
m = beta1*m + (1-beta1)*dx
mt = m / (1-beta1**t)
v = beta2*v + (1-beta2)*(dx**2)
vt = v / (1-beta2**t)
x += - learning_rate * mt / (np.sqrt(vt) + eps)

https://arxiv.org/pdf/1412.6980.pdf

学习率调整

torch.optim.lr_scheduler provides several methods to adjust the learning rate based on the number of epochs.

model = [Parameter(torch.randn(2, 2, requires_grad=True))]
optimizer = SGD(model, 0.1)
scheduler = ExponentialLR(optimizer, gamma=0.9) # 将 optimizer 传进 lr_scheduler 里

for epoch in range(20):
    for input, target in dataset:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
    scheduler.step()

还有同时使用多个 scheduler

model = [Parameter(torch.randn(2, 2, requires_grad=True))]
optimizer = SGD(model, 0.1)
scheduler1 = ExponentialLR(optimizer, gamma=0.9)
scheduler2 = MultiStepLR(optimizer, milestones=[30,80], gamma=0.1)

for epoch in range(20):
    for input, target in dataset:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
    scheduler1.step()
    scheduler2.step()

记住:一定要在 optimizer.step() 后边调用 scheduler.step()????

规则描述
lr_scheduler.LambdaLRSets the learning rate of each parameter group to the initial lr times a given function.
lr_scheduler.MultiplicativeLRMultiply the learning rate of each parameter group by the factor given in the specified function.
lr_scheduler.StepLRDecays the learning rate of each parameter group by gamma every step_size epochs.
lr_scheduler.MultiStepLRDecays the learning rate of each parameter group by gamma once the number of epoch reaches one of the milestones.
lr_scheduler.ConstantLRDecays the learning rate of each parameter group by a small constant factor until the number of epoch reaches a pre-defined milestone: total_iters.
lr_scheduler.LinearLRDecays the learning rate of each parameter group by linearly changing small multiplicative factor until the number of epoch reaches a pre-defined milestone: total_iters.
lr_scheduler.ExponentialLRDecays the learning rate of each parameter group by gamma every epoch.
lr_scheduler.CosineAnnealingLRSet the learning rate of each parameter group using a cosine annealing schedule, where \eta*{max}ηmax is set to the initial lr and T*{cur}Tcu**r is the number of epochs since the last restart in SGDR:
lr_scheduler.ChainedSchedulerChains list of learning rate schedulers.
lr_scheduler.SequentialLRReceives the list of schedulers that is expected to be called sequentially during optimization process and milestone points that provides exact intervals to reflect which scheduler is supposed to be called at a given epoch.
lr_scheduler.ReduceLROnPlateauReduce learning rate when a metric has stopped improving.
lr_scheduler.CyclicLRSets the learning rate of each parameter group according to cyclical learning rate policy (CLR).
lr_scheduler.OneCycleLRSets the learning rate of each parameter group according to the 1cycle learning rate policy.
lr_scheduler.CosineAnnealingWarmRestartsSet the learning rate of each parameter group using a cosine annealing schedule, where \eta*{max}ηmax is set to the initial lr, T*{cur}Tcu**r is the number of epochs since the last restart and T_{i}T**i is the number of epochs between two warm restarts in SGDR:

Stochastic Weight Averaging

见论文:https://arxiv.org/abs/1803.05407

loader, optimizer, model, loss_fn = ...
swa_model = torch.optim.swa_utils.AveragedModel(model)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=300)
swa_start = 160
swa_scheduler = SWALR(optimizer, swa_lr=0.05)
for epoch in range(300):
      for input, target in loader:
          optimizer.zero_grad()
          loss_fn(model(input), target).backward()
          optimizer.step()
      if epoch > swa_start:
          swa_model.update_parameters(model)
          swa_scheduler.step()
      else:
          scheduler.step()
# Update bn statistics for the swa_model at the end
torch.optim.swa_utils.update_bn(loader, swa_model)
# Use swa_model to make predictions on test data
preds = swa_model(test_input)


  1. https://cs231n.github.io/neural-networks-3/ ↩︎

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
使用pytorch搭建卷积神经网络识别手写数字的命令如下: ```python import torch import torch.nn as nn import torch.optim as optim from torchvision import datasets, transforms # 定义数据预处理 transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ]) # 加载数据集 train_set = datasets.MNIST('./data', train=True, download=True, transform=transform) test_set = datasets.MNIST('./data', train=False, download=True, transform=transform) # 定义模型 class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(1, 32, kernel_size=3) self.conv2 = nn.Conv2d(32, 64, kernel_size=3) self.fc1 = nn.Linear(64 * 5 * 5, 128) self.fc2 = nn.Linear(128, 10) def forward(self, x): x = nn.functional.relu(self.conv1(x)) x = nn.functional.relu(self.conv2(x)) x = nn.functional.max_pool2d(x, 2) x = x.view(-1, 64 * 5 * 5) x = nn.functional.relu(self.fc1(x)) x = self.fc2(x) return nn.functional.log_softmax(x, dim=1) # 训练模型 model = Net() optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5) loss_fn = nn.NLLLoss() def train(epoch): model.train() for batch_idx, (data, target) in enumerate(train_set): optimizer.zero_grad() output = model(data.unsqueeze(0)) loss = loss_fn(output, torch.tensor([target])) loss.backward() optimizer.step() if batch_idx % 1000 == 0: print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format( epoch, batch_idx * len(data), len(train_set.dataset), 100. * batch_idx / len(train_set), loss.item())) for epoch in range(1, 11): train(epoch) # 测试模型 model.eval() test_loss = 0 correct = 0 with torch.no_grad(): for data, target in test_set: output = model(data.unsqueeze(0)) test_loss += loss_fn(output, torch.tensor([target]), reduction='sum').item() pred = output.argmax(dim=1, keepdim=True) correct += pred.eq(torch.tensor([target]).view_as(pred)).sum().item() test_loss /= len(test_set.dataset) print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format( test_loss, correct, len(test_set.dataset), 100. * correct / len(test_set.dataset))) ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值