Pytorch代码中的常识和一些好用的操作

最新推荐文章于 2025-03-18 20:44:49 发布

乐清sss

最新推荐文章于 2025-03-18 20:44:49 发布

阅读量2k

点赞数 4

分类专栏： Pytorch deep learning 文章标签： Pytorch

本文链接：https://blog.csdn.net/sunyueqinghit/article/details/102963020

版权

Pytorch 同时被 2 个专栏收录

9 篇文章

订阅专栏

deep learning

3 篇文章

订阅专栏

本文精选深度学习领域的代码技巧，包括kwargs参数处理、模型参数保存与加载、tqdm进度条使用、OOV词处理、正交初始化、tensor操作详解、LeakyReLU激活函数、日志记录配置、分布式训练策略等，旨在提升代码质量和效率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

最近在读一些深度学习的代码，不禁感叹别人代码的天工之作，再看看自己的代码，哇真是垃圾，有好多操作没见到过，所以也不会用，在这里做个记录，正如以人为鉴可以正衣冠…

1. **kwargs

**kwargs表示关键字参数，它本质上是一个dict,来看个例子：

t = Train(train_iter=train_iter, dev_iter=dev_iter, test_iter=test_iter, model=model, config=config)

# Train类
class Train(object):
    """
        Train
    """
    def __init__(self, **kwargs):
        """
        :param kwargs:
        Args of data:
            train_iter : train batch data iterator
            dev_iter : dev batch data iterator
            test_iter : test batch data iterator
        Args of train:
            model : nn model
            config : config
        """
        print("Training Start......")
        # for k, v in kwargs.items():
        #     self.__setattr__(k, v)
        self.train_iter = kwargs["train_iter"]
        self.dev_iter = kwargs["dev_iter"]
        self.test_iter = kwargs["test_iter"]
        self.parser = kwargs["model"]
        self.config = kwargs["config"]

可以看到这里是按dict格式调用的。

2. torch.save

先建立一个字典，保存参数，如：

embed_dict = {"pretrain_embed": pretrain_embed}
torch.save(obj=embed_dict, f=os.path.join(config.pkl_directory, config.pkl_embed))

想恢复某一阶段的训练（或者进行测试）时，那么就可以读取之前保存的网络模型参数等

checkpoint = torch.load(dir)
model.load_state_dict(checkpoint['embed_dict '])

3. tqdm模块

tqdm 是一个快速，可扩展的Python进度条，可以在 Python 长循环中添加一个进度提示信息，用户只需要封装任意的迭代器 tqdm(iterator)。
比如在读取词向量文件时：

    def _read_file(path):
        """
        :param path: embed file path
        :return:
        """
        embed_dict = {}
        with open(path, encoding='utf-8') as f:
            lines = f.readlines()
            lines = tqdm.tqdm(lines)
            for line in lines:
                values = line.strip().split(' ')
                if len(values) == 1 or len(values) == 2 or len(values) == 3:
                    continue
                w, v = values[0], values[1:]
                embed_dict[w] = v
        return embed_dict

效果很炫酷
在这里插入图片描述

4. OOV的处理

可以参考知乎这个问题下的答案：Word Embedding 如何处理未登录词？
这里使用对已经找到的词向量平均化，下面这段代码写得很好，统计了准确匹配、模糊匹配、未登录词的数量

    def _avg_embed(self, embed_dict, words_dict):
        """
        :param embed_dict:
        :param words_dict:
        """
        print("loading pre_train embedding by avg for out of vocabulary.")
        embeddings = np.zeros((int(self.words_count), int(self.dim)))
        inword_list = {}
        for word in words_dict:
            if word in embed_dict:
                embeddings[words_dict[word]] = np.array([float(i) for i in embed_dict[word]], dtype='float32')
                inword_list[words_dict[word]] = 1
                # 准确匹配
                self.exact_count += 1
            elif word.lower() in embed_dict:
                embeddings[words_dict[word]] = np.array([float(i) for i in embed_dict[word.lower()]], dtype='float32')
                inword_list[words_dict[word]] = 1
                # 模糊匹配
                self.fuzzy_count += 1
            else:
                # 未登录词
                self.oov_count += 1
        # 对已经找到的词向量平均化
        sum_col = np.sum(embeddings, axis=0) / len(inword_list)  # avg
        sum_col /= np.std(sum_col)
        for i in range(len(words_dict)):
            if i not in inword_list and i != self.padID:
                embeddings[i] = sum_col
        final_embed = torch.from_numpy(embeddings).float()
        return final_embed

注：torch.from_numpy 从numpy中获得数据，生成返回的tensor会和ndarry共享数据，任何对tensor的操作都会影响到ndarry, 反之亦然

>>> a = numpy.array([1, 2, 3])
>>> t = torch.from_numpy(a)
>>> t
tensor([ 1,  2,  3])
>>> t[0] = -1
>>> a
array([-1,  2,  3])

5. zip

zip()是Python的一个内建函数，它接受一系列可迭代的对象作为参数，将对象中对应的元素打包成一个个tuple（元组），然后返回由这些tuples组成的list（列表）。若传入参数的长度不等，则返回list的长度和参数中长度最短的对象相同。利用*号操作符，可以将list unzip（解压）
可以使用它进行二维矩阵的行列变换
例子：

>>> a = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
>>> zip(*a)
[(1, 4, 7), (2, 5, 8), (3, 6, 9)]
>>> list(map(list,zip(*a)))
[[1, 4, 7], [2, 5, 8], [3, 6, 9]]

6. setattr

给object对象添加新的name(属性)和value(属性值)，通常在class中运用较多

for name, param in zip(param_names, layer_params):
    setattr(self, name, param)

这操作就很骚

7. np.eye

函数的原型：numpy.eye(N,M=None,k=0,dtype=<class ‘float’>,order='C)

返回的是一个二维2的数组(N,M)，对角线的地方为1，其余的地方为0.

参数介绍：

（1）N:int型，表示的是输出的行数

（2）M：int型，可选项，输出的列数，如果没有就默认为N

（3）k：int型，可选项，对角线的下标，默认为0表示的是主对角线，负数表示的是低对角，正数表示的是高对角。

（4）dtype：数据的类型，可选项，返回的数据的数据类型

（5）order：{‘C’，‘F’}，可选项，也就是输出的数组的形式是按照C语言的行优先’C’，还是按照Fortran形式的列优先‘F’存储在内存中
与之相似的np.identity(n,dtype=None)只能构建方阵
例如：

import numpy as np
a=np.eye(3)
print(a)
输出：
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

8. 正交初始化（Orthogonal Initialization）

主要用以解决深度网络下的梯度消失、梯度爆炸问题，在RNN中经常使用的参数初始化方法。使得权重矩阵W在初始化后是正交的

torch.nn.init.orthogonal_(tensor, gain=1)

更多初始化的方法可以参考：PyTorch 学习笔记（四）：权值初始化的十种方法

9. np.concatenate

数组拼接
numpy.concatenate((a1, a2, …), axis=0, out=None)
官方示例：

>>> a = np.array([[1, 2], [3, 4]])
>>> b = np.array([[5, 6]])
>>> np.concatenate((a, b), axis=0)
array([[1, 2],
       [3, 4],
       [5, 6]])
>>> np.concatenate((a, b.T), axis=1)
array([[1, 2, 5],
       [3, 4, 6]])

10. torch.bernoulli

from pytorch中文文档

torch.bernoulli(input, out=None) → Tensor

从伯努利分布中抽取二元随机数(0 或者 1)。
输入张量须包含用于抽取上述二元随机值的概率。因此，输入中的所有值都必须在［0,1］区间，即 0<=inputi<=1
输出张量的第i个元素值，将会以输入张量的第i个概率值等于1。
返回值将会是与输入相同大小的张量，每个值为0或者1 参数:
input (Tensor) – 输入为伯努利分布的概率值
out (Tensor, optional) – 输出张量(可选)

>>> a = torch.Tensor(3, 3).uniform_(0, 1) # generate a uniform random matrix with range [0, 1]
>>> a

 0.7544  0.8140  0.9842
 0.5282  0.0595  0.6445
 0.1925  0.9553  0.9732
[torch.FloatTensor of size 3x3]

>>> torch.bernoulli(a)

 1  1  1
 0  0  1
 0  1  1
[torch.FloatTensor of size 3x3]

>>> a = torch.ones(3, 3) # probability of drawing "1" is 1
>>> torch.bernoulli(a)

 1  1  1
 1  1  1
 1  1  1
[torch.FloatTensor of size 3x3]

>>> a = torch.zeros(3, 3) # probability of drawing "1" is 0
>>> torch.bernoulli(a)

 0  0  0
 0  0  0
 0  0  0
[torch.FloatTensor of size 3x3]

11. lstm中的mask

一直不明白lstm中的mask是什么作用，看了一篇博客，觉得很对：

在用LSTM等模型处理文本数据时，因为文本是变长的，所以在处理的过程中，要先进行长度的统一。常用的方法为
X_data = sequence.pad_sequence(maxlen=10,value=0,padding=‘post’)
此步骤将X_data统一长度为10.
如[1,2,3,4,5]–>变为[1,2,3,4,5,0,0,0,0,0]
这样就可以把X_data 输入到model的Embedding等层。
然而，交给LSTM处理时，还有对数据进行反padding.也就是把后面的0去掉。
这个时候就是Mask层派上用场的时候了。Mask(0)经过Mask后，可以忽略X_data中所有的0，当然，把后面补的0去掉是可以理解的。那如果句中有0呢？一般情况下，如文本处理，会把文本映射成index，这样最大的好处就是节约空间。有些大文本数据，几百个G，经过了index映射，也就还剩几个G。这是题外话了，我们在keras的Embedding层会讲的。而这个时候index中的0,往往是一些无法转成词向量的低频词，这些词没有词向量，去掉对整个文本的处理也没有影响，所以在Mask中和补上的0一起忽略就好啦。
这里的忽略是什么意思呢？也就是不处理。
很多朋友以为Mask后会直接把0去掉。其实不是的。
可以做一些实验，如model的Mask后接个LSTM层，对LSTM输出每个时间步的值，发现，如果设置了Mask层，则上面[1,2,3,4,5,00000]的数据处理结果，前5位是经过了计算，补0的对应的位置的值，和第5位的值相同，也就是说LSTM对后面补0的位置并没有计算。
————————————————
版权声明：本文为CSDN博主「by雷影」的原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/u010976347/article/details/80618931

12. tensor的expand、squeeze、unsqueeze、transpose、permute、stack操作

1. expand
返回当前张量在某维扩展更大后的张量。扩展张量不会分配新的内存，只是在存在的张量上创建一个新的视图（view）
例如：

在这里插入代码片

2. squeeze
torch.squeeze(n)函数表示压缩tensor中第n维为1的维数，比如下面第一个，b.squeeze(2).size()，原始的b为上面的torch.Size([1, 3, 2])，第二维是2≠1，所以不压缩，尺寸保持不变；而若b.squeeze(0).size()，则发现第一维为1，因此压缩为3x2的tensor

>>> b=a.view(-1, 3, 2)
>>> b
tensor([[[1., 2.],
         [3., 4.],
         [5., 6.]]])
>>> b.size()
torch.Size([1, 3, 2])
>>> b.squeeze(2).size()
torch.Size([1, 3, 2])
>>> b.squeeze(0).size()
torch.Size([3, 2])

3. unsqueeze
相反的，torch.unsqueeze(n)则是在第n维增加一个维数=1，如下，表示在原始的b的第二维增加一维，则尺寸变为1 * 3 * 1 * 2

>>> b.unsqueeze(2).size()
torch.Size([1, 3, 1, 2])
>>> b.unsqueeze(2)
tensor([[[[1., 2.]],

         [[3., 4.]],

         [[5., 6.]]]])

4. transpose
transpose只能操作2D矩阵的转置。有两种调用方式。
连续使用transpose也可实现permute的效果。

torch.transpose(Tensor, 1, 0)
t.rand(2,3,4,5).transpose(3,0).transpose(2,1).transpose(3,2).shape
Out[672]: torch.Size([5, 4, 2, 3])
t.rand(2,3,4,5).transpose(1,0).transpose(2,1).transpose(3,1).shape
Out[670]: torch.Size([3, 5, 2, 4])

5. permute
permute可以对任意高维矩阵进行转置

t.rand(2,3,4,5).permute(3,2,0,1).shape
Out[669]: torch.Size([5, 4, 2, 3])

6. stack
stack会增加新的维度。
如对两个12维的tensor在第0个维度上stack，则会变为212的tensor；在第1个维度上stack，则会变为12*2的tensor。

>>a=torch.rand((1,2))
>>b=torch.rand((1,2))

>>c=torch.stack((a,b),0)
>>c.size()
torch.Size([2, 1, 2])

>>d=torch.stack((a,b),1)
>>d.size()
torch.Size([1, 2, 2])

13. torch.nn.Linear

from Pytorch中文文档

class torch.nn.Linear(in_features, out_features, bias=True)
对输入数据做线性变换：y=Ax+b
参数：
in_features - 每个输入样本的大小
out_features - 每个输出样本的大小
bias - 若设置为False，这层不会学习偏置。默认值：True
形状：
输入: (N,in_features)
输出： (N,out_features)
变量：
weight -形状为(out_features x in_features)的模块中可学习的权值
bias -形状为(out_features)的模块中可学习的偏置
例如：

>>> m = nn.Linear(20, 30)
>>> input = autograd.Variable(torch.randn(128, 20))
>>> output = m(input)
>>> print(output.size())

14. torch.nn.LeakyReLU

torch.nn.LeakyReLU(negative_slope=0.01, inplace=False)
第一次见这个ReLU的形式，negative_slope是指负斜率，具体可以看下图：
在这里插入图片描述

15. collections.namedtuple

collections.namedtuple是一个工厂方法，它可以动态的创建一个继承tuple的子类。跟tuple相比，返回的子类可以使用名称来访问元素。

RawResult = collections.namedtuple("RawResult",
                                   ["unique_id", "start_logits", "end_logits", "switch"])

使用：

all_results.append(RawResult(unique_id=unique_id,
                             start_logits=start_logits,
                             end_logits=end_logits,
                             switch=switch))

16. logging.basicConfig

在训练模型时经常需要将需要log文件记录训练过程，logging.basicConfig是基础配置，
logging.basicConfig函数各参数:

filename: 指定日志文件名
filemode: 和file函数意义相同，指定日志文件的打开模式，'w'或'a'
format: 指定输出的格式和内容，format可以输出很多有用信息，如上例所示:
 %(levelno)s: 打印日志级别的数值
 %(levelname)s: 打印日志级别名称
 %(pathname)s: 打印当前执行程序的路径，其实就是sys.argv[0]
 %(filename)s: 打印当前执行程序名
 %(funcName)s: 打印日志的当前函数
 %(lineno)d: 打印日志的当前行号
 %(asctime)s: 打印日志的时间
 %(thread)d: 打印线程ID
 %(threadName)s: 打印线程名称
 %(process)d: 打印进程ID
 %(message)s: 打印日志信息
datefmt: 指定时间格式，同time.strftime()
level: 设置日志级别，默认为logging.WARNING
stream: 指定将日志的输出流，可以指定输出到sys.stderr,sys.stdout或者文件，默认输出到sys.stderr，当stream和filename同时指定时，stream被忽略
handlers: 如果指定，这应该是已经创建的处理程序的迭代，以便添加到根日志程序中。任何没有格式化程序集的处理程序都将被分配给在此函数中创建的默认格式化程序。注意，此参数与 filename 或 stream 不兼容——如果两者都存在，则会抛出 ValueError。

如：

logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',
                    datefmt='%m/%d/%Y %H:%M:%S',
                    level=logging.INFO,
                    handlers=[logging.FileHandler(os.path.join(args.output_dir, "log.txt")),
                              logging.StreamHandler()])

17. torch.cuda

参考链接：https://github.com/apachecn/pytorch-doc-zh/blob/master/docs/1.0/cuda.md
比较常用的操作：

查看是否有可用GPU、可用GPU数量： torch.cuda.is_available(), torch.cuda.device_count()
查看当前使用的GPU序号：torch.cuda.current_device()
查看指定GPU的容量、名称：
torch.cuda.get_device_capability(device), torch.cuda.get_device_name(device)
清空程序占用的GPU资源： torch.cuda.empty_cache()
为GPU设置随机种子：torch.cuda.manual_seed(seed), torch.cuda.manual_seed_all(seed)

18. collections.OrderedDict

python中的字典是无序的，因为它是按照hash来存储的，但是python中有个模块collections，里面自带了一个子类OrderedDict，实现了对字典对象中元素的排序。OrderedDict对象的字典对象，如果其顺序不同那么Python也会把他们当做是两个不同的对象。

import collections
 
dd = {'banana': 3, 'apple':4, 'pear': 1, 'orange': 2}
#按key排序
kd = collections.OrderedDict(sorted(dd.items(), key=lambda t: t[0]))
print (kd)
#按照value排序
vd = collections.OrderedDict(sorted(dd.items(),key=lambda t:t[1]))
print (vd)

输出：

OrderedDict([('apple', 4), ('banana', 3), ('orange', 2), ('pear', 1)])
OrderedDict([('pear', 1), ('orange', 2), ('banana', 3), ('apple', 4)])

19. state_dict

state_dict与.cpu()/.cuda()/.add_module()同样是模型的一个对象。
是torch.nn.Module的可学习参数(即权重和偏差)，模块模型包含在model’s参数中(通过model.parameters()访问)。
使用：

state_dict = torch.load(args.init_checkpoint, map_location='cpu')
if args.do_train and args.init_checkpoint.endswith('pytorch_model.bin'):
   	model.bert.load_state_dict(state_dict)

例子：

import torch
import torch.nn as nn
import torch.nn.functional as F
# Define model
class TheModelClass(nn.Module):
    def __init__(self):
        super(TheModelClass,self).__init__()
        self.conv1=nn.Conv2d(3,6,5)
        self.pool=nn.MaxPool2d(2,2)
        self.conv2=nn.Conv2d(6,16,5)
        self.fc1=nn.Linear(16*5*5,120)
        self.fc2=nn.Linear(120,84)
        self.fc3=nn.Linear(84,10)
    def farward(self,x):
        x=self.pool(F.relu(self.conv1(x)))
        x=self.pool(F.relu(self.conv2(x)))
        x=x.view(-1,16*5*5)
        x=F.relu(self.fc1(x))
        x=F.relu(self.fc2(x))
        x=self.fc3(x)
        return x
# Initialize model
model=TheModelClass()
# Initialize optimizer
optimizer=torch.optim.SGD(model.parameters(),lr=1e-4,momentum=0.9)

print("Model's state_dict:")
# Print model's state_dict
for param_tensor in model.state_dict():
    print(param_tensor,"\t",model.state_dict()[param_tensor].size())
print("optimizer's state_dict:")
# Print optimizer's state_dict
for var_name in optimizer.state_dict():
    print(var_name,"\t",optimizer.state_dict()[var_name])

输出：

Model's state_dict:
conv1.weight 	 torch.Size([6, 3, 5, 5])
conv1.bias 	 torch.Size([6])
conv2.weight 	 torch.Size([16, 6, 5, 5])
conv2.bias 	 torch.Size([16])
fc1.weight 	 torch.Size([120, 400])
fc1.bias 	 torch.Size([120])
fc2.weight 	 torch.Size([84, 120])
fc2.bias 	 torch.Size([84])
fc3.weight 	 torch.Size([10, 84])
fc3.bias 	 torch.Size([10])
optimizer's state_dict:
state 	 {}
param_groups 	 [{'lr': 0.0001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [2248503253976, 2248503254056, 2248503254136, 2248503254216, 2248503254296, 2248503254376, 2248503254456, 2248503254536, 2248503254616, 2248503254696]}]

20. 分布式训练

采用DistributedDataParallel多GPUs训练的方式比DataParallel更快一些
参考：
https://www.jianshu.com/p/221d9298808e
https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
Pytorch中多GPU训练指北

21. weight decay

定义需要weight decay的参数‘gamma’, ‘beta’ 是值LayerNormal层的，不要decay，直接训练即可。其他参数除去bias，均使用weight decay的方法进行训练
在这里插入图片描述

no_decay = ['bias', 'gamma', 'beta']
optimizer_parameters = [
            {'params': [p for n, p in model.named_parameters() if n not in no_decay], 'weight_decay_rate': 0.01},
            {'params': [p for n, p in model.named_parameters() if n in no_decay], 'weight_decay_rate': 0.0}
            ]
optimizer = BERTAdam(optimizer_parameters,
                     lr=args.learning_rate,
                     warmup=args.warmup_proportion,
                     t_total=num_train_steps)

关于Pytorch torch.optim优化器个性化使用参考：
https://www.cnblogs.com/ranjiewen/p/9240512.html

22. detach

简单来说,就是创建一个新的tensor,将其从当前的计算图中分离出来.新的tensor与之前的共享data,但是不具有梯度。对b进行修改,a的data值也会改变.说明他们是共享同一块显存的。

detach所做的就是,重新声明一个变量,指向原变量的存放位置,但是requires_grad为false.更深入一点的理解是,计算图从detach过的变量这里就断了, 它变成了一个leaf_node.即使之后重新将它的requires_node置为true,它也不会具有梯度.
另一方面,在调用完backward函数之后,非leaf_node的梯度计算完会立刻被清空.这也是为什么在执行backward之前显存占用很大,执行完之后显存占用立刻下降很多的原因.当然,这其中也包含了一些中间结果被存在buffer中,调用结束后也会被释放.
作者：nowherespyfly
链接：https://www.jianshu.com/p/f1bd4ff84926
来源：简书
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

在这里插入图片描述

23. optimizer.step()

optimizer.step()通常用在每个mini-batch之中，只有用了optimizer.step()，模型才会更新。多数optimizer里都可以这么做，每次用backward()这类的方法计算出了梯度后，就可以调用一次这个方法来更新参数。

for input, target in dataset:
	optimizer.zero_grad()
	ouput = model(input)
	loss = loss_fn(output, target)
	loss.backward()
	optimizer.step()

然后清空梯度
optimizer.step()

torch.manual_torch.save(obj=embed_dict, f=os.path.join(config.pkl_directory, config.pkl_embed))(seed_num)
random.seed(seed_num)
expand