Pytorch
Pytorch基本用法、网络搭建、模型优化、训练、测试
城俊BLOG
从此好好码代码。。
展开
-
mmcv NCCL 报错 mmcv/_ext.cpython-37m-x86_64-linux-gnu.so: undefined symbol, RuntimeError: NCCL error i
【代码】mmcv报错 mmcv/_ext.cpython-37m-x86_64-linux-gnu.so: undefined symbol。原创 2023-01-12 02:05:22 · 1457 阅读 · 0 评论 -
pytorch 1.7.0 torchvision 0.8.1 torch.cuda.amp gradscaler DDP 训练卡死
报错:pytorch DDP 模型卡住代码# 具体卡住的代码yolov5训练代码 train.py 中有一句: scaler.step(optimizer) # optimizer.step程序运行到第二个epoch的时候,卡住了,具体卡在调用语句:/home/xxx/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py中的 if not sum(v.item() for v原创 2022-05-08 22:10:18 · 1037 阅读 · 0 评论 -
pytorch tensor按广播赋值 scatter_函数
>>> import torch>>> a = torch.tensor([[1,2,3],[4,5,6]])# 和a shape相同,但是用0填充>>> b = torch.full_like(a,0)>>> c = torch.tensor([[0,0,1],[1,0,1]])# 赋值索引>>> c[:,0]tensor([0, 1])# 广播机制赋值>>> b[range(n)原创 2022-03-15 09:42:09 · 1992 阅读 · 0 评论 -
人脸识别模型训练加速
Turing架构,设置fp16=True,对精度可能有影响使用partial fc,设置config.sample_rate < 1 比如 0.5, 0.1之类的mxnet module.init_optimizer(kvstore=‘device’)增加batch_size直到训练速度(# sample/secode)不再增加,gpu-util在90%左右...原创 2021-06-01 16:31:35 · 239 阅读 · 0 评论 -
pytorch RuntimeError: one of the variables needed for gradient computation has been modified by an i
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [25088, 512]], which is output 0 of TBackward, is at version 4; expected version 3 instead. Hint: enable anomaly detection原创 2021-05-24 14:03:25 · 711 阅读 · 1 评论 -
pytorch BN层报错:ValueError: expected 4D input (got 2D input)
报错:Traceback (most recent call last): File "train_noPfc.py", line 201, in <module> main(args_) File "train_noPfc.py", line 160, in main f_masked, focc_masked, output, output_occ, f_diff, out = backbone(img1, img2) File "/home/user1/min翻译 2021-05-20 14:47:37 · 8937 阅读 · 0 评论 -
pytorch高版本(如1.7.0)RuntimeError: Legacy autograd function with non-static forward method is deprecate
就一个前向推理,也报错了。。。代码:from torchvision import modelsmodel = models.vgg19(pretrained=True)output = model(input.cuda())报错:RuntimeError: Legacy autograd function with non-static forward method is deprecated.完整报错:Traceback (most recent call last): File翻译 2021-05-18 17:13:41 · 752 阅读 · 0 评论 -
pytorch resnet 全连接层linear报错:RuntimeError: mat1 dim 1 must match mat2 dim 0
Traceback (most recent call last): File "/home/user1/pjs/frvt_pytorch/batch_run/2branch_alter_1update_2pfc_MMD_ori_auto/recognition/arcface_torch/tools/visualize.py", line 276, in <module> mask = grad_cam(input, target_index) File "/home/user翻译 2021-05-18 10:14:57 · 6796 阅读 · 0 评论 -
pytorch softmax dim=-1
dim = -1 和 dim = 1 都是对每一行求softmax测试:https://www.cnblogs.com/jeshy/p/10933882.html翻译 2021-05-14 17:03:46 · 2466 阅读 · 0 评论 -
pytorch retain_graph=True 训练导致GPU显存泄漏 OOM (out of memory)
训练过程中多个loss回传产生了GPU显存不够用的情况(即使是设置batch_size最小也不行),在backward函数中去掉retain_graph=True之后,情况没有出现。我这里出现这个情况的原因:因为不同loss求完之后没有算均值,可能返回的是一个tensor,要通过 .mean() 把它变成标量。解决:criterion = torch.nn.CrossEntropyLoss()output = module_a(fc1Features,label)arcLoss = criteri翻译 2021-05-13 16:53:00 · 2403 阅读 · 2 评论 -
pytorch分布式 RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]
报错:RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1解决:把DP改成DDP解决了报错的代码:# DP模式, module_a是一个分类器module_a = torch.nn.DataParallel(module_a)改完的代码:local_rank = 0# DDP模式mod原创 2021-05-13 11:26:46 · 1485 阅读 · 0 评论 -
pytorch RuntimeError: one of the variables needed for gradient computation has been modified by an i
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [25088, 512]], which is output 0 of TBackward, is at version 4; expected version 3 instead. Hint: enable anomaly detection原创 2021-05-07 14:38:03 · 449 阅读 · 1 评论 -
pytorch分布式:all_gather, all_reduce
all_gather: 把一个tensor广播到一个list,返回tensor list.chunk: 把一个tensor切分成几块# 把 features.data广播到list中的元素上,list中的内容是total_features切分成的几个块,最后返回这个list.相当于把features拷贝成了好几份 dist.all_gather(list(total_features.chunk(self.world_size, dim=0)), features.data)htt翻译 2021-04-29 11:14:42 · 5642 阅读 · 0 评论 -
pytorch固定随机种子,训练稳定复现
固定python随机数rng = numpy.random.RandomState(23355)rng.uniform(0,1,(2,3))rng = numpy.random.RandomState(23355)rng.uniform(0,1,(2,3))原创 2021-04-28 17:14:20 · 3757 阅读 · 0 评论 -
pytorch打印网络层梯度,保存结果到excel
def save_excel(netName, dataDict,colNames=None,): # pf = pd.DataFrame(list(dataDict)) # order = list[dataDict.keys()] # pf = pf[order] # pf.rename(columns=order, inplace=True) # file_path = pd.ExcelWriter('compdata.xlsx') # pf.fill.原创 2021-04-28 09:41:40 · 1108 阅读 · 0 评论 -
pytorch RuntimeError: Expected to have finished reduction in the prior iteration before starting a n
报错:Traceback (most recent call last): File "train.py", line 166, in <module> main(args_) File "train.py", line 118, in main f_clean_masked, f_occ_masked, fc, fc_occ = backbone(img1, img2) File "/home/user1/miniconda3/envs/py377/lib/pyt原创 2021-04-23 23:48:16 · 6634 阅读 · 0 评论 -
pytorch RuntimeError: running_mean should contain 1048576 elements not 25088
sss原创 2021-04-23 23:08:01 · 873 阅读 · 0 评论 -
Pytorch RuntimeError: Expected 4-dimensional input for 4-dimensional weight [512, 512, 3, 3], but go
报错:RuntimeError: Expected 4-dimensional input for 4-dimensional weight [512, 512, 3, 3], but got 2-dimensional input of size [4, 512] instead原创 2021-04-23 22:56:37 · 5824 阅读 · 5 评论 -
pytorch对自定义loss函数自动求梯度
通过 torch.autograd.gradclass MMD(nn.Module): def __init__(self): super(MMD, self).__init__() self.mmd = torch.nn.MSELoss() def forward(self,fc1Features1,fc1Features2): n = len(fc1Features1) fc1_1 = 1/n * torch.sum(f原创 2021-04-19 23:45:00 · 1309 阅读 · 0 评论 -
pytorch mxnet ValueError: too many dimensions ‘NDArray‘
报错:Traceback (most recent call last): File "topFAR_COX_py2_fc1_pytorch_PY3.py", line 199, in <module> IDimage_features_dict = getfeatures_dict(model, IDimage_list, IDimage_path, featurelen) File "topFAR_COX_py2_fc1_pytorch_PY3.py", line 106,原创 2021-04-09 11:36:38 · 584 阅读 · 0 评论 -
Pytorch 分布式dist.init_process_group报错NCCL 找不到GPU
完整报错:Traceback (most recent call last): File "topFAR_COX_py2_fc1_pytorch.py", line 181, in <module> model = constructmodel(args) File "topFAR_COX_py2_fc1_pytorch.py", line 36, in constructmodel dist.init_process_group(backend='nccl', ini原创 2021-04-01 23:00:22 · 7504 阅读 · 2 评论 -
Pytorch RuntimeError: The NVIDIA driver on your system is too old (found version 10010).
完整报错:Traceback (most recent call last): File "topFAR_COX_py2_fc1_pytorch.py", line 181, in <module> model = constructmodel(args) File "topFAR_COX_py2_fc1_pytorch.py", line 37, in constructmodel torch.cuda.set_device(rank) File "/home/u原创 2021-04-01 22:52:40 · 5822 阅读 · 2 评论 -
pytorch分布式RuntimeError: Tensors must be CUDA and dense
你的模型或者参数没有放到GPU上,解决:backbone = eval("backbones.{}".format(args.network))(False, dropout=dropout, fp16=args.fp16).to(rank)最后的 .to(rank) 做到了这一点。在我的代码中rank=0原创 2021-04-01 14:16:28 · 8335 阅读 · 0 评论 -
pytorch 1.7训练保存的模型在1.4低版本无法加载:frame #63: <unknown function> + 0x1db3e0 (0x55ba98ddd3e0 in /data/user
pytorch 1.7高版本训练保存的模型在1.4低版本无法加载,报错:torch.load('/home/user1/model_best_b.pth.tar')Traceback (most recent call last): File "/data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3417, in run_code exec(co原创 2020-12-10 16:27:19 · 11109 阅读 · 3 评论 -
Pytorch按照固定顺序加载样本
继承 torch.utils.data.Dataset 类num_workers设置为0shuffle=Falsesampler使用 SequentialSamplerclass CelebA(data.Dataset): xxxfrom torch.utils.data.sampler import SequentialSamplerdatasetC = CelebA(cDir, cImgDir[i], testAnn, transforms.Compose([transfor.原创 2020-12-02 15:49:10 · 3831 阅读 · 1 评论 -
pytorch AttributeError: ‘_IncompatibleKeys’ object has no attribute ‘eval’
哥们,你这个模型load_state_dict加载参数之后是不会将模型作为返回值的。所以你不要去接收返回参数。只是load_state_dict就可以了。错误的代码:# 这里你用列表接收了load_state_dict的返回值,但它其实不是load_state_dict之后得到的模型,而是一个_IncompatibleKeys的什么鬼models = [m.load_state_dict(ckpts[i]['state_dict']) for i,m in enumerate(models)]正确原创 2020-12-02 11:35:31 · 3708 阅读 · 0 评论 -
Pytorch扫盲 - 安装、训练、测试、可视化、网络结构、finetune、loss
文章目录可视化查看模型结构训练finetune只给部分网络层加载权重参数网络简单的multi-task网络可视化查看模型结构训练网络层定义好之后之后可以直接使用(有默认初始化)。初始化不是必须。https://blog.csdn.net/luo3300612/article/details/97675312finetune只给部分网络层加载权重参数# only load the weights in arc face original model weights file, ignore原创 2020-07-09 14:03:49 · 953 阅读 · 0 评论 -
python pytorch 安装accimage
https://github.com/pytorch/accimage翻译 2020-10-15 14:05:25 · 6654 阅读 · 4 评论 -
pytorch从dataloader中取数据 (python从enumerate或iterator对象中取数据)
报错:TypeError: 'enumerate' object is not subscriptableTypeError: 'DataLoader' object is not an iterator代码:_, (inputs2, labels2) = next(enumerate(test2))_, (inputs3, labels3) = next(enumerate(test3))iter应该同理。next(iter(test3))enumerate还可以加索引...原创 2020-08-31 16:08:21 · 9683 阅读 · 5 评论 -
python读取图片 PIL, matplotlib(plt.imshow) 、cv2.imread和skimage.imread 打开和显示图片的区别(shape,通道)
区别:PIL:(宽,高)cv2:(高,宽)原图:1920x1080Image.open(imgPath):(1920, 1080)cv2.imread(imgPath,0):(1080, 1920)原创 2020-08-17 22:48:55 · 4859 阅读 · 0 评论 -
pytorch RuntimeError: expected backend CUDA and dtype Float but got backend CPU and dtype Float
代码:criterion = nn.BCEWithLogitsLoss(reduction='none')loss = criterion(output, target)loss.mul_(weights)报错:Traceback (most recent call last):File “/home/user1/main_cs_0708.py”, line 391, in main()File “/home/user1/main_cs_0708.py”, line 301, in mai原创 2020-08-14 17:46:08 · 2364 阅读 · 0 评论 -
pytorch tensor操作
1,增加维度>>> a = torch.tensor([222])>>> atensor([222])>>> a.shapetorch.Size([1])# 变成1列>>> c = torch.unsqueeze(a,1) # torch.unsqueeze(a,0) 是变成一行>>> c.shapetorch.Size([1, 1])翻译 2020-08-12 16:37:22 · 944 阅读 · 0 评论 -
pytorch网络结构可视化graphviz.backend.ExecutableNotFound: failed to execute [‘dot‘, ‘-Tpdf‘, ‘-O‘, ‘
报错:Traceback (most recent call last): File "/data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/graphviz/backend.py", line 129, in render subprocess.check_call(args, stderr=stderr, **POPEN_KWARGS) File "/data/user1/pkgs/conda/envs/drc/lib/原创 2020-08-12 15:14:51 · 1465 阅读 · 0 评论 -
pytorch 每次测试结果不同
真是见了鬼了,搞好半天没弄清楚啥原因。把每个预测值看了一遍,确实每次结果不同。原因排查:自定义的Metric写的有问题,会根据batch_size变化,是按batch_size大小按每个batch分别计算的,而不是按batch累计的被测试的网络结构中有Dropout层,比如nn.Dropout测试的数据loader加入了随机处理,比如transforms.RandomCrop()其他可能的原因,目前我还没遇到。感谢:https://blog.csdn.net/t20134297/artic原创 2020-08-04 16:58:43 · 5918 阅读 · 0 评论 -
pytorch 使用指定的GPU RuntimeError: CUDA error: invalid device ordinal
pytorch使用指定GPU报错:Traceback (most recent call last): File "test_bed/process_deepglint.py", line 102, in <module> pred_dataset(outputFile) File "test_bed/process_deepglint.py", line 36, in pred_dataset pred_loader_deepg, model, criterion,原创 2020-07-31 20:15:50 · 33901 阅读 · 9 评论 -
DL模型可视化
model = Alexnet2fc() x = torch.rand(8, 3, 112, 112) y = model(x) # need TensorFlow 2.2 or higher # method 1 # from keras.utils import plot_model # plot_model(model, to_file='model.png') # method 2 # from IPython.di...原创 2020-07-22 16:07:27 · 391 阅读 · 0 评论 -
RuntimeError: CUDA error: an illegal instruction was encountered
pytorch训练跑着好好的, 断了:Traceback (most recent call last): File "main_multi_model_test.py", line 147, in <module> main() File "main_multi_model_test.py", line 119, in main train_loss, train_acc, train_bacc = train(model, optimizer, train_load原创 2020-07-19 11:28:47 · 5821 阅读 · 6 评论 -
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is Fal
在跑Pytorch模型测试代码时报错:Traceback (most recent call last): File "../models/arc_face.py", line 44, in Arcface learner.load_state(conf, 'ir_se50.pth', model_only=True, from_save_folder=False) File "../arcface/Learner.py", line 86, in load_state pretr原创 2020-07-19 11:10:11 · 6490 阅读 · 2 评论 -
pytorch训练ubuntu卡死内存泄漏
事件:使用pytorch进行multi-task learning,训练到30-60 epoch的时候,机器卡死了。虽然是ubuntu也卡死了原因:一通没头没脑地分析之后,原因可能是内存泄漏。解决:将数据记录到log文件以提供给tensorboard可视化分析的时候,注意要在结束时关闭 SummaryWriterwriter = SummaryWriter(os.path.join(ckptDir, 'logs'))for epoch in range(num_epochs): ...翻译 2020-07-15 10:32:17 · 1783 阅读 · 0 评论 -
pytorch AttributeError: ‘tuple‘ object has no attribute ‘dim‘
构建模型之后训练报错:Traceback (most recent call last): File "/home/user1/alexnet_test.py", line 117, in <module> main() File "/home/user1/alexnet_test.py", line 89, in main train_loss, train_acc, train_bacc = train(model, optimizer, train_loader,原创 2020-07-13 11:54:55 · 11363 阅读 · 7 评论