曾经犯过的错

最新推荐文章于 2024-07-26 13:15:00 发布

solicucu

最新推荐文章于 2024-07-26 13:15:00 发布

阅读量205

点赞数 1

分类专栏： Pytorch

本文链接：https://blog.csdn.net/weixin_42973678/article/details/104413210

版权

Pytorch 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

7、 one of the variables needed for gradient computation has been modified by an inplace operation
可能是某个需要计算梯度张量，在传递过程中丢失了
例如：我当时写attention的时候

def forward(self, x):

	spatial_res = self.spatial_atten(x)
	channel_res = self.channel_atten(x)

	res = spatial_res * channel_res 
	res = torch.sigmoid(self.conv1x1(res))
	# note that upper res is a probability 
	res = x * res  // 在写forward的时候，漏了这一步，到时只传回了个跟x无关的概率res

	return res

6、TypeError: Object of type ‘ndarray’ is not JSON serializable
numpy 的正整型类型是不可以json 序列化的，要转化为python 的整型类型 int 才行

ind = np.argmax(w[e], axis = -1) # numpy.int64
ops.append((fop_names[ind], e)) # e: int 
# numpy.int64 is not json wirtable, change to int 
edge_op.append((e, int(ind)))

5、KeyError: 'module.conv1.op.0.weight’
state_dict = torch.load(path_to_ckpt)
无缘无故key都多了个module, 原因是采用nn.DataParallel(model)的原因
如果保存直接保存model.state_dict(), 就会多一个module.
可以选择采用model.module.state_dict(), 再保存，就跟正常一样了

4、nn.DataParallel

model = nn.DataParallel(model)
# print(type(model)) <class 'torch.nn.parallel.data_parallel.DataParallel'>

用DataParallel包装后，model的类型已经改变，不能直接访问自己实现的函数
要采用model.module.function_name() 的形式来访问
如果没有包装，不需要加.module

3、tensor.device
os.environ[“CUDA_VISIBLE_DEVICES”] = “1”
有一次，指定GPU为1，但是不知怎么输出tensor.device 的结果总是“cuda:0”,
脑壳都崩了
正确解读，这里的0，是我们给的id号的下标而已。

2、os.environ[“CUDA_VISIBLE_DEVICE”] = “0，1”
model = nn.DataParallel(model)
血的教训，当初就想，为什么明明指定只用0，1，GPU却全部都用上了
notes: os.environ[“CUDA_VISIBLE_DEVICES”] = “0，1” 正确的写法是要加S的，不要忘了了了。
注意，nn.DataParallel(model), 可以理解成把模型放在当前可见的多块GPU，
res = model(imgs), imgs 很自然的也被分在多块GPU，但是，但是
如果在model里面定义了一些tensor，并且tensor.cuda() ,这些张量并不会被分配在多块GPU。默认只在第0块。

1、copy
在类里面有下面两个属性
self.layer_size = init_size （init_size）是一个列表
self.init_size = init_size ,
那么他们这两个属性有相同的address, 如果修改self.layer_size 里面的值，那么self.init_size 也会改变。
因此，如果我们不希望它改变，pass list的时候，最好采用copy.deepcopy(init_size), 这样他们分配的内存不一样。
另外如果self.init_size, 作为一个参数传出去，可能会被改变，那么，最好还是复制一个副本出来，再传副本。