在看别人的优秀代码时(pytorch
框架下),经常发现有人习惯用optimizer.zero_grad
,有人习惯用model.zero_grad
(这里的model指的是自定义的网络,命名为model,泛指自定义的网络),那么这两个有什么区别吗?其中某个有什么优势吗?什么情况下使用哪个更合理呢,还是说他们之间没有任何区别,随意使用就ok呢?这篇博客就来探究下。
先说下结论:
- 当仅有一个
model
,同时optimizer
只包含这一个model
的参数,那么model.zero_grad
和optimizer.zero_grad
没有区别,可以任意使用。 - 当有多个
model
,同时optimizer
包含多个model
的参数时,如果这多个model
都需要训练,那么使用optimizer.zero_grad
是比较好的方式,耗时和防止出错上比对每个model
都进行zero_grad
要更好。 - 当有多个
model
,对于每个model
或者部分model
有对应的optimizer
,同时还有一个total_optimizer
包含多个model
的参数时。如果是是只想训练某一个model
或者一部分model
,可以选择对需要训练的那个model
进行model.zero_grad
,然后使用他对应的optimizer
进行优化。如果是想对所有的model
进行训练,那么使用total_optimizer.zero_grad
是更优的方式。
实验可复现准备
为了保证实验结果的可复现性,对随机种子进行设定。
import torch
import torch.nn as nn
import random
import numpy as np
import torch.optim as optim
# set seed
seed = 0
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
random.seed(seed)
定义model
为满足实验条件,定义三个model class
,分别是SingleNet1
,SingleNet2
,MultiNet
# single network 1
# 32->16->Relu->16->1
class SingleNet1(nn.Module):
def __init__(self):
super(SingleNet1, self).__init__()
self.fc1 = nn.Linear(32, 16)
self.fc2 = nn.Linear(16, 1)
self.relu = nn.ReLU()
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
# single network 2
# 32->28->Relu->28->16->Relu->16->1
class SingleNet2(nn.Module):
def __init__(self):
super(SingleNet2, self).__init__()
self.fc1 = nn.Linear(32, 28)
self.fc2 = nn.Linear(28, 16)
self.fc3 = nn.Linear(16, 1)
self.relu = nn.ReLU()
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
x = self.relu(x)
x = self.fc3(x)
return x
# Multi network
# include two single network
class MultiNet(nn.Module):
def __init__(self, single_net1=SingleNet1(), single_net2=SingleNet2()):
super(MultiNet, self).__init__()
self.single_net1 = single_net1
self.single_net2 = single_net2
def forward(self, x):
x1 = self.single_net1(x)
x2 = self.single_net2(x)
return x1+x2
实验场景1
实验场景1下,定义一个SingleNet1
,用一个Adma
优化器优化该网络。并分别复制一份作为参考。
import copy
# define model
singlenet1_model = SingleNet1()
optimizer = optim.Adam(singlenet1_model.parameters(), lr=0.01)
# copy singlenet1_model
singlenet1_model_copy = copy.deepcopy(singlenet1_model)
optimizer_copy = optim.Adam(singlenet1_model_copy.parameters(), lr=0.01)
定义训练数据集并分别以optimizer.zero_grad
和model.zero_grad
两种方式训练。
# define train data
# [2, 32] random tensor float
train_data = torch.randn(2, 32)
print('=====================')
print('train_data:', train_data)
label = torch.randn(2, 1)
critirion = nn.MSELoss()
# train singlenet1_model by singlenet1_model.zero_grad()
singlenet1_model.zero_grad()
logits = singlenet1_model(train_data)
loss = critirion(logits, label)
loss.backward()
optimizer.step()
# train singlenet1_model_copy by optimizer_copy.zero_grad()
optimizer_copy.zero_grad()
logits_copy = singlenet1_model_copy(train_data)
loss_copy = critirion(logits_copy, label)
loss_copy.backward()
optimizer_copy.step()
# print params of singlenet1_model and singlenet1_model_copy
print('=====================')
print('singlenet1_model params:')
for name, param in singlenet1_model.named_parameters():
print(name, param[0])
print('=====================')
print('singlenet1_model_copy params:')
for name, param in singlenet1_model_copy.named_parameters():
print(name, param[0])
训练结果如下:
=====================
train_data: tensor([[ 0.6690, -0.3454, 0.8949, -0.4531, 1.4003, 0.7638, 0.1912, -0.5438,
-0.8425, 1.6828, 2.3559, -0.3261, -0.7294, 1.3667, -1.7879, 0.2045,
0.1701, -1.7047, -0.6142, 0.3445, 0.4001, -0.1859, 0.4033, 0.7373,
1.4582, 1.2285, -1.2816, 0.2424, 0.8365, -1.3395, 0.0139, -1.0983],
[ 0.3514, -0.0103, -0.8283, -1.5381, 0.3386, 1.1375, 1.4879, 1.3499,
0.3458, 1.1637, -1.0770, 0.1557, 0.5532, 0.6177, 0.5466, 1.8611,
-0.0840, 0.3289, -0.3520, 1.2174, 1.1054, 0.7831, -1.1250, -1.1665,
-0.8184, 0.4727, -0.0529, -0.2585, -1.3894, 0.0219, 0.5756, -0.0342]])
=====================
singlenet1_model params:
fc1.weight tensor([-0.0106, 0.0392, 0.1443, -0.0448, -0.0795, -0.1220, 0.1203, -0.0401,
-0.0562, 0.1096, 0.1527, 0.0026, -0.1146, 0.1340, -0.0505, -0.0915,
-0.1343, -0.0750, -0.1284, -0.1120, -0.1710, -0.0256, 0.0611, -0.1442,
-0.0102, -0.0752, 0.0955, 0.1431, -0.0334, -0.0790, -0.1458, -0.0939],
grad_fn=<SelectBackward0>)
fc1.bias tensor(0.1005, grad_fn=<SelectBackward0>)
fc2.weight tensor([-0.2158, 0.1200, -0.1511, -0.0294, -0.0880, -0.0199, -0.2178, -0.2293,
0.0975, -0.0383, -0.1408, -0.0164, -0.2182, 0.0564, 0.1087, 0.2525],
grad_fn=<SelectBackward0>)
fc2.bias tensor(-0.2137, grad_fn=<SelectBackward0>)
=====================
singlenet1_model_copy params:
fc1.weight tensor([-0.0106, 0.0392, 0.1443, -0.0448, -0.0795, -0.1220, 0.1203, -0.0401,
-0.0562, 0.1096, 0.1527, 0.0026, -0.1146, 0.1340, -0.0505, -0.0915,
-0.1343, -0.0750, -0.1284, -0.1120, -0.1710, -0.0256, 0.0611, -0.1442,
-0.0102, -0.0752, 0.0955, 0.1431, -0.0334, -0.0790, -0.1458, -0.0939],
grad_fn=<SelectBackward0>)
fc1.bias tensor(0.1005, grad_fn=<SelectBackward0>)
fc2.weight tensor([-0.2158, 0.1200, -0.1511, -0.0294, -0.0880, -0.0199, -0.2178, -0.2293,
0.0975, -0.0383, -0.1408, -0.0164, -0.2182, 0.0564, 0.1087, 0.2525],
grad_fn=<SelectBackward0>)
fc2.bias tensor(-0.2137, grad_fn=<SelectBackward0>)
可以很明显的发现训练结果是一致的(用jio想一想就知道肯定一样,不然其中一个肯定是错的hhh,现在通过实验验证了下他们确实一样)
接下来,通过运行时间验证下他们之间的区别。
# train singlenet1_model by singlenet1_model.zero_grad() 100 epoch
# and calculate the time consumption
import time
Epoch = 10000
start = time.time()
for epoch in range(Epoch):
singlenet1_model.zero_grad()
logits = singlenet1_model(train_data)
loss = critirion(logits, label)
loss.backward()
optimizer.step()
end = time.time()
print('=====================')
print("singlenet1_model zero_grad time consumption: ", end-start)
start_copy = time.time()
for epoch in range(Epoch):
# train singlenet1_model_copy by optimizer_copy.zero_grad()
optimizer_copy.zero_grad()
logits_copy = singlenet1_model_copy(train_data)
loss_copy = critirion(logits_copy, label)
loss_copy.backward()
optimizer_copy.step()
end_copy = time.time()
print('=====================')
print("singlenet1_model_copy zero_grad time consumption: ", end_copy-start_copy)
# print params of singlenet1_model and singlenet1_model_copy
print('=====================')
print('singlenet1_model params:')
for name, param in singlenet1_model.named_parameters():
print(name, param[0])
print('=====================')
print('singlenet1_model_copy params:')
for name, param in singlenet1_model_copy.named_parameters():
print(name, param[0])
实验结果如下:
=====================
singlenet1_model zero_grad time consumption: 3.0909366607666016
=====================
singlenet1_model_copy zero_grad time consumption: 3.0928189754486084
=====================
singlenet1_model params:
fc1.weight tensor([-0.0437, 0.0724, 0.1112, -0.0116, -0.1126, -0.1551, 0.0871, -0.0069,
-0.0230, 0.0764, 0.1195, 0.0357, -0.0815, 0.1009, -0.0173, -0.1246,
-0.1674, -0.0419, -0.0953, -0.1452, -0.2042, 0.0075, 0.0279, -0.1774,
-0.0434, -0.1083, 0.1287, 0.1099, -0.0666, -0.0458, -0.1790, -0.0607],
grad_fn=<SelectBackward0>)
fc1.bias tensor(0.0673, grad_fn=<SelectBackward0>)
fc2.weight tensor([ 0.0054, 0.0488, -0.1511, -0.2118, -0.1202, -0.0199, -0.2178, -0.2293,
0.0474, 0.0244, 0.1491, 0.0125, -0.1704, -0.0230, -0.1324, -0.0102],
grad_fn=<SelectBackward0>)
fc2.bias tensor(-0.3130, grad_fn=<SelectBackward0>)
=====================
singlenet1_model_copy params:
fc1.weight tensor([-0.0437, 0.0724, 0.1112, -0.0116, -0.1126, -0.1551, 0.0871, -0.0069,
-0.0230, 0.0764, 0.1195, 0.0357, -0.0815, 0.1009, -0.0173, -0.1246,
-0.1674, -0.0419, -0.0953, -0.1452, -0.2042, 0.0075, 0.0279, -0.1774,
-0.0434, -0.1083, 0.1287, 0.1099, -0.0666, -0.0458, -0.1790, -0.0607],
grad_fn=<SelectBackward0>)
fc1.bias tensor(0.0673, grad_fn=<SelectBackward0>)
fc2.weight tensor([ 0.0054, 0.0488, -0.1511, -0.2118, -0.1202, -0.0199, -0.2178, -0.2293,
0.0474, 0.0244, 0.1491, 0.0125, -0.1704, -0.0230, -0.1324, -0.0102],
grad_fn=<SelectBackward0>)
fc2.bias tensor(-0.3130, grad_fn=<SelectBackward0>)
可以明显的看出,optimizer.zero_grad
和model.zero_grad
两种方式在训练时间几乎没有区别。
接下来进入实验场景2,探索下其他内容:
实验场景2
在实验场景2下,我们定义两个single networks, 用optimizer同时优化这两个networks。这个场景下有三种组合方式优化:singlenet1_model.zero_grad()
、singlenet2_model.zero_grad()
、optimizer.zero_grad()
。所以需要单独创建另外两个备份。
import copy
singlenet1_model = SingleNet1()
singlenet2_model = SingleNet2()
# using optimizer to optimize two single network
optimizer = optim.Adam(list(singlenet1_model.parameters())+list(singlenet2_model.parameters()), lr=0.01)
# copy1
singlenet1_model_copy1 = copy.deepcopy(singlenet1_model)
singlenet2_model_copy1 = copy.deepcopy(singlenet2_model)
optimizer_copy1 = optim.Adam(list(singlenet1_model_copy1.parameters())+list(singlenet2_model_copy1.parameters()), lr=0.01)
# copy2
singlenet1_model_copy2 = copy.deepcopy(singlenet1_model)
singlenet2_model_copy2 = copy.deepcopy(singlenet2_model)
optimizer_copy2 = optim.Adam(list(singlenet1_model_copy2.parameters())+list(singlenet2_model_copy2.parameters()), lr=0.01)
接下来,按照三种更新方式进行实验:
import time
# define train data
# [2, 32] random tensor float
train_data = torch.randn(2, 32)
print('=====================')
print('train_data:', train_data)
label = torch.randn(2, 1)
critirion = nn.MSELoss()
# three way to update nework params
# 1. singlenet1_model.zero_grad and optimizer.step
# 2. singlenet2_model_copy1.zero_grad and optimizer_copy1.step
# 3. optimizer_copy2.zero_grad and optimizer_copy2.step
# first print the network, before train
# print params of singlenet1_model and singlenet2_model
print('=====================')
print("before train:")
print('singlenet1_model params:')
for name, param in singlenet1_model.named_parameters():
print(name, param[0])
print('=====================')
print('singlenet2_model params:')
for name, param in singlenet2_model.named_parameters():
print(name, param[0])
Epoch = 200
# way 1
start = time.time()
for epoch in range(Epoch):
# train singlenet1_model by singlenet1_model.zero_grad()
singlenet1_model.zero_grad()
logits_1 = singlenet1_model(train_data)
logits_2 = singlenet2_model(train_data)
loss = critirion(logits_1+logits_2, label)
loss.backward()
optimizer.step()
end = time.time()
# way 2
start_copy1 = time.time()
for epoch in range(Epoch):
singlenet2_model_copy1.zero_grad()
logits_copy1_1 = singlenet1_model_copy1(train_data)
logits_copy1_2 = singlenet2_model_copy1(train_data)
loss_copy1 = critirion(logits_copy1_1+logits_copy1_2, label)
loss_copy1.backward()
optimizer_copy1.step()
end_copy1 = time.time()
# way 3
start_copy2 = time.time()
for epoch in range(Epoch):
optimizer_copy2.zero_grad()
logits_copy2_1 = singlenet1_model_copy2(train_data)
logits_copy2_2 = singlenet2_model_copy2(train_data)
loss_copy2 = critirion(logits_copy2_1+logits_copy2_2, label)
loss_copy2.backward()
optimizer_copy2.step()
end_copy2 = time.time()
# print time
print('=====================')
print("singlenet1_model zero_grad time consumption: ", end-start)
print("singlenet2_model_copy1 zero_grad time consumption: ", end_copy1-start_copy1)
print("singlenet2_model_copy2 zero_grad time consumption: ", end_copy2-start_copy2)
# print params of singlenet1_model and singlenet2_model
print('=====================')
print("after train:")
print('singlenet1_model params:')
for name, param in singlenet1_model.named_parameters():
print(name, param[0])
print('=====================')
print('singlenet2_model params:')
for name, param in singlenet2_model.named_parameters():
print(name, param[0])
# print params of singlenet1_model_copy1 and singlenet2_model_copy1
print('=====================')
print('singlenet1_model_copy1 params:')
for name, param in singlenet1_model_copy1.named_parameters():
print(name, param[0])
print('=====================')
print('singlenet2_model_copy1 params:')
for name, param in singlenet2_model_copy1.named_parameters():
print(name, param[0])
# print params of singlenet1_model_copy2 and singlenet2_model_copy2
print('=====================')
print('singlenet1_model_copy2 params:')
for name, param in singlenet1_model_copy2.named_parameters():
print(name, param[0])
print('=====================')
print('singlenet2_model_copy2 params:')
for name, param in singlenet2_model_copy2.named_parameters():
print(name, param[0])
在分析实验结果前,我们先从理论上分析下:zero_grad
操作是将网络中的梯度置零,使得loss
反向传播的时候上一次的梯度不会对这次更新有影响。由于optimizer
包含了两个网络的参数,所以在进行optimizer.step
的时候会对两个网络的参数进行更新操作。只有分别将两个网络上一次更新时保留的梯度清零后,才能得到正确的训练结果。而对网络梯度置零的方式有两种,一种是使用model.zero_grad()
直接对需要更新的网络的梯度进行置零,另一种是使用optimizer.zero_grad()
对其包含的网络的梯度进行置零。
我在实验场景2设置的实验中,可以发现,前两种更新方式只对某一个网络进行的梯度置零操作,而三种更新方式都是用optimizer.step()
对两个网络同时进行更新。所以,针对只对某一个网络梯度置零的训练方式,得到的结果肯定是错误的,起码不是我们期望得到的结果。第一种训练方式只对singlenet1_model
进行的置零,第二种方式只对singlenet2_model_copy1
网络进行的置零。然而在训练过程中,都使用了两个网络。所以在optimizer.step()
的时候他们的参数都会被更新到。从理论上分析,按照第一种训练方式,singlenet1_model
的参数会和singlenet1_model_copy2
的参数一致,因为他们都进行了置零操作,同时singlenet2_model
的参数会很大或者很小,因为他的梯度一直没有置零,一直在累加;按照第二种训练方式,singlenet2_model_copy1
的参数会和singlenet2_model_copy2
一致,同理,因为他们都进行了置零操作,同时,singlenet1_model_copy1
的参数会很大或者很小,因为他的梯度一直没有置零,一直在累加。
现在来看下实验结果(比较长,可以直接跳过看结论)
=====================
train_data: tensor([[-0.3577, 2.4431, 0.3372, 0.2288, 0.7613, 0.1274, -1.0252, 1.0250,
0.2452, -0.6526, -0.9433, -0.7394, -1.6489, 1.7330, -0.7891, 0.4625,
-0.6588, -0.0543, 0.0127, -0.0774, 0.6443, -1.2630, 0.9220, 1.1260,
1.6407, 0.0766, 0.0209, 0.0746, 1.6997, -0.1046, -0.7353, 0.6818],
[-0.4440, -0.5214, 0.5883, 1.0508, -0.2984, -2.1876, 0.6161, 0.1900,
1.5946, -0.0482, 0.5890, 0.8470, 1.2942, -0.1769, 1.0030, 0.8395,
1.8354, -1.4991, -0.7001, 0.0906, -0.0111, 0.4815, -1.3357, -0.7953,
0.7219, 2.0727, 0.5926, -0.3730, 0.8590, -1.4792, -0.6220, 0.4623]])
=====================
before train:
singlenet1_model params:
fc1.weight tensor([-0.0572, -0.0114, 0.0519, -0.1085, 0.0158, 0.1051, -0.1034, 0.0514,
-0.0060, 0.0972, 0.1175, 0.0469, 0.0053, -0.1522, 0.0704, 0.1078,
-0.0642, -0.1755, 0.1433, -0.0415, 0.0354, -0.0497, 0.1127, 0.1724,
0.1108, -0.0563, 0.1239, 0.0863, 0.0172, -0.1424, -0.0759, -0.1749],
grad_fn=<SelectBackward0>)
fc1.bias tensor(-0.0693, grad_fn=<SelectBackward0>)
fc2.weight tensor([ 0.1883, 0.1576, -0.0101, -0.2065, 0.0201, -0.1209, 0.0651, -0.0994,
0.1927, -0.2256, -0.0673, 0.0394, 0.0238, 0.2054, -0.0790, -0.1996],
grad_fn=<SelectBackward0>)
fc2.bias tensor(0.1655, grad_fn=<SelectBackward0>)
=====================
singlenet2_model params:
fc1.weight tensor([ 0.0494, 0.1165, -0.1046, -0.1038, -0.0714, 0.0126, 0.0599, -0.1016,
-0.0166, -0.0306, 0.1613, -0.0638, 0.1713, -0.1084, -0.0728, -0.1330,
0.1302, -0.1216, 0.0732, -0.0650, 0.1762, 0.0249, -0.1535, 0.0304,
0.0215, -0.1710, 0.0355, 0.1687, 0.0509, -0.0009, -0.0366, 0.0470],
grad_fn=<SelectBackward0>)
fc1.bias tensor(0.1513, grad_fn=<SelectBackward0>)
fc2.weight tensor([-0.1004, -0.1722, 0.1581, -0.1148, 0.0254, 0.1514, -0.0238, -0.1239,
-0.1828, 0.0314, -0.0948, 0.0138, 0.0604, -0.0535, 0.1281, -0.0246,
-0.0754, -0.1250, 0.0573, -0.1168, 0.0397, -0.0239, 0.1299, -0.0926,
-0.0931, -0.0244, -0.1154, 0.0720], grad_fn=<SelectBackward0>)
fc2.bias tensor(0.0034, grad_fn=<SelectBackward0>)
fc3.weight tensor([-0.1414, 0.1791, 0.1463, -0.0393, 0.0974, -0.0140, 0.1953, 0.1029,
0.1823, -0.2272, 0.2436, 0.0931, 0.1215, 0.0557, 0.1642, 0.2354],
grad_fn=<SelectBackward0>)
fc3.bias tensor(-0.0213, grad_fn=<SelectBackward0>)
=====================
singlenet1_model zero_grad time consumption: 11.717670917510986
singlenet2_model_copy1 zero_grad time consumption: 25.526930332183838
singlenet2_model_copy2 zero_grad time consumption: 30.173574209213257
=====================
after train:
singlenet1_model params:
fc1.weight tensor([ 0.0028, -0.0714, -0.0082, -0.1686, -0.0443, 0.0451, -0.0433, -0.0086,
-0.0661, 0.1573, 0.1775, 0.1070, 0.0654, -0.2122, 0.1304, 0.0477,
-0.0042, -0.1155, 0.0832, 0.0186, -0.0247, 0.0104, 0.0527, 0.1124,
0.0508, -0.1163, 0.0638, 0.0262, -0.0428, -0.0823, -0.0159, -0.2350],
grad_fn=<SelectBackward0>)
fc1.bias tensor(-0.1294, grad_fn=<SelectBackward0>)
fc2.weight tensor([ 0.1283, 0.0566, 0.0154, 0.0041, 0.0201, -0.0611, 0.0651, -0.2134,
0.0157, -0.2256, 0.0021, -0.0236, -0.0672, 0.1453, 0.0145, -0.1996],
grad_fn=<SelectBackward0>)
fc2.bias tensor(0.0872, grad_fn=<SelectBackward0>)
=====================
singlenet2_model params:
fc1.weight tensor([ 200.0430, -199.7129, -200.1006, -200.1378, -199.8998, -199.6105,
199.8880, -200.0756, -200.0676, 199.9428, 199.9849, 199.7537,
199.9934, -199.9425, 199.7460, -200.1328, 199.9292, 200.0739,
-197.2046, 199.7518, -199.6571, 199.8545, -199.9679, -199.7917,
-199.9570, -200.3628, -200.1662, -199.6084, -199.9294, 200.1179,
199.9478, -199.9335], grad_fn=<SelectBackward0>)
fc1.bias tensor(-199.8338, grad_fn=<SelectBackward0>)
fc2.weight tensor([ 1.9979e+02, 1.9980e+02, 1.5810e-01, 1.9986e+02, 1.9921e+02,
1.9103e+02, -1.9921e+02, -1.2386e-01, -1.8283e-01, -2.0015e+02,
-2.0020e+02, -1.9891e+02, 2.0005e+02, 1.9993e+02, 1.2814e-01,
-2.4592e-02, -7.5420e-02, 1.9980e+02, -2.0006e+02, 1.9985e+02,
1.9984e+02, -2.0028e+02, -2.0016e+02, 1.9988e+02, -2.0008e+02,
1.9990e+02, 1.9985e+02, 1.9999e+02], grad_fn=<SelectBackward0>)
fc2.bias tensor(-199.5621, grad_fn=<SelectBackward0>)
fc3.weight tensor([-199.9896, -199.7922, -199.8229, -198.5367, 200.1291, -200.1243,
200.1326, 200.1382, -199.8084, 199.7870, 199.8969, -199.8801,
200.0364, 184.9377, -199.7243, 200.0841], grad_fn=<SelectBackward0>)
fc3.bias tensor(0.2422, grad_fn=<SelectBackward0>)
=====================
singlenet1_model_copy1 params:
fc1.weight tensor([ 199.9120, -199.9805, -199.9175, -200.0778, -199.9536, -199.8641,
199.8659, -199.9178, -199.9754, 200.0663, 200.0868, 200.0161,
199.9745, -200.1215, 200.0396, -199.8615, 199.9051, 199.7939,
-199.8260, 199.9279, -199.9338, 199.9195, -199.8565, -199.7969,
-199.8586, -200.0256, -199.8454, -199.8830, -199.9520, 199.8269,
199.8933, -200.1440], grad_fn=<SelectBackward0>)
fc1.bias tensor(-200.0385, grad_fn=<SelectBackward0>)
fc2.weight tensor([-1.9978e+02, -1.9984e+02, -1.8655e+02, 1.9988e+02, 2.0126e-02,
-2.0010e+02, 6.5054e-02, 1.9653e+02, -2.0007e+02, -2.2556e-01,
1.9922e+02, -1.9993e+02, -1.9996e+02, -1.9976e+02, 1.9860e+02,
-1.9956e-01], grad_fn=<SelectBackward0>)
fc2.bias tensor(-0.9140, grad_fn=<SelectBackward0>)
=====================
singlenet2_model_copy1 params:
fc1.weight tensor([ 0.1291, 0.0988, -0.1787, -0.1693, -0.0654, 0.0712, 0.0382, -0.2440,
-0.0805, 0.1293, 0.1383, -0.1026, 0.1415, -0.1445, -0.1137, -0.2065,
0.0790, -0.0606, 0.1328, -0.1043, 0.1318, 0.0201, -0.1103, 0.0572,
-0.0881, -0.2319, -0.0254, 0.2238, -0.0536, 0.0609, 0.0516, -0.0476],
grad_fn=<SelectBackward0>)
fc1.bias tensor(0.0671, grad_fn=<SelectBackward0>)
fc2.weight tensor([-0.1004, -0.2032, 0.1581, -0.0398, -0.0428, 0.1412, -0.0835, -0.1239,
-0.1828, 0.0189, 0.0860, -0.0502, 0.1548, 0.0317, 0.1281, -0.0246,
-0.0754, -0.0518, 0.1004, -0.0568, 0.0397, -0.0230, 0.1907, -0.0602,
-0.0931, 0.0384, -0.0554, 0.0590], grad_fn=<SelectBackward0>)
fc2.bias tensor(-0.1083, grad_fn=<SelectBackward0>)
fc3.weight tensor([-0.0401, 0.1161, 0.1755, 0.0007, 0.0154, -0.0282, 0.1332, -0.0330,
0.0215, -0.1596, 0.0839, 0.0238, 0.3422, -0.0764, -0.0379, 0.1469],
grad_fn=<SelectBackward0>)
fc3.bias tensor(-0.1265, grad_fn=<SelectBackward0>)
=====================
singlenet1_model_copy2 params:
fc1.weight tensor([ 0.0028, -0.0714, -0.0082, -0.1686, -0.0443, 0.0451, -0.0433, -0.0086,
-0.0661, 0.1573, 0.1775, 0.1070, 0.0654, -0.2122, 0.1304, 0.0477,
-0.0042, -0.1155, 0.0832, 0.0186, -0.0247, 0.0104, 0.0527, 0.1124,
0.0508, -0.1163, 0.0638, 0.0262, -0.0428, -0.0823, -0.0159, -0.2350],
grad_fn=<SelectBackward0>)
fc1.bias tensor(-0.1294, grad_fn=<SelectBackward0>)
fc2.weight tensor([ 1.2825e-01, 5.7429e-02, 2.5881e-06, 4.1123e-05, 2.0126e-02,
-6.1129e-02, 6.5054e-02, -7.6426e-02, 1.5042e-01, -2.2556e-01,
9.8415e-06, -2.3649e-02, -6.6960e-02, 1.4533e-01, -1.3255e-01,
-1.9956e-01], grad_fn=<SelectBackward0>)
fc2.bias tensor(-0.0092, grad_fn=<SelectBackward0>)
=====================
singlenet2_model_copy2 params:
fc1.weight tensor([ 0.1115, 0.1645, -0.1662, -0.1644, -0.0179, 0.0725, 0.0042, -0.1755,
-0.0771, 0.0627, 0.1054, -0.1215, 0.1146, -0.0730, -0.1309, -0.1944,
0.0711, -0.0614, 0.1332, -0.1228, 0.1507, -0.0283, -0.0952, 0.0868,
-0.0444, -0.2311, -0.0247, 0.2282, -0.0143, 0.0593, 0.0265, -0.0169],
grad_fn=<SelectBackward0>)
fc1.bias tensor(0.0887, grad_fn=<SelectBackward0>)
fc2.weight tensor([-0.1003, -0.0925, 0.1581, -0.0398, 0.0948, 0.1208, 0.0044, -0.1239,
-0.1828, 0.1017, -0.0602, 0.0095, 0.1549, 0.0317, 0.1281, -0.0246,
-0.0754, -0.1086, 0.1022, -0.0568, 0.0397, -0.0572, 0.1700, 0.0059,
-0.0931, 0.0382, -0.0554, 0.1305], grad_fn=<SelectBackward0>)
fc2.bias tensor(-0.0009, grad_fn=<SelectBackward0>)
fc3.weight tensor([ 5.5309e-07, 1.1611e-01, 8.6269e-02, 2.3212e-03, -3.8124e-03,
2.7853e-07, 3.2120e-02, -2.2453e-02, 6.2886e-02, -1.5326e-01,
-6.2916e-06, 4.1156e-03, -3.0080e-03, -5.2712e-03, 6.7732e-02,
-3.2713e-05], grad_fn=<SelectBackward0>)
fc3.bias tensor(-0.1960, grad_fn=<SelectBackward0>)
从实验结果中,我们发现singlenet1_model
的参数和singlenet1_model_copy2
的参数一致以及singlenet2_model_copy1
的参数和singlenet2_model_copy2
一致的现象并没有出现。这和预期的不一致,为什么?这是一个新坑了,我也不知道,有待进一步探究step反向传播更新的过程。
同时,从实验结果中发现,我预测的另一个现象是出现了的。所以,对于要更新的网络,一定要进行zero_grad
,否则得不到预期的结果,还可能会出现严重的梯度消失或者梯度爆炸问题。
接下来,还是通过实验来验证下理论猜想。这次在实验场景2中的三种更新方式中,每种更新方式都将两个网络进行zero_step
,一次来验证理论上的正确。
import time
# traing both network
# define train data
# [2, 32] random tensor float
train_data = torch.randn(2, 32)
print('=====================')
print('train_data:', train_data)
label = torch.randn(2, 1)
critirion = nn.MSELoss()
# three way to update nework params
# 1. singlenet1_model.zero_grad and optimizer.step
# 2. singlenet2_model_copy1.zero_grad and optimizer_copy1.step
# 3. optimizer_copy2.zero_grad and optimizer_copy2.step
# first print the network, before train
# print params of singlenet1_model and singlenet2_model
print('=====================')
print("before train:")
print('singlenet1_model params:')
for name, param in singlenet1_model.named_parameters():
print(name, param[0])
print('=====================')
print('singlenet2_model params:')
for name, param in singlenet2_model.named_parameters():
print(name, param[0])
Epoch = 20000
# way 1
start = time.time()
for epoch in range(Epoch):
# train singlenet1_model by singlenet1_model.zero_grad()
singlenet1_model.zero_grad()
singlenet2_model.zero_grad()
logits_1 = singlenet1_model(train_data)
logits_2 = singlenet2_model(train_data)
loss = critirion(logits_1+logits_2, label)
loss.backward()
optimizer.step()
end = time.time()
# way 2
start_copy1 = time.time()
for epoch in range(Epoch):
singlenet1_model_copy1.zero_grad()
singlenet2_model_copy1.zero_grad()
logits_copy1_1 = singlenet1_model_copy1(train_data)
logits_copy1_2 = singlenet2_model_copy1(train_data)
loss_copy1 = critirion(logits_copy1_1+logits_copy1_2, label)
loss_copy1.backward()
optimizer_copy1.step()
end_copy1 = time.time()
# way 3
start_copy2 = time.time()
for epoch in range(Epoch):
optimizer_copy2.zero_grad()
logits_copy2_1 = singlenet1_model_copy2(train_data)
logits_copy2_2 = singlenet2_model_copy2(train_data)
loss_copy2 = critirion(logits_copy2_1+logits_copy2_2, label)
loss_copy2.backward()
optimizer_copy2.step()
end_copy2 = time.time()
# print time
print('=====================')
print("singlenet1_model zero_grad time consumption: ", end-start)
print("singlenet2_model_copy1 zero_grad time consumption: ", end_copy1-start_copy1)
print("singlenet2_model_copy2 zero_grad time consumption: ", end_copy2-start_copy2)
# print params of singlenet1_model and singlenet2_model
print('=====================')
print("after train:")
print('singlenet1_model params:')
for name, param in singlenet1_model.named_parameters():
print(name, param[0])
print('=====================')
print('singlenet2_model params:')
for name, param in singlenet2_model.named_parameters():
print(name, param[0])
# print params of singlenet1_model_copy1 and singlenet2_model_copy1
print('=====================')
print('singlenet1_model_copy1 params:')
for name, param in singlenet1_model_copy1.named_parameters():
print(name, param[0])
print('=====================')
print('singlenet2_model_copy1 params:')
for name, param in singlenet2_model_copy1.named_parameters():
print(name, param[0])
# print params of singlenet1_model_copy2 and singlenet2_model_copy2
print('=====================')
print('singlenet1_model_copy2 params:')
for name, param in singlenet1_model_copy2.named_parameters():
print(name, param[0])
print('=====================')
print('singlenet2_model_copy2 params:')
for name, param in singlenet2_model_copy2.named_parameters():
print(name, param[0])
实验结果如下:
=====================
train_data: tensor([[-0.3577, 2.4431, 0.3372, 0.2288, 0.7613, 0.1274, -1.0252, 1.0250,
0.2452, -0.6526, -0.9433, -0.7394, -1.6489, 1.7330, -0.7891, 0.4625,
-0.6588, -0.0543, 0.0127, -0.0774, 0.6443, -1.2630, 0.9220, 1.1260,
1.6407, 0.0766, 0.0209, 0.0746, 1.6997, -0.1046, -0.7353, 0.6818],
[-0.4440, -0.5214, 0.5883, 1.0508, -0.2984, -2.1876, 0.6161, 0.1900,
1.5946, -0.0482, 0.5890, 0.8470, 1.2942, -0.1769, 1.0030, 0.8395,
1.8354, -1.4991, -0.7001, 0.0906, -0.0111, 0.4815, -1.3357, -0.7953,
0.7219, 2.0727, 0.5926, -0.3730, 0.8590, -1.4792, -0.6220, 0.4623]])
=====================
before train:
singlenet1_model params:
fc1.weight tensor([-0.0572, -0.0114, 0.0519, -0.1085, 0.0158, 0.1051, -0.1034, 0.0514,
-0.0060, 0.0972, 0.1175, 0.0469, 0.0053, -0.1522, 0.0704, 0.1078,
-0.0642, -0.1755, 0.1433, -0.0415, 0.0354, -0.0497, 0.1127, 0.1724,
0.1108, -0.0563, 0.1239, 0.0863, 0.0172, -0.1424, -0.0759, -0.1749],
grad_fn=<SelectBackward0>)
fc1.bias tensor(-0.0693, grad_fn=<SelectBackward0>)
fc2.weight tensor([ 0.1883, 0.1576, -0.0101, -0.2065, 0.0201, -0.1209, 0.0651, -0.0994,
0.1927, -0.2256, -0.0673, 0.0394, 0.0238, 0.2054, -0.0790, -0.1996],
grad_fn=<SelectBackward0>)
fc2.bias tensor(0.1655, grad_fn=<SelectBackward0>)
=====================
singlenet2_model params:
fc1.weight tensor([ 0.0494, 0.1165, -0.1046, -0.1038, -0.0714, 0.0126, 0.0599, -0.1016,
-0.0166, -0.0306, 0.1613, -0.0638, 0.1713, -0.1084, -0.0728, -0.1330,
0.1302, -0.1216, 0.0732, -0.0650, 0.1762, 0.0249, -0.1535, 0.0304,
0.0215, -0.1710, 0.0355, 0.1687, 0.0509, -0.0009, -0.0366, 0.0470],
grad_fn=<SelectBackward0>)
fc1.bias tensor(0.1513, grad_fn=<SelectBackward0>)
fc2.weight tensor([-0.1004, -0.1722, 0.1581, -0.1148, 0.0254, 0.1514, -0.0238, -0.1239,
-0.1828, 0.0314, -0.0948, 0.0138, 0.0604, -0.0535, 0.1281, -0.0246,
-0.0754, -0.1250, 0.0573, -0.1168, 0.0397, -0.0239, 0.1299, -0.0926,
-0.0931, -0.0244, -0.1154, 0.0720], grad_fn=<SelectBackward0>)
fc2.bias tensor(0.0034, grad_fn=<SelectBackward0>)
fc3.weight tensor([-0.1414, 0.1791, 0.1463, -0.0393, 0.0974, -0.0140, 0.1953, 0.1029,
0.1823, -0.2272, 0.2436, 0.0931, 0.1215, 0.0557, 0.1642, 0.2354],
grad_fn=<SelectBackward0>)
fc3.bias tensor(-0.0213, grad_fn=<SelectBackward0>)
=====================
singlenet1_model zero_grad time consumption: 20.477731466293335
singlenet2_model_copy1 zero_grad time consumption: 18.824637413024902
singlenet2_model_copy2 zero_grad time consumption: 12.071680068969727
=====================
after train:
singlenet1_model params:
fc1.weight tensor([ 0.0028, -0.0714, -0.0082, -0.1686, -0.0443, 0.0451, -0.0433, -0.0086,
-0.0661, 0.1573, 0.1775, 0.1070, 0.0654, -0.2122, 0.1304, 0.0477,
-0.0042, -0.1155, 0.0832, 0.0186, -0.0247, 0.0104, 0.0527, 0.1124,
0.0508, -0.1163, 0.0638, 0.0262, -0.0428, -0.0823, -0.0159, -0.2350],
grad_fn=<SelectBackward0>)
fc1.bias tensor(-0.1294, grad_fn=<SelectBackward0>)
fc2.weight tensor([ 1.2825e-01, 5.7429e-02, 2.5881e-06, 4.1123e-05, 2.0126e-02,
-6.1129e-02, 6.5054e-02, -7.6426e-02, 1.5042e-01, -2.2556e-01,
9.8415e-06, -2.3649e-02, -6.6960e-02, 1.4533e-01, -1.3255e-01,
-1.9956e-01], grad_fn=<SelectBackward0>)
fc2.bias tensor(-0.0092, grad_fn=<SelectBackward0>)
=====================
singlenet2_model params:
fc1.weight tensor([ 0.1115, 0.1645, -0.1662, -0.1644, -0.0179, 0.0725, 0.0042, -0.1755,
-0.0771, 0.0627, 0.1054, -0.1215, 0.1146, -0.0730, -0.1309, -0.1944,
0.0711, -0.0614, 0.1332, -0.1228, 0.1507, -0.0283, -0.0952, 0.0868,
-0.0444, -0.2311, -0.0247, 0.2282, -0.0143, 0.0593, 0.0265, -0.0169],
grad_fn=<SelectBackward0>)
fc1.bias tensor(0.0887, grad_fn=<SelectBackward0>)
fc2.weight tensor([-0.1003, -0.0925, 0.1581, -0.0398, 0.0948, 0.1208, 0.0044, -0.1239,
-0.1828, 0.1017, -0.0602, 0.0095, 0.1549, 0.0317, 0.1281, -0.0246,
-0.0754, -0.1086, 0.1022, -0.0568, 0.0397, -0.0572, 0.1700, 0.0059,
-0.0931, 0.0382, -0.0554, 0.1305], grad_fn=<SelectBackward0>)
fc2.bias tensor(-0.0009, grad_fn=<SelectBackward0>)
fc3.weight tensor([ 5.5309e-07, 1.1611e-01, 8.6269e-02, 2.3212e-03, -3.8124e-03,
2.7853e-07, 3.2120e-02, -2.2453e-02, 6.2886e-02, -1.5326e-01,
-6.2916e-06, 4.1156e-03, -3.0080e-03, -5.2712e-03, 6.7732e-02,
-3.2713e-05], grad_fn=<SelectBackward0>)
fc3.bias tensor(-0.1960, grad_fn=<SelectBackward0>)
=====================
singlenet1_model_copy1 params:
fc1.weight tensor([ 0.0028, -0.0714, -0.0082, -0.1686, -0.0443, 0.0451, -0.0433, -0.0086,
-0.0661, 0.1573, 0.1775, 0.1070, 0.0654, -0.2122, 0.1304, 0.0477,
-0.0042, -0.1155, 0.0832, 0.0186, -0.0247, 0.0104, 0.0527, 0.1124,
0.0508, -0.1163, 0.0638, 0.0262, -0.0428, -0.0823, -0.0159, -0.2350],
grad_fn=<SelectBackward0>)
fc1.bias tensor(-0.1294, grad_fn=<SelectBackward0>)
fc2.weight tensor([ 1.2825e-01, 5.7429e-02, 2.5881e-06, 4.1123e-05, 2.0126e-02,
-6.1129e-02, 6.5054e-02, -7.6426e-02, 1.5042e-01, -2.2556e-01,
9.8415e-06, -2.3649e-02, -6.6960e-02, 1.4533e-01, -1.3255e-01,
-1.9956e-01], grad_fn=<SelectBackward0>)
fc2.bias tensor(-0.0092, grad_fn=<SelectBackward0>)
=====================
singlenet2_model_copy1 params:
fc1.weight tensor([ 0.1115, 0.1645, -0.1662, -0.1644, -0.0179, 0.0725, 0.0042, -0.1755,
-0.0771, 0.0627, 0.1054, -0.1215, 0.1146, -0.0730, -0.1309, -0.1944,
0.0711, -0.0614, 0.1332, -0.1228, 0.1507, -0.0283, -0.0952, 0.0868,
-0.0444, -0.2311, -0.0247, 0.2282, -0.0143, 0.0593, 0.0265, -0.0169],
grad_fn=<SelectBackward0>)
fc1.bias tensor(0.0887, grad_fn=<SelectBackward0>)
fc2.weight tensor([-0.1003, -0.0925, 0.1581, -0.0398, 0.0948, 0.1208, 0.0044, -0.1239,
-0.1828, 0.1017, -0.0602, 0.0095, 0.1549, 0.0317, 0.1281, -0.0246,
-0.0754, -0.1086, 0.1022, -0.0568, 0.0397, -0.0572, 0.1700, 0.0059,
-0.0931, 0.0382, -0.0554, 0.1305], grad_fn=<SelectBackward0>)
fc2.bias tensor(-0.0009, grad_fn=<SelectBackward0>)
fc3.weight tensor([ 5.5309e-07, 1.1611e-01, 8.6269e-02, 2.3212e-03, -3.8124e-03,
2.7853e-07, 3.2120e-02, -2.2453e-02, 6.2886e-02, -1.5326e-01,
-6.2916e-06, 4.1156e-03, -3.0080e-03, -5.2712e-03, 6.7732e-02,
-3.2713e-05], grad_fn=<SelectBackward0>)
fc3.bias tensor(-0.1960, grad_fn=<SelectBackward0>)
=====================
singlenet1_model_copy2 params:
fc1.weight tensor([ 0.0028, -0.0714, -0.0082, -0.1686, -0.0443, 0.0451, -0.0433, -0.0086,
-0.0661, 0.1573, 0.1775, 0.1070, 0.0654, -0.2122, 0.1304, 0.0477,
-0.0042, -0.1155, 0.0832, 0.0186, -0.0247, 0.0104, 0.0527, 0.1124,
0.0508, -0.1163, 0.0638, 0.0262, -0.0428, -0.0823, -0.0159, -0.2350],
grad_fn=<SelectBackward0>)
fc1.bias tensor(-0.1294, grad_fn=<SelectBackward0>)
fc2.weight tensor([ 1.2825e-01, 5.7429e-02, 2.5881e-06, 4.1123e-05, 2.0126e-02,
-6.1129e-02, 6.5054e-02, -7.6426e-02, 1.5042e-01, -2.2556e-01,
9.8415e-06, -2.3649e-02, -6.6960e-02, 1.4533e-01, -1.3255e-01,
-1.9956e-01], grad_fn=<SelectBackward0>)
fc2.bias tensor(-0.0092, grad_fn=<SelectBackward0>)
=====================
singlenet2_model_copy2 params:
fc1.weight tensor([ 0.1115, 0.1645, -0.1662, -0.1644, -0.0179, 0.0725, 0.0042, -0.1755,
-0.0771, 0.0627, 0.1054, -0.1215, 0.1146, -0.0730, -0.1309, -0.1944,
0.0711, -0.0614, 0.1332, -0.1228, 0.1507, -0.0283, -0.0952, 0.0868,
-0.0444, -0.2311, -0.0247, 0.2282, -0.0143, 0.0593, 0.0265, -0.0169],
grad_fn=<SelectBackward0>)
fc1.bias tensor(0.0887, grad_fn=<SelectBackward0>)
fc2.weight tensor([-0.1003, -0.0925, 0.1581, -0.0398, 0.0948, 0.1208, 0.0044, -0.1239,
-0.1828, 0.1017, -0.0602, 0.0095, 0.1549, 0.0317, 0.1281, -0.0246,
-0.0754, -0.1086, 0.1022, -0.0568, 0.0397, -0.0572, 0.1700, 0.0059,
-0.0931, 0.0382, -0.0554, 0.1305], grad_fn=<SelectBackward0>)
fc2.bias tensor(-0.0009, grad_fn=<SelectBackward0>)
fc3.weight tensor([ 5.5309e-07, 1.1611e-01, 8.6269e-02, 2.3212e-03, -3.8124e-03,
2.7853e-07, 3.2120e-02, -2.2453e-02, 6.2886e-02, -1.5326e-01,
-6.2916e-06, 4.1156e-03, -3.0080e-03, -5.2712e-03, 6.7732e-02,
-3.2713e-05], grad_fn=<SelectBackward0>)
fc3.bias tensor(-0.1960, grad_fn=<SelectBackward0>)
由实验结果很明显的发现网络的参数结果都一致,但是从时间上看,使用optimizer.zero_grad
的方式时间会更短一点(但是,经过多次实验结果发现,这三种更新方式的时间没有特定的规律,所以这个结论还有待进一步验证)。
实验场景3
实验场景3主要探索的是不同梯度置零方式对时间的消耗,这里就不对参数进行验证了,实验场景2中已经探索得差不多了。
在实验场景3中,我们定义3个网络:singlenet1_model
、singlenet2_model
和multinet_model
,multinet_model
里面包含singlenet1_model
和singlenet2_model
的参数。对这三个网络分别定义对应的optimizer
。
singlenet1_model = SingleNet()
singlenet2_model = SingleNet()
multinet_model = MultiNet(singlenet1_model, singlenet2_model)
optimizer_singlenet1 = optim.Adam(singlenet1_model.parameters(), lr=0.01)
optimizer_singlenet2 = optim.Adam(singlenet2_model.parameters(), lr=0.01)
optimizer_multinet = optim.Adam(multinet_model.parameters(), lr=0.01)
单独训练singlenet1_model
的训练方式有三种:
- 使用
singlenet1_model.zero_grad
和optimizer_singlenet1.step
- 使用
singlenet1_model.zero_grad
和optimizer_multinet.step
- 使用
optimizer_multinet.zero_grad
和optimizer_multinet.step
import time
# way 1
Epoch = 100000
start = time.time()
for epoch in range(Epoch):
singlenet1_model.zero_grad()
logits_1 = singlenet1_model(train_data)
loss = critirion(logits_1, label)
loss.backward()
optimizer_singlenet1.step()
end = time.time()
print("way 1 time consumption", end-start)
# way 2
start = time.time()
for epoch in range(Epoch):
singlenet1_model.zero_grad()
logits_1 = singlenet1_model(train_data)
loss = critirion(logits_1, label)
loss.backward()
optimizer_multinet.step()
end = time.time()
print("way 2 time consumption", end-start)
# way 3
start = time.time()
for epoch in range(Epoch):
optimizer_multinet.zero_grad()
logits_1 = singlenet1_model(train_data)
loss = critirion(logits_1, label)
loss.backward()
optimizer_multinet.step()
end = time.time()
print("way 3 time consumption", end-start)
从理论上分析,第一种训练方式的时间会比后两种更短。经过了多次实验,发现,大多数情况下是符合这个推论。
针对两个模型的训练就简单的归结为两种方式:
- 两个模型分别
zero_grad
,分别用各自的optimizer
优化 optimizer_multinet.zero_grad
和optimizer_multinet.step
import time
# way 1
Epoch = 100000
start = time.time()
for epoch in range(Epoch):
singlenet1_model.zero_grad()
singlenet2_model.zero_grad()
logits_1 = singlenet1_model(train_data)
logits_2 = singlenet2_model(train_data)
loss = critirion(logits_1+logits_2, label)
loss.backward()
optimizer_singlenet1.step()
optimizer_singlenet2.step()
end = time.time()
print("way 1 time consumption", end-start)
# way 2
start = time.time()
for epoch in range(Epoch):
optimizer_multinet.zero_grad()
logits_1 = multinet_model(train_data)
loss = critirion(logits_1, label)
loss.backward()
optimizer_multinet.step()
end = time.time()
print("way 2 time consumption", end-start)
这次实验结果发现,第二种更新方式时间明显低于第一种。
总结
本次探索了在不同实验场景下model.zero_grad
和potimizer.zero_grad
的区别和关系。这两个方法都是对网络进行梯度置零,但是在不同应用场景下有比较优的选择,总结如下:
- 当仅有一个
model
,同时optimizer
只包含这一个model
的参数,那么model.zero_grad
和optimizer.zero_grad
没有区别,可以任意使用。 - 当有多个
model
,同时optimizer
包含多个model
的参数时,如果这多个model
都需要训练,那么使用optimizer.zero_grad
是比较好的方式,耗时和防止出错上比对每个model
都进行zero_grad
要更好。 - 当有多个
model
,对于每个model
或者部分model
有对应的optimizer
,同时还有一个total_optimizer
包含多个model
的参数时。如果是是只想训练某一个model
或者一部分model
,可以选择对需要训练的那个model
进行model.zero_grad
,然后使用他对应的optimizer
进行优化。如果是想对所有的model
进行训练,那么使用total_optimizer.zero_grad
是更优的方式。