使用tensorboardX 的SummaryWriter进行writer保存网络过程中遇到错误。
报错信息
TracerWarning: Trace had nondeterministic nodes. Did you forget call .eval() on your model? Nodes:
%x.3 : Float(16, 16, 500, 1, strides=[8000, 500, 1, 1], requires_grad=1, device=cpu) = aten::dropout(%263, %88, %89) # C:\Users\lenovo\.conda\envs\EEGnet\lib\site-packages\torch\nn\functional.py:1266:0
%input.17 : Float(16, 4, 16, 500, strides=[32000, 1, 2000, 4], requires_grad=1, device=cpu) = aten::dropout(%266, %133, %134) # C:\Users\lenovo\.conda\envs\EEGnet\lib\site-packages\torch\nn\functional.py:1266:0
%input.29 : Float(16, 4, 4, 125, strides=[2000, 1, 500, 4], requires_grad=1, device=cpu) = aten::dropout(%270, %186, %187) # C:\Users\lenovo\.conda\envs\EEGnet\lib\site-packages\torch\nn\functional.py:1266:0
This may cause errors in trace checking. To disable trace checking, pass check_trace=False to torch.jit.trace()
_check_trace(
C:\Users\lenovo\.conda\envs\EEGnet\lib\site-packages\torch\jit\_trace.py:1093: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Tensor-likes are not close!
Mismatched elements: 16 / 16 (100.0%)
Greatest absolute difference: 0.05825650691986084 at index (1, 0) (up to 1e-05 allowed)
Greatest relative difference: 0.1222612684738262 at index (15, 0) (up to 1e-05 allowed)
_check_trace(
提炼出重点:
Trace had nondeterministic nodes. Did you forget call .eval() on your model?
解决方法
- 排查对应出错代码为:
writer.add_graph(net, (inputs,))
- 整体代码:
writer = SummaryWriter('./Result')
# 训练 循环
for epoch in range(200):
print("\nEpoch ", epoch)
running_loss = 0.0
for i in range(len(X_train) // batch_size - 1):
s = i * batch_size
e = i * batch_size + batch_size
inputs = torch.from_numpy(X_train[s:e])
labels = torch.FloatTensor(np.array([y_train[s:e]]).T * 1.0)
# wrap them in Variable
inputs, labels = Variable(inputs), Variable(labels)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
# if i == 1:
# with SummaryWriter(comment='Net') as w:
# w.add_graph(net, (inputs,))
# 验证
params = ["acc", "auc", "fmeasure"]
print(params)
print("Training Loss ", running_loss)
train_acc, train_auc, train_fmeasure = evaluate(net, X_train, y_train, params)
print("Train - ", train_acc, train_auc, train_fmeasure)
val_acc, val_auc, val_fmeasure = evaluate(net, X_train, y_train, params)
print("Valvidation - ", val_acc, val_auc, val_fmeasure)
test_acc, test_auc, test_fmeasure = evaluate(net, X_train, y_train, params)
print("Test - ", test_acc, test_auc, test_fmeasure)
# net.eval()
writer.add_graph(net, (inputs,))
tags = ["data/train_val_acc", "data/train_val_acc", "data/train_test_acc"] # 绘图的tags
writer.add_scalars(tags[1], {'trainACC': train_acc, 'valACC': val_acc, 'testACC':test_acc}, epoch) # 绘制train和val的accuracy,放在一个图上
writer.add_scalar(tags[2], optimizer.param_groups[0]["lr"], epoch) # 绘制学习率曲线,放在一个图上
- 分析发现,问题在于add_graph(net, (inputs,))应在训练完成后,而不应该再epoch迭代中就保存。