参考案例1
import torch
class Filter(nn.Module):
def __init__(self):
super().__init__()
self.resample_filter = torch.rand(4,4)
def forward(self, x):
x = torch.nn.functional.pad(x, [1, 1, 1, 1]) # If this line is commented out, it works.
weight = self.resample_filter[None, None].repeat([x.shape[1] , 1] + [1] * self.resample_filter.ndim)
x = torch.nn.functional.conv2d(input=x, padding=1, weight=weight, groups=x.shape[1] )
return x
x = torch.rand((1, 3, 256, 256))
f = Filter()
y = f(x)
torch.onnx.export(f, x, "test-filter.onnx", opset_version=15)
错误信息
File "/root/miniconda3/lib/python3.9/site-packages/torch/onnx/symbolic_opset9.py", line 2519, in _convolution
raise errors.SymbolicValueError(
torch.onnx.errors.SymbolicValueError: Unsupported: ONNX export of convolution for kernel of unknown shape. [Caused by the value '28 defined in (%28 : Float(*, *, *, *, strides=[199692, 66564, 258, 1], requires_grad=0, device=cpu) = onnx::Pad[mode="constant"](%0, %27, %3), scope: __main__.Filter:: # /mnt/f/codes/onnx_export/test.py:10:0
)' (type 'Tensor') in the TorchScript graph. The containing node has kind 'onnx::Pad'.]
这种问题一般出现在卷积的权重不是常规的直接的训练参数,而是从其他计算分支计算得到。
调试:
进入上面torch/onnx/symbolic_opset9.py", line 2519加入打印:
错误提示为Caused by the value '28 defined in (%28 : Float(*, *, *, *, strides=[199692, 66564, 258, 1], requires_grad=0, device=cpu)
从%28一直往上跟踪找到第一个出现*未知shape的位置:
%28 : Float(*, *, *, *, strides=[199692, 66564, 258, 1], requires_grad=0, device=cpu) = onnx::Pad[mode="constant"](%0, %27, %3), scope: __main__.Filter:: # /mnt/f/codes/onnx_export/test.py:10:0
这里指示了是test.py第10行引起的,也就是pad那一句导致的。
这其实是底层infer shape的bug。
一种解决方案是去底层修改pytorch支持infer shape。
另一种是采取一些方法规避,使得进入conv前的shape是已知的,我们加入一个reshape 算子:
class Filter(nn.Module):
def __init__(self):
super().__init__()
self.resample_filter = torch.rand(4, 4)
def forward(self, x):
x = torch.nn.functional.pad(x, [1, 1, 1, 1]) # If this line is commented out, it works.
shape = x.shape
shape = [int(elem) for elem in shape]
x = x.reshape(shape)
weight = self.resample_filter[None, None].repeat([x.shape[1], 1] + [1] * self.resample_filter.ndim)
x = torch.nn.functional.conv2d(input=x, padding=1, weight=weight, groups=x.shape[1])
return x
注意改动为3行:
shape = x.shape
shape = [int(elem) for elem in shape]
x = x.reshape(shape)
这使得x的shape重新被完全静态确定。
改动后该代码可以进行导出。
如果想导出动态图像大小,可以考虑指对batch channel的维度对应的shape进行int固化,看看是否成功。
有时候工程代码太复杂,不知道pad在哪里,那就直接去torch 的pad代码forward里面插入上面的语句。例如:
torch.onnx.errors.SymbolicValueError: Unsupported: ONNX export of convolution for kernel of unknown shape. [Caused by the value '2004 defined in (%2004 : Float(*, *, *, *, strides=[1102464, 5742, 87, 1], requires_grad=0, device=cuda:0) = onnx::Pad[mode="constant"](%xi, %2002, %2003), # /root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/padding.py:205:0
那就直接去torch/nn/modules/padding.py:205:0插入上面代码,注意应该插入到对pad计算结果处理,而不是对pad的输入处理。另外还可以搜索一下算法代码的pad算子,对pad结果进行处理。
类似的解决方案还可以考虑替换为底层没有infer shape bug的算子。例如把上面的pad改为concat算子。
另外最好把pytorch版本升级为最新的,可能修复了一些infer shape的bug。