背景
使用mmdetection3d训练基于BEV的点云模型时出现该异常
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/autocast_mode.py", line 217, in decorate_fwd
return fwd(*_cast(args, cast_inputs), **_cast(kwargs, cast_inputs))
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/autocast_mode.py", line 160, in _cast
return type(value)(iterable)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/autocast_mode.py", line 158, in <lambda>
iterable = map(lambda v: _cast(v, dtype), value)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/autocast_mode.py", line 150, in _cast
return value.to(dtype) if is_eligible else value
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
原因分析
1. 该算法模型使用了spconv,在mmdet3d中对应mmdet3d/models/middle_encoders/sparse_encoder.py中的以下代码:
@auto_fp16(apply_to=("voxel_features",))
def forward(self, voxel_features, coors, batch_size, **kwargs):
"""Forward of SparseEncoder.
Args:
voxel_features (torch.float32): Voxel features in shape (N, C).
coors (torch.int32): Coordinates in shape (N, 4),
the columns in the order of (batch_idx, z_idx, y_idx, x_idx).
batch_size (int): Batch size.
Returns:
dict: Backbone features.
"""
coors = coors.int()
input_sp_tensor = spconv.SparseConvTensor(
voxel_features, coors, self.sparse_shape, batch_size
)
# 报错语句
x = self.conv_input(input_sp_tensor)
更进一步,self.conv_input的定义为mmdet3d/ops/spconv/conv.py中的SparseConvolution
def forward(self, input):
...
if self.fused_bn:
assert self.bias is not None
out_features = ops.fused_indice_conv(
features,
self.weight,
self.bias,
indice_pairs.to(device),
indice_pair_num,
outids.shape[0],
self.inverse,
self.subm,
)
else:
if self.subm:
# 报错异常位置
out_features = Fsp.indice_subm_conv(
features, self.weight, indice_pairs.to(device), indice_pair_num, outids.shape[0]
)
...
2. 进一步分析该函数实现细节,发现最终的原因在于sparse_shape定义错误,具体来说,sparse_shape的大小和point_cloud_range、voxel_size两个变量有关系:
假设:point_cloud_range = [-50.0, -15.0, -5.0, 50.0, 15.0, 3.0]; voxel_size = [0.1, 0.1, 0.2]
那么:sparse_shape=[(50-(-50))/0.1, (15-(-15))/0.1, (3-(-5))/0.2+1]=[1000, 300, 41]
但我的配置中将sparse_shape配置为了[300,600,41],从而导致错误
解决方案
如原因分析中所述,根据point_cloud_range重新计算sparse_shape即可
参考资料
mmdet3d中也有类似的问题:Getting "CUDA error: an illegal memory access was encountered" on MVXNet #382