RuntimeError: CUDA error: an illegal memory access was encountered

背景

使用mmdetection3d训练基于BEV的点云模型时出现该异常

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/autocast_mode.py", line 217, in decorate_fwd
    return fwd(*_cast(args, cast_inputs), **_cast(kwargs, cast_inputs))
  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/autocast_mode.py", line 160, in _cast
    return type(value)(iterable)
  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/autocast_mode.py", line 158, in <lambda>
    iterable = map(lambda v: _cast(v, dtype), value)
  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/autocast_mode.py", line 150, in _cast
    return value.to(dtype) if is_eligible else value
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

原因分析

1. 该算法模型使用了spconv,在mmdet3d中对应mmdet3d/models/middle_encoders/sparse_encoder.py中的以下代码:

@auto_fp16(apply_to=("voxel_features",))
def forward(self, voxel_features, coors, batch_size, **kwargs):
    """Forward of SparseEncoder.

    Args:
        voxel_features (torch.float32): Voxel features in shape (N, C).
        coors (torch.int32): Coordinates in shape (N, 4),
            the columns in the order of (batch_idx, z_idx, y_idx, x_idx).
        batch_size (int): Batch size.

    Returns:
        dict: Backbone features.
    """
    coors = coors.int()
    input_sp_tensor = spconv.SparseConvTensor(
        voxel_features, coors, self.sparse_shape, batch_size
    )
    # 报错语句
    x = self.conv_input(input_sp_tensor)

更进一步,self.conv_input的定义为mmdet3d/ops/spconv/conv.py中的SparseConvolution

def forward(self, input):
    ...
    if self.fused_bn:
        assert self.bias is not None
        out_features = ops.fused_indice_conv(
            features,
            self.weight,
            self.bias,
            indice_pairs.to(device),
            indice_pair_num,
            outids.shape[0],
            self.inverse,
            self.subm,
        )
    else:
        if self.subm:
            # 报错异常位置
            out_features = Fsp.indice_subm_conv(
                features, self.weight, indice_pairs.to(device), indice_pair_num, outids.shape[0]
            )
    ...

2. 进一步分析该函数实现细节,发现最终的原因在于sparse_shape定义错误,具体来说,sparse_shape的大小和point_cloud_range、voxel_size两个变量有关系:

假设:point_cloud_range = [-50.0, -15.0, -5.0, 50.0, 15.0, 3.0]; voxel_size = [0.1, 0.1, 0.2]

那么:sparse_shape=[(50-(-50))/0.1, (15-(-15))/0.1, (3-(-5))/0.2+1]=[1000, 300, 41]

但我的配置中将sparse_shape配置为了[300,600,41],从而导致错误

解决方案

如原因分析中所述,根据point_cloud_range重新计算sparse_shape即可

参考资料

mmdet3d中也有类似的问题:Getting "CUDA error: an illegal memory access was encountered" on MVXNet #382

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值