Assertion `t >= 0 && t < n_classes` failed.

 报错描述如下:

/opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/native/cuda/Loss.cu:257: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7ff3b4f7a1bd in /root/miniconda3/envs/pyskl/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7ff3f2e086ea in /root/miniconda3/envs/pyskl/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7ff3f2e0acd0 in /root/miniconda3/envs/pyskl/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7ff3f2e0bf65 in /root/miniconda3/envs/pyskl/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xc9039 (0x7ff44b0d4039 in /root/miniconda3/envs/pyskl/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff475c76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff475b9b163 in /usr/lib/x86_64-linux-gnu/libc.so.6)

排查原因:

这个是官网提供的标准.pkl格式的长的样子

(pyskl) root@autodl-container-b0a24f8ab3-d2df5a90:~/pyskl# python show_pkl.py
{'frame_dir': 'S001C001P001R001A001', 'label': 0, 'img_shape': (1080, 1920), 'original_shape': (1080, 1920), 'total_frames': 103, 'keypoint': array([[[[1032. ,  334.8],
         [1041. ,  325.8],
         [1023.5,  325.8],
         ...,
         [1028. ,  611.5],
         [1063. ,  704. ],
         [1037. ,  695. ]],

        [[1032. ,  334. ],
         [1041. ,  325. ],
         [1023. ,  325. ],
         ...,
         [1027. ,  612.5],
         [1063. ,  707. ],
         [1036. ,  693.5]],

        [[1032. ,  334. ],
         [1041. ,  325. ],
         [1023. ,  325. ],
         ...,
         [1027. ,  612.5],
         [1063. ,  707. ],
         [1036. ,  698. ]],

        ...,

        [[1037. ,  321.8],
         [1050. ,  317.5],
         [1033. ,  313. ],
         ...,
         [1028. ,  612. ],
         [1064. ,  704. ],
         [1037. ,  695.5]],

        [[1039. ,  324. ],
         [1048. ,  315.2],
         [1035. ,  315.2],
         ...,
         [1030. ,  611. ],
         [1066. ,  703.5],
         [1039. ,  695. ]],

        [[1037. ,  322. ],
         [1050. ,  317.5],
         [1033. ,  313.2],
         ...,
         [1028. ,  613.5],
         [1064. ,  701.5],
         [1037. ,  697. ]]]], dtype=float16), 'keypoint_score': array([[[0.934 , 0.9766, 0.9736, ..., 0.876 , 0.8857, 0.892 ],
        [0.9546, 0.993 , 0.989 , ..., 0.877 , 0.9043, 0.9014],
        [0.9536, 0.9937, 0.988 , ..., 0.8867, 0.907 , 0.903 ],
        ...,
        [0.9365, 0.9043, 0.9414, ..., 0.8955, 0.888 , 0.9033],
        [0.9585, 0.9385, 0.939 , ..., 0.8984, 0.9126, 0.9146],
        [0.9395, 0.904 , 0.9453, ..., 0.898 , 0.8813, 0.886 ]]],
      dtype=float16)}

 发现'label'是从0开始的

'frame_dir': 'S001C001P001R001A001', 'label': 0,

而发现自己做的自定义的.pkl格式长的样子是下面这样

(pyskl) root@autodl-container-b0a24f8ab3-d2df5a90:~/pyskl# python show_pkl.py
{'frame_dir': 'S001C001P004R002A026', 'label': 26, 'img_shape': (540, 960), 'original_shape': (540, 960), 'total_frames': 90, 'num_person_raw': 1, 'keypoint': array([[[[501.2, 145.1],
         [504. , 140. ],
         [496.2, 137.5],
         ...,
         [475.8, 306.8],
         [496.2, 355.5],
         [470.5, 360.8]],

        [[501.5, 145.2],
         [504. , 140.1],
         [496.2, 137.5],
         ...,
         [475.8, 307. ],
         [493.8, 353.2],
         [470.8, 358.2]],

        [[501.8, 145.1],
         [504.2, 140. ],
         [496.5, 137.4],
         ...,
         [476. , 307.2],
         [494. , 353.5],
         [471. , 358.5]],

        ...,

        [[545. , 162.2],
         [547.5, 157.4],
         [540.5, 157.4],
         ...,
         [501.5, 303.2],
         [496.5, 351.8],
         [474.8, 325. ]],

        [[543.5, 159.8],
         [548.5, 157.2],
         [538.5, 154.8],
         ...,
         [501.8, 300. ],
         [499.2, 349. ],
         [474.5, 324.5]],

        [[543.5, 158.1],
         [548.5, 155.6],
         [538.5, 153.1],
         ...,
         [501.5, 299.2],
         [499. , 348.8],
         [476.8, 324. ]]]], dtype=float16), 'keypoint_score': array([[[0.9663, 0.9497, 0.9375, ..., 0.866 , 0.8975, 0.8516],
        [0.9653, 0.948 , 0.9365, ..., 0.87  , 0.904 , 0.8447],
        [0.967 , 0.9478, 0.9375, ..., 0.8667, 0.9062, 0.8486],
        ...,
        [0.937 , 0.9414, 0.9307, ..., 0.8574, 0.794 , 0.5063],
        [0.938 , 0.938 , 0.937 , ..., 0.871 , 0.7666, 0.513 ],
        [0.9336, 0.94  , 0.9365, ..., 0.851 , 0.7637, 0.5107]]],
      dtype=float16)}

 

所以接下来的思路就是想办法把label从0开始,而不是从1开始。

参考:

nll_loss_forward_reduce_cuda_kernel_2d: Assertion `t >= 0 && t < n__classes` failed._loss.cu:271: block: [0,0,0], thread: [1,0,0] asser-CSDN博客

  • 10
    点赞
  • 24
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值