报错描述如下:
/opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/native/cuda/Loss.cu:257: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7ff3b4f7a1bd in /root/miniconda3/envs/pyskl/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7ff3f2e086ea in /root/miniconda3/envs/pyskl/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7ff3f2e0acd0 in /root/miniconda3/envs/pyskl/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7ff3f2e0bf65 in /root/miniconda3/envs/pyskl/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xc9039 (0x7ff44b0d4039 in /root/miniconda3/envs/pyskl/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff475c76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff475b9b163 in /usr/lib/x86_64-linux-gnu/libc.so.6)
排查原因:
这个是官网提供的标准.pkl格式的长的样子
(pyskl) root@autodl-container-b0a24f8ab3-d2df5a90:~/pyskl# python show_pkl.py
{'frame_dir': 'S001C001P001R001A001', 'label': 0, 'img_shape': (1080, 1920), 'original_shape': (1080, 1920), 'total_frames': 103, 'keypoint': array([[[[1032. , 334.8],
[1041. , 325.8],
[1023.5, 325.8],
...,
[1028. , 611.5],
[1063. , 704. ],
[1037. , 695. ]],
[[1032. , 334. ],
[1041. , 325. ],
[1023. , 325. ],
...,
[1027. , 612.5],
[1063. , 707. ],
[1036. , 693.5]],
[[1032. , 334. ],
[1041. , 325. ],
[1023. , 325. ],
...,
[1027. , 612.5],
[1063. , 707. ],
[1036. , 698. ]],
...,
[[1037. , 321.8],
[1050. , 317.5],
[1033. , 313. ],
...,
[1028. , 612. ],
[1064. , 704. ],
[1037. , 695.5]],
[[1039. , 324. ],
[1048. , 315.2],
[1035. , 315.2],
...,
[1030. , 611. ],
[1066. , 703.5],
[1039. , 695. ]],
[[1037. , 322. ],
[1050. , 317.5],
[1033. , 313.2],
...,
[1028. , 613.5],
[1064. , 701.5],
[1037. , 697. ]]]], dtype=float16), 'keypoint_score': array([[[0.934 , 0.9766, 0.9736, ..., 0.876 , 0.8857, 0.892 ],
[0.9546, 0.993 , 0.989 , ..., 0.877 , 0.9043, 0.9014],
[0.9536, 0.9937, 0.988 , ..., 0.8867, 0.907 , 0.903 ],
...,
[0.9365, 0.9043, 0.9414, ..., 0.8955, 0.888 , 0.9033],
[0.9585, 0.9385, 0.939 , ..., 0.8984, 0.9126, 0.9146],
[0.9395, 0.904 , 0.9453, ..., 0.898 , 0.8813, 0.886 ]]],
dtype=float16)}
发现'label'是从0开始的
'frame_dir': 'S001C001P001R001A001', 'label': 0,
而发现自己做的自定义的.pkl格式长的样子是下面这样
(pyskl) root@autodl-container-b0a24f8ab3-d2df5a90:~/pyskl# python show_pkl.py
{'frame_dir': 'S001C001P004R002A026', 'label': 26, 'img_shape': (540, 960), 'original_shape': (540, 960), 'total_frames': 90, 'num_person_raw': 1, 'keypoint': array([[[[501.2, 145.1],
[504. , 140. ],
[496.2, 137.5],
...,
[475.8, 306.8],
[496.2, 355.5],
[470.5, 360.8]],
[[501.5, 145.2],
[504. , 140.1],
[496.2, 137.5],
...,
[475.8, 307. ],
[493.8, 353.2],
[470.8, 358.2]],
[[501.8, 145.1],
[504.2, 140. ],
[496.5, 137.4],
...,
[476. , 307.2],
[494. , 353.5],
[471. , 358.5]],
...,
[[545. , 162.2],
[547.5, 157.4],
[540.5, 157.4],
...,
[501.5, 303.2],
[496.5, 351.8],
[474.8, 325. ]],
[[543.5, 159.8],
[548.5, 157.2],
[538.5, 154.8],
...,
[501.8, 300. ],
[499.2, 349. ],
[474.5, 324.5]],
[[543.5, 158.1],
[548.5, 155.6],
[538.5, 153.1],
...,
[501.5, 299.2],
[499. , 348.8],
[476.8, 324. ]]]], dtype=float16), 'keypoint_score': array([[[0.9663, 0.9497, 0.9375, ..., 0.866 , 0.8975, 0.8516],
[0.9653, 0.948 , 0.9365, ..., 0.87 , 0.904 , 0.8447],
[0.967 , 0.9478, 0.9375, ..., 0.8667, 0.9062, 0.8486],
...,
[0.937 , 0.9414, 0.9307, ..., 0.8574, 0.794 , 0.5063],
[0.938 , 0.938 , 0.937 , ..., 0.871 , 0.7666, 0.513 ],
[0.9336, 0.94 , 0.9365, ..., 0.851 , 0.7637, 0.5107]]],
dtype=float16)}
所以接下来的思路就是想办法把label从0开始,而不是从1开始。
参考: