错误现象
在集群上使用 4 卡跑 swin transformer 模型,训练完一个 epoch 的时候出现读取图片失败异常退出。
训练脚本:
cd '/share/home/defaultTenant/ch**11/mmdetection'
bash ./tools/dist_train.sh ./projects/ciping4_new/mask-rcnn_swin-s-p4-w7_fpn_amp-ms-crop-3x_coco.py 4
异常日志:
03/26 18:46:06 - mmengine - INFO - Saving checkpoint at 1 epochs
03/26 18:49:23 - mmengine - INFO - Epoch(val) [1][500/611] eta: 0:00:43 time: 0.3182 data_time: 0.2853 memory: 1238
Traceback (most recent call last):
File “./tools/train.py”, line 121, in
main()
File “./tools/train.py”, line 117, in main
runner.train()
File “/.conda/envs/open-mmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1777, in train
model = self.train_loop.run() # type: ignore
File "/.conda/envs/open-mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py”, line 102, in run
self.runner.val_loop.run()
File “/.conda/envs/open-mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 370, in run
for idx, data_batch in enumerate(self.dataloader):
File "/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 521, in next
data = self._next_data()
File “/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 1229, in _process_data
data.reraise()
File “/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
raise exception
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py”, line 287, in _worker_loop
data = fetcher.fetch(index)
File “/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py”, line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/.conda/envs/open-mmlab/lib/python3.8/site-packages/mmengine/dataset/base_dataset.py", line 403, in getitem
data = self.prepare_data(idx)
File "/.conda/envs/open-mmlab/lib/python3.8/site-packages/mmengine/dataset/base_dataset.py”, line 793, in prepare_data
return self.pipeline(data_info)
File “/.conda/envs/open-mmlab/lib/python3.8/site-packages/mmengine/dataset/base_dataset.py", line 60, in call
data = t(data)
File "/.conda/envs/open-mmlab/lib/python3.8/site-packages/mmcv/transforms/base.py”, line 12, in call
return self.transform(results)
File “*/.conda/envs/open-mmlab/lib/python3.8/site-packages/mmcv/transforms/loading.py”, line 110, in transform
assert img is not None, f’failed to load image: {filename}’
AssertionError: failed to load image: */datasets/ciping4_new-v2/test/PS48.jpg
异常定位
94 if self.file_client_args is not None:
95 file_client = fileio.FileClient.infer_client(
96 self.file_client_args, filename)
97 img_bytes = file_client.get(filename)
98 else:
99 img_bytes = fileio.get(filename, backend_args=self.backend_args)
101 img = mmcv.imfrombytes(
102 img_bytes, flag=self.color_type, backend=self.imdecode_backend)
通过对以上代码的理解,写了一个小脚本测试了一下 mmcv 读取图片的方法
可以看到 backend=‘cv2’ 时 img 为 None,改 backend=‘pillow’ 后,正常打印图片矩阵。
同时在源码的注释里也有相关的描述。
/mmcv/transforms/loading.py
LoadImageFromFile
解决办法
配置文件的 LoadImageFromFile 增加 imdecode_backend=‘pillow’ 。
其他
有意思的是,碰到这个问题的人还不知我一人,阅读源码时作者明确注释了如何解决此问题的 issue 地址:https://github.com/open-mmlab/mmpretrain/issues/1427
vi */.conda/envs/open-mmlab/lib/python3.8/site-packages/mmcv/transforms/loading.py
:110