问题描述
Traceback (most recent call last):
File "/mnt/lab/XXX/.vscode-server/extensions/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/XXX/_vendored/pydevd/pydevd.py", line 3489, in <module>
main()
File "/mnt/lab/XXX/.vscode-server/extensions/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3482, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/mnt/lab/XXX/.vscode-server/extensions/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/XXX/_vendored/pydevd/pydevd.py", line 2510, in run
return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
File "/mnt/lab/XXX/.vscode-server/extensions/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2517, in _exec
globals = pydevd_runpy.run_path(file, globals, '__main__')
File "/mnt/lab/XXX/.vscode-server/extensions/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
return _run_module_code(code, init_globals, run_name,
File "/mnt/lab/XXX/.vscode-server/extensions/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/mnt/lab/XXX/.vscode-server/extensions/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
exec(code, run_globals)
File "./diff_prompter/ppo_prompter.py", line 283, in <module>
main(args)
File "./diff_prompter/ppo_prompter.py", line 268, in main
model = trlx.train(
File "/mnt/lab/XXX/code/promptist/trlx/trlx/trlx.py", line 47, in train
model: AcceleratePPOModel = get_model(config.model.model_type)(config)
File "/mnt/lab/XXX/code/promptist/trlx/trlx/model/accelerate_ppo_model.py", line 48, in __init__
self.model, self.opt, self.scheduler, rollout_loader = self.accelerator.prepare(
File "/mnt/lab/XXX/anaconda3/envs/promptist/lib/python3.9/site-packages/accelerate/accelerator.py", line 1274, in prepare
result = tuple(
File "/mnt/lab/XXX/anaconda3/envs/promptist/lib/python3.9/site-packages/accelerate/accelerator.py", line 1275, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/mnt/lab/XXX/anaconda3/envs/promptist/lib/python3.9/site-packages/accelerate/accelerator.py", line 1151, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/mnt/lab/XXX/anaconda3/envs/promptist/lib/python3.9/site-packages/accelerate/accelerator.py", line 1403, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/mnt/lab/XXX/anaconda3/envs/promptist/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 676, in __init__
_sync_module_states(
File "/mnt/lab/XXX/anaconda3/envs/promptist/lib/python3.9/site-packages/torch/distributed/utils.py", line 142, in _sync_module_states
_sync_params_and_buffers(
File "/mnt/lab/XXX/anaconda3/envs/promptist/lib/python3.9/site-packages/torch/distributed/utils.py", line 160, in _sync_params_and_buffers
dist._broadcast_coalesced(
RuntimeError: Invalid scalar type
问题分析
在另一个环境中能跑,怀疑是环境不匹配,重新配置环境。
问题解决
重新配置环境后,不报错,代码能跑,但是出现一串警告。
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] On Ampere and higher architectures please use CUDA 11+
[WARNING] On Ampere and higher architectures please use CUDA 11+
[WARNING] On Ampere and higher architectures please use CUDA 11+
[WARNING] On Ampere and higher architectures please use CUDA 11+
[WARNING] On Ampere and higher architectures please use CUDA 11+
[WARNING] On Ampere and higher architectures please use CUDA 11+
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.0), only 1.0.0 is known to be compatible
[WARNING] On Ampere and higher architectures please use CUDA 11+
运行一段时间后,报错
UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
return F.conv2d(input, weight, bias, self.stride,
问题分析
查阅资料,是cpu和gpu混用。
问题解决
gloo用于cpu,而nvcc用于gup。
os.environ['LOCAL_RANK'] = '0'
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12385'
torch.distributed.init_process_group(backend='gloo', rank=0, world_size=1)
改为:
os.environ['LOCAL_RANK'] = '0'
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12385'
torch.distributed.init_process_group(backend='gloo' if device='cpu' else 'nvcc', rank=0, world_size=1)
参考
RuntimeError (Invalid Scalar Type) when training using bf16-mixed precision and DDP (gloo backend) #1809
[Distributed] Invalid scalar type when dist.scatter() boolean tensor #90245
RuntimeError: Invalid scalar type #5485