【踩坑日记20】RuntimeError: Invalid scalar type

问题描述

Traceback (most recent call last):
  File "/mnt/lab/XXX/.vscode-server/extensions/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/XXX/_vendored/pydevd/pydevd.py", line 3489, in <module>
    main()
  File "/mnt/lab/XXX/.vscode-server/extensions/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3482, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/mnt/lab/XXX/.vscode-server/extensions/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/XXX/_vendored/pydevd/pydevd.py", line 2510, in run
    return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
  File "/mnt/lab/XXX/.vscode-server/extensions/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2517, in _exec
    globals = pydevd_runpy.run_path(file, globals, '__main__')
  File "/mnt/lab/XXX/.vscode-server/extensions/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/mnt/lab/XXX/.vscode-server/extensions/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/mnt/lab/XXX/.vscode-server/extensions/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "./diff_prompter/ppo_prompter.py", line 283, in <module>
    main(args)
  File "./diff_prompter/ppo_prompter.py", line 268, in main
    model = trlx.train(
  File "/mnt/lab/XXX/code/promptist/trlx/trlx/trlx.py", line 47, in train
    model: AcceleratePPOModel = get_model(config.model.model_type)(config)
  File "/mnt/lab/XXX/code/promptist/trlx/trlx/model/accelerate_ppo_model.py", line 48, in __init__
    self.model, self.opt, self.scheduler, rollout_loader = self.accelerator.prepare(
  File "/mnt/lab/XXX/anaconda3/envs/promptist/lib/python3.9/site-packages/accelerate/accelerator.py", line 1274, in prepare
    result = tuple(
  File "/mnt/lab/XXX/anaconda3/envs/promptist/lib/python3.9/site-packages/accelerate/accelerator.py", line 1275, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/mnt/lab/XXX/anaconda3/envs/promptist/lib/python3.9/site-packages/accelerate/accelerator.py", line 1151, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/mnt/lab/XXX/anaconda3/envs/promptist/lib/python3.9/site-packages/accelerate/accelerator.py", line 1403, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/mnt/lab/XXX/anaconda3/envs/promptist/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 676, in __init__
    _sync_module_states(
  File "/mnt/lab/XXX/anaconda3/envs/promptist/lib/python3.9/site-packages/torch/distributed/utils.py", line 142, in _sync_module_states
    _sync_params_and_buffers(
  File "/mnt/lab/XXX/anaconda3/envs/promptist/lib/python3.9/site-packages/torch/distributed/utils.py", line 160, in _sync_params_and_buffers
    dist._broadcast_coalesced(
RuntimeError: Invalid scalar type

问题分析

在另一个环境中能跑,怀疑是环境不匹配,重新配置环境。

问题解决

重新配置环境后,不报错,代码能跑,但是出现一串警告。

 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  On Ampere and higher architectures please use CUDA 11+
 [WARNING]  On Ampere and higher architectures please use CUDA 11+
 [WARNING]  On Ampere and higher architectures please use CUDA 11+
 [WARNING]  On Ampere and higher architectures please use CUDA 11+
 [WARNING]  On Ampere and higher architectures please use CUDA 11+
 [WARNING]  On Ampere and higher architectures please use CUDA 11+
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
 [WARNING]  On Ampere and higher architectures please use CUDA 11+

运行一段时间后,报错

UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
  return F.conv2d(input, weight, bias, self.stride,

问题分析

查阅资料,是cpu和gpu混用。

问题解决

gloo用于cpu,而nvcc用于gup。

os.environ['LOCAL_RANK'] = '0'
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12385'
torch.distributed.init_process_group(backend='gloo', rank=0, world_size=1)

改为:

os.environ['LOCAL_RANK'] = '0'
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12385'
torch.distributed.init_process_group(backend='gloo' if device='cpu' else 'nvcc', rank=0, world_size=1)

参考

RuntimeError (Invalid Scalar Type) when training using bf16-mixed precision and DDP (gloo backend) #1809
[Distributed] Invalid scalar type when dist.scatter() boolean tensor #90245
RuntimeError: Invalid scalar type #5485

  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
根据引用\[1\]中的错误提示,出现了RuntimeError: expected scalar type Half but found Float的错误。这个错误通常是由于在使用GPU V100(32GB)对ChatGLM模型进行lora微调时,数据类型不匹配导致的。为了解决这个问题,可以参考引用\[2\]中的解决方案,即将load_in_8bit=True修改为torch_dtype=torch.float16。这样可以将数据类型转换为半精度浮点数,与GPU V100(32GB)的数据类型匹配。另外,还可以检查微调命令中的参数设置,确保使用了正确的数据类型和设备。如果问题仍然存在,可以尝试调整其他参数,如batch size和learning rate等,以优化模型的训练和预测过程。 #### 引用[.reference_title] - *1* *3* [RuntimeError: expected scalar type Half but found Float解决方案](https://blog.csdn.net/weixin_43178406/article/details/130383527)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down28v1,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* [MOSS LORA 方式微调报错,RuntimeError: expected scalar type Half but found Float解决方案](https://blog.csdn.net/uloveqian/article/details/130759174)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down28v1,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值