报错
[2024-05-13 21:03:16,806] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-13 21:03:33,623] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0: setting --include=localhost:0
Traceback (most recent call last):
File "/home/bingxing2/ailab/scxlab0069/.local/bin/deepspeed", line 6, in <module>
main()
File "/home/bingxing2/ailab/scxlab0069/.local/lib/python3.9/site-packages/deepspeed/launcher/runner.py", line 422, in main
raise RuntimeError("Unable to proceed, no GPU resources available")
RuntimeError: Unable to proceed, no GPU resources available
1.ssh连接到计算节点
2.
python
import torch
print(torch.cuda.is_available())
3.如果有显卡仍输出0可能是torch安装完后没有source里面的env.sh模块