本报错发生在使用Lora微调baichuan2-7B-chat时,指定offload时出现此报错。
RuntimeError: Error building extension 'cpu_adam'
result = self._prepare_deepspeed(*args)
File "/nfsshare/home/xxx/.conda/envs/open-instruct-env/lib/python3.10/site-packages/accelerate/accelerator.py", line 1594, in _prepare_deepspeed
optimizer = DeepSpeedCPUAdam(optimizer.param_groups, **defaults)
File "/nfsshare/home/xxx/.conda/envs/open-instruct-env/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
self.ds_opt_adam = CPUAdamBuilder().load()
File "/nfsshare/home/xxx/.conda/envs/open-instruct-env/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 454, in load
return self.jit_load(verbose)
File "/nfsshare/home/xxx/.conda/envs/open-instruct-env/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
op_module = load(name=self.name,
File "/nfsshare/home/xxx/.conda/envs/open-instruct-env/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/nfsshare/home/xxx/.conda/envs/open-instruct-env/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/nfsshare/home/xxx/.conda/envs/open-instruct-env/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "<frozen importlib._bootstrap>", line 571, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1176, in create_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /nfsshare/home/xxx/.cache/torch_extensions/py310_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
里面提到说cannot open shared object file: No such file or directory,没有so文件。
网上查到的解决方案是说从其他地方复制一个so文件,但又没有给出这个so文件。
查了n多资料后,感觉可能是cuda的问题,环境变量没有配置好。
【解决方案】明确cuda路径,如我的是在服务器的公共文件夹中的’/nfsshare/apps/cuda-11.8/‘这个路径里。
ps:如果是自己使用conda安装的cuda,会只在相关conda环境的文件夹下有一系列cuda相关的文件,但是没有专门的’cuda’子文件夹。我之前指定了conda环境的路径作为cuda路径,但是没有用,还是得公共服务器里安装的cuda。(如果服务器里的cuda版本比较低,可以让管理员下载)
明确了cuda路径后在.bashrc文件末尾添加这两行:
export CUDA_HOME=/nfsshare/apps/cuda-11.8/
export PATH="$CUDA_HOME/bin:$PATH"
最后运行source .bashrc,更新环境变量,即可!!
(为了解决这个报错弄了一个通宵,累死我了)