问题描述
如题所示,本人在运行ChatGLM时遇到了如下报错:
Traceback (most recent call last):
File "/xxx/LuXun-GPT/inference.py", line 44, in <module>
model = AutoModel.from_pretrained(args.base_model, trust_remote_code=True, load_in_8bit=True, device_map='auto', revision="v0.1.0")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/xxx/luxuntest/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 466, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/xxx/.conda/envs/luxuntest/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2646, in from_pretrained
) = cls._load_pretrained_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/xxx/.conda/envs/luxuntest/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2969, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/xxx/.conda/envs/luxuntest/lib/python3.11/site-packages/transformers/modeling_utils.py", line 676, in _load_state_dict_into_meta_model
set_module_8bit_tensor_to_device(model, param_name, param_device, value=param)
File "/xxx/.conda/envs/luxuntest/lib/python3.11/site-packages/transformers/utils/bitsandbytes.py", line 70, in set_module_8bit_tensor_to_device
new_value = bnb.nn.Int8Params(new_value, requires_grad=False, has_fp16_weights=has_fp16_weights).to(device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/xxx/.conda/envs/luxuntest/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 196, in to
return self.cuda(device)
^^^^^^^^^^^^^^^^^
File "/xxx/.conda/envs/luxuntest/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 160, in cuda
CB, CBt, SCB, SCBt, coo_tensorB = bnb.functional.double_quant(B)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/xxx/.conda/envs/luxuntest/lib/python3.11/site-packages/bitsandbytes/functional.py", line 1616, in double_quant
row_stats, col_stats, nnz_row_ptr = get_colrow_absmax(
^^^^^^^^^^^^^^^^^^
File "/xxx/.conda/envs/luxuntest/lib/python3.11/site-packages/bitsandbytes/functional.py", line 1505, in get_colrow_absmax
lib.cget_col_row_stats(ptrA, ptrRowStats, ptrColStats, ptrNnzrows, ct.c_float(threshold), rows, cols)
^^^^^^^^^^^^^^^^^^^^^^
File "/xxx/.conda/envs/luxuntest/lib/python3.11/ctypes/__init__.py", line 389, in __getattr__
func = self.__getitem__(name)
^^^^^^^^^^^^^^^^^^^^^^
File "/xxx/.conda/envs/luxuntest/lib/python3.11/ctypes/__init__.py", line 394, in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: /xxx/.conda/envs/luxuntest/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats
问题分析
可以看到在报错中有一行:
File "/xxx/.conda/envs/luxuntest/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 196, in to
return self.cuda(device)
^^^^^^^^^^^^^^^^^
这里使用了GPU,但是在报错的最后一行为:
AttributeError: /xxx/.conda/envs/luxuntest/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats
通过libbitsandbytes_cpu.so可以看出,调用的是CPU对应的so。
解决方案
使用最简单粗暴的方法,将libbitsandbytes_cpu.so替换成GPU的so。
cd /home/xxx/.conda/envs/xxx/lib/python3.x/site-packages/bitsandbytes
cp libbitsandbytes_cuda1xx.so libbitsandbytes_cpu.so
具体使用哪个版本的_cudaxxx.so根据torch_gpu对应的cuda版本而定。