docker安装llama-cpp-python加载gguf推理全过程

最新推荐文章于 2025-03-28 19:50:34 发布

X.Cristiano

最新推荐文章于 2025-03-28 19:50:34 发布

阅读量910

点赞数 3

文章标签： docker ai

本文链接：https://blog.csdn.net/m0_37733448/article/details/142183251

版权

1、进入镜像

# 需要设置 --gpus all 否则进去容器后没法用上gpu进行推理
docker run -it --gpus all infer_llama_cpp:latest bash

2、安装依赖

apt-get update
apt-get install -y build-essential cmake ninja-build
apt-get install -y libstdc++6 libgcc1
apt-get install -y g++-10
pip install cmake ninja
export GGML_CUDA=on
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python -U --force-reinstall
# 执行完到这里应该就没啥问题了，有问题针对提示的错误进行搜索一般都能解决得了

3、python代码示例

from llama_cpp import Llama
import json
from tqdm import tqdm
# n_gpu_layers:当使用适当的支持（当前是 CLBlast 或 cuBLAS）进行编译时，此选项允许将某些层卸载到 GPU 进行计算。 通常会提高性能。
# n_gpu_layers=-1,指的是全部都用GPU进行推理
llm = Llama(model_path="Qwen2-0.5B-Instruct-Q4_K_M.gguf",n_gpu_layers=-1, n_ctx=2048)

datas = json.load(open("test.json", "r", encoding="utf8"))
for idx, data in tqdm(enumerate(datas)):
    instruction = data["instruction"]
    output = llm(instruction, max_tokens=128)
    print(idx, output['choices'][0]['text'].strip())

4、当我们没有那么多显存足以加载整个gguf模型，就得分一部分给CPU进行加载推理了

from llama_cpp import Llama
import json
from tqdm import tqdm
llm =Llama(model_path="Qwen2-72B-Instruct-Q4_K_M.gguf",n_gpu_layers=20, chat_format='qwen', n_ctx=2048)

datas = json.load(open("test.json", "r", encoding="utf8"))
for idx, data in tqdm(enumerate(datas)):
    instruction = data["instruction"]
    output = llm(instruction, max_tokens=128)
    print(idx, output['choices'][0]['text'].strip())

5、第五步当CPU比较新的情况下是没有问题的，当比较老的CPU或者V100等的机器，一般在加载的时候会遇到报错：illegal instruction (core dumped)!

添加图片注释，不超过 140 字（可选）

6、解决方案

得重新编译llama-cpp-python, 且对应的参数得改：

CMAKE_ARGS="-DGGML_CUDA=on -DLLAMA_AVX2=OFF" pip install llama-cpp-python -U --force-reinstall --no-cache-dir

这个过程可能要好几分钟，等待编译完成，重新执行第五步就正常同时利用GPU&CPU进行推理了。

7、其他

nvcc not found解决方法：

# 查看cuda的bin目录下是否有nvcc
cd /usr/local/cuda/bin

# 如果存在，直接将cuda路径加入系统路径即可
vi ~/.bashrc
 
#添加以下两行
#在/.bashrc中配置LD_LIBRARY_PATH路径、配置PATH路径，完整配置如下：
 
export LD_LIBRARY_PATH=/usr/local/cuda/lib
export PATH=$PATH:/usr/local/cuda/bin

# 更新配置文件
source ~/.bashrc

# 重新执行编译（需要等待编译完成~15min）
make clean &&  make LLAMA_CUDA=1

参考链接：

"Illegal instruction" when trying to run the server using a precompiled docker image · Issue #272 · abetlen/llama-cpp-python

Illegal instruction (core dumped) when trying to load model · Issue #839 · abetlen/llama-cpp-python