LLama.cpp轻量化模型部署及量化

Cary丿Xin

已于 2024-08-05 17:40:46 修改

阅读量1k

点赞数 27

文章标签： llama linux 运维

于 2024-08-01 19:47:34 首次发布

本文链接：https://blog.csdn.net/weixin_42254289/article/details/140849981

版权

模型文件下载

首先说一下用到的模型，这次用的是Llama3-8B-Chinese-Chat-GGUF-8bit模型，想要快速的从huggingface下载模型可以参考我的另一篇博文。

从huggingface更快的下载模型

1.准备模型文件

export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download shenzhi-wang/Llama3-8B-Chinese-Chat-GGUF-8bit --local-dir /home/xintk/workspace/model/Llama3-8B-Chinese-Chat-GGUF

2.安装

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

3.编译

编译会用到CMake。起初对CMake不是很了解，对CMake和Make傻傻分不清。

查资料了解到：

CMake是一个跨平台的系统生成工具，它的主要作用是通过配置文件（通常是 CMakeLists.txt）生成适合于目标平台的构建脚本或文件。

Make是一个构建自动化工具。通过读取Makefile 来执行编译和构建过程。

编译过程：

CMake 生成 Makefile。
Make 读取 Makefile 并调用 g++ 进行编译和链接。
g++ 是实际执行编译和链接的编译器。

CPU版本

cmake -B build_cpu
cmake --build build_cpu --config Release

CUDA版本

cmake -B build_cuda -DLLAMA_CUDA=ON
# cmake -B:新建一个文件夹build_cuda,然后把所有需要被编译的文件都放到build_cuda文件下面
# -DLLAMA_CUDA=ON：打开cuda开关，表示支持cuda

cmake --build build_cuda --config Release -j 12
# --build:编译命令 
# --config:配置使用Release模式
# -j:使用多核性能,12核加速编译

执行cmake -B build_cuda -DLLAMA_CUDA=ON报错

检查了一下，我的cmake版本是cmake version 3.30.0,cuda是12.2,一开始以为cmake版本不对应，后来发现gcc版本的原因，于是调整了gcc版本。

更换之前我的gcc版本是13.2，太新了。

安装gcc

sudo apt install gcc-9 g++-9 gcc-10 g++-10

设置gcc优先级为gcc-10

sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 100 --slave /usr/bin/g++ g++ /usr/bin/g++-10 --slave /usr/bin/gcov gcov /usr/bin/gcov-10

重新执行cmake -B build_cuda -DLLAMA_CUDA=ON成功了。效果如下

执行cmake --build build_cuda --config Release -j 12命令，过程需要一点点时间

cmake --build build_cuda --config Release -j 12

编译完成之后会在build_cuda/bin目录下生成可执行文件

4.具体使用

cmake的功能很强大，这里主要用到了三个功能。

1）主功能main

编译完成之后先进入到build_cuda/bin目录下，可以看到这里有llama-cali文件，然后执行如下命令。 main参数介绍

cd /home/xintk/workspace/llama.cpp/build_cuda/bin

./llama-cli -m /home/xintk/workspace/model/Llama3-8B-Chinese-Chat-GGUF/Llama3-8B-Chinese-Chat-q8_0-v2_1.gguf \
    -n -1 \
    -ngl 256 \
    -t 12 \
    --color \
    -r "User:" \
    --in-prefix " " \
    -i \
    -p \
'User: 你好
AI: 你好啊，我是小明，要聊聊吗?
User: 好啊!
AI: 你想聊聊什么话题呢？
User:'

参数介绍：

-m 指定模型路径

-n 预测多少token,假设指定128：预测到128个token结束，-1表示一直生成

-ngl 使用 GPU 支持进行编译时，允许将某些层卸载到 GPU 进行计算。通常可提高性能

-t 使用多核性能

--color 启用彩色输出以在视觉上区分提示、用户输入和生成的文本

-r 反向提示，遇到某个字符停下来，营造一种交互，这里指定为User:

--in-prefix 输入之前给一个空格

-i 交互模型(chat)

-p 等价--prompt

官方给的参数解析

2) 服务部署功能server

这一部分由build_cuda/bin/llama-server控制，执行如下命令

cd /home/xintk/workspace/llama.cpp/build_cuda/bin

./llama-server \
    -m /home/xintk/workspace/model/Llama3-8B-Chinese-Chat-GGUF/Llama3-8B-Chinese-Chat-q8_0-v2_1.gguf \
    --host "127.0.0.1" \
    --port 8080 \
    -c 2048 \
    -ngl 128 \

启动这个服务，效果如下：

访问这个服务

# Openai 风格
curl --request POST \
    --url http://localhost:8080/v1/chat/completions \
    --header "Content-Type: application/json" \
    --header "Authorization: Bearer echo in the moon" \
    --data '{
        "model": "gpt-3.5-turbo",
        "messages": [
            {"role": "system", "content": "你叫蜡笔小新，是一个ai助手，是由哆唻爱梦开发实现的."},
            {"role": "user", "content": "你好啊！怎么称呼你呢？"}
        ]
    }'

效果如下：会快速地给一个结果

3）量化功能

量化指的是把一个高精度的浮点数，改变成一个低精度的浮点数或Int

1.fp16 ---int8

2.fp16 ---fp16

1)将 gguf 格式进行（再）量化

cd /home/xintk/workspace/llama.cpp/build_cuda/bin

./llama-quantize

可以先执行./llama-quantize -h 查看一下参数

./llama-quantize
./llama-quantize --allow-requantize /home/xintk/workspace/model/Llama3-8B-Chinese-Chat-GGUF/Llama3-8B-Chinese-Chat-q8_0-v2_1.gguf /home/xintk/workspace/model/Llama3-8B-Chinese-Chat-GGUF/Llama3-8B-Chinese-Chat-q4_1-v1.gguf Q4_1

2）将 safetensors 格式转成 gguf

cd /home/xintk/workspace/llama.cpp

python convert-hf-to-gguf.py /home/xintk/workspace/model/Llama3-8B-Chinese-Chat --outfile /home/xintk/workspace/model/Llama3-8B-Chinese-Chat-GGUF/Llama3-8B-Chinese-Chat-q8_0-v1.gguf --outtype q8_0