linux上使用fastllm部署qwen1.8b,2G显存可跑

最新推荐文章于 2024-06-10 00:00:14 发布

画山

最新推荐文章于 2024-06-10 00:00:14 发布

阅读量819

点赞数 11

文章标签：人工智能深度学习语言模型

本文链接：https://blog.csdn.net/weixin_49651327/article/details/135910400

版权

本文详细介绍了如何在特定环境下配置FastLLM项目，包括安装PythonAPI、下载Qwen1.8B权重、第三方库，以及编译、安装和模型转换步骤。着重讲解了如何在GPU环境下测试模型性能和内存占用情况。

摘要由CSDN通过智能技术生成

1、环境需求

推荐gcc大于9.4

gcc -v

python，用于转换模型权重，fastllm也提供了python的api，我的是python3.10.13

2、下载相关文件

下载qwen1.8b权重和fastllm项目到本地。

下载qwen1.8b权重，国内推荐使用魔塔社区开源库，安装

pip install modelscope

下载，cache_dir修改为自己的路径

from modelscope import snapshot_download
model_dir = snapshot_download("Qwen/Qwen-1_8B-Chat", revision = "v1.0.0", cache_dir='/root/source/model_path')

下载好的文件目录如下：

下载fastllm工程到本地。

git clone https://github.com/ztxz16/fastllm

或者直接把项目的zip文件下载到本地再解压。

下载第三方库：

cd fastllm/third_party
git clone https://github.com/pybind/pybind11/tree/0e2c3e5db41b6b2af4038734c84ab855ccaaa5f0

3、编译安装

到fastllm目录，修改下main.cpp文件，加一个计时。

#include "model.h"
#include <chrono> // 添加头文件

//省略其他代码

int main(int argc, char **argv) {
    int round = 0;
    std::string history = "";

    RunConfig config;
    fastllm::GenerationConfig generationConfig;
    ParseArgs(argc, argv, config, generationConfig);

    fastllm::PrintInstructionInfo();
    fastllm::SetThreads(config.threads);
    fastllm::SetLowMemMode(config.lowMemMode);
    auto model = fastllm::CreateLLMModelFromFile(config.path);

    static std::string modelType = model->model_type;
    printf("欢迎使用 %s 模型. 输入内容对话，reset清空历史记录，stop退出程序.\n", model->model_type.c_str());
    while (true) {
        printf("用户: ");
        std::string input;
        std::getline(std::cin, input);
        if (input == "reset") {
            history = "";
            round = 0;
            continue;
        }
        if (input == "stop") {
            break;
        }

        auto start_time = std::chrono::high_resolution_clock::now(); // 开始计时

        std::string ret = model->Response(model->MakeInput(history, round, input), [](int index, const char* content) {
            if (index == 0) {
                printf("%s:%s", modelType.c_str(), content);
                fflush(stdout);
            }
            if (index > 0) {
                printf("%s", content);
                fflush(stdout);
            }
            if (index == -1) {
                printf("\n");
            }
        }, generationConfig);

        auto end_time = std::chrono::high_resolution_clock::now(); // 结束计时
        auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count();

        printf("本次响应耗时: %lld ms\n", duration); // 打印耗时

        history = model->MakeHistory(history, round, input, ret);
        round++;
    }

    return 0;
}

执行编译安装

#在fastllm目录执行
mkdir build
cd build
cmake .. -DUSE_CUDA=ON # 如果不使用GPU编译，那么使用 cmake .. -DUSE_CUDA=OFF
make -j
cd tools && python setup.py install

环境配置不出错，这个过程一般不会有问题。

编译完成后，可以使用如下命令安装简单的python工具包。

cd tools # 这时在fastllm/build/tools目录下
python setup.py install

4、模型转换

修改tools目录下的qwen2flm.py文件

这三个路径修改为自己的模型储存目录，下面修改一下，方便加载。

到build目录执行模型转换脚本，转为int4格式

python3 tools/qwen2flm.py int4

也可以选择其它格式：

转完之后在build目录生成一个model.flm文件，大概1.9G大小。

5、测试

./main -p model.flm

模型路径传对，参数传对，基本不会有问题。

可以同时查下显存占用

nvidia-smi

1G显存可以加载.

速度也可以：

显存在占用不到2G

画山

关注

11
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
linux上使用fastllm部署qwen1.8b,2G显存可跑

python，用于转换模型权重，fastllm也提供了python的api，我的是python3.10.13。转完之后在build目录生成一个model.flm文件，大概1.9G大小。到fastllm目录，修改下main.cpp文件，加一个计时。这三个路径修改为自己的模型储存目录，下面修改一下，方便加载。下载qwen1.8b权重，国内推荐使用魔塔社区开源库，安装。编译完成后，可以使用如下命令安装简单的python工具包。下载，cache_dir修改为自己的路径。模型路径传对，参数传对，基本不会有问题。
复制链接

扫一扫