llama.cpp编译和运行 API调用

最新推荐文章于 2025-03-09 23:32:34 发布

beyond阿亮

最新推荐文章于 2025-03-09 23:32:34 发布

阅读量1.5k

点赞数 9

分类专栏： AI 文章标签： llama ai 人工智能 c++

本文链接：https://blog.csdn.net/yinjl123456/article/details/145166653

版权

AI 专栏收录该内容

10 篇文章

订阅专栏

llama.cpp编译和运行 API调用

llama.cpp介绍

llama.cpp是一个开源项目,官方地址：https://github.com/ggerganov/llama.cpp，使用纯 C/C++推理 Meta 的LLaMA模型,专门为在本地CPU上部署量化模型而设计。
它提供了一种简单而高效的方法，将训练好的量化模型转换为可在CPU上运行的低配推理版本,可加快推理速度并减少内存使用。

llama.cpp优势

高性能：llama.cpp针对CPU进行了优化，能够在保证精度的同时提供高效的推理性能。
低资源：由于采用了量化技术，llama.cpp可以显著减少模型所需的存储空间和计算资源,可运行在端侧设备上。
易集成：llama.cpp提供了简洁的API和接口，方便开发者将其集成到自己的项目中。
跨平台支持：llama.cpp可在多种操作系统和CPU架构上运行，具有很好的可移植性。

llama.cpp编译

安装编译环境
sudo apt install cmake g++ git

下载源代码
git clone https://github.com/ggerganov/llama.cpp


cd llama.cpp/
cd build/
编译
cmake ..
make

gcc --version
g++ --version
cmake .. -DCMAKE_CXX_FLAGS="-mavx -mfma"
    
cmake --build build --config Release -march=native -mtune=native
cmake -march=native -mtune=native --build build --config Release
cmake -DLLAMA_NATIVE=OFF
cmake -B build -DGGML_LLAMAFILE=OFF

编译完成后，会生成很多可执行文件，如图
在这里插入图片描述

llama.cpp运行

llama.cpp提供了与OpenAI API兼容的API接口，使用make生成的llama-server来启动API服务

本地启动 HTTP 服务器，使用端口：8080 指定Llama-3.1-8B-Instruct推理模型
.\llama-server.exe -m E:\ai_model\Imstudio-ai\lmstudio-community\Meta-Llama-3.1-8B-Instruct-GGUF\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --port 8080

调用API服务

curl --request POST     --url http://localhost:8080/completion
     --header "Content-Type: application/json"
     --data '{"prompt": "介绍一下llama.cpp"}'
     ```