【LLM】AMD GPU上实现高性能LLM推理

最新推荐文章于 2025-02-28 14:34:58 发布

shen12138

最新推荐文章于 2025-02-28 14:34:58 发布

阅读量1.3k

点赞数 3

文章标签： python 语言模型

本文链接：https://blog.csdn.net/shen12138/article/details/138388470

版权

【LLM】AMD GPU上实现高性能LLM推理

参考链接1：https://zhuanlan.zhihu.com/p/649088095
参考链接2：https://github.com/mlc-ai/mlc-llm

家里有台游戏机，配置相对训练大模型要求的资源较低，而且是AMD显卡，拿来玩玩推理。
该篇文章主要是参考上述两个链接，使用mlc-llm开源项目运行。

前置条件

环境：miniconda
操作系统：windows11
显卡：AMD RX6600 8G

一、安装mlc-llm

进入miniconda，新建一个python3.10的虚拟环境

# 查看已安装的python环境
conda info -e
# 基于python3.10版本创建一个名字为test的python独立环境
conda create --name test python=3.10
# 进入环境
conda activate test

执行pip安装

# 安装mlc
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly

# 操作的时候忘记执行这两步，可能是导致下面两个问题的原因
conda install -c conda-forge clang libvulkan-loader
conda install zstd

# 测试
python -c "import mlc_llm; print(mlc_llm.__path__)"

下载模型
这步下载会非常之缓慢，所以我使用了镜像网站，手动迅雷下载到本地，执行时指定即可。

# 在线下载
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

# 离线下载地址
https://hf-mirror.com/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC/tree/main
点击下载按钮，下载到D:\llm\Llama-3-8B-Instruct-q4f16_1-MLC

运行
新建一个test.py脚本，内容如下

from mlc_llm import MLCEngine

# Create engine
#model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
model = "Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)
print("\n")

engine.terminate()

执行命令

python test.py

返回结果

serve运行，可以进行rest api调用

mlc_llm serve .\Llama-3-8B-Instruct-q4f16_1-MLC\

脚本调用

import requests
import json

# Get a response using a prompt with streaming
payload = {
    "model": "../Llama-2-7b-chat-hf-q4f16_1-MLC/",
    "messages": [{"role": "user", "content": "鲁迅为什么会打周树人？用中文回答我"}],
    "stream": True,
}
with requests.post("http://127.0.0.1:8000/v1/chat/completions", json=payload, stream=True) as r:
    for chunk in r.iter_content(chunk_size=None):
        chunk = chunk.decode("utf-8")
        if "[DONE]" in chunk[6:]:
            break
        response = json.loads(chunk[6:])
        if len(response["choices"]) > 0:
           content = response["choices"][0]["delta"].get("content", "")
           print(content, end="", flush=True)
print("\n")

返回结果：
在这里插入图片描述

二、遇到的问题

未安装clang报错，手动安装

下载链接：https://releases.llvm.org/
在这里插入图片描述
解压到D:\llm\clang+llvm-18.1.0-x86_64-pc-windows-msvc，配置环境变量

退出cmd命令行，重新进入，执行命令查看是否生效

clang -v

在这里插入图片描述

未安装MSVC
执行python test.py命令报错

clang: warning: unable to find a Visual Studio installation; 
try running Clang from a developer command prompt [-Wmsvc-not-found]

下载并安装
下载链接：https://visualstudio.microsoft.com/zh-hans/downloads/
在这里插入图片描述

退出命令行，重新执行即可。