【通义千问】Qwen-Audio-Chat 语音大模型离线使用指南（亲测可用）

Xavier Jiezou

已于 2025-03-14 23:10:43 修改

阅读量725

点赞数 5

文章标签：通义千问大模型语音大模型 swift qwen-audio-chat 模型部署模型推理

于 2025-03-11 17:50:17 首次发布

本文链接：https://blog.csdn.net/qq_42951560/article/details/146181540

版权

引言

本文介绍基于 swift 的 Qwen-Audio-Chat 语音大模型离线使用指南（需要显存约16GB）。

配置环境

conda create -n swift python=3.10.16
conda activate swift
pip install ms-swift==3.2.0.post2
pip install vllm==0.7.3
pip install lmdeploy==0.7.1
pip install transformers==4.49.0
conda install ffmpeg=4.3 --channel conda-forge

在命令行调用大模型推理

MODELSCOPE_CACHE=./.cache/modelscope/hub CUDA_VISIBLE_DEVICES=0 swift infer \
    --model Qwen/Qwen-Audio-Chat \
    --infer_backend pt

<<< 你是谁？
我是来自达摩院的大规模语言模型，我叫通义千问。
--------------------------------------------------
<<< <audio>这是首什么样的音乐
Input an audio path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/music.wav
这是一首风格是Pop的音乐。
--------------------------------------------------
<<< <audio>这段语音说了什么
Input an audio path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav
这段语音中说了中文："今天天气真好呀"。
--------------------------------------------------
<<< 这段语音是男生还是女生
根据音色判断，这段语音是男性。
--------------------------------------------------

参数解释：

MODELSCOPE_CACHE：指定模型权重存放的路径。
CUDA_VISIBLE_DEVICES：指定显卡索引。
model：指定待部署的模型名称。可在支持的模型和数据集目录中查看所有支持的模型。
infer_backend：推理后端。

更多推理参数见这里。

在服务端部署大模型服务

MODELSCOPE_CACHE=./.cache/modelscope/hub CUDA_VISIBLE_DEVICES=0 swift deploy \
    --model Qwen/Qwen-Audio-Chat \
    --infer_backend pt \
    --served_model_name Qwen-Audio-Chat
    --port 8001

参数解释：

MODELSCOPE_CACHE：指定模型权重存放的路径。
CUDA_VISIBLE_DEVICES：指定显卡索引。
model：指定待部署的模型名称。可在【支持的模型和数据集】目录中查看所有支持的模型。
infer_backend：推理后端。
served_model_name：指定部署后模型调用的别名。
port：端口号。默认为8000。

更多部署参数见这里。

在客户端调用大模型接口

curl

输入样例

curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen-Audio-Chat",
"messages": [{"role": "user", "content": [
    {"type": "audio", "audio": "http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav"},
    {"type": "text", "text": "What does this audio say?"}
]}]
}'

输出样例

{
    "model": "Qwen-Audio-Chat",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The audio says: \"今天天气真好呀\".",
                "tool_calls": null
            },
            "finish_reason": "stop",
            "logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 83,
        "completion_tokens": 12,
        "total_tokens": 95
    },
    "id": "chatcmpl-692050fbe65b4c06bc7872816f23f410",
    "object": "chat.completion",
    "created": 1741684676
}

openai

代码样例

from openai import OpenAI

client = OpenAI(
    api_key='EMPTY',
    base_url=f'http://127.0.0.1:8001/v1',
)
model = client.models.list().data[0].id
print(f'model: {model}')

messages = [{'role': 'user', 'content': [
    {'type': 'audio', 'audio': 'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav'},
    {'type': 'text', 'text': 'What does this audio say?'}
]}]

resp = client.chat.completions.create(model=model, messages=messages, max_tokens=512, temperature=0)
query = messages[0]['content']
response = resp.choices[0].message.content
print(f'query: {query}')
print(f'response: {response}')

输出样例

model: Qwen-Audio-Chat
query: [{'type': 'audio', 'audio': 'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav'}, {'type': 'text', 'text': 'What does this audio say?'}]
response: The audio says: "今天天气真好呀".

swift

代码样例

from swift.llm import InferRequest, InferClient, RequestConfig
from swift.plugin import InferStats


engine = InferClient(host='127.0.0.1', port=8001)
print(f'models: {engine.models}')
metric = InferStats()
request_config = RequestConfig(max_tokens=512, temperature=0)

# 这里使用了2个infer_request来展示batch推理
infer_requests = [
    InferRequest(messages=[{'role': 'user', 'content': 'who are you?'}]),
    InferRequest(messages=[{'role': 'user', 'content': '<audio>What does this audio say?'}],
                 audios=['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav']),
]

resp_list = engine.infer(infer_requests, request_config, metrics=[metric])
print(f'response0: {resp_list[0].choices[0].message.content}')
print(f'response1: {resp_list[1].choices[0].message.content}')
print(metric.compute())
metric.reset()

输出样例

models: ['Qwen-Audio-Chat']
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.10it/s]
response0: I am a large language model created by DAMO Academy. I am called QianWen.
response1: The audio says: "今天天气真好呀".
{'num_prompt_tokens': 106, 'num_generated_tokens': 33, 'num_samples': 2, 'runtime': 1.8124737851321697, 'samples/s': 1.1034642356795, 'tokens/s': 18.20715988871175}