部署Qwen2.5-7b大模型详解

原创

已于 2024-10-18 10:34:46 修改

· 1.5w 阅读

41 ·

版权

文章标签：

#人工智能

于 2024-10-17 16:48:58 首次发布

部署Qwen2.5-7b大模型详解

本文参考教程：https://qwen.readthedocs.io/en/latest/getting_started/quickstart.html

下载模型

https://modelscope.cn/organization/qwen

搜索 qwen2.5-7b

在这里插入图片描述

可以看到它提供了六个模型，以满足不同的需求，从下载量来看，Qwen2.5-7B-Instruct 远超其他五个。

1、预训练 (Pre-training) 和基模型 (Base models)： 是基础语言模型，不建议直接使用基础语言模型进行对话任务。您可以在此模型基础上进行后训练，例如SFT、RLHF、持续预训练等。基础模型不带Instruct字样，你可以对其进行情景学习，下游微调等。

2、后训练 (Post-training) 和指令微调模型 (Instruction-tuned models)： 是专门设计用于理解并以对话风格执行特定指令的模型。这些模型经过微调，能准确地解释用户命令，并能以更高的准确性和一致性执行诸如摘要、翻译和问答等任务。与在大量文本语料库上训练的基础模型不同，指令调优模型会使用包含指令示例及其预期结果的数据集进行额外训练，通常涵盖多个回合。这种训练方式使它们非常适合需要特定功能的应用，同时保持生成流畅且连贯文本的能力。

-Instruct模型就是之前的-Chat模型。

3、GGUF： 以GGUF格式保存的模型，用于 llama.cpp。

4、AWQ量化，GPTQ量化： 属于经过量化算法优化后的版本，意在降低部署门槛，后面介绍。

显存要求

一般而言，模型加载所需显存可以按参数量乘二计算，例如，7B 模型需要 14GB 显存加载，其原因在于，对于大语言模型，计算所用数据类型为16位浮点数。当然，推理运行时还需要更多显存以记录激活状态。

对于 transformers ，推荐加载时使用 torch_dtype="auto" ，这样模型将以 bfloat16 数据类型加载。否则，默认会以 float32 数据类型加载，所需显存将翻倍。也可以显式传入 torch.bfloat16 或 torch.float16 作为 torch_dtype 。

关于多卡推理

transformers 依赖 accelerate 支持多卡推理，其实现为一种简单的模型并行策略：不同的卡计算模型的不同层，分配策略由 device_map="auto" 或自定义的 device_map 指定。

然而，这种实现方式并不高效，因为对于单一请求而言，同时只有单个 GPU 在进行计算而其他 GPU 则处于等待状态。为了充分利用所有的 GPU ，你需要像流水线一样安排多个处理序列，确保每个 GPU 都有一定的工作负载。但是，这将需要进行并发管理和负载均衡，这些超出了 transformers 库的范畴。即便实现了所有这些功能，整体吞吐量可以通过提高并发提高，但每个请求的延迟并不会很理想。

对于多卡推理，建议使用专门的推理框架，如 vLLM 和 TGI，这些框架支持张量并行。

INFERENCE 推理

1、Hugging Face Transformers

https://qwen.readthedocs.io/zh-cn/latest/inference/chat.html

transformers 支持手动和Pipeline两种作业方式。

支持继续对话，流式输出

2、ModelScope

ModelScope的用法与Transformers基本一样，就是下载安装快一些，在使用时将包名由transformers改成modelscope即可。

https://github.com/modelscope/modelscope

https://www.modelscope.cn/docs

RUN LOCALLY 运行模型

1、llama.cpp

https://qwen.readthedocs.io/zh-cn/latest/run_locally/llama.cpp.html

https://github.com/ggerganov/llama.cpp

https://www.cnblogs.com/ghj1976/p/18063411/gguf-mo-xing

它是一个用来运行大模型的工具，其特点是以最小的代价来启动模型进行推理，它允许LLM在CPU上运行和推理。

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

纯粹的C/C++实现，没有外部依赖
支持广泛的硬件：
- x86_64 CPU的AVX、AVX2和AVX512支持
- 通过Metal和Accelerate支持Apple Silicon（CPU和GPU）
- NVIDIA GPU（通过CUDA）、AMD GPU（通过hipBLAS）、Intel GPU（通过SYCL）、昇腾NPU（通过CANN）和摩尔线程GPU（通过MUSA）
- GPU的Vulkan后端
多种量化方案以加快推理速度并减少内存占用
CPU+GPU混合推理，以加速超过总VRAM容量的模型

对于不是很精通c++的使用者，主要还是使用llama-cli 工具

因此要求模型保存为GGUF格式，同时，它支持将一个大模型分割成多个文件保存。

在这里插入图片描述

2、Ollama

https://github.com/ollama/ollama

https://ollama.com/

Go语言开发的，类似于docker一样的工具，用来管理和运行LLM。它利用了 llama.cpp 提供的底层功能，它不仅支持gguf，还支持pt（PyTorch）和 safetensors（Tensorflow）。

它支持以 CLI 和 API 的方式运行LLM。

Ollama并不托管基模型。即便模型标签不带instruct后缀，实际也是instruct模型。

Ollama也可以运行已保存在本地的模型。

需要区分一下：LLaMa 是一个Meta公司开源的预训练大型语言模型，llama.cpp用于加载和运行GGUF语言模型。Ollama是大模型运行框架。

WEB UI

Text Generation Web UI（简称TGW)

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui

你可以根据你的操作系统直接运行相应的脚本，例如在Linux系统上运行 start_linux.sh ，在Windows系统上运行 start_windows.bat ，在MacOS系统上运行 start_macos.sh ，或者在Windows子系统Linux（WSL）上运行 start_wsl.bat 。

访问：http://localhost:7860/?__theme=dark

DEPLOYMENT 部署

1、vLLM

vLLM是伯克利大学LMSYS组织开源的大语言模型高速推理框架，旨在极大地提升实时场景下的语言模型服务的吞吐与内存使用效率。vLLM是一个快速且易于使用的库，用于 LLM 推理和服务，可以和HuggingFace 无缝集成。vLLM利用了全新的注意力算法「PagedAttention」，有效地管理注意力键和值。吞吐量最多可以达到 huggingface 实现的24倍，文本生成推理（TGI）高出3.5倍，并且不需要对模型结构进行任何的改变。vLLM 库要比 HaggingFace Transformers库的推理速度高出一倍左右。

使用vLLM运行Qwen大模型：https://qwen.readthedocs.io/en/latest/deployment/vllm.html

vLLM 文档：https://docs.vllm.ai/en/stable/

要部署 Qwen2.5 ，我们建议您使用 vLLM 。 vLLM 是一个用于 LLM 推理和服务的快速且易于使用的框架。

# vLLM>=0.4.0

pip install vllm

# 构建一个与 OpenAI 兼容的 API 服务
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct

如 vllm>=0.5.3 ，也可以如下启动：

vllm serve Qwen/Qwen2.5-7B-Instruct

默认情况下，它将在 http://localhost:8000 启动服务器。您可以通过 --host 和 --port 参数来自定义地址。

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen2.5-7B-Instruct",
  "messages": [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": "Tell me something about large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "repetition_penalty": 1.05,
  "max_tokens": 512
}'

或者您可以按照下面所示的方式，使用 openai Python 包中的 Python 客户端：

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {
   "role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {
   "role": "user", "content": "Tell me something about large language models."},
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=512,
    extra_body={
   
        "repetition_penalty": 1.05,
    },
)
print("Chat response:", chat_response)

也可以使用 vLLM 包来运行大模型。

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Pass the default decoding hyperparameters of Qwen2.5-7B-Instruct
# max_tokens is for the maximum length for generation.
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)

# Input the model name or path. Can be GPTQ or AWQ models.
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")

# Prepare your prompts
prompt = "Tell me something about large language models."
messages = [
    {
   "role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {
   "role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# generate outputs
outputs = llm.generate([text], sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {
     prompt!r}, Generated text: {
     generated_text!r}")

工具（函数）使用

结构化/JSON输出

多卡分布式部署

上下文支持扩展：YaRN

部署量化模型：只需要指定参数 quantization的值

2、TGI

https://qwen.readthedocs.io/en/latest/deployment/tgi.html

Hugging Face 的 Text Generation Inference (TGI) 是一个专为部署大规模语言模型 (Large Language Models, LLMs) 而设计的生产级框架。TGI提供了流畅的部署体验，并稳定支持如下特性：

推测解码 (Speculative Decoding) ：提升生成速度。
张量并行 (Tensor Parallelism) ：高效多卡部署。
流式生成 (Token Streaming) ：支持持续性生成文本。
灵活的硬件支持：与 AMD ， Gaudi 和 [AWS Inferentia](https://aws.amazon.com/blogs/machine-learning/announcing-the-launch-of-new-hugging-face-llm-inference-containers-on-amazon-sagemaker/#:~:text=Get started with TGI on SageMaker Hosting) 无缝衔接。

以docker的方式，通过TGI部署Qwen2.5

model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model

访问

curl http://localhost:8080/generate_stream -H 'Content-Type: application/json' \
        -d '{"inputs":"Tell me something about large language models.","parameters":{"max_new_tokens":512}}'

也可使用 OpenAI 风格的 API 访问TGI，注意，JSON 中的 model 字段不会被 TGI 识别，您可传入任意值。

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "",
  "messages": [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": "Tell me something about large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "repetition_penalty": 1.05,
  "max_tokens": 512
}'

完整 API 文档，请查阅 [TGI Swagger UI](https://huggingface.github.io/text-generation-inference/#/Text Generation Inference/completions) 。

你也可以使用 Python 访问

from openai import OpenAI

# initialize the client but point it to TGI
client = OpenAI(
   base_url="http://localhost:8080/v1/",  # replace with your endpoint url
   api_key="",  # this field is not used when running locally
)
chat_completion = client.chat.completions.create(
   model="",  # it is not used by TGI, you can put anything
   messages=[
      {
   "role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
      {
   "role": "user", "content": "Tell me something about large language models."},
   ],
   stream=True,
   temperature=0.7,
   top_p=0.8,
   max_tokens=512,
)

# iterate and print stream
for message in chat_completion:
   print(message.choices[0].delta.content, end="")