SGLang部署deepseek-ai/DeepSeek-R1-Distill-Qwen-32B实测比VLLM快30%,LMDeploy比VLLM快50%

结果

实测使用SGLang速率为30.47 tokens/s,VLLM速率为23.23 tokens/s,快30%
LMDeploy速率为34.96 tokens/s,比VLLM快50%

安装SGLang

前置环境

SGLang 目前只支持 torch 2.5.1,如果版本不一致先升级
升级后会有一个安装包冲突 vllm-flash-attn 2.6.1 requires torch==2.4.0
但是不会影响vllm的运行

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1

下面是安装完成并成功运行的环境,我这里cuda是12.4,其他的影响不大

$ python3 -m sglang.check_env
INFO 02-12 15:17:04 __init__.py:190] Automatically detected platform cuda.
Python: 3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7,8: NVIDIA GeForce RTX 3090
GPU 0,1,2,3,4,5,6,7,8 Compute Capability: 8.6
CUDA_HOME: /usr/local/cuda-12.4
NVCC: Cuda compilation tools, release 12.4, V12.4.99
CUDA Driver Version: 550.54.14
PyTorch: 2.5.1+cu124
sglang: 0.4.2.post4
sgl_kernel: 0.0.3.post3
flashinfer: 0.2.0.post2+cu124torch2.5
triton: 3.1.0
transformers: 4.48.3
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.8.6
fastapi: 0.110.3
hf_transfer: Module Not Found
huggingface_hub: 0.24.6
interegular: 0.3.3
modelscope: 1.17.1
orjson: 3.10.7
packaging: 23.2
psutil: 5.9.8
pydantic: 2.9.2
multipart: 0.0.9
zmq: 25.1.2
uvicorn: 0.30.6
uvloop: 0.20.0
vllm: 0.7.2
openai: 1.54.4
anthropic: 0.18.1
litellm: Module Not Found
decord: 0.6.0
NVIDIA Topology:
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	GPU8	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PIX	PIX	PIX	PIX	SYS	SYS	SYS	SYS	0-3,8-11	0		N/A
GPU1	PIX	 X 	PIX	PIX	PIX	SYS	SYS	SYS	SYS	0-3,8-11	0		N/A
GPU2	PIX	PIX	 X 	PIX	PIX	SYS	SYS	SYS	SYS	0-3,8-11	0		N/A
GPU3	PIX	PIX	PIX	 X 	PIX	SYS	SYS	SYS	SYS	0-3,8-11	0		N/A
GPU4	PIX	PIX	PIX	PIX	 X 	SYS	SYS	SYS	SYS	0-3,8-11	0		N/A
GPU5	SYS	SYS	SYS	SYS	SYS	 X 	PIX	PIX	PIX	0-3,8-11	0		N/A
GPU6	SYS	SYS	SYS	SYS	SYS	PIX	 X 	PIX	PIX	0-3,8-11	0		N/A
GPU7	SYS	SYS	SYS	SYS	SYS	PIX	PIX	 X 	PIX	0-3,8-11	0		N/A
GPU8	SYS	SYS	SYS	SYS	SYS	PIX	PIX	PIX	 X 	0-3,8-11	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1024

开始安装

pip install --upgrade pip
pip install sgl-kernel --force-reinstall --no-deps
pip install "sglang[all]>=0.4.2.post4" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer/

如果flashinfer未安装成功,前往https://flashinfer.ai/whl/cu124/torch2.5/flashinfer/自行下载后安装
然后再进行pip install sglang

下载模型

国内使用modelscope很方便,没有的话可以先pip install modelscope
然后在命令行输入python,再输入下面的命令

from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype="auto",device_map="auto")

下载完成后目录在

~/.cache/modelscope/hub/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/

启动模型

我是9卡3090,运行DeepSeek-R1-Distill-Qwen-32B需要4张卡,使用VLLM可以自定义在随机4张卡上运行,但是SGLang必须是连续的4张
我这里在5,6,7,8卡上运行

SGLang启动

CUDA_VISIBLE_DEVICES=5,6,7,8 python3 -m sglang.launch_server --model ~/.cache/modelscope/hub/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/ --tp 4 --host 0.0.0.0 --port 8000

在这里插入图片描述

VLLM启动

vllm安装极其简单,直接pip install vllm即可

CUDA_VISIBLE_DEVICES=5,6,7,8 vllm serve ~/.cache/modelscope/hub/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/ --tensor-parallel-size 4 --max-model-len 32768 --enforce-eager --served-model-name DeepSeek-R1-Distill-Qwen-32B --host 0.0.0.0

在这里插入图片描述

LMDeply启动

LMDeply安装极其简单,直接pip install lmdeploy即可
环境依然用我上面的环境即可

CUDA_VISIBLE_DEVICES=5,6,7,8 lmdeploy serve api_server  ~/.cache/modelscope/hub/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/ --tp 4 --server-port 8000 --model-name DeepSeek-R1-Distill-Qwen-32B

在这里插入图片描述

测试

from openai import OpenAI
from transformers import AutoTokenizer
import time
tokenizer = AutoTokenizer.from_pretrained(
    "~/.cache/modelscope/hub/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/", trust_remote_code=True, resume_download=True
)

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not used actually")
def cli_main():

    while True:
        query = input("User> ").strip()
        if query.startswith(":"):
            command_words = query[1:].strip().split()
            if not command_words:
                command = ""
            else:
                command = command_words[0]
            if command in ["exit", "quit", "q"]:
                break
            elif command in ["clear", "cl"]:
                    _clear_screen()
                    continue
            else:
                # As normal query.
                pass
        start_time = time.time()
        accumulated_text = ""
        total_tokens = 0
        response = client.chat.completions.create(
            model="DeepSeek-R1-Distill-Qwen-32B",# 这里似乎写什么都行
            messages=[{"role": "user", "content": query}],
            stream=True,
        )
        print('\n')
        for chunk in response:
            content = chunk.choices[0].delta.content
            if content:
                accumulated_text += content
                print(content, end='', flush=True)
        end_time = time.time()
        duration = end_time - start_time
        total_tokens = len(tokenizer.encode(accumulated_text))
        throughput = total_tokens / duration
        print(f"\n吞吐率: {throughput:.2f} tokens/s")
        print('\n')
        
if __name__ == "__main__":
    cli_main()

vllm

在这里插入图片描述

sgLang

在这里插入图片描述

LMDeply

在这里插入图片描述

### 比较 DeepSeek-R1-Distill-Qwen-14B 和 DeepSeek-R1-Distill-Qwen-14B-GGUF #### 参数量与模型结构 DeepSeek-R1-Distill-Qwen-14B 是基于 Qwen 架构的大规模预训练语言模型,参数量达到 140亿。该模型通过蒸馏技术优化,在保持性能的同时降低了计算资源需求[^1]。 相比之下,DeepSeek-R1-Distill-Qwen-14B-GGUF 版本同样拥有相同的架构基础和相似的参数数量,但是经过 GGUF (General Graph-based Unified Format) 技术处理,使得模型文件更紧凑高效,适合边缘设备部署。 #### 文件格式与存储效率 标准版 DeepSeek-R1-Distill-Qwen-14B 使用常见的权重保存方式,而 GGUF 格式的变体则采用了图结构化数据表示方法来压缩模型尺寸并提高加载速度。这种改进对于内存有限或带宽受限环境特别有利。 #### 推理性能对比 由于GGUF版本进行了针对性优化,因此在某些硬件平台上可能会表现出更好的推理延迟特性;然而具体表现取决于实际应用场景以及所使用的加速库等因素影响。通常情况下两者的核心算法逻辑一致,主要区别在于实现细节上的不同。 ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer def load_model(model_name): tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) return model, tokenizer model_standard, tokenizer_standard = load_model("deepseek-ai/DeepSeek-R1-Distill-ai/DeepSeek-R1-Distill-Qwen-14B-GGUF") text = "Once upon a time" input_ids_standard = tokenizer_standard(text, return_tensors="pt").input_ids output_standard = model_standard.generate(input_ids_standard) input_ids_gguf = tokenizer_gguf(text, return_tensors="pt").input_ids output_gguf = model_gguf.generate(input_ids_gguf) print(f'Standard Model Output: {tokenizer_standard.decode(output_standard[0], skip_special_tokens=True)}') print(f'GGUF Model Output: {tokenizer_gguf.decode(output_gguf[0], skip_special_tokens=True)}') ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值