两张4090极限部署qwen模型，72b-int4版本能布上吗？

最新推荐文章于 2025-04-24 07:09:19 发布

IT修炼家

最新推荐文章于 2025-04-24 07:09:19 发布

阅读量4.3k

点赞数 20

分类专栏：大模型部署文章标签：大模型模型参数大模型部署

本文链接：https://blog.csdn.net/qq_42755230/article/details/143406214

版权

大模型部署专栏收录该内容

19 篇文章

订阅专栏

背景：

项目需要用到大模型，但是只提供两张4090 24G显存的显卡。要在这上面部署qwen大模型，最多部署哪个版本？[所有模型部署均使用了vllm]
在这里插入图片描述

实践：

qwen上月最新发布了qwen2.5版本，其中包含了0.5b、1.5b、3b、7b、14b、32b、72b（好全啊，之前只有0.5b、1.5b、7b和72b）。

访问qwen2.5官方文档，查看各个大小的模型部署需要的GPU大小。

在这里插入图片描述

一看72b的大模型int4量化版本输入长度14336才只需要46.86G显存，高兴坏了，立马运行试了一下，结果报错（但是最后调整一些参数后，极限跑通了，详情后面）

torch.cuda.OutOfMemoryError: CUDA out of memory.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 928.00 MiB. GPU 1 has a total capacty of 23.64 GiB of which 773.69 MiB is free. Process 407715 has 22.88 GiB memory in use. Of the allocated memory 22.22 GiB is allocated by PyTorch, and 18.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

好吧，我觉得这里可能是官方文档上显示使用1张卡跑的，我这里用了两张，那么使用效率肯定会降低，对应需要的显存则会升高。

于是我尝试32b-int8版本的，但是依然报错

ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine

我按照它说的，把gpu_memory_utilization提高到0.98（默认情况下是0.9），然后运行报错：

ValueError: The model’s max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (1008). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine

说明模型可以加载，但是没有缓存存放输入文本。到这里松了一口气，只要模型加载进去了就好办了，大不了减少输入文本长度。于是乎将输入文本长度调整为20480，然后运行，报错：

ValueError: The model’s max seq len (20480) is larger than the maximum number of tokens that can be stored in KV cache (11680). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

好家伙，输入长度还是太长了，那我按他说的限制在11680内，设置最大输入为10240，运行，总算跑通了。
在这里插入图片描述