LMDeploy大模型量化部署实践

最新推荐文章于 2024-09-15 15:29:29 发布

Jeff2017

最新推荐文章于 2024-09-15 15:29:29 发布

阅读量459

点赞数 12

文章标签：人工智能算法 python

本文链接：https://blog.csdn.net/qq_18608609/article/details/141673140

版权

1 LMDeploy API部署InternLM2.5
在上一章节，我们直接在本地部署InternLM2.5。而在实际应用中，我们有时会将大模型封装为API接口服务，供客户端访问。

1.1 启动API服务器
首先让我们进入创建好的conda环境，并通下命令启动API服务器，部署InternLM2.5模型：

conda activate lmdeploy lmdeploy serve api_server \ /root/models/internlm2_5-7b-chat \ --model-format hf \ --quant-policy 0 \ --server-name 0.0.0.0 \ --server-port 23333 \ --tp 1
以命令行形式连接API服务器
关闭http://127.0.0.1:23333网页，但保持终端和本地窗口不动，按箭头操作新建一个终端。

运行如下命令，激活conda环境并启动命令行客户端。

conda activate lmdeploy
lmdeploy serve api_client http://localhost:23333
稍待片刻，等出现double enter to end input >>>的输入提示即启动成功，此时便可以随意与InternLM2.5对话，同样是两下回车确定，输入exit退出。

1.2 以Gradio网页形式连接API服务器
保持第一个终端不动，在新建终端中输入exit退出。

输入以下命令，使用Gradio作为前端，启动网页。

lmdeploy serve gradio http://localhost:23333 \
--server-name 0.0.0.0 \
--server-port 6006

使用的是50%A100，占用了23GB，执行以下命令，观看占用显存情况。

lmdeploy chat /root/models/internlm2_5-7b-chat --cache-max-entry-count 0.4
稍待片刻，观测显存占用情况，可以看到减少了约4GB的显存。

1.3 设置在线 kv cache int4/int8 量化
输入以下指令，启动API服务::

lmdeploy serve api_server \
/root/models/internlm2_5-7b-chat \
--model-format hf \
--quant-policy 4 \
--cache-max-entry-count 0.4\
--server-name 0.0.0.0 \
--server-port 23333 \

1.4 W4A16 模型量化和部署
在最新的版本中，LMDeploy使用的是AWQ算法，能够实现模型的4bit权重量化。

输入以下指令，执行量化工作。

lmdeploy lite auto_awq \
/root/models/internlm2_5-7b-chat \
--calib-dataset 'ptb' \
--calib-samples 128 \
--calib-seqlen 2048 \
--w-bits 4 \
--w-group-size 128 \
--batch-size 1 \
--search-scale False \
--work-dir /root/models/internlm2_5-7b-chat-w4a16-4bit

等待推理完成，便可以直接在目标文件夹看到对应的模型文件。

查看显存占用情况：

输入以下指令启动量化后的模型。

lmdeploy chat /root/models/internlm2_5-7b-chat-w4a16-4bit/ --model-format awq

输入以下指令，让我们同时启用量化后的模型、设定kv cache占用和kv cache int4量化。

lmdeploy serve api_server \
/root/models/internlm2_5-7b-chat-w4a16-4bit/ \
--model-format awq \
--quant-policy 4 \
--cache-max-entry-count 0.4\
--server-name 0.0.0.0 \
--server-port 23333 \
--tp 1