5、LMDeploy 量化部署 LLM&VLM实战（homework）

nty102

已于 2024-04-11 08:53:30 修改

阅读量671

点赞数 3

文章标签：人工智能深度学习 chatgpt langchain

于 2024-04-10 16:49:47 首次发布

本文链接：https://blog.csdn.net/nty102/article/details/137596422

版权

本文介绍了如何在InternStudio上配置LMDeploy环境，包括创建conda环境、安装lmdeploy、下载并使用InternLM2-Chat模型进行对话。还详细讲解了模型量化、KVCache管理、API服务器设置以及Python代码集成视觉模型的过程。

摘要由CSDN通过智能技术生成

基础作业（结营必做）

完成以下任务，并将实现过程记录截图：

配置lmdeploy运行环境

由于环境依赖项存在torch，下载过程可能比较缓慢。InternStudio上提供了快速创建conda环境的方法。打开命令行终端，创建一个名为lmdeploy的环境：

studio-conda -t lmdeploy -o pytorch-2.1.2

# 接下来，激活刚刚创建的虚拟环境。
conda activate lmdeploy

# 安装0.3.0版本的lmdeploy。
pip install lmdeploy[all]==0.3.0

# 等待安装结束就OK了！

下载internlm-chat-1.8b模型

## OpenXLab平台支持通过Git协议下载模型。首先安装git-lfs组件。

## 对于root用于请执行如下指令：
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt update
apt install git-lfs   
git lfs install  --system

## 对于非root用户需要加sudo，请执行如下指令：
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt update
sudo apt install git-lfs   
sudo git lfs install  --system

## 安装好git-lfs组件后，由OpenXLab平台下载InternLM2-Chat-1.8B模型：

git clone https://code.openxlab.org.cn/OpenLMLab/internlm2-chat-1.8b.git

以命令行方式与模型对话

进阶作业

完成以下任务，并将实现过程记录截图：

设置KV Cache最大占用比例为0.4，开启W4A16量化，以命令行方式与模型对话。（优秀学员必做）

## 模型W4A16量化
lmdeploy lite auto_awq /root/models/Shanghai_AI_Laboratory/internl
m2-chat-1_8b --calib-dataset 'ptb' --calib-samples 128 --calib-seqlen 1024 --w-bits 4 --w-group-size 128 --work-dir /root/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b-4bit


## KV Cache最大占用比例为0.4，开启W4A16量化，以命令行方式与模型对话。
lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq --cache-max-entry-count 0.4

以API Server方式启动 lmdeploy，开启 W4A16量化，调整KV Cache的占用比例为0.4，分别使用命令行客户端与Gradio网页客户端与模型对话。（优秀学员）

通过以下命令启动API服务器，推理internlm2-chat-1_8b模型：

lmdeploy serve api_server \
     /root/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b-4bit/  \
    --model-format  awq  \
    --quant-policy 0 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --cache-max-entry-count 0.4 \
    --tp 1

## 运行命令行客户端：

lmdeploy serve api_client http://localhost:23333

命令行客户端：

Gradio网页端：

lmdeploy serve gradio http://localhost:23333 \
    --server-name 0.0.0.0 \
    --server-port 6006

使用W4A16量化，调整KV Cache的占用比例为0.4，使用Python代码集成的方式运行internlm2-chat-1.8b模型。（优秀学员必做）

from lmdeploy import pipeline, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)

pipe = pipeline('/root/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b',
                backend_config=backend_config)
# pipe = pipeline('/root/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b')
response = pipe(['Hi, pls intro yourself', '上海是'])
print(response)

使用 LMDeploy 运行视觉多模态大模型 llava gradio demo （优秀学员必做）

from lmdeploy import pipeline
from lmdeploy.vl import load_image

# pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b') 非开发机运行此命令
pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b')

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

将 LMDeploy Web Demo 部署到 OpenXLab （OpenXLab cuda 12.2 的镜像还没有 ready，可先跳过，一周之后再来做）

nty102

关注

3
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
5、LMDeploy 量化部署 LLM&VLM实战（homework）

由于环境依赖项存在torch，下载过程可能比较缓慢。InternStudio上提供了快速创建conda环境的方法。打开命令行终端，创建一个名为lmdeploy。
复制链接

扫一扫