第二期书生浦语大模型实战营第五次课程笔记----LMDeploy 量化部署 LLM-VLM 实践

原创已于 2024-06-25 16:01:05 修改 · 437 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#笔记 #人工智能 #自然语言处理 #学习

于 2024-04-14 08:52:50 首次发布

大模型专栏收录该内容

21 篇文章

订阅专栏

本文详细介绍了如何使用LMDeploy进行大模型部署，包括理论背景、模型对话、安装部署、量化方法（如量化感知和W4A16量化）、服务启动以及与视觉多模态模型LLaVA的集成，提供了实用的步骤和代码示例。

课程视频：https://www.bilibili.com/video/BV1tr421x75B/
课程文档：https://github.com/InternLM/Tutorial/blob/camp2/lmdeploy/README.md
课程作业：https://github.com/InternLM/Tutorial/blob/camp2/lmdeploy/homework.md
github地址：https://github.com/InternLM/LMDeploy
https://github.com/internLM/internLM/
配置教程：https://github.com/InternLM/Tutorial/tree/camp2/tools/openxlab-deploy

理论学习

大模型部署背景：模型部署就是将训练好的深度学习模型在特定环境中运行的过程，面临巨大的挑战–计算量巨大、内存开销巨大、访存密集瓶颈、动态请求。
大模型部署方法：模型剪枝，知识蒸馏–上下文学习、思维链、指令跟随，量化–量化感知训练、量化感知微调、训练后量化
LMDeploy简介：涵盖LLM任务的全套轻量化、部署和服务解决方案，核心功能----chat、lite、serve，
在这里插入图片描述

动手实践–安装、部署、量化

操作平台：https://studio.intern-ai.org.cn/console/instance/
在这里插入图片描述

部署

安装并激活环境：

studio-conda -t lmdeploy -o pytorch-2.1.2
conda activate lmdeploy
pip install lmdeploy[all]==0.3.0

服务器或本地环境安装

conda create -n lmdeploy python=3.10
conda activate lmdeploy
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia

模型对话

huggingface在线托管模型和数据集，模型常用HF格式，国内的Mindscope、OpenXlab也可以托管模型和数据集，
TurboMind关于LLM推理的高效引擎，是LM deploy的一个子模块，

下载模型

软链接或拷贝模型

cd ~
ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b /root/
# cp -r /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b /root/

使用transformer库运行模型

touch /root/pipeline_transformer.py

Pipline_transformer.py

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("/root/internlm2-chat-1_8b", trust_remote_code=True)

# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("/root/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()

inp = "hello"
print("[INPUT]", inp)
response, history = model.chat(tokenizer, inp, history=[])
print("[OUTPUT]", response)

inp = "please provide three suggestions about time management"
print("[INPUT]", inp)
response, history = model.chat(tokenizer, inp, history=history)
print("[OUTPUT]", response)

conda activate lmdeploy
python /root/pipeline_transformer.py

Lmdeploy chat

conda activate lmdeploy
lmdeploy chat /root/internlm2-chat-1_8b  #lmdeploy chat [HF格式模型路径/TurboMind格式模型路径]

模型量化

KV缓存

在lmdeploy chat下设置最大KV Cache缓存大小，–cache-max-entry-count XXX

lmdeploy chat /root/internlm2-chat-1_8b --cache-max-entry-count 0.5

量化

在lmdeploy lite下使用W4A16量化，

pip install einops==0.7.0
lmdeploy lite auto_awq \
   /root/internlm2-chat-1_8b \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 1024 \
  --w-bits 4 \
  --w-group-size 128 \
  --work-dir /root/internlm2-chat-1_8b-4bit
  #lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq
  lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq --cache-max-entry-count 0.01

服务

启动API接口

在lmdeploy serve 下启动API接口服务

lmdeploy serve api_server \
    /root/internlm2-chat-1_8b \
    --model-format hf \
    --quant-policy 0 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

windows powershell 映射端口：ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.intern-ai.org.cn -p 你的ssh端口号
浏览器：http://127.0.0.1:23333

连接API

命令行客户端/网页客户端连接API服务器

conda activate lmdeploy
lmdeploy serve api_client http://localhost:23333  #命令行客户端
lmdeploy serve gradio http://localhost:23333 \  # 网页客户端
    --server-name 0.0.0.0 \
    --server-port 6006

windows powershell 映射端口：ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.intern-ai.org.cn -p 你的ssh端口号
浏览器：http://127.0.0.1:6006

代码集成

对话集成：Pipeline.py

conda activate lmdeploy
touch /root/pipeline.py

from lmdeploy import pipeline

pipe = pipeline('/root/internlm2-chat-1_8b')
response = pipe(['Hi, pls intro yourself', '上海是'])
print(response)

python /root/pipeline.py

KV缓存和量化集成：Pipeline_kv.py

touch /root/pipeline_kv.py
from lmdeploy import pipeline, TurbomindEngineConfig

# 调低 k/v cache内存占比调整为总显存的 20%
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)

pipe = pipeline('/root/internlm2-chat-1_8b',
                backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', '上海是'])
print(response)

python /root/pipeline_kv.py

拓展部分

使用LMDeploy运行视觉多模态大模型llava，（运行本pipeline最低需要30%的InternStudio开发机）

命令行客户端

conda activate lmdeploy
pip install git+https://github.com/haotian-liu/LLaVA.git@4e2277a060da264c4f21b364c867cc622c945874
touch /root/pipeline_llava.py

from lmdeploy.vl import load_image
from lmdeploy import pipeline, TurbomindEngineConfig


backend_config = TurbomindEngineConfig(session_len=8192) # 图片分辨率较高时请调高session_len
# pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config) 非开发机运行此命令
pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config)

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

Gradio运行llava模型

touch /root/gradio_llava.py

import gradio as gr
from lmdeploy import pipeline, TurbomindEngineConfig


backend_config = TurbomindEngineConfig(session_len=8192) # 图片分辨率较高时请调高session_len
# pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config) 非开发机运行此命令
pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config)

def model(image, text):
    if image is None:
        return [(text, "请上传一张图片。")]
    else:
        response = pipe((text, image)).text
        return [(text, response)]

demo = gr.Interface(fn=model, inputs=[gr.Image(type="pil"), gr.Textbox()], outputs=gr.Chatbot())
demo.launch()