课程视频:https://www.bilibili.com/video/BV1tr421x75B/
课程文档:https://github.com/InternLM/Tutorial/blob/camp2/lmdeploy/README.md
课程作业:https://github.com/InternLM/Tutorial/blob/camp2/lmdeploy/homework.md
github地址:https://github.com/InternLM/LMDeploy
https://github.com/internLM/internLM/
配置教程:https://github.com/InternLM/Tutorial/tree/camp2/tools/openxlab-deploy
理论学习
大模型部署背景:模型部署就是将训练好的深度学习模型在特定环境中运行的过程,面临巨大的挑战–计算量巨大、内存开销巨大、访存密集瓶颈、动态请求。
大模型部署方法:模型剪枝,知识蒸馏–上下文学习、思维链、指令跟随,量化–量化感知训练、量化感知微调、训练后量化
LMDeploy简介:涵盖LLM任务的全套轻量化、部署和服务解决方案,核心功能----chat、lite、serve,

动手实践–安装、部署、量化
操作平台:https://studio.intern-ai.org.cn/console/instance/

部署
安装并激活环境:
studio-conda -t lmdeploy -o pytorch-2.1.2
conda activate lmdeploy
pip install lmdeploy[all]==0.3.0
服务器或本地环境安装
conda create -n lmdeploy python=3.10
conda activate lmdeploy
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
模型对话
huggingface在线托管模型和数据集,模型常用HF格式,国内的Mindscope、OpenXlab也可以托管模型和数据集,
TurboMind关于LLM推理的高效引擎,是LM deploy的一个子模块,
下载模型
软链接或拷贝模型
cd ~
ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b /root/
# cp -r /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b /root/
使用transformer库运行模型
touch /root/pipeline_transformer.py
Pipline_transformer.py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("/root/internlm2-chat-1_8b", trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("/root/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()
inp = "hello"
print("[INPUT]", inp)
response, history = model.chat(tokenizer, inp, history=[])
print("[OUTPUT]", response)
inp = "please provide three suggestions about time management"
print("[INPUT]", inp)
response, history = model.chat(tokenizer, inp, history=history)
print("[OUTPUT]", response)
conda activate lmdeploy
python /root/pipeline_transformer.py
Lmdeploy chat
conda activate lmdeploy
lmdeploy chat /root/internlm2-chat-1_8b #lmdeploy chat [HF格式模型路径/TurboMind格式模型路径]
模型量化
KV缓存
在lmdeploy chat下设置最大KV Cache缓存大小,–cache-max-entry-count XXX
lmdeploy chat /root/internlm2-chat-1_8b --cache-max-entry-count 0.5
量化
在lmdeploy lite下使用W4A16量化,
pip install einops==0.7.0
lmdeploy lite auto_awq \
/root/internlm2-chat-1_8b \
--calib-dataset 'ptb' \
--calib-samples 128 \
--calib-seqlen 1024 \
--w-bits 4 \
--w-group-size 128 \
--work-dir /root/internlm2-chat-1_8b-4bit
#lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq
lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq --cache-max-entry-count 0.01
服务
启动API接口
在lmdeploy serve 下启动API接口服务
lmdeploy serve api_server \
/root/internlm2-chat-1_8b \
--model-format hf \
--quant-policy 0 \
--server-name 0.0.0.0 \
--server-port 23333 \
--tp 1
windows powershell 映射端口:ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.intern-ai.org.cn -p 你的ssh端口号
浏览器:http://127.0.0.1:23333
连接API
命令行客户端/网页客户端连接API服务器
conda activate lmdeploy
lmdeploy serve api_client http://localhost:23333 #命令行客户端
lmdeploy serve gradio http://localhost:23333 \ # 网页客户端
--server-name 0.0.0.0 \
--server-port 6006
windows powershell 映射端口:ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.intern-ai.org.cn -p 你的ssh端口号
浏览器:http://127.0.0.1:6006
代码集成
对话集成:Pipeline.py
conda activate lmdeploy
touch /root/pipeline.py
from lmdeploy import pipeline
pipe = pipeline('/root/internlm2-chat-1_8b')
response = pipe(['Hi, pls intro yourself', '上海是'])
print(response)
python /root/pipeline.py
KV缓存和量化集成:Pipeline_kv.py
touch /root/pipeline_kv.py
from lmdeploy import pipeline, TurbomindEngineConfig
# 调低 k/v cache内存占比调整为总显存的 20%
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)
pipe = pipeline('/root/internlm2-chat-1_8b',
backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', '上海是'])
print(response)
python /root/pipeline_kv.py
拓展部分
使用LMDeploy运行视觉多模态大模型llava,(运行本pipeline最低需要30%的InternStudio开发机)
命令行客户端
conda activate lmdeploy
pip install git+https://github.com/haotian-liu/LLaVA.git@4e2277a060da264c4f21b364c867cc622c945874
touch /root/pipeline_llava.py
from lmdeploy.vl import load_image
from lmdeploy import pipeline, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(session_len=8192) # 图片分辨率较高时请调高session_len
# pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config) 非开发机运行此命令
pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config)
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
Gradio运行llava模型
touch /root/gradio_llava.py
import gradio as gr
from lmdeploy import pipeline, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(session_len=8192) # 图片分辨率较高时请调高session_len
# pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config) 非开发机运行此命令
pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config)
def model(image, text):
if image is None:
return [(text, "请上传一张图片。")]
else:
response = pipe((text, image)).text
return [(text, response)]
demo = gr.Interface(fn=model, inputs=[gr.Image(type="pil"), gr.Textbox()], outputs=gr.Chatbot())
demo.launch()
python /root/gradio_llava.py
windows powershell 映射端口:ssh -CNg -L 7860:127.0.0.1:7860 root@ssh.intern-ai.org.cn -p <你的ssh端口>
浏览器访问http://127.0.0.1:7860
本文详细介绍了如何使用LMDeploy进行大模型部署,包括理论背景、模型对话、安装部署、量化方法(如量化感知和W4A16量化)、服务启动以及与视觉多模态模型LLaVA的集成,提供了实用的步骤和代码示例。
1043

被折叠的 条评论
为什么被折叠?



