LMDeploy部署InternVL-Chat-V1-5（V100)

最新推荐文章于 2025-11-13 00:33:54 发布

原创最新推荐文章于 2025-11-13 00:33:54 发布 · 427 阅读

5 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能

环境配置同时被 2 个专栏收录

5 篇文章

订阅专栏

大模型

2 篇文章

订阅专栏

本篇写于2024年7月，请注意甄别

环境准备

[LMDeploy官方文档]

LMDeploy 提供了快速安装、模型量化、离线批处理、在线推理服务等功能。每个功能只需简单的几行代码或者命令就可以完成。

环境安装：

### 启动环境
source /mnt/lustre/share/platform/env/lmdeploy

### 新建 conda 环境
conda create -n lmdeploy python=3.8 # 大于3.8即可

# CUDA12
pip install lmdeploy 
pip install timm
# 手动下载 flash_attn.whl 并安装 : https://github.com/Dao-AILab/flash-attention/releases/

# CUDA 11+
# 浏览器下载 https://github.com/InternLM/lmdeploy/releases/download/v0.5.0/lmdeploy-0.5.0+cu118-cp38-cp38-manylinux2014_x86_64.whl
pip install lmdeploy-0.5.0+cu118-cp38-cp38-manylinux2014_x86_64.whl
# 除上述包之外，可能需要手动下载 torch2.2.2-cu118-cp38、torchvision-0.17.2-cu118-cp38
# 下载地址：https://download.pytorch.org/whl/cu118

## 联网情况下
# export LMDEPLOY_VERSION=0.5.0
# export PYTHON_VERSION=38
# pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

推理引擎：可选Turbomind和Pytorch【支持的模型链接】

【注意】：

1. pipeline 默认申请一定比例显存，用来存储推理过程中产生的 k/v。v0.2.0 默认比例为 0.5，表示 GPU****总显存的 50% 被分配给 k/v cache，lmdeploy > v0.2.1分配策略改为从空闲显存中按比例为 k/v cache 开辟空间。默认比例值调整为 0.8

2. 比例由参数TurbomindEngineConfig.cache_max_entry_count 控制，OOM或缓解显存占用问题可以修改这个参数，V100的机器无法使用 FlashAttention (only supports Ampere GPUs or newer)，通过修改模型根目录下的 config.json 文件来规避这个问题

use_flash_attn： false

attn_implementation：eager

离线推理

from lmdeploy import pipeline
from lmdeploy.messages import TurbomindEngineConfig
from lmdeploy.vl import load_image

model = '/path/to/InternVL-Chat-V1-5'
image = load_image('/path/to/infering_image.jpeg')
backend_config = TurbomindEngineConfig(tp=4) # 量化模型加 model_format='awq'
pipe = pipeline(model, backend_config=backend_config, log_level='INFO')
response = pipe((prompt, image))
print(response)

API-Server

服务端口

srun -p pat_dev --gres=gpu:4 -w target_ip lmdeploy serve api_server /path/to/InternVL-Chat-V1-5/ --server-port 23333 --tp 4
# 量化模型要加 --model-format awq
# 如果现存占用过大 --cache-max-entry-count 0.5 # 默认是 0.8

Infer脚本

from lmdeploy.serve.openai.api_client import APIClient

api_client = APIClient(f'http://{target_ip}:23333')
model_name = api_client.available_models[0]
messages = [
    {'role':'user',
        'content': [
            {
                'type': 'text',
                'text': 'Describe the image please',
            },{
                'type': 'image_url',
                'image_url': {
                     'url':'/path/to/infering_image.jpeg',
                 },
            }]
    }]

res = api_client.chat_completions_v1(model=model_name,messages=messages):

for item in res:
    for k,v in item.items():
        print(k, v)
        print("---")

推理速度

环境：V100 * 4，CUDA 12，lmdeploy 0.5.0

数据：

图像：coco2014测试集（5k）

图像平均分辨率486*575，最大640*640，最小144*176

文本 prompt 统一为“describe this image”

显存占用，cache_max_entry_count=0.5 的情况下，左侧为仅加载模型，右侧为推理状态

batch推理脚本

from lmdeploy import pipeline
from lmdeploy.messages import TurbomindEngineConfig
from lmdeploy.vl import load_image

import os, time, json

def batchInfer(batch_size):
    idx = 0
    res_lists = {}
    while idx < 100-batch_size:
        prompt = 'describe this image'
        images = []
        image_names = []
        for i in range(batch_size):
            image = load_image(os.path.join(image_path, imnames[idx]))
            images.append(image)
            image_names.append(imnames[idx])
            idx += 1

        batch = [(prompt, images[i]) for i in range(batch_size)]

        start_time = time.time()
        response = pipe.batch_infer(batch)
        end_time = time.time()
        stage_time = end_time - start_time

        for idx, res in enumerate(response):
            res_lists[image_names[idx]] = {'modelout': res.text, 'time': stage_time/batch_size}
    return res_lists

if __name__ == '__main__':
    model = '/mnt/cache/wenduren/data/InternVL-Chat-V1-5'
    image_path = "/mnt/cache/wenduren/data/images_coco_2014_5k_test"
    
    backend_config = TurbomindEngineConfig(tp=4, cache_max_entry_count=0.5)
    pipe = pipeline(model, backend_config=backend_config)#, log_level='INFO')

    imnames = os.listdir(image_path)

    test_batch_size = [1, 4, 8, 16, 32]
    for batch_size in test_batch_size:
        res_lists = batchInfer(batch_size)
        with open(f"result/bs{batch_size}/result.txt", "w") as f:
            json.dump(res_lists,f,indent=4)
        print(f'Infererence on Batch Size = {batch_size} Done.')

可能遇到的问题

输入文本prompt没有问题，输入图片就报错

解决方法；

echo $LD_LiBRARY_PATH # 检查是否为空，否则
unset LD_LiBRARY_PATH # 加到 ~/.bashrc