InternLM 实战营笔记-5-LMDeploy量化部署LLM&VLM实践

zwplus

已于 2024-04-17 15:50:11 修改

阅读量828

点赞数 24

分类专栏： InternLM 实战营笔记文章标签：笔记

于 2024-04-17 15:46:17 首次发布

本文链接：https://blog.csdn.net/zwplus/article/details/137875614

版权

InternLM 实战营笔记专栏收录该内容

4 篇文章 0 订阅

订阅专栏

LMDeploy量化部署LLM&VLM实践

LMDeploy 量化部署

大模型部署的挑战

计算量巨大

请添加图片描述

这里的前向推理公式主要适用那些采用了Decoder-only并且是逐个Token生成的LLM，这里公式也是指生成一个Token的前向计算量。

$2 N$ 表示的是单个Token与模型参数之间的计算量，不包括注意力图的计算，这个结果可以通过将LLM简化成一个大的全连接层来看即 $y = w x + b$ ,这里假设 $x 大小为 (1, b), w 大小为 (a, b), b 的大小为 (a,)$ ,由于 $x$ 与 $w$ 进行的是点积运算，所以相应运算量为 $a * b$ 乘法和 $a * (b - 1)$ 加法运算，相应的和偏置的运算量为 $a$ 次加法，因此总计 $2 * a * b$ ，而模型的实际参数量应该是 $N = a * b + a$ ，但是因为 $a * b >> a$ ，所以这里将计算量近似为 $2 N$ 。

除了前面模型输入和模型本身的计算，模型另一个计算开销就是在注意力相关的运算上，这里也是和之前一样，只适合估计当前类似于GPT这中单向且Decoder-only的结构。在这种结构下，只需要计算当前Token和之前Token（包括自己）的注意力关系，之前的Token之间的关系用过KV-Cache记录了，不再需要计算。这里假设前面包括自己在内的Token总数为 $n_{ctx}$ ，那在计算注意力时，实际上是1个Query和 $n_{ctx}$ 个Key以及 $n_{ctx}$ 个Value之间的计算，相应的如果注意维度为 $d$ ,则相应的QK矩阵的计算量应该在 $n_{ctx}*d+n_{ctx}*(d-1)=2*n_{ctx}*d$ ，相应的注意力和value计算的计算量应该在 $n_{ctx}*d+(n_{ctx}-1)*d=2*n_{ctx}*d$ ，所以这里的计算量应该是 $4*n_{layer}*n_{ctx}*d$ ;不太清楚为啥PPT里给是乘2，所以后面还需要看一下这一块的计算。

内存开销大

LLM模型参数量较大，1B模型就产生2G的存储（float16格式）
KV-Cache，现有的大模型会运算过程存储已经计算过的K,V值，以提高运算速度，这一部分也会占用额外的显存，计算公式如下图，这里也是float16的格式

请添加图片描述

访存瓶颈

大模型在推理要密集访问存储在模型中参数和中间结果，因此在实际上大模型推理过程是一个访存密集性任务，这对显存带宽提出很高的要求。

动态请求

用户的请求量、请求时间以及逐个Token生成方式所带来不确定生成数量会影响模型对GPU的利用率

大模型部署方法

模型剪枝：对模型中不重要、不敏感的参数进行剪枝，来给模型瘦身
知识蒸馏：通过引导学生模型（轻量）模仿教师模型（大模型）的方式来确保模型性能的同时，降低模型的参数量
量化：通过将模型的参数或这中间计算结果转换成整数的形式来存储，以降低存储成本和访存量，在具体计算过程还需要使用浮点数来确保精度。

LMDeploy模型对话

huggingface上托管的模型采用的HF格式，因此huggingface的Transformer库是默认支持HF格式模型
TurboMind是LMDeploy采用的一个高效推理引擎，其只能支持基于TurboMind格式的模型进行高效推理，所以如果想要使用TurboMind来加速HF格式模型的推理，需要先将HF格式模型转换为TurboMind格式的模型，LMDeploy在遇到HF格式的模型时会将其自动转换。

使用Transformer库运行InternLM2-Chat-1.8B

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM  

tokenizer = AutoTokenizer.from_pretrained("/root/internlm2-chat-1_8b", trust_remote_code=True)  

# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("/root/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda()   #以float16精度加载internlm2-chat-1_8b
model = model.eval()

#进行两端对话
inp = "hello"
print("[INPUT]", inp)
response, history = model.chat(tokenizer, inp, history=[])
print("[OUTPUT]", response)

inp = "please provide three suggestions about time management"
print("[INPUT]", inp)
response, history = model.chat(tokenizer, inp, history=history)
print("[OUTPUT]", response)

请添加图片描述

使用LMDeploy与模型对话

lmdeploy chat /root/internlm2-chat-1_8b   #这里nternlm2-chat-1_8b 应该是HF格式的模型，因此lmdeploy在部署时会存在一个额外的转换到TurboMind格式的过程

请添加图片描述
在这里插入图片描述

LMDeploy模型量化

设置最大KV Cache缓存大小

当前基于Decoder-only架构的LLM主要使用casual attention作为主要的attention机制，因此在推理阶段逐个生成Token时，可以通过保存之前的Token的Key和Value的值来避免重复计算，从而提高速度。然而将之前Token的Key和Value保存到显存中，可能会提高模型对显存占用，因此为了降低对显存的占用，我们会考虑将之前的Token的key和value放到内存中存储，但是这样有可能会降低模型运行速度。那这里的KV Cache指的就是我们会在显卡上分配特定大小显存专门用来存储之前Token的Key和Value，在KV Canche没有存满前，之前的Token的Key，Value都会被放在显存上，一旦KV Cache满了，就会将存不下的Key和Value放到内存中。因此大的KV Cache可以提高模型的推理速度，但是也会带来显存占用的提升，小的KV Cache可以降低模型的显存占用，但是也会导致模型推理速度的下降。

lmdeploy 使用–cache-max-entry-count来指定KV缓存占用剩余显存的最大比例，默认比例是0.8

lmdeploy chat /root/internlm2-chat-1_8b #从结果图中可以看到模型现在占据20G左右的显存,1_8b正常4G，所以（24-4）*0.8=16  16+4=20

在这里插入图片描述

lmdeploy chat /root/internlm2-chat-1_8b --cache-max-entry-count 0.5    #从结果图中可以看到模型现在占据14G左右的显存，(24-4)*0.5+4=14

在这里插入图片描述

lmdeploy chat /root/internlm2-chat-1_8b --cache-max-entry-count 0.01   #从结果图中可以看到模型现在占据14G左右的显存，(24-4)*0.01+4=4

在这里插入图片描述

使用W4A16量化

LLM是一个访存密集的任务，在推理时需要密集的访问显存（参数、之前保存的key和value) ,为了降低LLM在推理时的访存开销，我们可以在保存参数、key、value时对其进行量化操作（比如将原本float16量化int8，可以减低一倍访存量），读取时也只读取量化后值。然后在真正计算时，对其进行反量化以确保模型的精度。

KV8量化：针对我们前面提到的Key和Value进行量化，其将逐 Token（Decoding）生成过程中的上下文 K 和 V 中间结果进行 INT8 量化（计算时再反量化），以降低生成过程中的显存占用。

W4A16 量化：其针对模型参数进行量化，特别地其将 FP16 的模型权重量化为 INT4，Kernel 计算时，访存量直接降为 FP16 模型的 1/4，大幅降低了访存成本。Weight Only 是指仅量化权重，数值计算依然采用 FP16（需要将 INT4 权重反量化）。特别地其会根据模型的激活值来判断那些权重对模型的比较重要，然后再量化前对这些权重进行放大，这样原本在小数位上值就会来到整数位上，这样可以降低这些参数在量化时的误差，能够进一步保证模型推理时的精度。

lmdeploy lite auto_awq \
   /root/internlm2-chat-1_8b \
  --calib-dataset 'ptb' \   #这里应该会用这个数据集来计算一下激活值，以便确定那些权重比较重要
  --calib-samples 128 \		#这里应该指会根据128个样本的激活值的综合来缺定一个参数的重要程度
  --calib-seqlen 1024 \
  --w-bits 4 \
  --w-group-size 128 \
  --work-dir /root/internlm2-chat-1_8b-4bit

在这里插入图片描述

lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq  #部署经过awq量化过后的模型

在这里插入图片描述

lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq --cache-max-entry-count 0.01 #部署经过awq的模型，并且将kv-canche调至可以忽略不计，此时可以看到相较于之前4G的显存占用，模型的显存占用来到了2G

在这里插入图片描述

LMDeploy服务(serve)

目前大模型部署的整体架构图如下：

在这里插入图片描述

整体上可以分为如下三个部分：

模型推理/服务。主要提供模型本身的推理，一般来说可以和具体业务解耦，专注模型推理本身性能的优化。可以以模块、API等多种方式提供。
API Server。中间协议层，把后端推理/服务通过HTTP，gRPC或其他形式的接口，供前端调用。
Client。可以理解为前端，与用户交互的地方。通过通过网页端/命令行去调用API接口，获取模型推理/服务。

启动API服务器

这里api服务器就相当于前面架构中的模型推理服务部分，客户端和中间协议层可以通过调用api的方式访问到模型。

lmdeploy serve api_server \
    /root/internlm2-chat-1_8b \
    --model-format hf \
    --quant-policy 0 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

在这里插入图片描述

命令行客户端连接API服务器

lmdeploy serve api_client http://localhost:23333  #这使用客户端直接去访问模型的服务，这里相当于没有中间协议层

在这里插入图片描述

网页客户端连接API服务器

lmdeploy serve gradio http://localhost:23333 \
    --server-name 0.0.0.0 \
    --server-port 6006     #这里使用gradio来起到中间协议层的效果

在这里插入图片描述

Python代码集成

Python代码集成运行1.8B模型

from lmdeploy import pipeline

pipe = pipeline('/root/internlm2-chat-1_8b')
response = pipe(['Hi, pls intro yourself', '上海是'])
print(response)

在这里插入图片描述

向TurboMind后端传递参数

from lmdeploy import pipeline, TurbomindEngineConfig

# 调低 k/v cache内存占比调整为总显存的 20%
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)

pipe = pipeline('/root/internlm2-chat-1_8b',
                backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', '上海是'])
print(response)

在这里插入图片描述

使用LMDeploy运行视觉多模态大模型llava

在通过Python代码来运行llava

from lmdeploy.vl import load_image
from lmdeploy import pipeline, TurbomindEngineConfig


backend_config = TurbomindEngineConfig(session_len=8192) # 图片分辨率较高时请调高session_len
# pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config) 非开发机运行此命令
pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config)

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('请描述这张图片', image))  ##从结果图可以看出llava对中文理解存在问题
print(response)

在这里插入图片描述

通过使用gradio来运行llava

import gradio as gr
from lmdeploy import pipeline, TurbomindEngineConfig


backend_config = TurbomindEngineConfig(session_len=8192) # 图片分辨率较高时请调高session_len
# pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config) 非开发机运行此命令
pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config)

def model(image, text):
    if image is None:
        return [(text, "请上传一张图片。")]
    else:
        response = pipe((text, image)).text
        return [(text, response)]

demo = gr.Interface(fn=model, inputs=[gr.Image(type="pil"), gr.Textbox()], outputs=gr.Chatbot())
demo.launch()

在这里插入图片描述

定量比较LMDeploy与Transformer库的推理速度差异

使用Transformer推理Internlm2-chat-1.8b

import torch
import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("/root/internlm2-chat-1_8b", trust_remote_code=True)

# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("/root/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()

# warmup 就是让模型先运行一阵子，确保模型进入稳定状态
inp = "hello"
for i in range(5):
    print("Warm up...[{}/5]".format(i+1))
    response, history = model.chat(tokenizer, inp, history=[])

# test speed
inp = "请介绍一下你自己。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):   #进行十次对话
    response, history = model.chat(tokenizer, inp, history=history)
    total_words += len(response)
end_time = datetime.datetime.now()

delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.3f} words/s".format(speed))

使用LMDeploy推理Internlm2-chat-1.8b

import datetime
from lmdeploy import pipeline

pipe = pipeline('/root/internlm2-chat-1_8b')

# warmup
inp = "hello"
for i in range(5):
    print("Warm up...[{}/5]".format(i+1))
    response = pipe([inp])

# test speed
inp = "请介绍一下你自己。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
    response = pipe([inp])
    total_words += len(response[0].text)
end_time = datetime.datetime.now()

delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.3f} words/s".format(speed))

ps:这里漏截Transformer库推理的图了

在这里插入图片描述

zwplus

关注

24
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
InternLM 实战营笔记-5-LMDeploy量化部署LLM&VLM实践

这里的前向推理公式主要适用那些采用了Decoder-only并且是逐个Token生成的LLM，这里公式也是指生成一个Token的前向计算量。2N表示的是单个Token与模型参数之间的计算量，不包括注意力图的计算，这个结果可以通过将LLM简化成一个大的全连接层来看即ywxb,这里假设x大小为1bw大小为abb的大小为a,由于x与w进行的是点积运算，所以相应运算量为a∗b乘法和a∗b−1加法运算，相应的和偏置的运算量为a次加法，因此总计2∗a∗b。
复制链接

扫一扫

专栏目录