L2-LMDeploy 量化部署进阶实践

RussFinn

已于 2024-08-27 10:43:52 修改

阅读量948

点赞数 19

文章标签：人工智能 python nlp

于 2024-08-22 00:25:03 首次发布

本文链接：https://blog.csdn.net/Finnwestbrook/article/details/141402534

版权

任务链接：Tutorial/docs/L2/LMDeploy/task.md at camp3 · InternLM/Tutorial · GitHub

1. LMDeploy

2. LMDeploy Lite

2.1 KV Cache

2.2 在线 kv cache int4/int8 量化

2.3 W4A16 模型量化和部署

2.3.1 W4A16

2.4 W4A16 量化+ KV cache+KV cache 量化

2.5 Function call

2.5.1 api 开发

2.5.2 Function call

1. LMDeploy

mkdir /root/tmpmodels
ln -s /root/share/new_models//Shanghai_AI_Laboratory/internlm2_5-1_8b-chat /root/tmpmodels
ln -s /root/share/new_models/OpenGVLab/InternVL2-26B /root/tmpmodels

尝试一下与模型交流

lmdeploy chat /root/tmpmodels/internlm2_5-1_8b-chat

查看显存占用情况，运行之前

运行后

以 api server 方式运行

conda activate lmdeploy
lmdeploy serve api_server /root/tmpmodels/internlm2_5-1_8b-chat --model-format hf --quant-policy 0 --server-name 0.0.0.0 --server-port 23333 --tp 1
# 命令解析
lmdeploy serve api_server \ # 启动 api 服务
    /root/tmpmodels/internlm2_5-1_8b-chat \ # 模型路径
    --model-format hf \ # 模型格式
    --quant-policy 0 \ # 量化策略
    --server-name 0.0.0.0 \ # 服务器名称，0.0.0.0 表示所有网络接口
    --server-port 23333 \ # 端口号
    --tp 1 # GPU 并行数量

新开窗口，连接 api 服务器

conda activate lmdeploy
lmdeploy serve api_client http://localhost:23333

1.1 显存计算

InternLM2.5 1.8B 使用 bfloat16 的每个参数使用16位浮点数（等于 2个 Byte）表示；

因此，根据计算公式可以得到 InternLM2.5 1.8B 的模型权重大小约等于：

18 * 10^9 parameters * 2 Bytes/parameter = 3.6GB

那么，对于 30% A100，显存为 80 * 30% = 24GB，LMDpeloy推理精度为 bf16 的 1.8B 模型权重需要占用 3.6GB 显存；又因为 lmdeploy 默认设置 cache-max-entry-count == 0.8，即 kv cache 占用剩余显存的80%；

此时权重占用 3.6GB 显存，剩余显存 24 - 3.6 = 20.4GB，因此kv cache占用 20.4GB * 0.8 = 16.32GB，加上原来的权重，总共占用 3.6 + 16.32 = 19.92GB = 20398MB

2. LMDeploy Lite

2.1 KV Cache

kv cache是一种缓存技术，通过存储键值对的形式来复用计算结果，以达到提高性能和降低内存消耗的目的。在大规模训练和推理中，kv cache可以显著减少重复计算量，从而提升模型的推理速度。理想情况下，kv cache全部存储于显存，以加快访存速度。

模型在运行时，占用的显存可大致分为三部分：模型参数本身占用的显存、kv cache占用的显存，以及中间运算结果占用的显存。LMDeploy的kv cache管理器可以通过设置--cache-max-entry-count参数，控制kv缓存占用剩余显存的最大比例。默认的比例为0.8。

修改 kv-cache 比例为0.4 （默认为0.8）

lmdeploy chat /root/tmpmodels/internlm2_5-1_8b-chat --cache-max-entry-count 0.4

现在的显存占用为：3.6GB + 20.4GB * 0.4 = 11.76GB

2.2 在线 kv cache int4/int8 量化

自 v0.4.0 起，LMDeploy 支持在线 kv cache int4/int8 量化，量化方式为 per-head per-token 的非对称量化。此外，通过 LMDeploy 应用 kv 量化非常简单，只需要设定 quant_policy 和cache-max-entry-count参数。目前，LMDeploy 规定 qant_policy=4 表示 kv int4 量化，quant_policy=8 表示 kv int8 量化，

lmdeploy serve api_server /root/tmpmodels/internlm2_5-1_8b-chat --model-format hf --quant-po
licy 4 --cache-max-entry-count 0.4 --server-name 0.0.0.0 --server-port 23333 --tp 1

显存占用几乎没有变化，这是因为都用了 cache-max-entry-count == 0.4。不同点在于 quant-policy 设置为4时，意味着使用 int4 精度进行量化。

相比使用BF16精度的kv cache，int4的Cache可以在相同的显存下只需要4位来存储一个数值，而BF16需要16位。这意味着int4的Cache可以存储的元素数量是BF16的四倍。

2.3 W4A16 模型量化和部署

模型量化是一种优化技术，旨在减少机器学习模型的大小并提高其推理速度。量化通过将模型的权重和激活从高精度（如16位浮点数）转换为低精度（如8位整数、4位整数、甚至二值网络）来实现。

2.3.1 W4A16

W4：这通常表示权重量化为4位整数（int4）。这意味着模型中的权重参数将从它们原始的浮点表示（例如FP32、BF16或FP16，Internlm2.5精度为BF16）转换为4位的整数表示。这样做可以显著减少模型的大小。
A16：这表示激活（或输入/输出）仍然保持在16位浮点数（例如FP16或BF16）。激活是在神经网络中传播的数据，通常在每层运算之后产生。

因此，W4A16的量化配置意味着：

权重被量化为4位整数。
激活保持为16位浮点数。

利用 AWQ 算法实现模型的 4bit 权重量化

lmdeploy lite auto_awq /root/tmpmodels/internlm2_5-1_8b-chat --calib-dataset 'ptb' --calib-samples 128 --calib-seqlen 2048 --w-bits 4 --w-group-size 128 --batch-size 4 --search-scale False --work-dir /root/tmpmodels/internlm2_5-1_8b-chat-w4a16-4bit

lmdeploy lite auto_awq \
   /root/tmpmodels/internlm2_5-1_8b-chat \
  --calib-dataset 'ptb' \ # 指定了一个校准数据集，这里使用的是’ptb’（Penn Treebank，一个常用的语言模型数据集）
  --calib-samples 128 \ # 指定了用于校准的样本数量—128个样本
  --calib-seqlen 2048 \ # 指定了校准过程中使用的序列长度—1024
  --w-bits 4 \ # 表示权重的位数将被量化为4位
  --w-group-size 128 \
  --batch-size 4 \
  --search-scale False \
  --work-dir /root/tmpmodels/internlm2_5-1_8b-chat-w4a16-4bit # 工作目录

运行完成后，查看模型文件的大小，前面两项由于是软链接不占空间

原模型大小为 3.6GB，现在为 1.5GB。但是按道理来说，模型大小应该变成原来的 1/4 （bf16 -> int4, 2 byte -> 0.5 byte）？

运行后查看显存

lmdeploy chat /root/tmpmodels/internlm2_5-1_8b-chat-w4a16-4bit/ --model-format awq

现在的显存占用大约从 20968M(20.5GB) -> 20356M(19.9GB)，减少了约 600M

1、在 BF16 精度下，1.8B模型权重占用 3.6GB；

2、kv cache占用 16.32GB；

3、其他项

是故 20.48GB = 权重占用 3.6GB + kv cache占用 16.32GB + 其它项

而对于W4A16量化之后的显存占用情况(19.9GB)：

1、在 int4 精度下，7B模型权重占用 0.9GB

2、kv cache占用 18.48GB

3、其他项

最终 19.9GB = 权重占用 0.9GB + kv cache占用 18.48GB + 其它项

2.4 W4A16 量化+ KV cache+KV cache 量化

首先开启 api 服务

lmdeploy serve api_server /root/tmpmodels/internlm2_5-1_8b-chat-w4a16-4bit/ --model-format awq --quant-policy 4 --cache-max-entry-count 0.4 --server-name 0.0.0.0 --server-port 23333 --tp 1

显存占用

2.5 Function call

2.5.1 api 开发

# internlm2_5.py
from openai import OpenAI


# 创建一个OpenAI的客户端实例，需要传入API密钥和API的基础URL
client = OpenAI(
    api_key='YOUR_API_KEY',  
    # 替换为你的OpenAI API密钥，由于我们使用的本地API，无需密钥，任意填写即可
    base_url="http://0.0.0.0:23333/v1"  
    # 指定API的基础URL，这里使用了本地地址和端口
)

# 调用client.models.list()方法获取所有可用的模型，并选择第一个模型的ID
# models.list()返回一个模型列表，每个模型都有一个id属性
model_name = client.models.list().data[0].id

# 使用client.chat.completions.create()方法创建一个聊天补全请求
# 这个方法需要传入多个参数来指定请求的细节
response = client.chat.completions.create(
  model=model_name,  
  # 指定要使用的模型ID
  messages=[  
  # 定义消息列表，列表中的每个字典代表一个消息
    {"role": "system", "content": "你是一个友好的小助手，负责解决问题."},  
    # 系统消息，定义助手的行为
    {"role": "user", "content": "帮我讲述一个关于狐狸和西瓜的小故事"},  
    # 用户消息，询问时间管理的建议
  ],
    temperature=0.8,  
    # 控制生成文本的随机性，值越高生成的文本越随机
    top_p=0.8  
    # 控制生成文本的多样性，值越高生成的文本越多样
)

# 打印出API的响应结果
print(response.choices[0].message.content)

2.5.2 Function call

首先启动API服务器

lmdeploy serve api_server /root/tmpmodels/internlm2_5-1_8b-chat --model-format hf --quant-policy 0 --server-name 0.0.0.0 --server-port 23333 --tp 1

touch /root/internlm2_5_func.py

# internlm2_5_func.py
from openai import OpenAI


def add(a: int, b: int):
    return a + b


def mul(a: int, b: int):
    return a * b


tools = [{
    'type': 'function',
    'function': {
        'name': 'add',
        'description': 'Compute the sum of two numbers',
        'parameters': {
            'type': 'object',
            'properties': {
                'a': {
                    'type': 'int',
                    'description': 'A number',
                },
                'b': {
                    'type': 'int',
                    'description': 'A number',
                },
            },
            'required': ['a', 'b'],
        },
    }
}, {
    'type': 'function',
    'function': {
        'name': 'mul',
        'description': 'Calculate the product of two numbers',
        'parameters': {
            'type': 'object',
            'properties': {
                'a': {
                    'type': 'int',
                    'description': 'A number',
                },
                'b': {
                    'type': 'int',
                    'description': 'A number',
                },
            },
            'required': ['a', 'b'],
        },
    }
}]
messages = [{'role': 'user', 'content': 'Compute (3+5)*2'}]

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=False,
    tools=tools)
print(response)
func1_name = response.choices[0].message.tool_calls[0].function.name
func1_args = response.choices[0].message.tool_calls[0].function.arguments
func1_out = eval(f'{func1_name}(**{func1_args})')
print(func1_out)

messages.append({
    'role': 'assistant',
    'content': response.choices[0].message.content
})
messages.append({
    'role': 'environment',
    'content': f'3+5={func1_out}',
    'name': 'plugin'
})
response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=False,
    tools=tools)
print(response)
func2_name = response.choices[0].message.tool_calls[0].function.name
func2_args = response.choices[0].message.tool_calls[0].function.arguments
func2_out = eval(f'{func2_name}(**{func2_args})')
print(func2_out)

计算了很多次，看着结果从 8 变成 15，最后终于算对了答案 16