【书生浦语实战】LMDeploy 量化部署进阶实践：1.8b模型int4量化+部署实践

最新推荐文章于 2024-10-06 20:16:20 发布

纸灰机飞啊飞

最新推荐文章于 2024-10-06 20:16:20 发布

阅读量885

点赞数 13

文章标签： java android 开发语言

本文链接：https://blog.csdn.net/weixin_45575017/article/details/142690541

版权

结果速览

1.量化后的1.8b的模型运行只占原来的一半多点；如果叠buff（W4A16 量化+ KV cache+KV cache 量化）则占显存更少，
2.buff叠太多模型表现可能变差
3. 量化后的1.8b的模型没法调tools、还是用7b的真香

背景

随着模型变得越来越大，我们需要一些大模型压缩技术来降低模型部署的成本，并提升模型的推理性能。LMDeploy 提供了权重量化和 k/v cache两种策略。

环境配置

在之前搭建的环境里面，更新一下这几个依赖包的版本即可

pip install timm==1.0.8 openai==1.40.3 lmdeploy[all]==0.5.3

把需要用到的几个模型用软链接放出来：

mkdir /root/models
ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2_5-7b-chat /root/models
ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2_5-1_8b-chat /root/models
ln -s /root/share/new_models/OpenGVLab/InternVL2-26B /root/models

量化前的模型

验证量化前的模型能正常部署并运行

lmdeploy chat /root/models/internlm2_5-1_8b-chat --model-format hf

在这里插入图片描述
从图中可以看到，实际模型部署运行占用20.5GB

未量化的模型占用显存情况：

对于24GB的显卡，即30%A100，权重占用3.35 GB显存，kv cache占用16.52 GB，总共占用19.87 GB

1.8 x 10^9 param x 2（bytes/param）/1024/1024/1024 = 3.35 GB
(24 GB-3.35 GB) x 0.8 = 16.52 GB
3.35+16.52 = 19.87 GB

压缩LLM方案1:设置最大kv cache缓存大小

lmdeploy模型最大kv cache占剩余显存大小的0.8，现将其改成0.4

设置 cache-max-entry-count 0.4：

lmdeploy chat /root/models/internlm2_5-1_8b-chat --cache-max-entry-count 0.4

在这里插入图片描述
从图中可以看到，实际模型部署运行占用12.5GB

理论上，模型占用显存情况：

权重占用3.35 GB显存，kv cache占用8.26 GB，总共占用11.61 GB。

1.8 x 10^9 param x 2（bytes/param）/1024/1024/1024 = 3.35 GB
(24 GB-3.35 GB) x 0.4 = 8.26 GB
3.35+8.26 = 11.61 GB

压缩LLM方案2: 设置在线 kv cache int4/int8 量化

自 v0.4.0 起，LMDeploy 支持在线 kv cache int4/int8 量化，量化方式为 per-head per-token 的非对称量化。此外，通过 LMDeploy 应用 kv 量化非常简单，只需要设定 quant_policy 和cache-max-entry-count参数。目前，LMDeploy 规定 quant_policy=4 表示 kv int4 量化，quant_policy=8 表示 kv int8 量化。

lmdeploy chat /root/models/internlm2_5-1_8b-chat \
	--cache-max-entry-count 0.4 \
	--quant-policy 4

在这里插入图片描述
从图中可以看到，实际模型部署运行占用12.5GB；跟最大kv cache缓存大小的情况差不多

那么设置在线 kv cache int4/int8 量化的19GB的显存占用与设置最大kv cache缓存大小中19GB的显存占用区别何在呢？

由于都使用BF16精度下的internlm2.5 1.8B模型，故剩余显存均为3.35GB，且 cache-max-entry-count 均为0.4，这意味着LMDeploy将分配40%的剩余显存用于kv cache，即8.26GB。但quant-policy 设置为4时，意味着使用int4精度进行量化。因此，LMDeploy将会使用int4精度提前开辟4GB的kv cache。

相比使用BF16精度的kv cache，int4的Cache可以在相同4GB的显存下只需要4位来存储一个数值，而BF16需要16位。这意味着int4的Cache可以存储的元素数量是BF16的四倍。

压缩LLM方案3:量化模型（int4）

lmdeploy lite auto_awq \
   /root/models/internlm2_5-1_8b-chat \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 2048 \
  --w-bits 4 \
  --w-group-size 128 \
  --batch-size 1 \
  --search-scale False \
  --work-dir /root/models/internlm2_5-1_8b-chat-w4a16-4bit

命令解释：

lmdeploy lite auto_awq: lite这是LMDeploy的命令，用于启动量化过程，而auto_awq代表自动权重量化（auto-weight-quantization）。

/root/models/internlm2_5-1_8b-chat: 模型文件的路径。

–calib-dataset ‘ptb’: 这个参数指定了一个校准数据集，这里使用的是’ptb’（Penn Treebank，一个常用的语言模型数据集）。

–calib-samples 128: 这指定了用于校准的样本数量—128个样本

–calib-seqlen 2048: 这指定了校准过程中使用的序列长度—2048

–w-bits 4: 这表示权重（weights）的位数将被量化为4位。

–work-dir /root/models/internlm2_5-1_8b-chat-w4a16-4bit: 这是工作目录的路径，用于存储量化后的模型和中间结果。

量化完成后

1.8b的模型等了一个多小时就量化完成，量化完的模型比原模型小了一半（但理论上来说应该只有1/4？）
在这里插入图片描述

在这里插入图片描述
从图中可以看到，实际模型部署运行占用20.5GB；

此时模型占用显存情况：

权重占用 0.84 GB显存，kv cache占用18.53 GB，总共占用19.19 GB

1.8 x 10^9 param x 0.5（bytes/param）/1024/1024/1024 = 0.84 GB
(24 GB- 0.84 GB) x 0.8 = 18.53 GB
0.84+18.53 = 19.19 GB

三项压缩技术叠buff：W4A16 量化+ KV cache+KV cache 量化

lmdeploy chat /root/models/internlm2_5-1_8b-chat-w4a16-4bit \
	--model-format awq  \
	--cache-max-entry-count 0.4 \
	--quant-policy 4

在这里插入图片描述
从图中可以看到，实际模型部署运行占用11.2GB,确实少占了很多显存，但是似乎压缩太狠影响了模型表现？测试case“写一篇欢庆国庆的论文” 最后出现了重复生成的情况：

封装本地api运行三重buff量化后的模型

部署量化后的模型

lmdeploy serve api_server \
    /root/models/internlm2_5-1_8b-chat-w4a16-4bit \
    --model-format awq \
    --quant-policy 4 \
    --cache-max-entry-count 0.4\
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

2.用openAI的类改造一个本地API接口

# 导入openai模块中的OpenAI类，这个类用于与OpenAI API进行交互
from openai import OpenAI


# 创建一个OpenAI的客户端实例，需要传入API密钥和API的基础URL
client = OpenAI(
    api_key='YOUR_API_KEY',  
    # 替换为你的OpenAI API密钥，由于我们使用的本地API，无需密钥，任意填写即可
    base_url="http://0.0.0.0:23333/v1"  
    # 指定API的基础URL，这里使用了本地地址和端口
)

# 调用client.models.list()方法获取所有可用的模型，并选择第一个模型的ID
# models.list()返回一个模型列表，每个模型都有一个id属性
model_name = client.models.list().data[0].id

# 使用client.chat.completions.create()方法创建一个聊天补全请求
# 这个方法需要传入多个参数来指定请求的细节
response = client.chat.completions.create(
  model=model_name,  
  # 指定要使用的模型ID
  messages=[  
  # 定义消息列表，列表中的每个字典代表一个消息
    {"role": "system", "content": "你是一个友好的小助手，会遵循用户指定帮助用户得到想要的回复。"},  
    # 系统消息，定义助手的行为
    {"role": "user", "content": "写一篇欢庆国庆的论文"},  
    # 用户消息，询问时间管理的建议
  ],
    temperature=0.8,  
    # 控制生成文本的随机性，值越高生成的文本越随机
    top_p=0.8  
    # 控制生成文本的多样性，值越高生成的文本越多样
)

# 打印出API的响应结果
print(response.choices[0].message.content)

运行脚本后服务的终端看到已发出的请求：

INFO:     Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit)
INFO:     127.0.0.1:35550 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     127.0.0.1:35550 - "POST /v1/chat/completions HTTP/1.1" 200 OK

运行结果：

(internlm-demo) root@intern-studio-50006073:~/models# python invoke_w4a16_openaiApi.py 
标题：欢庆国庆：团结奋进，共筑辉煌

引言：
在中华人民共和国的壮丽画卷上，国庆节是那最为璀璨的一颗明珠，它不仅是对国家历史的一次回顾，更是对未来发展愿景的一次深情承诺。每年的十月一日，全国上下洋溢着对祖国的热爱与感激，欢庆国庆不仅是对过去成就的致敬，更是对未来发展的期许。

一、国庆节的历史与文化

国庆节，作为中国最隆重的传统节日之一，其历史可以追溯到1949年10月1日，这一天的历史意义深远。它不仅标志着中华人民共和国的正式成立，更是中华民族从苦难中崛起、在复兴中前行的重要里程碑。在这一天，全国上下通过各种形式来庆祝，包括阅兵仪式、群众游行、音乐表演等，展现了中国作为一个大国自信与繁荣的姿态。

文化方面，国庆节期间，传统与现代、东方与西方的文化交融，各种非物质文化遗产与现代艺术相辅相成，共同塑造了一个文化的多样性。在节日的庆祝中，我们不仅看到了民族自豪，也看到了文化自信。

二、国庆节的社会意义与国家形象

从社会层面来看，国庆节不仅是国家精神象征的体现，更是社会团结、和谐发展的象征。它鼓励了全国人民的团结与奋斗，激发了人民群众的自豪感和责任感，从而提升了国家的凝聚力。

在国家形象方面，国庆节通过各种形式对外展示国家的成就与尊严，提升了国际社会对中国作为一个负责任大国的认知。通过大型展览、国际交流等活动，我们不仅展现了中国在科技、经济、文化等各个领域的辉煌成就，也传递出和平发展、合作共赢的国际形象。

三、国庆节对个人与社会的启示

对个人而言，欢庆国庆节不仅是一种情感的释放，更是一种精神境界的提升。在忙碌的工作中，通过庆祝国庆节，我们能够暂时抽离喧嚣，沉淀心情，重新审视生活的意义。它提醒我们，无论面对多大的挑战，都要保持对美好生活的向往和追求。

对社会而言，国庆节是一个展示国家成就、激发民族精神的平台。它鼓励人们以更开放的心态接纳新思想、新观念，共同营造一个充满活力、充满希望的社会环境。同时，通过庆祝国庆节，我们也能感受到国家对全体公民的关怀与承诺，激发了全社会的责任感与使命感。

结语：
欢庆国庆节，不仅仅是对国家的赞美，更是对未来的期许。在这个特殊的时刻，我们应当牢记历史，传承精神，共同努力，为实现中华民族的伟大复兴而奋斗。通过欢庆国庆节，我们不仅表达了对祖国的深情，也展示了作为中国人对未来的无限憧憬。让我们一起，以欢庆为起点，以奋斗为动力，共同书写中华民族的辉煌篇章。

让模型调用tools去完成一次运算任务

用教程提供的代码，但是实际上1.8b的模型根本调不起tools，提示词改了很多都不成功

from openai import OpenAI


def add(a: int, b: int):
    return a + b


def mul(a: int, b: int):
    return a * b


tools = [{
    'type': 'function',
    'function': {
        'name': 'add',
        'description': 'Compute the sum of two numbers',
        'parameters': {
            'type': 'object',
            'properties': {
                'a': {
                    'type': 'int',
                    'description': 'A number',
                },
                'b': {
                    'type': 'int',
                    'description': 'A number',
                },
            },
            'required': ['a', 'b'],
        },
    }
}, {
    'type': 'function',
    'function': {
        'name': 'mul',
        'description': 'Calculate the product of two numbers',
        'parameters': {
            'type': 'object',
            'properties': {
                'a': {
                    'type': 'int',
                    'description': 'A number',
                },
                'b': {
                    'type': 'int',
                    'description': 'A number',
                },
            },
            'required': ['a', 'b'],
        },
    }
}]
messages = [{'role': 'user', 'content': 'use the provided tools to do the calculation: (100+1231)*2'},
{'role': 'assistant', 'content': '我直接会计算 (100+1231)*2 = 1321'},
{'role': 'user', 'content': 'Try again! remember to use the tools!'}
]


client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=False,
    tools=tools)

print(f"***********response tool 1***********\n{response}")
print(f"\n***********response tool 1 end***********")

func1_name = response.choices[0].message.tool_calls[0].function.name
func1_args = response.choices[0].message.tool_calls[0].function.arguments
func1_out = eval(f'{func1_name}(**{func1_args})')
print(func1_out)

messages.append({
    'role': 'assistant',
    'content': response.choices[0].message.content
})
messages.append({
    'role': 'environment',
    'content': f'10+15={func1_out}',
    'name': 'plugin'
})
response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=False,
    tools=tools)

print(f"***********response tool 2***********\n{response}")
print(f"\n***********response tool 2 end***********")

func2_name = response.choices[0].message.tool_calls[0].function.name
func2_args = response.choices[0].message.tool_calls[0].function.arguments
func2_out = eval(f'{func2_name}(**{func2_args})')
print(func2_out)

运行失败：

(internlm-demo) root@intern-studio-50006073:~/models# python invoke_w4a16_openaiApi_func.py 
***********response tool 1***********
ChatCompletion(id='13', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='好的，我现在用正确的工具帮你计算。 (100+1231)*2 = 2432', refusal=None, role='assistant', function_call=None, tool_calls=None))], created=1727936020, model='/root/models/internlm2_5-1_8b-chat-w4a16-4bit', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=21, prompt_tokens=308, total_tokens=329))

***********response tool 1 end***********
Traceback (most recent call last):
  File "/root/models/invoke_w4a16_openaiApi_func.py", line 72, in <module>
    func1_name = response.choices[0].message.tool_calls[0].function.name
TypeError: 'NoneType' object is not subscriptable
(internlm-demo) root@intern-studio-50006073:~/models# python invoke_w4a16_openaiApi_func.py 
***********response tool 1***********
ChatCompletion(id='14', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='好的，我现在重新计算。 (100+1231)*2 = 2331', refusal=None, role='assistant', function_call=None, tool_calls=None))], created=1727936060, model='/root/models/internlm2_5-1_8b-chat-w4a16-4bit', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=18, prompt_tokens=310, total_tokens=328))

***********response tool 1 end***********
Traceback (most recent call last):
  File "/root/models/invoke_w4a16_openaiApi_func.py", line 72, in <module>
    func1_name = response.choices[0].message.tool_calls[0].function.name
TypeError: 'NoneType' object is not subscriptable
(internlm-demo) root@intern-studio-50006073:~/models# python invoke_w4a16_openaiApi_func.py 
***********response tool 1***********
ChatCompletion(id='15', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='(100+1231)*2 = 2432', refusal=None, role='assistant', function_call=None, tool_calls=None))], created=1727936246, model='/root/models/internlm2_5-1_8b-chat-w4a16-4bit', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=12, prompt_tokens=307, total_tokens=319))

***********response tool 1 end***********
Traceback (most recent call last):
  File "/root/models/invoke_w4a16_openaiApi_func.py", line 72, in <module>
    func1_name = response.choices[0].message.tool_calls[0].function.name
TypeError: 'NoneType' object is not subscriptable
(internlm-demo) root@intern-studio-50006073:~/models# python invoke_w4a16_openaiApi_func.py 
***********response tool 1***********
ChatCompletion(id=None, choices=None, created=None, model=None, object='error', service_tier=None, system_fingerprint=None, usage=None, message='Failed to parse fc related info to json format!', code=400)

***********response tool 1 end***********
Traceback (most recent call last):
  File "/root/models/invoke_w4a16_openaiApi_func.py", line 72, in <module>
    func1_name = response.choices[0].message.tool_calls[0].function.name
TypeError: 'NoneType' object is not subscriptable

但是，用7b的模型，就非常丝滑：

***********response tool 1***********
ChatCompletion(id='1', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content="To solve the given problem, I will break it down into two steps:\n\n1. Add the numbers 100 and 1231.\n2. Multiply the result by 2.\n\nLet's start with the first step.", refusal=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='0', function=Function(arguments='{"a": 100, "b": 1231}', name='add'), type='function')]))], created=1727936881, model='/root/models/internlm2_5-7b-chat', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=71, prompt_tokens=272, total_tokens=343))

***********response tool 1 end***********
1331
***********response tool 2***********
ChatCompletion(id='2', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content='Now that we have the sum of 100 and 1231, which is 1331, we can proceed to the second step. We need to multiply this sum by 2.', refusal=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='1', function=Function(arguments='{"a": 1331, "b": 2}', name='mul'), type='function')]))], created=1727936882, model='/root/models/internlm2_5-7b-chat', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=64, prompt_tokens=337, total_tokens=401))

***********response tool 2 end***********
2662