vllm 使用FP8运行模型

简介

vLLM 支持使用硬件加速在 GPU 上进行 FP8(8 位浮点)计算,例如 Nvidia H100 和 AMD MI300x。目前,仅支持 Hopper 和 Ada Lovelace GPU。使用 FP8 对模型进行量化可以将模型内存需求减少 2 倍,并在对准确性影响极小的情况下将吞吐量提高最多 1.6 倍。

FP8 类型有两种不同的表示形式,每种形式在不同场景中都有用:

  • E4M3:由1个符号位、4个指数位和3个位的尾数组成。它可以存储的值范围是 +/-448 和 nan。
  • E5M2:由1个符号位、5个指数位和2个位的尾数组成。它可以存储的值范围是 +/-57344、+/- inf 和 nan。增加动态范围的代价是存储值的精度降低。

量化模型

llmcompressor

qwen2.5
"""pip install llmcompressor -i https://pypi.tuna.tsinghua.edu.cn/simple"""

from llmcompressor.transformers import SparseAutoModelForCausalLM,oneshot
from transformers import AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "/data/modelscope/hub/qwen/Qwen2___5-32B-Instruct"

# load model
model = SparseAutoModelForCausalLM.from_pretrained(
  MODEL_ID, device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Configure the simple PTQ quantization
recipe = QuantizationModifier(
  targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])

# Apply the quantization algorithm.
oneshot(model=model, recipe=recipe)

# Save the model.
SAVE_DIR = MODEL_ID + "-FP8-Dynamic"
print('-------------------' + SAVE_DIR)
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

qwen2.5-vl
"""pip install llmcompressor -i https://mirrors.aliyun.com/pypi/simple"""

from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "/data/modelscope/hub/Qwen/Qwen2___5-VL-7B-Instruct"

# load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto"
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

# Configure the simple PTQ quantization
recipe = QuantizationModifier(
    targets="Linear", scheme="FP8_DYNAMIC", ignore=["re:.*lm_head", "re:visual.*"]
)

# Apply the quantization algorithm.
SAVE_DIR = MODEL_ID + "-FP8-Dynamic"
print('-------------------' + SAVE_DIR)
oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)

# Save the model.
processor.save_pretrained(SAVE_DIR)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(processor.decode(output[0]))
print("==========================================")

AutoFP8 (官方已不推荐使用)

下载AutoFP8
git clone https://github.com/neuralmagic/AutoFP8.git
pip install -e AutoFP8
使用AutoFP8量化
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "/data/modelscope/qwen/Qwen2-72B-Instruct"
quantized_model_dir = "/data/modelscope/qwen/Qwen2-72B-FP8-Instruct"

# Define quantization config with static activation scales
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="dynamic")
# For dynamic activation scales, there is no need for calbration examples
examples = []

# Load the model, quantize, and save checkpoint
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值