vllm 使用FP8运行模型

liuzhenghua66

已于 2025-04-08 17:58:09 修改

阅读量2.2k

点赞数 2

分类专栏： # AI 文章标签：人工智能

于 2024-06-12 19:48:29 首次发布

本文链接：https://blog.csdn.net/liuzhenghua66/article/details/139635250

版权

AI 专栏收录该内容

15 篇文章

订阅专栏

简介

vLLM 支持使用硬件加速在 GPU 上进行 FP8（8 位浮点）计算，例如 Nvidia H100 和 AMD MI300x。目前，仅支持 Hopper 和 Ada Lovelace GPU。使用 FP8 对模型进行量化可以将模型内存需求减少 2 倍，并在对准确性影响极小的情况下将吞吐量提高最多 1.6 倍。

FP8 类型有两种不同的表示形式，每种形式在不同场景中都有用：

E4M3：由1个符号位、4个指数位和3个位的尾数组成。它可以存储的值范围是 +/-448 和 nan。
E5M2：由1个符号位、5个指数位和2个位的尾数组成。它可以存储的值范围是 +/-57344、+/- inf 和 nan。增加动态范围的代价是存储值的精度降低。

量化模型

llmcompressor

qwen2.5

"""pip install llmcompressor -i https://pypi.tuna.tsinghua.edu.cn/simple"""

from llmcompressor.transformers import SparseAutoModelForCausalLM,oneshot
from transformers import AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "/data/modelscope/hub/qwen/Qwen2___5-32B-Instruct"

# load model
model = SparseAutoModelForCausalLM.from_pretrained(
  MODEL_ID, device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Configure the simple PTQ quantization
recipe = QuantizationModifier(
  targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])

# Apply the quantization algorithm.
oneshot(model=model, recipe=recipe)

# Save the model.
SAVE_DIR = MODEL_ID + "-FP8-Dynamic"
print('-------------------' + SAVE_DIR)
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

qwen2.5-vl

"""pip install llmcompressor -i https://mirrors.aliyun.com/pypi/simple"""

from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "/data/modelscope/hub/Qwen/Qwen2___5-VL-7B-Instruct"

# load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto"
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

# Configure the simple PTQ quantization
recipe = QuantizationModifier(
    targets="Linear", scheme="FP8_DYNAMIC", ignore=["re:.*lm_head", "re:visual.*"]
)

# Apply the quantization algorithm.
SAVE_DIR = MODEL_ID + "-FP8-Dynamic"
print('-------------------' + SAVE_DIR)
oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)

# Save the model.
processor.save_pretrained(SAVE_DIR)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(processor.decode(output[0]))
print("==========================================")

AutoFP8 （官方已不推荐使用）

下载AutoFP8

git clone https://github.com/neuralmagic/AutoFP8.git
pip install -e AutoFP8

使用AutoFP8量化

from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "/data/modelscope/qwen/Qwen2-72B-Instruct"
quantized_model_dir = "/data/modelscope/qwen/Qwen2-72B-FP8-Instruct"

# Define quantization config with static activation scales
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="dynamic")
# For dynamic activation scales, there is no need for calbration examples
examples = []

# Load the model, quantize, and save checkpoint
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)