[模型优化] 2. 模型量化

置顶斌zz

已于 2025-05-19 11:59:36 修改

阅读量518

点赞数 27

分类专栏：深度学习从零开始文章标签：深度学习计算机视觉人工智能机器学习目标检测

于 2025-05-19 11:59:11 首次发布

本文链接：https://blog.csdn.net/ayiya_Oese/article/details/148060362

版权

深度学习从零开始专栏收录该内容

31 篇文章

订阅专栏

👋 你好！这里有实用干货与深度分享✨✨ 若有帮助，欢迎：
👍 点赞 | ⭐ 收藏 | 💬 评论 | ➕ 关注，解锁更多精彩！
📁 收藏专栏即可第一时间获取最新推送🔔。
📖后续我将持续带来更多优质内容，期待与你一同探索知识，携手前行，共同进步🚀。

人工智能

模型量化

本文详细介绍深度学习模型量化的相关技术，包括定点量化、动态量化、静态量化、量化感知训练和混合精度量化等方法，以及最新的量化技术进展。

1. 量化基础

1.1 量化概念

量化是将高精度浮点数（如FP32）转换为低精度表示（如INT8）的过程，可以显著减小模型大小、降低计算复杂度、提高推理速度。

1.2 量化类型对比

量化类型	精度损失	性能提升	实现复杂度	适用场景
定点量化	中等	高	低	通用场景
动态量化	较低	中等	低	推理时激活值变化大
静态量化	中等	高	中等	需要校准数据集
量化感知训练	最低	高	高	对精度要求高
混合精度量化	低	中等	高	模型结构复杂

2. 定点量化

2.1 原理介绍

定点量化将浮点数转换为定点数表示，通过以下步骤实现：

确定量化范围
计算量化参数（scale和zero_point）
执行量化映射
进行定点运算
反量化恢复结果

2.2 数学表示

量化过程可以用以下公式表示：

$round(\frac{r}{scale} + zero\_point)$

反量化过程：

$zero\_point) \times scale$

其中， $r$ 是原始浮点值， $q$ 是量化后的整数值。

2.3 实现示例

import torch

def quantize_tensor(x, num_bits=8):
    # 计算量化参数
    qmin = 0.
    qmax = 2.**num_bits - 1.
    scale = (x.max() - x.min()) / (qmax - qmin)
    zero_point = qmin - x.min() / scale
    
    # 量化
    q_x = x / scale + zero_point
    q_x = torch.clamp(q_x, qmin, qmax)
    q_x = torch.round(q_x)
    
    # 反量化
    x_hat = (q_x - zero_point) * scale
    return q_x, scale, zero_point, x_hat

# 使用示例
x = torch.randn(4, 4)
q_x, scale, zp, x_hat = quantize_tensor(x)

# 计算量化误差
quant_error = torch.abs(x - x_hat).mean().item()
print(f"量化误差: {quant_error:.6f}")

3. 动态量化

3.1 特点和应用场景

运行时确定量化参数
仅量化权重，激活值在运行时量化
适用于权重固定但激活值变化的场景
计算开销相对较大但灵活性高
实现简单，无需重新训练或校准

3.2 PyTorch实现

import torch

# 创建模型
model = torch.nn.Sequential(
    torch.nn.Linear(784, 128),
    torch.nn.ReLU(),
    torch.nn.Linear(128, 10)
)

# 动态量化
quantized_model = torch.quantization.quantize_dynamic(
    model,  # 要量化的模型
    {torch.nn.Linear},  # 要量化的层类型
    dtype=torch.qint8  # 量化数据类型
)

# 推理
input_tensor = torch.randn(1, 784)
output = quantized_model(input_tensor)

# 保存量化模型
torch.jit.save(torch.jit.script(quantized_model), "dynamic_quantized_model.pt")

3.3 性能评估

import time

# 原始模型性能测试
start_time = time.time()
for _ in range(100):
    output = model(input_tensor)
fp32_time = time.time() - start_time

# 量化模型性能测试
start_time = time.time()
for _ in range(100):
    output = quantized_model(input_tensor)
int8_time = time.time() - start_time

print(f"FP32模型推理时间: {fp32_time:.4f}秒")
print(f"INT8模型推理时间: {int8_time:.4f}秒")
print(f"加速比: {fp32_time/int8_time:.2f}x")

4. 静态量化

4.1 工作原理

静态量化在模型推理前完成权重和激活值的量化，需要使用校准数据集来确定激活值的量化参数。

4.2 PyTorch实现

import torch
from torch.quantization import get_default_qconfig
from torch.quantization.quantize_fx import prepare_fx, convert_fx

def quantize_model(model, calibration_data):
    # 设置量化配置（fbgemm用于x86架构，qnnpack用于ARM架构）
    qconfig = get_default_qconfig('fbgemm')  
    qconfig_dict = {"":qconfig}

    # 准备量化（插入观察节点）
    model_prepared = prepare_fx(model, qconfig_dict)

    # 校准（收集激活值的分布信息）
    for data in calibration_data:
        model_prepared(data)

    # 转换为量化模型（替换浮点运算为整数运算）
    model_quantized = convert_fx(model_prepared)

    return model_quantized

# 使用示例
model = YourModel()
model.eval()  # 量化前必须设置为评估模式
calibration_data = get_calibration_data()  # 获取校准数据
quantized_model = quantize_model(model, calibration_data)

# 保存量化模型
torch.jit.save(torch.jit.script(quantized_model), "static_quantized_model.pt")

5. 量化感知训练

5.1 训练流程

插入伪量化节点
前向传播模拟量化效果
反向传播更新模型参数
导出量化模型

5.2 实现示例

import torch
from torch.quantization import QuantStub, DeQuantStub

class QuantAwareModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.quant = QuantStub()
        self.conv = torch.nn.Conv2d(3, 64, 3)
        self.relu = torch.nn.ReLU()
        self.dequant = DeQuantStub()
        
    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        x = self.relu(x)
        x = self.dequant(x)
        return x

# 准备量化
model = QuantAwareModel()
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_prepared = torch.quantization.prepare_qat(model)

# 训练循环
optimizer = torch.optim.SGD(model_prepared.parameters(), lr=0.001)
criterion = torch.nn.CrossEntropyLoss()
num_epochs = 10

for epoch in range(num_epochs):
    for data, target in train_loader:
        optimizer.zero_grad()
        output = model_prepared(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
    
    # 每个epoch打印训练信息
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}")

# 转换为量化模型
model_prepared.eval()
quantized_model = torch.quantization.convert(model_prepared)

# 保存量化模型
torch.jit.save(torch.jit.script(quantized_model), "qat_model.pt")

5.3 微调策略

对于预训练模型，可以采用以下微调策略：

# 加载预训练模型
pretrained_model = torchvision.models.resnet18(pretrained=True)

# 准备量化感知训练
pretrained_model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_prepared = torch.quantization.prepare_qat(pretrained_model)

# 冻结部分层参数
for name, param in model_prepared.named_parameters():
    if "layer1" in name or "layer2" in name:
        param.requires_grad = False

# 使用较小的学习率进行微调
optimizer = torch.optim.SGD(filter(lambda p: p.requires_grad, model_prepared.parameters()), lr=0.0001)

6. 混合精度量化

6.1 策略选择

根据层的敏感度选择量化精度
关键层保持高精度
非关键层使用低精度
可以结合模型剪枝等技术

6.2 实现方法

import torch

class MixedPrecisionModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # 8位量化的层
        self.conv1 = torch.nn.Conv2d(3, 64, 3)
        # 保持FP32精度的层
        self.conv2 = torch.nn.Conv2d(64, 64, 3)
        self.relu = torch.nn.ReLU()
        
    def forward(self, x):
        # 第一层使用INT8
        x = self.conv1(x)
        x = self.relu(x)
        # 第二层使用FP32
        x = self.conv2(x)
        return x

# 配置不同层的量化策略
def configure_mixed_precision(model):
    # 为不同层设置不同的量化配置
    model.conv1.qconfig = torch.quantization.get_default_qconfig('fbgemm')
    # conv2不设置qconfig，保持FP32精度
    return model

# 使用示例
model = MixedPrecisionModel()
model = configure_mixed_precision(model)

# 准备量化
model_prepared = torch.quantization.prepare(model)

# 校准
for data in calibration_data:
    model_prepared(data)

# 转换为量化模型
quantized_model = torch.quantization.convert(model_prepared)

6.3 敏感度分析

def analyze_layer_sensitivity(model, test_loader, criterion):
    # 记录原始精度
    original_accuracy = evaluate_model(model, test_loader)
    sensitivity_dict = {}
    
    # 逐层量化并测试精度变化
    for name, module in model.named_modules():
        if isinstance(module, (torch.nn.Conv2d, torch.nn.Linear)):
            # 临时量化该层
            temp_model = copy.deepcopy(model)
            layer = get_module_by_name(temp_model, name)
            layer.qconfig = torch.quantization.get_default_qconfig('fbgemm')
            
            # 准备和转换
            prepared = torch.quantization.prepare(temp_model)
            for data in calibration_data:
                prepared(data)
            quantized = torch.quantization.convert(prepared)
            
            # 评估精度下降
            quantized_accuracy = evaluate_model(quantized, test_loader)
            sensitivity = original_accuracy - quantized_accuracy
            sensitivity_dict[name] = sensitivity
    
    return sensitivity_dict

7. 最新量化技术

7.1 后训练量化(PTQ)进展

AdaRound: 自适应舍入策略，减少量化误差
BRECQ: 基于重建的量化，逐层优化量化参数
ZeroQ: 无需校准数据的量化方法

# AdaRound实现示例
def adaround_quantize(weight, n_bits=8, round_mode='learned'):
    # 计算量化参数
    qmin, qmax = 0, 2**n_bits - 1
    scale = (weight.max() - weight.min()) / (qmax - qmin)
    zero_point = qmin - weight.min() / scale
    
    # 初始量化
    w_scaled = weight / scale + zero_point
    
    if round_mode == 'learned':
        # 可学习的舍入参数
        alpha = torch.nn.Parameter(torch.zeros_like(weight))
        # 使用STE(Straight-Through Estimator)进行优化
        # 实际实现需要更复杂的优化过程
        w_q = w_scaled.floor() + torch.sigmoid(alpha)
    else:
        # 传统舍入
        w_q = torch.round(w_scaled)
    
    # 裁剪到量化范围
    w_q = torch.clamp(w_q, qmin, qmax)
    
    # 反量化
    w_dq = (w_q - zero_point) * scale
    return w_dq

7.2 低比特量化

二值化网络(BNN): 权重和激活值使用1位表示
三值化网络(TWN): 权重使用{-1, 0, 1}三个值表示
INT4/INT2量化: 超低比特量化技术

# 二值化网络实现示例
class BinaryConv2d(torch.nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0):
        super().__init__()
        self.conv = torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)
        
    def forward(self, x):
        # 二值化权重
        binary_weight = torch.sign(self.conv.weight)
        # 二值化输入
        binary_input = torch.sign(x)
        # 使用二值化权重和输入进行卷积
        return torch.nn.functional.conv2d(
            binary_input, binary_weight, self.conv.bias,
            self.conv.stride, self.conv.padding
        )

8. 常见问题与解决方案

8.1 精度下降问题

问题	原因	解决方案
量化后精度显著下降	量化范围不合理	调整量化范围，使用更精确的校准数据
特定层量化效果差	权重分布不均匀	对敏感层使用更高精度或保持浮点
小值被量化为零	量化步长过大	调整量化参数，考虑非对称量化

8.2 部署相关问题

问题	解决方案
量化模型兼容性	确认目标硬件支持的量化格式
推理速度不如预期	检查是否所有操作都被量化，避免频繁类型转换
内存使用优化	结合模型剪枝和知识蒸馏技术

8.3 调试技巧

# 量化前后精度对比
def compare_layer_outputs(original_model, quantized_model, test_input):
    # 注册钩子函数获取中间层输出
    original_outputs = {}
    quantized_outputs = {}
    
    def hook_fn(name):
        def hook(module, input, output):
            if name in original_outputs:
                quantized_outputs[name] = output
            else:
                original_outputs[name] = output
        return hook
    
    # 为原始模型和量化模型注册钩子
    for name, module in original_model.named_modules():
        if isinstance(module, (torch.nn.Conv2d, torch.nn.Linear)):
            module.register_forward_hook(hook_fn(name))
    
    for name, module in quantized_model.named_modules():
        if isinstance(module, (torch.nn.Conv2d, torch.nn.Linear)):
            module.register_forward_hook(hook_fn(name))
    
    # 执行前向传播
    original_model(test_input)
    quantized_model(test_input)
    
    # 计算每层的误差
    for name in original_outputs:
        if name in quantized_outputs:
            error = torch.abs(original_outputs[name] - quantized_outputs[name]).mean().item()
            print(f"层 {name} 的平均误差: {error:.6f}")