【MQBench源码学习】

Ashley_huo

已于 2024-08-10 20:20:21 修改

阅读量1k

点赞数 29

文章标签：深度学习人工智能

于 2024-08-10 17:02:07 首次发布

本文链接：https://blog.csdn.net/Ashley_huo/article/details/141093114

版权

MQBench概述

MQBench是基于PyTorch fx的一套开源的量化工具。
文档：https://mqbench.readthedocs.io/en/latest/
学习视频：https://www.bilibili.com/video/BV1G44y1g7i9/?spm_id_from=333.337.search-card.all.click
在本文中，使用的tudui模型结构见：
https://www.bilibili.com/video/BV1hE411t7RN?p=22&vd_source=6e7f46ba13babff2c1bbdefb6cd2047f

量化流程

1、模型导入，带预训练权重
2、设定后端或者量化的参数，trace模型，构造量化模型
3、对模型进行校准
4、（可选）根据需要进行Advance量化
5、量化模型评估
6、导出可部署模型

1、模型导入

导入模型有多种方式：
方式1：
官方文档中的代码，从torchvision.models直接加载带预训练模型的权重

import torchvision.models as models   
model = models.__dict__["resnet18"](pretrained=True)

方式2：
从第三方库里导入已经训练好的模型

from pytorchcv.model_provider import get_model as ptcv_get_model
model = ptcv_get_model('resnet20_cifar10', pretrained=True)

方式3：
自己定义的简单模型。

class Tudui(nn.Module):
    def __init__(self, num_channels=3):
        。。。
    def forward(self, x):
        。。。
 model = Tudui() # 定义模型后加载模型参数，需要自己训练
 model.load_state_dict(torch.load('tudui_cnn.pt', map_location='cpu'))

前3种方式适用于大部分图像分类模型，但对于模型的transform模型或者目标检测模型，就有点复杂。
4、Vit模型的导入

from transformers import AutoModelForImageClassification
model = AutoModelForImageClassification.from_pretrained("./vit_model")

提前把vit模型下载好放在vit_model文件中，模型能够正常导入，但执行prepare_by_platform时，报错：torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow。
这是在模型的追踪过程中，有一个符号追踪的变量被用作了控制流的输入。解决方式需要修改模型，有点复杂。
但MQBench里有Vit的样例，研究后，发现有两点特别关键：
1、用HFTracer代替tracer。

tracer = HFTracer()
model = prepare_by_platform(model, backend, custom_tracer=tracer)

这个能解决控制流的问题，但代码还是报错“TypeError: bool should return bool, returned Tensor”，参数类型不匹配。
2、增加concrete_args字段

tracer = HFTracer()
input_names = ['pixel_values']
prepare_custom_config_dict = {
    'concrete_args': get_concrete_args(model, input_names),
 # 其他配置可有可无，略
}
model = prepare_by_platform(model, backend,
                     prepare_custom_config_dict=prepare_custom_config_dict),
                             custom_tracer=tracer)

至此，Vit模型导入成功。
同样的方法想尝试导入YOLO 或RTdetr模型，但报错：RuntimeError: Could not generate input named input for because root is not a transformers.PreTrainedModel.这个问题尚未解决

2、prepare_by_platform

个人感觉是最重要的一步，通过torch.fx.trace完成，如果能把模型正常trace成静态图，后续就能利用MQBench里的各种量化工具进行实验。对于大部分CV模型，trace都支持，但像yolo之类的目标检测模型，就需要进行模型的前后处理分割（可看学习视频尝试修改）。
这步骤是在这里实现的：

backend = BackendType.Academic
extra_config = {
。。。
}
# trace model and add quant nodes for model on backend
model = prepare_by_platform(model, backend,extra_config)

2.1 trace 模型

关键代码：graph = tracer.trace(model, concrete_args)
在这里插入图片描述

左侧是模型代码，右侧是经过trace转化后的graph，可以看到，基本是一一对应的。

2.2 构造graph_module

graph_module = GraphModule(modules, graph, name)
执行结果：
在这里插入图片描述

大致上是将模型结构和forward放在一起。

2.3 quantizer.prepare

核心代码：

prepared = quantizer.prepare(graph_module, qconfig)

其中，graph_module见上图，qconfig是之前配置的量化参数，示例数据如下：
QConfig(activation=functools.partial(<class 'mqbench.fake_quantize.fixed.FixedFakeQuantize'>, observer=<class 'mqbench.observer.EMAMSEObserver'>, quant_min=-128, quant_max=127, dtype=torch.qint8, pot_scale=False, qscheme=torch.per_tensor_symmetric, reduce_range=False, ch_axis=-1){}, weight=functools.partial(<class 'mqbench.fake_quantize.adaround_quantizer.AdaRoundFakeQuantize'>, observer=<class 'mqbench.observer.MSEObserver'>, quant_min=-128, quant_max=127, dtype=torch.qint8, pot_scale=False, qscheme=torch.per_tensor_symmetric, reduce_range=False, ch_axis=-1){})
这是其实是将量化的操作加载在模型的graph中，比如原来模型conv1模块，加入了weight_fake_quant和activation_post_process

2.3.1关键代码实现

在这里插入图片描述

quantizer.prepare 根据不同的后端，执行不同的prepare操作，这里以AcademicQuantizer为例：

class AcademicQuantizer(ModelQuantizer):
    """Academic setting mostly do not merge BN and leave the first and last layer to higher bits.
    """
    def prepare(self, model: GraphModule, qconfig):
        self._get_io_module(model)
        self._get_post_act_8bit_node_name(model)
        model = self._weight_quant(model, qconfig)
        model = self._insert_fake_quantize_for_act_quant(model, qconfig)
        return model

A、self._weight_quant

其中，self._weight_quant(model, qconfig)是将conv1module增加weight量化

   def _weight_quant(self, model: GraphModule, qconfig):
        。。。
        self._qat_swap_modules(model, self.additional_qat_module_mapping)
        return model

    def _qat_swap_modules(self, root: GraphModule, additional_qat_module_mapping: Dict[Callable, Callable]):
        all_mappings = get_combined_dict(
            get_default_qat_module_mappings(), additional_qat_module_mapping)
        # all_mappings 没看懂，请看下面截图
        root = self._convert(root, all_mappings, inplace=True)
        return root

all_mappings 的值，似乎是预定的，将conv2d映射成nn.qat.conv2d
在这里插入图片描述

    def _convert(self, module, mapping=None, inplace=False, scope=''):
        。。。
        for name, mod in module.named_children():
            。。。
            reassign[name] = swap_module(mod, mapping, {})
        for key, value in reassign.items():
            module._modules[key] = value
        return module

def swap_module(mod, mapping, custom_module_class_mapping):
    new_mod = mod
    if hasattr(mod, 'qconfig') and mod.qconfig is not None:
        swapped = False
        。。。
        if type(mod) in mapping:
            new_mod = mapping[type(mod)].from_float(mod)
            swapped = True
    return new_mod

执行结束后，reassign为：
在这里插入图片描述

可以看到，已经加入了weight_fake_quant，然后修改module，完成weight_quant

B、self._insert_fake_quantize_for_act_quant

    def _insert_fake_quantize_for_act_quant(self, model: GraphModule, qconfig):
        graph = model.graph
        nodes = list(model.graph.nodes)
        quantizer_prefix = "_post_act_fake_quantizer"
        node_to_quantize_output = self._find_act_quants(model)
        #找到需要插入激活值量化的地方 [x, max_pool2d, max_pool2d_1, view, fc1]
        。。。
        for node in node_to_quantize_output:
            fake_quantizer = qconfig.activation()
            quantizer_name = node.name + quantizer_prefix
            setattr(model, quantizer_name, fake_quantizer)#通过这个实现插入伪量化结点
            # 将(x_post_act_fake_quantizer): FixedFakeQuantize(...)插入进model
            logger.info("Insert act quant {}".format(quantizer_name))
        model.recompile()#再次编译，执行后forward发生改变
        model.graph.lint()
        return model

2.3.2运行结果

加入前的conv1：

 (conv1): Conv2d(3, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))

加入后的conv1：
在这里插入图片描述

fc的变化：

(fc1): Linear(in_features=1024, out_features=64, bias=True) #graph_module中的fc1

在这里插入图片描述

此外增加了一些：post_act_fake_quantizer
在这里插入图片描述

forward的前后变化，可以看到增加了权重和激活值的伪量化操作。
在这里插入图片描述

此阶段，其实是将伪量化器和观察器放置在模型中合适的地方。

3、校准模型

特别提一下，模型中的observer_enabled和fake_quant_enable，用来控制module的模式：
1）校准模式Calib：Observer收集必要的统计数据，而FakeQuantize不会对输入进行量化；
启用观察者（enable_observer）：self.observer_enabled[0] = 1 if enabled else 0
禁用模拟量化（disable_fake_quant）：self.fake_quant_enabled[0] = 1 if enabled else 0
2）量化模式quant：FakeQuantize使用Observer计算得到的量化参数（qparams）执行量化前向传播。
在这里插入图片描述

校准代码：

def calibration(model, train_loader,calibrate_num=1024):
    #model.cuda()
    model.eval()
    with torch.no_grad():
        count = 0
        for (inputs, targets) in tqdm(train_loader):
            if count >= calibrate_num:
                break  # 如果你只需要1024个样本，可以在这里停止迭代
            model(inputs)
            count += len(inputs)

校准阶段，self.observer_enabled[0] == 1，在量化器里更新scale和zero_point。

class FixedFakeQuantize(QuantizeBase):
    def forward(self, X):
        if self.observer_enabled[0] == 1:
            self.activation_post_process(X.detach())
            _scale, _zero_point = self.calculate_qparams()
            _scale, _zero_point = _scale.to(self.scale.device), _zero_point.to(self.zero_point.device)
            if self.scale.shape != _scale.shape:
                self.scale.resize_(_scale.shape)
                self.zero_point.resize_(_zero_point.shape)
            self.scale.copy_(_scale)
            self.zero_point.copy_(_zero_point)

4、Advance量化

通过代码model = ptq_reconstruction(model, stacked_tensor, ptq_reconstruction_config)进行advance量化，这里着重说一下adaround
在这里插入图片描述

核心为优化的4个公式，但如何应用的，来看看源码.

4.1 分割子图

在这里插入图片描述

将原模型分割为5个子图：
[x_post_act_fake_quantizer, conv1, max_pool2d, max_pool2d_post_act_fake_quantizer]
[max_pool2d_post_act_fake_quantizer, conv2, max_pool2d_1, max_pool2d_1_post_act_fake_quantizer]
[max_pool2d_1_post_act_fake_quantizer, conv3, max_pool2d_2, view, view_post_act_fake_quantizer]
[view_post_act_fake_quantizer, fc1, fc1_post_act_fake_quantizer]
[fc1_post_act_fake_quantizer, fc2]
分割的依据代码在def extract_layer(node, fp32_modules)，大致的规则是如果模块的值要传入conv2d或者linear，就停止，所以基本上述子图的第2个操作，就是conv或者fc。

_ADAROUND_SUPPORT_TYPE = (torch.nn.Conv2d, torch.nn.Linear)
def extract_layer(node, fp32_modules):
    while True:
        for user in cur_node.users:
            。。。
            if user.op == 'call_module' and isinstance(
                    fp32_modules[user], _ADAROUND_SUPPORT_TYPE):
                stop = True
            。。。
            if user.op == 'output':
                is_next_block, stop = True, True

4.2 记录子图量化前后的输入输出

定义DataSaverHook，保存当前模块的inp和oup。

def save_inp_oup_data(model: GraphModule, inp_module: Module, oup_module: Module, cali_data: list, store_inp=True, store_oup=True,keep_gpu: bool = True):
    cached = ([], [])
    with torch.no_grad():
        for batch in cali_data:
            if store_inp:
                if keep_gpu:
                    cached[0].append([tensor_detach(inp) for inp in inp_saver.input_store])
                else:
                    cached[0].append([to_device(tensor_detach(inp), 'cpu') for inp in inp_saver.input_store])  # tuple/list one
            if store_oup:
               。。。
    return cached

_, fp32_inps = save_inp_oup_data(fp32_model, None, fp32_inp_module, cali_data, store_inp=False, store_oup=(config.prob < 1.0), keep_gpu=config.keep_gpu)。。。
。。。
cached_inps = (quant_all_inps, fp32_all_inps) if config.prob < 1.0 else quant_all_inps
cached_oups = fp32_final_oups

4.3 在子图内部根据cached_inps, cached_oups进行优化

            subgraph = extract_subgraph(quant_modules_by_name, layer_node_list,
                                        layer_node_list[-1], g2node)
           subgraph_reconstruction(subgraph, cached_inps, cached_oups, config)

关键代码，subgraph是4.1形成的子图，cached_inps, cached_oups是此子图的输入和输出，进行子图的构造。

def subgraph_reconstruction(subgraph, cached_inps, cached_oups, config):
    。。。
    w_para, a_para = [], [] # w和a的参数
    w_opt, w_scheduler = None, None # w的优化器和调度器
    。。。
    for name, layer in subgraph.named_modules():
        if isinstance(layer, _ADAROUND_SUPPORT_TYPE):#进行weight_quantizer的初始化，更新w_para
            weight_quantizer = layer.weight_fake_quant
            weight_quantizer.init(layer.weight.data, config.round_mode)
            w_para += [weight_quantizer.alpha]
        。。。
    w_opt = torch.optim.Adam(w_para)#设置weight的优化器
    loss_func = LossFunction(subgraph=subgraph, weight=config.weight, max_count=config.max_count, b_range=config.b_range,warm_up=config.warm_up)#损失函数定义，论文中21-24的公式，也就是优化目标

其实就是根据loss_func 进行反向推理的时候，优化的是w_para和a_para参数。上面代码进行对优化器、调度器和loss函数进行了配置。下面loss函数就跟论文中的公式一致。

def lp_loss(pred, tgt, p=2.0):
    return (pred - tgt).abs().pow(p).sum(1).mean()
rec_loss = lp_loss(pred, tgt, p=self.p)
。。。
        for layer in self.subgraph.modules():
            if isinstance(layer, _ADAROUND_SUPPORT_TYPE):
                round_vals = layer.weight_fake_quant.rectified_sigmoid()# vij进行公式23计算
                round_loss += self.weight * (1 - ((round_vals - .5).abs() * 2).pow(b)).sum()#公式24，求得freg(v)

        total_loss = rec_loss + round_loss#公式21

    self.gamma, self.zeta = -0.1, 1.1
    。。。
    def rectified_sigmoid(self):
        return ((self.zeta - self.gamma) * torch.sigmoid(self.alpha) + self.gamma).clamp(0, 1)

    def get_hard_value(self, X):
        X = adaround_forward(X, self.scale.data, self.zero_point.data.long(), self.quant_min,
                             self.quant_max, self.ch_axis, self.alpha, self.zeta, self.gamma, hard_value=True)
        return X
    #公式22
def adaround_forward(x, scale, zero_point, quant_min, quant_max, ch_axis, alpha, zeta, gamma, hard_value=False):
   。。。
    x += zero_point
    x = torch.clamp(x, quant_min, quant_max)    # 量化操作，公式22
    x = (x - zero_point) * scale    # 反量化操作
    return x

程序迭代20000次：

 for i in range(config.max_count):#迭代20000次
        idx = np.random.randint(0, sz)#随机在0-16内取一个数，输出cached_oups的list有16个
        cur_args = []
        for a in range(num_args):#输出list有几个
                。。。
                cur_inp = to_device(cached_inps[a][idx], device)
            cur_args.append(cur_inp)
        if a_opt:
            a_opt.zero_grad()#每轮迭代，优化器都清除梯度
        w_opt.zero_grad()
        out_quant = subgraph(*cur_args)#进行模型进行前向推理
        err = loss_func(out_quant, cur_out)#计算损失度
        err.backward()# 后向传播，计算grad
        w_opt.step()#更新优化器和调度器
        if a_opt:
            a_opt.step()
        if w_scheduler:
            w_scheduler.step()
        if a_scheduler:
            a_scheduler.step()

迭代结束后，更新layer.weight.data

if isinstance(layer, _ADAROUND_SUPPORT_TYPE):
        weight_quantizer = layer.weight_fake_quant
        layer.weight.data = weight_quantizer.get_hard_value(layer.weight.data)#将网络中的weight修改为伪量化后的值
        weight_quantizer.adaround = False

5、评估量化后的模型

test_loader = getTestData('cifar10',batch_size=64,path='../data/')
model.eval()
print('\n****** Quant model test ******')
test(model, test_loader)

6、导出可部署模型

input_shape={'data': [1, 3, 32, 32]}
convert_deploy(model, backend, input_shape)

Ashley_huo

关注

29
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫