量化总结1-Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

本文探讨了一种线性量化方法,通过整数运算替代浮点运算以优化神经网络的效率。核心在于将矩阵乘法转化为整数运算和移位操作,减少了计算复杂度。量化训练过程中,权重和激活值的量化策略不同,权重采用最大最小值,而激活值使用EMA计算。在不包含BN层和包含BN层的情况下,分别展示了量化流程,并指出在特定情况下可以避免不必要的量化反量化操作以提高效率。此外,文章讨论了仅量化权重不量化激活的策略,以及在前向推理时BN层融合的处理方式。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

先回顾论文的思想,再进一步讨论改进,本方法我进行了复现,但是没有将卷积乘法改成int型乘法,因为服务器没法做:

1、本方法属于线性量化,如下式,q是fp32的r的量化后的值:

 

将最小值也做了对应的量化(zero-point)。

  1. 讨论矩阵乘法的整形运算(我觉得是本文最精华的地方)

流程如下式所示,式2->式3->式4->式5:

由上可知,除了M以外,全部是整数,同时,我们进一步的将乘以M这个浮点操作转化为整数运算和移位运算的合成:

将M0控制在[0.5,1]范围内,应该是更好的表达成int32类型。

接下来,缩减公式4的运算参数,将公式4转化为公式7:

公式8可以用2n^2(2*n的平方-为什么是这个呢:因为M的计算是2的n次方,里面a1,a2的计算也是2的n次方,相乘就是2的2*n次方),剩下的运算集中在公式9中(计算复杂度-2n^3(2*n的3次方-同理)):

但是此处uint8的乘法需要使用int32来存储,即如下:

第三步,将bias表示为int32,对应的量化参数为:

并将r_bias=s_1*s_2*q_bias加到公式3右边,再转化到公式7的形式转化后,公式7括号里面运算就变成了int32的类型,在乘以M可以看成现诚意int32类型的定点数M0,然后再进行移位,最后将结果在(0,255]进行截取,来实现uint8的输出,此事不再需要使用relu。
3、模拟量化训练

需要量化的地方插入伪量化节点,如果有bn层,则需要将bn层的参数再量化前foled-into权重中;激活值的量化在激活函数以后或者resnet的shortcut以后进行。

伪量化流程如下:

量化流程分别针对权重个激活值:

  1. 对于权重,a,b直接按照通道使用最大最小值
  2. 对于激活值,a,b使用EMA来计算得到(注意BN的w-fold流程是不一样的)

以上两点将根据我复现的代码进行进一步的解析,接下来我们看详细的伪量化流程:

1、不带bn层的伪量化

注意,这个图是量化图,但实际理解应该是伪量化图,因为最关键的量化卷积计算int_conv并没有体现。

2、bn层的folding

由bn层时需要将bn层考虑进入权重,在进行量化:

训练时候的正常bn层示意图

 训练时fold_bn的示意图                

训练时fold_bn并且伪量化示意图

由上图可知,

  1. 正常训练时计算结果后直接输入bn层然后进行relu6。
  2. 训练时fold_bn层也很好理解,先使用1部分进行正常的卷积计算,再使用2部分将卷积计算的结果与bn层融合到进行拆分,将其当做新的卷积权重和偏置,最后使用3部分进行fold的卷积计算。
  3. 原理同2,但是在第2部分时先对fold的卷积权重进行了量化,第3部分最后对激活值进行了量化。

更进一步的讨论,如果每一层都需要量化、反量化,会带来大量的不必要计算和复杂度,如果下一层也是conv-bn-relu,可以不用反量化,如下图所示,下图将激活函数量化在了pipline里面,所以不用频繁量化反量化:

存疑:

1、直接进行前向推理时,当前推理bn层的时bn参数没有固定,那如何进行bn的融合?

解:方案1-在前向训练的时候跑两遍,第一遍前向求出activation的均值和方差,在后第二遍再将均值和方差融入进w,成为w_fold

2、代码复现时细节和问题

解:1、注意,bn层计算的是本层的均值和方差,所以在推理一定要先计算本层的输出,再计算bn的均值和方差。

2、注意:基于google的INQ方法不需要选择阈值来做饱和量化模式,对权重,选择实际使用的通道/层的最大最小值决定量化参数,对于激活输出,使用batches的最大最小值的滑动平均值来决定量化参数,因为我们使用了先训练后量化的方法,本身精度就是得到保证的。

3、如果只量化权重不量化激活呢:

解:白皮书指出,这对减小模型内存大小是很有作用的,但不考虑浮点的计算输出性能开销。

  1. 从白皮书的结论给出:

激活值的8bit量化精度损失很小,由于以下原因,激活值的动态范围很小:1、无缩放的批归一化 2、relu6将范围固定在0,6之间。

逐层量化权重时,精度损失会非常严重,但是指出,激活值仍然需要逐层对称量化来量化。

参考链接:Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference-CSDN博客

githu:GitHub - panchengl/pcldetection

### PyTorch Quantization Aware Training (QAT) with YOLOv8 Implementation #### Overview of QAT in PyTorch Quantization Aware Training (QAT) is a technique that simulates the effects of post-training quantization during training, allowing models to be trained directly for lower precision inference. This approach helps mitigate accuracy loss when converting floating-point models to integer-based representations like INT8[^2]. For implementing QAT specifically within the context of YOLOv8 using PyTorch, several key steps need attention: #### Preparation Steps Before Applying QAT on YOLOv8 Model Before applying QAT, ensure the environment setup includes necessary libraries such as `torch`, `torchvision` along with specific versions compatible with your hardware and software stack. Ensure the model architecture supports QAT by verifying compatibility or making adjustments where required. Some layers might not support direct quantization; hence modifications may be needed before proceeding further. ```python import torch from ultralytics import YOLO model = YOLO('yolov8n.pt') # Load pre-trained YOLOv8 nano model ``` #### Configuring the Model for QAT To prepare the YOLOv8 model for QAT, configure it according to PyTorch guidelines provided in official documentation[^1]. The configuration involves setting up observers which will collect statistics about activations and weights throughout different stages of forward passes. ```python # Prepare model for QAT model.train() model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm') torch.quantization.prepare_qat(model, inplace=True) for name, module in model.named_modules(): if isinstance(module, torch.nn.Conv2d): torch.quantization.fuse_modules( model, [name], inplace=True ) ``` #### Fine-Tuning Process During QAT Phase During fine-tuning phase under QAT mode, continue training while periodically validating performance metrics against validation datasets. Adjust learning rates carefully since aggressive changes could negatively impact convergence properties observed earlier without quantization constraints applied. Monitor both original float32 evaluation scores alongside their corresponding int8 counterparts generated through simulated low-bit operations introduced via inserted fake_quant modules across network paths. ```python optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9) criterion = torch.nn.CrossEntropyLoss() for epoch in range(num_epochs): train_one_epoch(model, criterion, optimizer, data_loader_train, device=device) validate(model, criterion, data_loader_val, device=device) ``` #### Exporting Post-QAT Trained Models into ONNX Format Once satisfied with achieved accuracies after completing sufficient epochs count, export resulting optimized graph structure together with learned parameters ready for deployment onto target platforms supporting efficient execution over reduced bit-width arithmetic units. Exported files should contain explicit instructions regarding how each tensor gets transformed between full-range floats versus narrow-scaled integers at runtime boundaries defined inside exported protocol buffers specification documents adhering closely enough so they remain interoperable among diverse ecosystem components involved from development until production phases inclusive. ```python dummy_input = torch.randn(1, 3, 640, 640).to(device) torch.onnx.export( model.eval(), dummy_input, 'qat_yolov8.onnx', opset_version=13, do_constant_folding=True, input_names=['input'], output_names=['output'] ) ``` #### Best Practices When Implementing QAT on YOLOv8 Adopting best practices ensures successful application of QAT techniques leading towards effective utilization of computational resources available today's edge devices capable running deep neural networks efficiently even constrained environments characterized limited power supply options present mobile phones cameras drones etcetera. - **Start Simple**: Begin experimentation process utilizing smaller variants first e.g., Nano version instead larger ones initially. - **Validate Early & Often**: Regularly check intermediate results ensuring no significant drop occurs compared baseline configurations prior introducing any form approximation schemes whatsoever. - **Adjust Learning Rate Carefully**: Gradually decrease step sizes especially near end iterations avoiding abrupt shifts causing instability issues otherwise avoided altogether following systematic reduction schedules designed maintain stability throughout entire procedure duration spanned multiple rounds optimization cycles executed sequentially ordered fashion preserving overall integrity final product delivered customers hands ultimately.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值