insightface torch 量化感知训练QAT实现方法

Darren.Zhang

已于 2022-02-24 15:47:49 修改

阅读量3.2k

点赞数 1

文章标签：深度学习人工智能人脸识别 pytorch

于 2021-12-09 15:15:02 首次发布

本文链接：https://blog.csdn.net/charlesyohn/article/details/120966075

版权

持续更新完善中...

本文基于insightface官方pytorch代码修改:

insightface/recognition/arcface_torch at master · deepinsight/insightface · GitHub

本文旨在快速修改代码，并可训练。更多量化原理及技术细节请移步文末参考博客链接。

目前仅可实现量化训练、模型存储及前向推理，暂不可转onnx及ncnn等通用格式模型。全文以MobileFaceNet为例对代码修改，其他backbone以此类推。

一、训练

1.qconfig and prepare_qat

# train.py

backbone.train() #在本行后插入qconfig代码
# QAT
backbone.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')    # add
backbone = torch.quantization.prepare_qat(backbone, inplace=True)          # add
print("#################### Quantization Aware Training ####################\n", backbone.qconfig)    # add

2.Fuse Modules

在.local/lib/python3.6/site-packages/torch/quantization/fuser_method_mappings.py中：

DEFAULT_OP_LIST_TO_FUSER_METHOD : Dict[Tuple, Union[nn.Sequential, Callable]] = {
    (nn.Conv1d, nn.BatchNorm1d): fuse_conv_bn,
    (nn.Conv1d, nn.BatchNorm1d, nn.ReLU): fuse_conv_bn_relu,
    (nn.Conv2d, nn.BatchNorm2d): fuse_conv_bn,
    (nn.Conv2d, nn.BatchNorm2d, nn.ReLU): fuse_conv_bn_relu,
    (nn.Conv3d, nn.BatchNorm3d): fuse_conv_bn,
    (nn.Conv3d, nn.BatchNorm3d, nn.ReLU): fuse_conv_bn_relu,
    (nn.Conv1d, nn.ReLU): nni.ConvReLU1d,
    (nn.Conv2d, nn.ReLU): nni.ConvReLU2d,
    (nn.Conv3d, nn.ReLU): nni.ConvReLU3d,
    (nn.Linear, nn.BatchNorm1d): fuse_linear_bn,
    (nn.Linear, nn.ReLU): nni.LinearReLU,
    (nn.BatchNorm2d, nn.ReLU): nni.BNReLU2d,
    (nn.BatchNorm3d, nn.ReLU): nni.BNReLU3d,
}

如上所示可知，支持折叠的算子组合：

Conv + BN
Conv + BN + ReLU
Conv + ReLU
Linear + BN (不支持training）
Linear + ReLU
BN + ReLU

由于PReLU不在支持折叠的算子组合范围内，所以MobileFaceNet的PReLU无法进行算子折叠，在不替换PReLU时可进行正常训练，但经过实验发现无法进行前向推理。需将PReLU替换为ReLU，以支持前向推理过程。

# backbones/mobilefacenet.py

class ConvBlock(Module):
    def __init__(self, in_c, out_c, kernel=(1, 1), stride=(1, 1), padding=(0, 0), groups=1):
        super(ConvBlock, self).__init__()
        self.layers = nn.Sequential(
            Conv2d(in_c, out_c, kernel, groups=groups, stride=stride, padding=padding, bias=False),
            BatchNorm2d(num_features=out_c),
            # PReLU(num_parameters=out_c),  # PReLU can not be fused, but ReLU
            ReLU(inplace=True)
        )

        # QAT fuse modules
        self.layers = torch.quantization.fuse_modules(self.layers, ['0', '1', '2'], inplace=True)    # add

    def forward(self, x):
        return self.layers(x)

# And
class LinearBlock(Module):
    def __init__(self, in_c, out_c, kernel=(1, 1), stride=(1, 1), padding=(0, 0), groups=1):
        super(LinearBlock, self).__init__()
        self.layers = nn.Sequential(
            Conv2d(in_c, out_c, kernel, stride, padding, groups=groups, bias=False),
            BatchNorm2d(num_features=out_c)
        )

        # QAT fuse modules
        self.layers = torch.quantization.fuse_modules(self.layers, ['0', '1'], inplace=True)    # add

    def forward(self, x):
        return self.layers(x)

shotcut使用nn.quantized.FloatFunctional()替换

class DepthWise(Module):
    def __init__():
        self.skip_add = nn.quantized.FloatFunctional()

    def forward(self, x):
        # output = short_cut + x
        output = self.skip_add.add(short_cut, x)

在模型前向推理时还要在MobileFaceNet类的forward函数里加上quant和dequant，此处在训练时无需使用：

# backbones/mobilefacenet.py

class MobileFaceNet(Module):
    def __init__():
        self.quant = torch.quantization.QuantStub()     # add
        self.dequant = torch.quantization.DeQuantStub() # add
    
    def forward(self, x):
        # Quant
        x = self.quant(x)                               # add
    
        with torch.cuda.amp.autocast(self.fp16):
            x = self.layers(x)
        x = self.conv_sep(x.float() if self.fp16 else x)
        x = self.features(x)
        
        # DeQuant
        x = self.dequant(x)                             # add

        return x

3.Model Convert and Save

量化训练的模型可以存为两种格式，一种是量化的pth模型，大小约为原fp32模型的1/4。一种是caffe2的onnx模型，与普通的onnx模型不同，无法按普通onnx进行推理。以保存pth模型为例，由于quantized不支持CUDA后端，所以先要将GPU训练的模型进行deepcopy后切换至CPU上，然后使用torch.quantization.convert对模型convert，再按state_dict()方式存储。

# utils/utils_callbacks.py

class CallBackModelCheckpoint(object):
    def __init__(self, rank, output="./"):
        self.rank: int = rank
        self.output: str = output

    def __call__(self, global_step, backbone, partial_fc, ):
        if global_step > 100 and self.rank == 0:
            # QAT Save
            quantized_eval_model = copy.deepcopy(backbone)                          # add
            quantized_eval_model.eval()                                             # add
            quantized_eval_model.to(torch.device('cpu'))                            # add
            torch.quantization.convert(quantized_eval_model, inplace=True)          # add

            path_module = os.path.join(self.output, "backbone.pth")
            # torch.save(backbone.module.state_dict(), path_module)
            torch.save(quantized_eval_model.module.state_dict(), path_module)     # backbone ==> quantized_eval_model
            logging.info("Pytorch Model Saved in '{}'".format(path_module))

        if global_step > 100 and partial_fc is not None:
            partial_fc.save_params()

二、int8前向推理

## -*- coding: utf-8 -*-

import cv2
import torch
import argparse
import numpy as np
from backbones import get_model


class FaceRecognition:
    def __init__(self, args):
        self.network = args.network
        self.embedding_size = args.embedding_size
        
        self.net = get_model(self.network, num_features=self.embedding_size)
        self.net.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
        self.net = torch.quantization.prepare_qat(self.net, inplace=True)
        self.net = torch.quantization.convert(self.net)
        self.net.load_state_dict(torch.load(args.pth_path))
        self.net.eval()
        
    def norm(self, A):
        return A/np.linalg.norm(A)
        
    def cosDist(self, A, B):
        return np.dot(A,B)
        
        
    def inference(self, imdata):
        imdata = cv2.imread(imdata)
        imdata = cv2.resize(imdata, (112, 112))
        imdata = cv2.cvtColor(imdata, cv2.COLOR_BGR2RGB)
        imdata = np.transpose(imdata, (2, 0, 1))
        imdata = torch.from_numpy(imdata).unsqueeze(0).float()
        imdata = imdata.unsqueeze(0).float()
        imdata.div_(255).sub_(0.5).div_(0.5)
        feat = self.norm(self.net(imdata).numpy()[0])
        return feat
    
    
if __name__=='__main__':

    parser = argparse.ArgumentParser(description='ArcFace PyTorch to onnx')
    parser.add_argument('--sample',         type=str,   default="")
    parser.add_argument('--pth_path',       type=str,   default="")
    parser.add_argument('--network',        type=str,   default="mbf")
    parser.add_argument('--embedding_size', type=int,   default=256)
    args = parser.parse_args()
    
    FaceRec = FaceRecognition(args)
    feat = FaceRec.inference(args.sample)
    print(feat)

三、模型转换

转PNNX格式模型参考：

PNNX: PyTorch Neural Network Exchange - 知乎

四、报错汇总

1.模型convert问题

RuntimeError: Could not run 'aten::quantize_per_channel' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::quantize_per_channel' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradNestedTensor, UNKNOWN_TENSOR_TYPE_ID, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

原因：由torch.quantization.convert(backbone)导致，模型训练在GPU上进行，量化仅支持CPU方式。需将模型转换至CPU后进行存储。

arcface_torch/utils/utils_callbacks.py中CallBackModelCheckpoint加入

quantized_eval_model = copy.deepcopy(backbone)
quantized_eval_model.eval()
quantized_eval_model.to(torch.device('cpu'))
torch.quantization.convert(quantized_eval_model, inplace=True)

2.前向推理报错

torch.nn.modules.module.ModuleAttributeError: 'MobileFaceNet' object has no attribute 'copy'

原因：训练保存模型时未使用state_dict()

3.不支持的算子

RuntimeError: Could not run 'aten::prelu' with arguments from the 'QuantizedCPU' backend. 'aten::prelu' is only available for these backends: [CPU, CUDA, Autograd, Profiler, Tracer, Autocast].

或：

RuntimeError: Could not run 'aten::native_batch_norm' with arguments from the 'QuantizedCPU' backend. 'aten::native_batch_norm' is only available for these backends: [CPU, CUDA, MkldnnCPU, Autograd, Profiler, Tracer].

原因：由于prelu及单独bn算子无法进行折叠，而导致前向时报错。prelu修改为relu，去掉bn或修改为支持折叠的组合。

参考

PyTorch的量化 - 知乎

Pytorch实现量化感知训练QAT(一) - 知乎

PyTorch模型量化工具学习 - 知乎

分布式训练混合精度 BN融合 QAT - 知乎

NCNN量化详解（二） - 知乎

Onnx export failed int8 model - #17 by jerryzh168 - quantization - PyTorch Forums

Darren.Zhang

关注

1
点赞
踩
12

收藏

觉得还不错? 一键收藏
6
评论
insightface torch 量化感知训练QAT实现方法

本文旨在快速修改代码，并可训练。更多量化原理及技术细节请移步文末参考博客连接。修改为量化感知训练总体上分为五个步骤1.设置qconfig2.fuse modules3.prepare_qat4.data feeding5.model convert保存模型报错RuntimeError: Could not run 'aten::quantize_per_channel' with arguments from the 'CUDA' backend. This could
复制链接

扫一扫