insightface torch 量化感知训练QAT实现方法

 持续更新完善中...

本文基于insightface官方pytorch代码修改:

insightface/recognition/arcface_torch at master · deepinsight/insightface · GitHub

本文旨在快速修改代码,并可训练。更多量化原理及技术细节请移步文末参考博客链接。

目前仅可实现量化训练、模型存储及前向推理,暂不可转onnx及ncnn等通用格式模型。 全文以MobileFaceNet为例对代码修改,其他backbone以此类推。

一、训练

1.qconfig and prepare_qat

# train.py

backbone.train() #在本行后插入qconfig代码
# QAT
backbone.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')    # add
backbone = torch.quantization.prepare_qat(backbone, inplace=True)          # add
print("#################### Quantization Aware Training ####################\n", backbone.qconfig)    # add

2.Fuse Modules

在.local/lib/python3.6/site-packages/torch/quantization/fuser_method_mappings.py中:

DEFAULT_OP_LIST_TO_FUSER_METHOD : Dict[Tuple, Union[nn.Sequential, Callable]] = {
    (nn.Conv1d, nn.BatchNorm1d): fuse_conv_bn,
    (nn.Conv1d, nn.BatchNorm1d, nn.ReLU): fuse_conv_bn_relu,
    (nn.Conv2d, nn.BatchNorm2d): fuse_conv_bn,
    (nn.Conv2d, nn.BatchNorm2d, nn.ReLU): fuse_conv_bn_relu,
    (nn.Conv3d, nn.BatchNorm3d): fuse_conv_bn,
    (nn.Conv3d, nn.BatchNorm3d, nn.ReLU): fuse_conv_bn_relu,
    (nn.Conv1d, nn.ReLU): nni.ConvReLU1d,
    (nn.Conv2d, nn.ReLU): nni.ConvReLU2d,
    (nn.Conv3d, nn.ReLU): nni.ConvReLU3d,
    (nn.Linear, nn.BatchNorm1d): fuse_linear_bn,
    (nn.Linear, nn.ReLU): nni.LinearReLU,
    (nn.BatchNorm2d, nn.ReLU): nni.BNReLU2d,
    (nn.BatchNorm3d, nn.ReLU): nni.BNReLU3d,
}

如上所示可知,支持折叠的算子组合:

  • Conv + BN
  • Conv + BN + ReLU
  • Conv + ReLU
  • Linear + BN (不支持training)
  • Linear + ReLU
  • BN + ReLU

由于PReLU不在支持折叠的算子组合范围内,所以MobileFaceNet的PReLU无法进行算子折叠,在不替换PReLU时可进行正常训练,但经过实验发现无法进行前向推理。需将PReLU替换为ReLU,以支持前向推理过程。

# backbones/mobilefacenet.py

class ConvBlock(Module):
    def __init__(self, in_c, out_c, kernel=(1, 1), stride=(1, 1), padding=(0, 0), groups=1):
        super(ConvBlock, self).__init__()
        self.layers = nn.Sequential(
            Conv2d(in_c, out_c, kernel, groups=groups, stride=stride, padding=padding, bias=False),
            BatchNorm2d(num_features=out_c),
            # PReLU(num_parameters=out_c),  # PReLU can not be fused, but ReLU
            ReLU(inplace=True)
        )

        # QAT fuse modules
        self.layers = torch.quantization.fuse_modules(self.layers, ['0', '1', '2'], inplace=True)    # add

    def forward(self, x):
        return self.layers(x)

# And
class LinearBlock(Module):
    def __init__(self, in_c, out_c, kernel=(1, 1), stride=(1, 1), padding=(0, 0), groups=1):
        super(LinearBlock, self).__init__()
        self.layers = nn.Sequential(
            Conv2d(in_c, out_c, kernel, stride, padding, groups=groups, bias=False),
            BatchNorm2d(num_features=out_c)
        )

        # QAT fuse modules
        self.layers = torch.quantization.fuse_modules(self.layers, ['0', '1'], inplace=True)    # add

    def forward(self, x):
        return self.layers(x)

shotcut使用nn.quantized.FloatFunctional()替换

class DepthWise(Module):
    def __init__():
        self.skip_add = nn.quantized.FloatFunctional()

    def forward(self, x):
        # output = short_cut + x
        output = self.skip_add.add(short_cut, x)

在模型前向推理时还要在MobileFaceNet类的forward函数里加上quant和dequant,此处在训练时无需使用:

# backbones/mobilefacenet.py

class MobileFaceNet(Module):
    def __init__():
        self.quant = torch.quantization.QuantStub()     # add
        self.dequant = torch.quantization.DeQuantStub() # add
    
    def forward(self, x):
        # Quant
        x = self.quant(x)                               # add
    
        with torch.cuda.amp.autocast(self.fp16):
            x = self.layers(x)
        x = self.conv_sep(x.float() if self.fp16 else x)
        x = self.features(x)
        
        # DeQuant
        x = self.dequant(x)                             # add

        return x

3.Model Convert and Save

量化训练的模型可以存为两种格式,一种是量化的pth模型,大小约为原fp32模型的1/4。一种是caffe2的onnx模型,与普通的onnx模型不同,无法按普通onnx进行推理。以保存pth模型为例,由于quantized不支持CUDA后端,所以先要将GPU训练的模型进行deepcopy后切换至CPU上,然后使用torch.quantization.convert对模型convert,再按state_dict()方式存储。

# utils/utils_callbacks.py

class CallBackModelCheckpoint(object):
    def __init__(self, rank, output="./"):
        self.rank: int = rank
        self.output: str = output

    def __call__(self, global_step, backbone, partial_fc, ):
        if global_step > 100 and self.rank == 0:
            # QAT Save
            quantized_eval_model = copy.deepcopy(backbone)                          # add
            quantized_eval_model.eval()                                             # add
            quantized_eval_model.to(torch.device('cpu'))                            # add
            torch.quantization.convert(quantized_eval_model, inplace=True)          # add

            path_module = os.path.join(self.output, "backbone.pth")
            # torch.save(backbone.module.state_dict(), path_module)
            torch.save(quantized_eval_model.module.state_dict(), path_module)     # backbone ==> quantized_eval_model
            logging.info("Pytorch Model Saved in '{}'".format(path_module))

        if global_step > 100 and partial_fc is not None:
            partial_fc.save_params()

二、int8前向推理

## -*- coding: utf-8 -*-

import cv2
import torch
import argparse
import numpy as np
from backbones import get_model


class FaceRecognition:
    def __init__(self, args):
        self.network = args.network
        self.embedding_size = args.embedding_size
        
        self.net = get_model(self.network, num_features=self.embedding_size)
        self.net.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
        self.net = torch.quantization.prepare_qat(self.net, inplace=True)
        self.net = torch.quantization.convert(self.net)
        self.net.load_state_dict(torch.load(args.pth_path))
        self.net.eval()
        
    def norm(self, A):
        return A/np.linalg.norm(A)
        
    def cosDist(self, A, B):
        return np.dot(A,B)
        
        
    def inference(self, imdata):
        imdata = cv2.imread(imdata)
        imdata = cv2.resize(imdata, (112, 112))
        imdata = cv2.cvtColor(imdata, cv2.COLOR_BGR2RGB)
        imdata = np.transpose(imdata, (2, 0, 1))
        imdata = torch.from_numpy(imdata).unsqueeze(0).float()
        imdata = imdata.unsqueeze(0).float()
        imdata.div_(255).sub_(0.5).div_(0.5)
        feat = self.norm(self.net(imdata).numpy()[0])
        return feat
    
    
if __name__=='__main__':

    parser = argparse.ArgumentParser(description='ArcFace PyTorch to onnx')
    parser.add_argument('--sample',         type=str,   default="")
    parser.add_argument('--pth_path',       type=str,   default="")
    parser.add_argument('--network',        type=str,   default="mbf")
    parser.add_argument('--embedding_size', type=int,   default=256)
    args = parser.parse_args()
    
    FaceRec = FaceRecognition(args)
    feat = FaceRec.inference(args.sample)
    print(feat)
    

三、模型转换

转PNNX格式模型参考:

PNNX: PyTorch Neural Network Exchange - 知乎

四、报错汇总

1.模型convert问题

RuntimeError: Could not run 'aten::quantize_per_channel' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::quantize_per_channel' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradNestedTensor, UNKNOWN_TENSOR_TYPE_ID, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

原因:由torch.quantization.convert(backbone)导致,模型训练在GPU上进行,量化仅支持CPU方式。需将模型转换至CPU后进行存储。

arcface_torch/utils/utils_callbacks.py中CallBackModelCheckpoint加入

quantized_eval_model = copy.deepcopy(backbone)
quantized_eval_model.eval()
quantized_eval_model.to(torch.device('cpu'))
torch.quantization.convert(quantized_eval_model, inplace=True)

2.前向推理报错

torch.nn.modules.module.ModuleAttributeError: 'MobileFaceNet' object has no attribute 'copy'

 原因:训练保存模型时未使用state_dict()

3.不支持的算子

RuntimeError: Could not run 'aten::prelu' with arguments from the 'QuantizedCPU' backend. 'aten::prelu' is only available for these backends: [CPU, CUDA, Autograd, Profiler, Tracer, Autocast].

或:

RuntimeError: Could not run 'aten::native_batch_norm' with arguments from the 'QuantizedCPU' backend. 'aten::native_batch_norm' is only available for these backends: [CPU, CUDA, MkldnnCPU, Autograd, Profiler, Tracer].

原因:由于prelu及单独bn算子无法进行折叠,而导致前向时报错。prelu修改为relu,去掉bn或修改为支持折叠的组合。

参考

PyTorch的量化 - 知乎

Pytorch实现量化感知训练QAT(一) - 知乎

PyTorch模型量化工具学习 - 知乎

分布式训练 混合精度 BN融合 QAT - 知乎

NCNN量化详解(二) - 知乎

Onnx export failed int8 model - #17 by jerryzh168 - quantization - PyTorch Forums

  • 1
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 6
    评论
好的,我可以给你提供一个简单的 PyTorch GNN 实现,并且附带训练和测试代码。这个实现是基于 GCN (Graph Convolutional Network)的,你可以根据需要进行改进。 首先,我们需要安装 PyTorch 和 DGL(Deep Graph Library): ```bash pip install torch dgl ``` 然后,我们可以定义一个简单的 GNN 模型: ```python import torch import torch.nn as nn import dgl.function as fn from dgl.nn.pytorch import GraphConv class GNN(nn.Module): def __init__(self, in_feats, hidden_feats, out_feats): super(GNN, self).__init__() self.conv1 = GraphConv(in_feats, hidden_feats) self.conv2 = GraphConv(hidden_feats, out_feats) def forward(self, g, features): h = self.conv1(g, features) h = torch.relu(h) h = self.conv2(g, h) return h ``` 这个 GNN 模型包含两个 GraphConv 层,每个层都是由输入特征到输出特征的线性变换,然后通过 ReLU 激活函数进行非线性变换。在这个例子中,我们使用了两个 GraphConv 层,但你可以根据需要添加更多层。 接下来,我们可以定义一个简单的训练循环: ```python import dgl import torch.optim as optim def train(model, g, features, labels, train_mask, epochs): optimizer = optim.Adam(model.parameters()) criterion = nn.CrossEntropyLoss() for epoch in range(epochs): model.train() logits = model(g, features) loss = criterion(logits[train_mask], labels[train_mask]) optimizer.zero_grad() loss.backward() optimizer.step() print('Epoch %d | Loss: %.4f' % (epoch, loss.item())) ``` 在这个训练循环中,我们使用 Adam 优化器和交叉熵损失函数对 GNN 进行训练训练过程中,我们计算模型的预测值(logits),然后根据训练集上的标签和掩码计算交叉熵损失。最后,我们通过反向传播和优化器来更新模型参数。 最后,我们可以定义一个简单的测试函数: ```python import torch.nn.functional as F def test(model, g, features, labels, test_mask): model.eval() with torch.no_grad(): logits = model(g, features) pred = logits.argmax(1) acc = F.accuracy(logits[test_mask], labels[test_mask]) print('Accuracy: %.4f' % acc.item()) ``` 在这个测试函数中,我们首先将模型设置为评估模式,然后使用模型对测试集进行预测。最后,我们计算准确率并输出结果。 现在,我们可以使用这些代码来训练和测试我们的 GNN 模型: ```python import dgl.data dataset = dgl.data.CoraGraphDataset() g = dataset[0] features = g.ndata['feat'] labels = g.ndata['label'] train_mask = g.ndata['train_mask'] test_mask = g.ndata['test_mask'] model = GNN(features.shape[1], 16, dataset.num_classes) train(model, g, features, labels, train_mask, epochs=100) test(model, g, features, labels, test_mask) ``` 在这个例子中,我们使用 Cora 数据集来测试我们的 GNN 模型。训练和测试代码将使用上述函数进行训练和测试,最后输出测试准确率。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值