持续更新完善中...
本文基于insightface官方pytorch代码修改:
insightface/recognition/arcface_torch at master · deepinsight/insightface · GitHub
本文旨在快速修改代码,并可训练。更多量化原理及技术细节请移步文末参考博客链接。
目前仅可实现量化训练、模型存储及前向推理,暂不可转onnx及ncnn等通用格式模型。 全文以MobileFaceNet为例对代码修改,其他backbone以此类推。
一、训练
1.qconfig and prepare_qat
# train.py
backbone.train() #在本行后插入qconfig代码
# QAT
backbone.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm') # add
backbone = torch.quantization.prepare_qat(backbone, inplace=True) # add
print("#################### Quantization Aware Training ####################\n", backbone.qconfig) # add
2.Fuse Modules
在.local/lib/python3.6/site-packages/torch/quantization/fuser_method_mappings.py中:
DEFAULT_OP_LIST_TO_FUSER_METHOD : Dict[Tuple, Union[nn.Sequential, Callable]] = {
(nn.Conv1d, nn.BatchNorm1d): fuse_conv_bn,
(nn.Conv1d, nn.BatchNorm1d, nn.ReLU): fuse_conv_bn_relu,
(nn.Conv2d, nn.BatchNorm2d): fuse_conv_bn,
(nn.Conv2d, nn.BatchNorm2d, nn.ReLU): fuse_conv_bn_relu,
(nn.Conv3d, nn.BatchNorm3d): fuse_conv_bn,
(nn.Conv3d, nn.BatchNorm3d, nn.ReLU): fuse_conv_bn_relu,
(nn.Conv1d, nn.ReLU): nni.ConvReLU1d,
(nn.Conv2d, nn.ReLU): nni.ConvReLU2d,
(nn.Conv3d, nn.ReLU): nni.ConvReLU3d,
(nn.Linear, nn.BatchNorm1d): fuse_linear_bn,
(nn.Linear, nn.ReLU): nni.LinearReLU,
(nn.BatchNorm2d, nn.ReLU): nni.BNReLU2d,
(nn.BatchNorm3d, nn.ReLU): nni.BNReLU3d,
}
如上所示可知,支持折叠的算子组合:
- Conv + BN
- Conv + BN + ReLU
- Conv + ReLU
- Linear + BN (不支持training)
- Linear + ReLU
- BN + ReLU
由于PReLU不在支持折叠的算子组合范围内,所以MobileFaceNet的PReLU无法进行算子折叠,在不替换PReLU时可进行正常训练,但经过实验发现无法进行前向推理。需将PReLU替换为ReLU,以支持前向推理过程。
# backbones/mobilefacenet.py
class ConvBlock(Module):
def __init__(self, in_c, out_c, kernel=(1, 1), stride=(1, 1), padding=(0, 0), groups=1):
super(ConvBlock, self).__init__()
self.layers = nn.Sequential(
Conv2d(in_c, out_c, kernel, groups=groups, stride=stride, padding=padding, bias=False),
BatchNorm2d(num_features=out_c),
# PReLU(num_parameters=out_c), # PReLU can not be fused, but ReLU
ReLU(inplace=True)
)
# QAT fuse modules
self.layers = torch.quantization.fuse_modules(self.layers, ['0', '1', '2'], inplace=True) # add
def forward(self, x):
return self.layers(x)
# And
class LinearBlock(Module):
def __init__(self, in_c, out_c, kernel=(1, 1), stride=(1, 1), padding=(0, 0), groups=1):
super(LinearBlock, self).__init__()
self.layers = nn.Sequential(
Conv2d(in_c, out_c, kernel, stride, padding, groups=groups, bias=False),
BatchNorm2d(num_features=out_c)
)
# QAT fuse modules
self.layers = torch.quantization.fuse_modules(self.layers, ['0', '1'], inplace=True) # add
def forward(self, x):
return self.layers(x)
shotcut使用nn.quantized.FloatFunctional()替换
class DepthWise(Module):
def __init__():
self.skip_add = nn.quantized.FloatFunctional()
def forward(self, x):
# output = short_cut + x
output = self.skip_add.add(short_cut, x)
在模型前向推理时还要在MobileFaceNet类的forward函数里加上quant和dequant,此处在训练时无需使用:
# backbones/mobilefacenet.py
class MobileFaceNet(Module):
def __init__():
self.quant = torch.quantization.QuantStub() # add
self.dequant = torch.quantization.DeQuantStub() # add
def forward(self, x):
# Quant
x = self.quant(x) # add
with torch.cuda.amp.autocast(self.fp16):
x = self.layers(x)
x = self.conv_sep(x.float() if self.fp16 else x)
x = self.features(x)
# DeQuant
x = self.dequant(x) # add
return x
3.Model Convert and Save
量化训练的模型可以存为两种格式,一种是量化的pth模型,大小约为原fp32模型的1/4。一种是caffe2的onnx模型,与普通的onnx模型不同,无法按普通onnx进行推理。以保存pth模型为例,由于quantized不支持CUDA后端,所以先要将GPU训练的模型进行deepcopy后切换至CPU上,然后使用torch.quantization.convert对模型convert,再按state_dict()方式存储。
# utils/utils_callbacks.py
class CallBackModelCheckpoint(object):
def __init__(self, rank, output="./"):
self.rank: int = rank
self.output: str = output
def __call__(self, global_step, backbone, partial_fc, ):
if global_step > 100 and self.rank == 0:
# QAT Save
quantized_eval_model = copy.deepcopy(backbone) # add
quantized_eval_model.eval() # add
quantized_eval_model.to(torch.device('cpu')) # add
torch.quantization.convert(quantized_eval_model, inplace=True) # add
path_module = os.path.join(self.output, "backbone.pth")
# torch.save(backbone.module.state_dict(), path_module)
torch.save(quantized_eval_model.module.state_dict(), path_module) # backbone ==> quantized_eval_model
logging.info("Pytorch Model Saved in '{}'".format(path_module))
if global_step > 100 and partial_fc is not None:
partial_fc.save_params()
二、int8前向推理
## -*- coding: utf-8 -*-
import cv2
import torch
import argparse
import numpy as np
from backbones import get_model
class FaceRecognition:
def __init__(self, args):
self.network = args.network
self.embedding_size = args.embedding_size
self.net = get_model(self.network, num_features=self.embedding_size)
self.net.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
self.net = torch.quantization.prepare_qat(self.net, inplace=True)
self.net = torch.quantization.convert(self.net)
self.net.load_state_dict(torch.load(args.pth_path))
self.net.eval()
def norm(self, A):
return A/np.linalg.norm(A)
def cosDist(self, A, B):
return np.dot(A,B)
def inference(self, imdata):
imdata = cv2.imread(imdata)
imdata = cv2.resize(imdata, (112, 112))
imdata = cv2.cvtColor(imdata, cv2.COLOR_BGR2RGB)
imdata = np.transpose(imdata, (2, 0, 1))
imdata = torch.from_numpy(imdata).unsqueeze(0).float()
imdata = imdata.unsqueeze(0).float()
imdata.div_(255).sub_(0.5).div_(0.5)
feat = self.norm(self.net(imdata).numpy()[0])
return feat
if __name__=='__main__':
parser = argparse.ArgumentParser(description='ArcFace PyTorch to onnx')
parser.add_argument('--sample', type=str, default="")
parser.add_argument('--pth_path', type=str, default="")
parser.add_argument('--network', type=str, default="mbf")
parser.add_argument('--embedding_size', type=int, default=256)
args = parser.parse_args()
FaceRec = FaceRecognition(args)
feat = FaceRec.inference(args.sample)
print(feat)
三、模型转换
转PNNX格式模型参考:
PNNX: PyTorch Neural Network Exchange - 知乎
四、报错汇总
1.模型convert问题
RuntimeError: Could not run 'aten::quantize_per_channel' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::quantize_per_channel' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradNestedTensor, UNKNOWN_TENSOR_TYPE_ID, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].
原因:由torch.quantization.convert(backbone)导致,模型训练在GPU上进行,量化仅支持CPU方式。需将模型转换至CPU后进行存储。
arcface_torch/utils/utils_callbacks.py中CallBackModelCheckpoint加入
quantized_eval_model = copy.deepcopy(backbone)
quantized_eval_model.eval()
quantized_eval_model.to(torch.device('cpu'))
torch.quantization.convert(quantized_eval_model, inplace=True)
2.前向推理报错
torch.nn.modules.module.ModuleAttributeError: 'MobileFaceNet' object has no attribute 'copy'
原因:训练保存模型时未使用state_dict()
3.不支持的算子
RuntimeError: Could not run 'aten::prelu' with arguments from the 'QuantizedCPU' backend. 'aten::prelu' is only available for these backends: [CPU, CUDA, Autograd, Profiler, Tracer, Autocast].
或:
RuntimeError: Could not run 'aten::native_batch_norm' with arguments from the 'QuantizedCPU' backend. 'aten::native_batch_norm' is only available for these backends: [CPU, CUDA, MkldnnCPU, Autograd, Profiler, Tracer].
原因:由于prelu及单独bn算子无法进行折叠,而导致前向时报错。prelu修改为relu,去掉bn或修改为支持折叠的组合。
参考
Onnx export failed int8 model - #17 by jerryzh168 - quantization - PyTorch Forums