训练好后的神经网络如何快速定点化?

定点化

与伪量化不同的是,定点化在推理时,不需要还原为浮点数。这需要框架实现算子的定点化运算支持。目前MNN、XNN等移动端AI框架中,均加入了定点化支持。

 

训练好后的神经网络如何快速定点化?

如何把浮点权重快速定点化后使用在嵌入式设备中

 

tensorflow模型的定点化

如何量化和量化意义?
网络参数是按层组织,每层数值都在同一数量级,即范围相差不大,如[-6.0,4.0],有大量论文研究表明确认值最大和最小后每层数据使用8bit定点化量化已可以很好满足推断计算。
量化最直接结果是参数存储空间要求变小,经验值是减少约3/4;减少内存读取数据量,节省带宽;使用simd进行计算加速,如果有dsp进行8bit加速计算节能,使得移动设备上进行推断计算变得更强大有效。

怎么量化模型?

需要使用tensorflow提供的量化工具,使用示例如下......

 

 

基于pytorch的三种模型定点化方法

dynamic qu,Post-training static quantization,Quantization-aware training (QAT) 

  1. Post-training static quantization

Post-training static quantization involves not just converting the weights from float to int,
as in dynamic quantization, but also performing the additional step of first feeding batches
of data through the network and computing the resulting distributions of the different activations
(specifically, this is done by inserting observer modules at different points that record this
data). These distributions are then used to determine how the specifically the different activations
should be quantized at inference time (a simple technique would be to simply divide the entire range
of activations into 256 levels, but we support more sophisticated methods as well). Importantly,
this additional step allows us to pass quantized values between operations instead of converting these
values to floats - and then back to ints - between every operation, resulting in a significant speed-up.

num_calibration_batches = 10

myModel = load_model(saved_model_dir + float_model_file).to('cpu')
myModel.eval()

# Fuse Conv, bn and relu
myModel.fuse_model()

# Specify quantization configuration
# Start with simple min/max range estimation and per-tensor quantization of weights
myModel.qconfig = torch.quantization.default_qconfig
print(myModel.qconfig)
torch.quantization.prepare(myModel, inplace=True)

# Calibrate first
print('Post Training Quantization Prepare: Inserting Observers')
print('\n Inverted Residual Block:After observer insertion \n\n', myModel.features[1].conv)

# Calibrate with the training set
evaluate(myModel, criterion, data_loader, neval_batches=num_calibration_batches)
print('Post Training Quantization: Calibration done')

# Convert to quantized model
torch.quantization.convert(myModel, inplace=True)
print('Post Training Quantization: Convert done')
print('\n Inverted Residual Block: After fusion and quantization, note fused modules: \n\n',myModel.features[1].conv)

print("Size of model after quantization")
print_size_of_model(myModel)

top1, top5 = evaluate(myModel, criterion, data_loader_test, neval_batches=num_eval_batches)
print('Evaluation accuracy on %d images, %2.2f'%(num_eval_batches * eval_batch_size, top1.avg))

For this quantized model, we see a significantly lower accuracy of just ~62% on these same 300
images. Nevertheless, we did reduce the size of our model down to just under 3.6 MB, almost a 4x
decrease.

In addition, we can significantly improve on the accuracy simply by using a different
quantization configuration. We repeat the same exercise with the recommended configuration for
quantizing for x86 architectures. This configuration does the following:

  • Quantizes weights on a per-channel basis
  • Uses a histogram observer that collects a histogram of activations and then picks
    quantization parameters in an optimal manner.
per_channel_quantized_model = load_model(saved_model_dir + float_model_file)
per_channel_quantized_model.eval()
per_channel_quantized_model.fuse_model()
per_channel_quantized_model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
print(per_channel_quantized_model.qconfig)

torch.quantization.prepare(per_channel_quantized_model, inplace=True)
evaluate(per_channel_quantized_model,criterion, data_loader, num_calibration_batches)
torch.quantization.convert(per_channel_quantized_model, inplace=True)
top1, top5 = evaluate(per_channel_quantized_model, criterion, data_loader_test, neval_batches=num_eval_batches)
print('Evaluation accuracy on %d images, %2.2f'%(num_eval_batches * eval_batch_size, top1.avg))
torch.jit.save(torch.jit.script(per_channel_quantized_model), saved_model_dir + scripted_quantized_model_file)
  1. Quantization-aware training

Quantization-aware training (QAT) is the quantization method that typically results in the highest accuracy.
With QAT......

 

 

Simulink是有定点化工具(fixed-point tool)

 

 

 

 

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值