训练好后的神经网络如何快速定点化？

最新推荐文章于 2024-07-22 21:26:32 发布

It-is-me!

最新推荐文章于 2024-07-22 21:26:32 发布

阅读量1.7k

点赞数

分类专栏： Deep Learning

本文链接：https://blog.csdn.net/weixin_47556699/article/details/107884501

版权

Deep Learning 专栏收录该内容

57 篇文章 1 订阅

订阅专栏

定点化

与伪量化不同的是，定点化在推理时，不需要还原为浮点数。这需要框架实现算子的定点化运算支持。目前MNN、XNN等移动端AI框架中，均加入了定点化支持。

训练好后的神经网络如何快速定点化？

如何把浮点权重快速定点化后使用在嵌入式设备中

tensorflow模型的定点化

如何量化和量化意义？
网络参数是按层组织，每层数值都在同一数量级，即范围相差不大，如[-6.0,4.0]，有大量论文研究表明确认值最大和最小后每层数据使用8bit定点化量化已可以很好满足推断计算。
量化最直接结果是参数存储空间要求变小，经验值是减少约3/4；减少内存读取数据量，节省带宽；使用simd进行计算加速，如果有dsp进行8bit加速计算节能，使得移动设备上进行推断计算变得更强大有效。

怎么量化模型?

需要使用tensorflow提供的量化工具，使用示例如下......

dynamic qu，Post-training static quantization，Quantization-aware training (QAT)

Post-training static quantization

Post-training static quantization involves not just converting the weights from float to int,
as in dynamic quantization, but also performing the additional step of first feeding batches
of data through the network and computing the resulting distributions of the different activations
(specifically, this is done by inserting observer modules at different points that record this
data). These distributions are then used to determine how the specifically the different activations
should be quantized at inference time (a simple technique would be to simply divide the entire range
of activations into 256 levels, but we support more sophisticated methods as well). Importantly,
this additional step allows us to pass quantized values between operations instead of converting these
values to floats - and then back to ints - between every operation, resulting in a significant speed-up.

num_calibration_batches = 10

myModel = load_model(saved_model_dir + float_model_file).to('cpu')
myModel.eval()

# Fuse Conv, bn and relu
myModel.fuse_model()

# Specify quantization configuration
# Start with simple min/max range estimation and per-tensor quantization of weights
myModel.qconfig = torch.quantization.default_qconfig
print(myModel.qconfig)
torch.quantization.prepare(myModel, inplace=True)

# Calibrate first
print('Post Training Quantization Prepare: Inserting Observers')
print('\n Inverted Residual Block:After observer insertion \n\n', myModel.features[1].conv)

# Calibrate with the training set
evaluate(myModel, criterion, data_loader, neval_batches=num_calibration_batches)
print('Post Training Quantization: Calibration done')

# Convert to quantized model
torch.quantization.convert(myModel, inplace=True)
print('Post Training Quantization: Convert done')
print('\n Inverted Residual Block: After fusion and quantization, note fused modules: \n\n',myModel.features[1].conv)

print("Size of model after quantization")
print_size_of_model(myModel)

top1, top5 = evaluate(myModel, criterion, data_loader_test, neval_batches=num_eval_batches)
print('Evaluation accuracy on %d images, %2.2f'%(num_eval_batches * eval_batch_size, top1.avg))

For this quantized model, we see a significantly lower accuracy of just ~62% on these same 300
images. Nevertheless, we did reduce the size of our model down to just under 3.6 MB, almost a 4x
decrease.

In addition, we can significantly improve on the accuracy simply by using a different
quantization configuration. We repeat the same exercise with the recommended configuration for
quantizing for x86 architectures. This configuration does the following:

Quantizes weights on a per-channel basis
Uses a histogram observer that collects a histogram of activations and then picks
quantization parameters in an optimal manner.

per_channel_quantized_model = load_model(saved_model_dir + float_model_file)
per_channel_quantized_model.eval()
per_channel_quantized_model.fuse_model()
per_channel_quantized_model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
print(per_channel_quantized_model.qconfig)

torch.quantization.prepare(per_channel_quantized_model, inplace=True)
evaluate(per_channel_quantized_model,criterion, data_loader, num_calibration_batches)
torch.quantization.convert(per_channel_quantized_model, inplace=True)
top1, top5 = evaluate(per_channel_quantized_model, criterion, data_loader_test, neval_batches=num_eval_batches)
print('Evaluation accuracy on %d images, %2.2f'%(num_eval_batches * eval_batch_size, top1.avg))
torch.jit.save(torch.jit.script(per_channel_quantized_model), saved_model_dir + scripted_quantized_model_file)