Post-training quantization的方法
tensorflow lite model 的quantization的方法有两种:
“hybrid” post training quantization and post-training integer quantization
“hybrid” post training quantization approach reduced the model size and latency in many cases, but it has the limitation of requiring floating point computation, which may not be available in all hardware accelerators (i.e. Edge TPUs).
post-training integer quantization enables users to take an already-trained floating-point model and fully quantize it to only use 8-bit signed integers (i.e. `int8`). By leveraging this quantization scheme, we can get reasonable quantized model accuracy across many models without resorting to retraining a model with quantization-aware training. With this new tool, models will continue to be 4x smaller, but will see even greater CPU speed-ups. Fixed point hardware accelerators, such as Edge TPUs, will also be able to run these models.
量化的原理
1] 每轴(或每通道)或每张量的权重用int8进行定点量化的可表示范围为[-127,127],且zero-point就是量化值0
2] 每张量的激活值或输入值用int8进行定点量化的可表示范围为[-128,127],其zero-point在[-128,127]内依据公式求得
量化参数:
S: Rmax - Rmin / Qmax - Qmin = Scale: 每个Q单位表示多大的Real value
Z: Zero = Qmax - Rmax / S: Real zero表示多大的Q value
明确权重的量化和激活值的量化方法不同:
1. weight的量化方法是: Real zero 由 Q value 0表示, 而激活值的 Real zero是由 Z = Qmax - Rmax/S 计算得到
2. 定点的范围也不同weight: [-127, 127], active value:[-128, 127]
对量化是阈值的选取计算,这篇文章采用了更简单的方法。
对于权重,使用实际的最大和最小值来决定量化参数。
对于激活输出,使用跨批(batches)的最大和最小值的滑动平均值来决定量化参数。
生成 float tflite model/ hybrid quatization and integer quantization
下面用简单的例子演示怎样生成 float tflite mode(no quantization), hybrid post training quatization and post-training integer quantization.
生成简单的mnist模型
使用的tensorfow版本为2.0.0
import tensorflow as tf
import numpy as np
print (tf.__version__)
(x_train, y_train),(x_test, y_test) = tf.keras.datasets.mnist.load_data()
#x_train, x_test = (x_train / 255.0), (x_test / 255.0)
x_train, x_test = (x_train / 255.0).astype(np.float32), (x_test / 255.0).astype(np.float32)
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=1)
model.evaluate(x_test, y_test)
保存为 saved_model and h5格式
model.save('saved_model')
model.save('keral_model.h5')
用的竟然是同一个函数save, 可能是从后面的参数 文件夹还是文件区分的
生成tf.lite.TFLiteConverter的方法
在介绍生成quantization之前,先介绍下生成TFLiteConverter的方法。
The Python API for converting TensorFlow models to TensorFlow Lite is tf.lite.TFLiteConverter
. TFLiteConverter
provides the following classmethods to convert a model based on the original model format:
TFLiteConverter.from_saved_model()
: Converts SavedModel directories.TFLiteConverter.from_keras_model()
: Convertstf.keras
models.TFLiteConverter.from_concrete_functions()
: Converts concrete functions.
This document contains example usages of the API, a detailed list of changes in the API between Tensorflow 1 and TensorFlow 2, and instructions on running the different versions of TensorFlow.
和之前1.x的区别是没有了 from_keras_model_file("xxx.h5")
在2.0中没有该函数,怎么从h5文件中得到
keras_model = tf.keras.models.load_model('keras_model.h5') 然后
converter = tf.lite.TFLiteConverter.from_keras_model(keras_model)
生成float模型
converter = tf.lite.TFLiteConverter.from_keras_model(model)
float_model = converter.convert()
open("float_model.tflite", "wb").write(float_model)
生成hybrid quantization
converter_hybrid = tf.lite.TFLiteConverter.from_saved_model('saved_model')
converter_hybrid.optimizations = [tf.lite.Optimize.DEFAULT]