DPU on PYNQ-Z2系列—2.2 DNNDK使用—使用decent工具量化模型

最新推荐文章于 2021-08-06 12:04:21 发布

lulugay

最新推荐文章于 2021-08-06 12:04:21 发布

阅读量3.9k

点赞数

分类专栏： DPU on PYNQ-Z2

本文链接：https://blog.csdn.net/lulugay/article/details/103976121

版权

DPU on PYNQ-Z2 专栏收录该内容

8 篇文章 43 订阅

订阅专栏

使用decent工具量化模型

本篇博文以dnndk提供的resnet50模型为例介绍如何使用decent工具对模型进行量化

TensorFlow

1.冻结模型

freeze_graph \
--input_graph=./float_graph/resnet50v1.pb \
--input_checkpoint=./float_graph/resnet50v1.ckpt \
--input_binary=true \
--output_graph=./resnet50v1.pb \
--output_node_name=resnet_v1_50/predictions/Reshape_1

这里的pb和ckpt文件是dnndk提供的，input_node和output_node的名称是在定义模型时确定的，如果是自定义模型要根据定义修改。

2.验证冻结后的模型

evaluate_frozen_graph.sh内容如下：

#!/bin/sh
set -e
# Please set your imagenet validation dataset path here,
IMAGE_DIR=/media/DATASET/imagenet2012/val/
IMAGE_LIST=/media/DATASET/imagenet2012/val.txt

EVAL_BATCHES=1000
BATCH_SIZE=50

python3 eval.py \
  --input_frozen_graph ./frozen_resnet50v1.pb \
  --input_node input \
  --output_node resnet_v1_50/predictions/Reshape_1
  --eval_batches $EVAL_BATCHES \
  --batch_size $BATCH_SIZE \
  --eval_image_dir $IMAGE_DIR \
  --eval_image_list $IMAGE_LIST \
  --gpu 0

在这里，用到了eval.py，主要内容如下

def eval(input_graph_def, input_node, output_node):
    """Evaluate classification network graph_def's accuracy, need evaluation dataset"""
    tf.import_graph_def(input_graph_def,name = '')

    # Get input tensors
    input_tensor = tf.get_default_graph().get_tensor_by_name(input_node+':0')
    input_labels = tf.placeholder(tf.float32,shape = [None,FLAGS.class_num])

    # Calculate accuracy
    output = tf.get_default_graph().get_tensor_by_name(output_node+':0')
    prediction = tf.reshape(output, [FLAGS.batch_size, FLAGS.class_num])
    correct_labels = tf.argmax(input_labels, 1)
    top1_prediction = tf.nn.in_top_k(prediction, correct_labels, k = 1)
    top5_prediction = tf.nn.in_top_k(prediction, correct_labels, k = 5)
    top1_accuracy = tf.reduce_mean(tf.cast(top1_prediction,'float'))
    top5_accuracy = tf.reduce_mean(tf.cast(top5_prediction,'float'))

    # Start evaluation
    print("Start Evaluation for {} Batches...".format(FLAGS.eval_batches))
    with tf.Session() as sess:
        progress = ProgressBar()
        top1_sum_acc = 0
        top5_sum_acc = 0
        for iter in progress(range(0,FLAGS.eval_batches)):
            input_data = eval_input(iter, FLAGS.eval_image_dir, FLAGS.eval_image_list, FLAGS.class_num, FLAGS.batch_size)
            images = input_data['input']
            # img = input_data['input']
            # images = np.array(img)
            labels = input_data['labels']
            feed_dict = {input_tensor: images, input_labels: labels}
            top1_acc, top5_acc = sess.run([top1_accuracy, top5_accuracy],feed_dict)
            top1_sum_acc += top1_acc
            top5_sum_acc += top5_acc
    final_top1_acc = top1_sum_acc/FLAGS.eval_batches
    final_top5_acc = top5_sum_acc/FLAGS.eval_batches
    print("Accuracy: Top1: {}, Top5: {}".format(final_top1_acc, final_top5_acc))

其中大部分内容都是固定下来无需做任何改动，只有
input_data = eval_input(iter, FLAGS.eval_image_dir, FLAGS.eval_image_list, FLAGS.class_num, FLAGS.batch_size)
这一行需要改动。这一行的作用是把图片以及相应的label信息读取进来，经过图像预处理成Tensor，并且返回相应的label。eval_input内容如下

def eval_input(iter, eval_image_dir, eval_image_list, class_num, eval_batch_size):
    images = []
    labels = []
    line = open(eval_image_list).readlines()
    for index in range(0, eval_batch_size):
        curline = line[iter * eval_batch_size + index]
        [image_name, label_id] = curline.split(' ')
        image = cv2.imread(eval_image_dir + image_name)
        image = preprocess(image)
        images.append(image)
        labels.append(int(label_id))
        lb = preprocessing.LabelBinarizer()
    lb.fit(range(0, class_num))
    labels = lb.transform(labels)
    return {"input": images, "labels": labels}

这里边最关键的内容是preprocess(image)，
我们在做模型验证、量化时要保证这里用的图像预处理与模型训练时用的预处理是一致的
我们在做模型验证、量化时要保证这里用的图像预处理与模型训练时用的预处理是一致的
我们在做模型验证、量化时要保证这里用的图像预处理与模型训练时用的预处理是一致的
重要的事情说三遍。从dnndk提供的脚本来看resnet50v1的预处理是对RGB三个通道分别减去103.939，117.779，123.68，并且将RGB转换成BGR。于是在eval_input里的preprocess函数也应该执行相同的操作，内容如下：

def preprocess(img):
    img = np.array(img, dtype=np.float32)
    height, width, _ = img.shape
    new_height = height * 256 // min(img.shape[:2])
    new_width = width * 256 // min(img.shape[:2])
    img = cv2.resize(img, (new_width, new_height), interpolation=cv2.INTER_CUBIC)

    height, width, _ = img.shape
    startx = width//2 - (224//2)
    starty = height//2 - (224//2)
    img = img[starty:starty+224,startx:startx+224]
    assert img.shape[0] == 224 and img.shape[1] == 224, (img.shape, height, width)

    img[:,:,0] -= 123.68
    img[:,:,1] -= 116.779
    img[:,:,2] -= 103.939 
    
    # Resize
    return img

需要注意两点

cv2.imread读取进来的直接是BGR格式，无需再做RGB2BGR
cv2.imread读取进来的图像数据格式是uint8，需要转换成float32再减去均值

然后执行evaluate_frozen.sh结果如下，准确率Top1: 0.7355, Top5: 0.9147，说明我们的预处理是没有问题的

root@3f231e40c7cd:/mnt/nvidia/host_x86/models/tensorflow/resnet50# sh evaluate_frozen.sh
WARNING:tensorflow:From eval.py:55: FastGFile.__init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.gfile.GFile.
Start Evaluation for 1000 Batches...
2020-03-08 17:18:32.296229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:04:00.0
totalMemory: 10.76GiB freeMemory: 10.60GiB
2020-03-08 17:18:32.296285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2020-03-08 17:18:32.668404: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-08 17:18:32.668467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2020-03-08 17:18:32.668478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2020-03-08 17:18:32.668634: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10232 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:04:00.0, compute capability: 7.5)
100% |###################################################################################################################################################################################################|
Accuracy: Top1: 0.6144000029563904, Top5: 0.8405999964475632

3.量化模型

decent_q quantize \
  --input_frozen_graph ./frozen_resnet50v1.pb \
  --input_nodes input \
  --input_shapes ?,224,224,3 \
  --output_nodes resnet_v1_50/predictions/Reshape_1 \
  --input_fn input_fn.calib_input \
  --method 1 \
  --gpu 0 \
  --calib_iter 27 \
  --output_dir ./quantize_results \
  --weight_bit 8 \
  --activation_bit 8

4.编译模型

准备dcf文件
在Vivado中集成DPU IP一节中我们提到要保存hwh文件，在dnndk中调用dlet生成dcf文件。

dlet -f pynq_dpu.hwh
[DLet]Generate DPU DCF file dpu-11111530-111530-201911111530-1530-30.dcf successfully.

编译

dnnc --parser=tensorflow                         \
       --frozen_pb=./quantize_results/deploy_model.pb   \
       --output_dir=dnnc_output                 \
       --dcf=pynqz2.dcf                         \
       --mode=normal                        \
       --cpu_arch=arm32                     \
       --net_name=resnet50v1

等待一段时间我们可以看到下面的结果

[DNNC][Warning] layer [resnet_v1_50_SpatialSqueeze] (type: Squeeze) is not supported in DPU, deploy it in CPU instead.
[DNNC][Warning] layer [resnet_v1_50_predictions_Softmax] (type: Softmax) is not supported in DPU, deploy it in CPU instead.

DNNC Kernel topology "resnet50v1_kernel_graph.jpg" for network "resnet50v1"
DNNC kernel list info for network "resnet50v1"
                               Kernel ID : Name
                                       0 : resnet50v1_0
                                       1 : resnet50v1_1

                             Kernel Name : resnet50v1_0
--------------------------------------------------------------------------------
                             Kernel Type : DPUKernel
                               Code Size : 0.99MB
                              Param Size : 24.35MB
                           Workload MACs : 6964.51MOPS
                         IO Memory Space : 2.25MB
                              Mean Value : 0, 0, 0,
                              Node Count : 58
                            Tensor Count : 59
                    Input Node(s)(H*W*C)
            resnet_v1_50_conv1_Conv2D(0) : 224*224*3
                   Output Node(s)(H*W*C)
           resnet_v1_50_logits_Conv2D(0) : 1*1*1000


                             Kernel Name : resnet50v1_1
--------------------------------------------------------------------------------
                             Kernel Type : CPUKernel
                    Input Node(s)(H*W*C)
             resnet_v1_50_SpatialSqueeze : 1*1*1000
                   Output Node(s)(H*W*C)
        resnet_v1_50_predictions_Softmax : 1*1*1000

需要解释一下为什么产生了两个kernel，却只生成了一个elf文件。在ResNet50v1网络中，从输入resnet_v1_50_conv1_Conv2D到resnet_v1_50_logits_Conv2D节点，都是放在dpu上计算的，但是后边的squeeze和softmax操作dpu不支持，就需要我们把数据从resnet_v1_50_logits_Conv2D节点拿出来再手动写squeeze和softmax的功能。不过我们在这里做的只是分类，并不需要把softmax结果计算出来，让dpu计算到resnet_v1_50_logits_Conv2D，对结果直接排序就可以得到分类的结果了。