[tensorflow] 模型保存、加载与转换详解

最新推荐文章于 2024-04-12 13:18:03 发布

ASR_THU

最新推荐文章于 2024-04-12 13:18:03 发布

阅读量1.1k

点赞数 3

分类专栏：工具性知识点 tensorflow 文章标签： tensorflow

工具性知识点同时被 2 个专栏收录

92 篇文章 4 订阅

订阅专栏

tensorflow

24 篇文章 0 订阅

订阅专栏

TensorFlow模型加载与转换详解

本次讲解主要涉及到TensorFlow框架训练时候模型文件的管理以及转换。

首先我们需要明确TensorFlow模型文件的存储格式以及文件个数：

model_folder:
------checkpoint
------model.meta
------model.data-00000-of-00001
------model.index
以上是模型文件夹里面存在的所有文件：
checkpoint文件是存储所有模型文件的名字，在使用tf.train.latest_checkpoint()的时候，该函数会借助此文件内容获取最新模型文件。
model.meta文件是图的基本架构，pb格式文件，里面包含变量，操作，集合等数据。
model.data-00000-of-00001文件和model.index文件就是ckpt文件，里面的内容存储的就是权重、偏置等内容。在TensorFlow0.11之前，使用ckpt一个后缀文件存储，以后的TensorFlow版本都是使用这两个文件共同存储模型参数。

明确了这一点以后，我们就开始创建计算图，也就是网络结构已经内部的运算。

网络的搭建，为了简单起见，我们搭建的网络就比较简单：input layer,conv_1,conv_2,fc1,dropout,fc2(输出层),具体搭建代码如下：

def network():

    # define the placeholder by using feed the data
    with tf.name_scope('input_placeholder'):
        x = tf.placeholder(tf.float32, [None, 784], 'x')  # 28*28=784 dim
        x_input = tf.reshape(x, [-1, 28, 28, 1], 'x_reshape')  # reshape for conv, -1表示不固定数量，1为通道数
        y_label = tf.placeholder(tf.float32, [None, FLAGS.classes], 'y_label')  # label - 10 dim

    # define convolution layer1
    with tf.name_scope('conv_layer1'):
        W_conv1 = weight_variable([5, 5, 1, 32], name='w_conv_1')  # Weight in:1  out:32
        b_conv1 = bias_variable([32], name='b_conv_1')  # bias
        h_relu1 = tf.nn.relu(conv2d(x_input, W_conv1) + b_conv1, name='relu_1')  # relu
        h_pool1 = max_pool_2(h_relu1, name='pool_1')  # pool after relu1

    # define convolution layer2
    with tf.name_scope('conv_layer2'):
        W_conv2 = weight_variable([5, 5, 32, 64], name='w_conv_2')  # Weight in:32  out:64
        b_conv2 = bias_variable([64], name='b_conv_2')  # bias for 64 kernel
        h_relu2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2, name='relu_2')  # relu
        h_pool2 = max_pool_2(h_relu2, name='pool_2')  # pool after relu2

    # define the first FC layer
    with tf.name_scope('fc1'):
        W_fc1 = weight_variable([7 * 7 * 64, 1024], name='w_fc1')  # Weight in:7*7res*64  out:1024
        b_fc1 = bias_variable([1024], name='b_fc1')  # bias for 1024
        h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64], name='pool1')
        h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1, name='relu1')

    # adding the dropout, in order to restrain overfitting
    with tf.name_scope('drop_out'):
        keep_prob = tf.placeholder(tf.float32, name='drop_out_placeholder')
        drop_fc1 = tf.nn.dropout(h_fc1, keep_prob, name='drop_out_fc')

    # define the second FC layer, by using softmax
    with tf.name_scope('fc2'):
        W_fc2 = weight_variable([1024, FLAGS.classes], name='w_fc2')  # Weight in:1024  out:10
        b_fc2 = bias_variable([FLAGS.classes], name='b_fc2')  # bias for 10, 10类划分
        y = tf.nn.softmax(tf.matmul(drop_fc1, W_fc2) + b_fc2, name='y_out')  # 计算结果

    global_step = tf.Variable(0, trainable=False)

    # define the loss
    with tf.name_scope('loss'):
        cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(y), reduction_indices=[1]), name='cross_entropy')
    with tf.name_scope('train_op'):
        train_step = tf.train.AdamOptimizer(FLAGS.lr).minimize(cross_entropy,
                                                               global_step=global_step,
                                                               name='train_operation')  # Adam 替代SGD

    # define the accuracy
    with tf.name_scope('accuracy'):
        correct_pred = tf.equal(tf.argmax(y, 1), tf.argmax(y_label, 1), name='condition')
        accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name='accuracy')

    return x, y, keep_prob, y_label, train_step, accuracy, global_step

以上需要注意的地方是，我在每一层中都加入了name_scope，这样的好处就是可以更清楚的分清层与层之间的关系，以及对于后续我们直接通过tensor name来获取变量，而无须创建计算图架构做准备。

数据加载以及开始训练

# 数据加载
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
# 将数据全部加载在mnist中，供后需训练和测试使用

# 模型训练，也可以看到模型保存的类
def train():

    # the sign which save the meta graph, just once.
    a = False
    x, y, keep_prob, y_label, train_step, accuracy, global_step = network()

    sess.run(tf.global_variables_initializer())
    saver = tf.train.Saver(max_to_keep=3)

    if FLAGS.use_model:
        model_t = tf.train.latest_checkpoint(FLAGS.model_path)
        saver.restore(sess, model_t)

    for i in range(FLAGS.max_iter_step):
        batch = mnist.train.next_batch(FLAGS.batch_size)  # 每50个一个batch
        if i % 100 == 0:
            # eval执行过程－训练精度
            train_accuracy = sess.run(accuracy, feed_dict={x: batch[0], y_label: batch[1], keep_prob: 1.0})
            print("step {step}, training accuracy {acc}".format(step=i, acc=train_accuracy))
            if (train_accuracy > 0.5):
                if a == 0:
                    saver.export_meta_graph(FLAGS.model_path + FLAGS.meta_graph_name)
                    a = True
                saver.save(sess, FLAGS.model_path + FLAGS.model_name, global_step=global_step, write_meta_graph=False)
        sess.run(train_step, feed_dict={x: batch[0], y_label: batch[1], keep_prob: FLAGS.keep_drop})

那么怎么构建模型保存呢？

# 首先，创建Saver对象
saver = tf.train.Saver(max_to_keep=3) # 这里设置的模型文件最大保存个数是三个，也就是说checkpoint文件中始终有三个版本的模型文件
# 第二步，那就是根据不同的迭代或者epoch，你可以随心所欲的保存模型。
# 我这里处理的逻辑就是每训练一百次，然后train_accuracy的大小大于0.5，那么我就开始存储。
# 这里需要注意一点，那就是在保存模型文件的时候，完全没有必要每次都保存meta，所以，可以单独在第一次保存meta，因为meta是graph，所以后续训练不会对meta起作用，所以减少开销。
saver.export_meta_graph(FLAGS.model_path + FLAGS.meta_graph_accuracy)
# 上面的代码只是在第一次保存模型的时候执行
saver.save(sess, FLAGS.model_path + FLAGS.model_name, global_step=global_step, write_meta_graph=False)
# 上面的代码每次都执行，但是不会保存meta数据，在一般的保存模型的时候，write_meta_graph标志位是True

好，那么我们训练完成以后，模型文件已经有了，那么我们该如何导入刚才的模型文件执行测试呢？
```
def test():

    if FLAGS.use_model:
        with tf.Session() as sess:
            saver = tf.train.import_meta_graph(FLAGS.model_path + FLAGS.meta_graph_name)
            saver.restore(sess, tf.train.latest_checkpoint(FLAGS.model_path))

            graph = tf.get_default_graph()


            # one operation possibly have many outputs, so you need specify the which output, such as "name:0"
            x = graph.get_tensor_by_name("input_placeholder/x:0")
            y_label = graph.get_tensor_by_name("input_placeholder/y_label:0")
            keep_prob = graph.get_tensor_by_name("drop_out/drop_out_placeholder:0")
            accuracy = graph.get_tensor_by_name("accuracy/accuracy:0")

            feed_dict = {x: mnist.test.images,
                         y_label: mnist.test.labels,
                         keep_prob: 1.0}

            acc = sess.run(accuracy, feed_dict=feed_dict)
            print("test accuracy {acc:.4f}".format(acc=acc))
```
上面的代码：

首先判断是否使用模型文件，

然后打开会话，在这里注意，我并没有创建网络结构，也就是说，在TensorFLow默认的图中是不存在我的计算图结构的。

然后我们使用tf.train.import_meta_graph()方法将模型图，也就是meta文件导入给saver

然后使用saver的restore方法将模型文件导入

然后使用tf.get_default_graph()方法获取TensorFlow默认的计算图（这回会获取到那个我们保存的计算图）

因为，我们没有定义网络结构中的变量，所以我们无法得到具体的网络执行变量，所以我们需要借助graph.get_tensor_by_name()方法来实现计算图变量的获取。

然后通过sess.run()方法来运行想要的tenosr值

（上面的代码中需要注意的是get_tensor_by_name的名字组成：（name_scope）/(tensor_name):第几个值）因为我们的运算是建立在tensor上的，但是每次运行的结果都是通过operation来实现的，也就是说，后面的那个index就是我们的第几个operation所要取的值。

好，这里把模型文件的特殊保存和加载都讲完了，所以需要转换成pb文件，因为在后续我们使用TensorRT部署TensorFLow模型文件的时候，是需要pb文件，然后将pb文件转换为uff文件或者onnx文件来实现TenorRT网络的构建。

def save_pb_file():

    if FLAGS.use_model:
        saver = tf.train.import_meta_graph(FLAGS.model_path + FLAGS.meta_graph_name)


        model_t = tf.train.latest_checkpoint(FLAGS.model_path)
        saver.restore(sess, model_t)

        graphdef = tf.get_default_graph().as_graph_def()

        frozen_graph = tf.graph_util.convert_variables_to_constants(sess, graphdef, ['fc2/y_out']) # 这个地方需要注意，是最后一个输出节点的tenosr名字

        return tf.graph_util.remove_training_nodes(frozen_graph)
    else:
        return False

graph_def = save_pb_file()

if graph_def is False:
    raise ValueError("The meta graph do not exist!!!")

    output_file = './graph.pb'
    with tf.gfile.GFile(name = output_file, mode = 'w') as f:
        s = graph_def.SerializeToString()
        f.write(s)

首先也是加载图，然后加载权重参数文件，然后将graph作为一个graphdef返回，然后通过tf.graph_util.convert_variables_to_constants将参数文件转换为常量，最后，使用tf.graph_util.remove_training_nodes(frozen_graph)将在训练阶段才使用的变量去除，也就是一些gradients。

返回去除训练阶段的节点，然后通过tf.gfile.GFile写入到指定文件。

整体的代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2018/4/24 20:08
# @Author  : milittle
# @Site    : www.weaf.top
# @File    : model.py
# @Software: PyCharm
#coding=utf-8


import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
from tensorflow.python.framework import ops
import dataset

ops.reset_default_graph()
sess = tf.Session()

FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_integer('max_iter_step', 1000, 'define iteration times')
tf.app.flags.DEFINE_integer('batch_size', 128, 'define batch size')
tf.app.flags.DEFINE_integer('classes', 10, 'define classes')
tf.app.flags.DEFINE_float('keep_drop', 0.5, 'define keep dropout')
tf.app.flags.DEFINE_float('lr', 0.001, 'define learning rate')
tf.app.flags.DEFINE_string('model_path', 'model\\','define model path')
tf.app.flags.DEFINE_string('model_name', 'model.ckpt', 'define model name')
tf.app.flags.DEFINE_string('meta_graph_name', 'model.meta', 'define model name')
tf.app.flags.DEFINE_bool('use_model', False, 'define use_model sign')
tf.app.flags.DEFINE_bool('is_train', True, 'define train sign')
tf.app.flags.DEFINE_bool('is_test', False, 'define train sign')


mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
# mnist_train = dataset.train("MNIST_data/")
# mnist_test = dataset.train("MNIST_data/")

# define W & b
def weight_variable(para, name):
    # 采用截断的正态分布，标准差stddev＝0.1
    initial = tf.truncated_normal(para,stddev=0.1)
    return tf.Variable(initial, name)

def bias_variable(para, name):
    initial = tf.constant(0.1, shape=para)
    return tf.Variable(initial, name)

# define conv & pooling
def conv2d(x,W):
    return tf.nn.conv2d( x,W,strides=[1,1,1,1],padding='SAME' )

def max_pool_2(x, name):
    return tf.nn.max_pool(x,ksize=[1,2,2,1],strides=[1,2,2,1],padding='SAME', name=name)

def network():

    # define the placeholder by using feed the data
    with tf.name_scope('input_placeholder'):
        x = tf.placeholder(tf.float32, [None, 784], 'x')  # 28*28=784 dim
        x_input = tf.reshape(x, [-1, 28, 28, 1], 'x_reshape')  # reshape for conv, -1表示不固定数量，1为通道数
        y_label = tf.placeholder(tf.float32, [None, FLAGS.classes], 'y_label')  # label - 10 dim

    # define convolution layer1
    with tf.name_scope('conv_layer1'):
        W_conv1 = weight_variable([5, 5, 1, 32], name='w_conv_1')  # Weight in:1  out:32
        b_conv1 = bias_variable([32], name='b_conv_1')  # bias
        h_relu1 = tf.nn.relu(conv2d(x_input, W_conv1) + b_conv1, name='relu_1')  # relu
        h_pool1 = max_pool_2(h_relu1, name='pool_1')  # pool after relu1

    # define convolution layer2
    with tf.name_scope('conv_layer2'):
        W_conv2 = weight_variable([5, 5, 32, 64], name='w_conv_2')  # Weight in:32  out:64
        b_conv2 = bias_variable([64], name='b_conv_2')  # bias for 64 kernel
        h_relu2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2, name='relu_2')  # relu
        h_pool2 = max_pool_2(h_relu2, name='pool_2')  # pool after relu2

    # define the first FC layer
    with tf.name_scope('fc1'):
        W_fc1 = weight_variable([7 * 7 * 64, 1024], name='w_fc1')  # Weight in:7*7res*64  out:1024
        b_fc1 = bias_variable([1024], name='b_fc1')  # bias for 1024
        h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64], name='pool1')
        h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1, name='relu1')

    # adding the dropout, in order to restrain overfitting
    with tf.name_scope('drop_out'):
        keep_prob = tf.placeholder(tf.float32, name='drop_out_placeholder')
        drop_fc1 = tf.nn.dropout(h_fc1, keep_prob, name='drop_out_fc')

    # define the second FC layer, by using softmax
    with tf.name_scope('fc2'):
        W_fc2 = weight_variable([1024, FLAGS.classes], name='w_fc2')  # Weight in:1024  out:10
        b_fc2 = bias_variable([FLAGS.classes], name='b_fc2')  # bias for 10, 10类划分
        y = tf.nn.softmax(tf.matmul(drop_fc1, W_fc2) + b_fc2, name='y_out')  # 计算结果

    global_step = tf.Variable(0, trainable=False)

    # define the loss
    with tf.name_scope('loss'):
        cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(y), reduction_indices=[1]), name='cross_entropy')
    with tf.name_scope('train_op'):
        train_step = tf.train.AdamOptimizer(FLAGS.lr).minimize(cross_entropy,
                                                               global_step=global_step,
                                                               name='train_operation')  # Adam 替代SGD

    # define the accuracy
    with tf.name_scope('accuracy'):
        correct_pred = tf.equal(tf.argmax(y, 1), tf.argmax(y_label, 1), name='condition')
        accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name='accuracy')

    return x, y, keep_prob, y_label, train_step, accuracy, global_step

def train():

    # the sign which save the meta graph, just once.
    a = False
    x, y, keep_prob, y_label, train_step, accuracy, global_step = network()

    sess.run(tf.global_variables_initializer())
    saver = tf.train.Saver(max_to_keep=3)

    if FLAGS.use_model:
        model_t = tf.train.latest_checkpoint(FLAGS.model_path)
        saver.restore(sess, model_t)

    for i in range(FLAGS.max_iter_step):
        batch = mnist.train.next_batch(FLAGS.batch_size)  # 每50个一个batch
        if i % 100 == 0:
            # eval执行过程－训练精度
            train_accuracy = sess.run(accuracy, feed_dict={x: batch[0], y_label: batch[1], keep_prob: 1.0})
            print("step {step}, training accuracy {acc}".format(step=i, acc=train_accuracy))
            if (train_accuracy > 0.5):
                if a == 0:
                    saver.export_meta_graph(FLAGS.model_path + FLAGS.meta_graph_name)
                    a = True
                saver.save(sess, FLAGS.model_path + FLAGS.model_name, global_step=global_step, write_meta_graph=False)
        sess.run(train_step, feed_dict={x: batch[0], y_label: batch[1], keep_prob: FLAGS.keep_drop})

def test():

    if FLAGS.use_model:
        with tf.Session() as sess:
            saver = tf.train.import_meta_graph(FLAGS.model_path + FLAGS.meta_graph_name)
            saver.restore(sess, tf.train.latest_checkpoint(FLAGS.model_path))

            graph = tf.get_default_graph()


            # one operation possibly have many outputs, so you need specify the which output, such as "name:0"
            x = graph.get_tensor_by_name("input_placeholder/x:0")
            y_label = graph.get_tensor_by_name("input_placeholder/y_label:0")
            keep_prob = graph.get_tensor_by_name("drop_out/drop_out_placeholder:0")
            accuracy = graph.get_tensor_by_name("accuracy/accuracy:0")

            feed_dict = {x: mnist.test.images,
                         y_label: mnist.test.labels,
                         keep_prob: 1.0}

            acc = sess.run(accuracy, feed_dict=feed_dict)
            print("test accuracy {acc:.4f}".format(acc=acc))

def save_pb_file():

    if FLAGS.use_model:
        saver = tf.train.import_meta_graph(FLAGS.model_path + FLAGS.meta_graph_name)


        model_t = tf.train.latest_checkpoint(FLAGS.model_path)
        saver.restore(sess, model_t)

        graphdef = tf.get_default_graph().as_graph_def()

        frozen_graph = tf.graph_util.convert_variables_to_constants(sess, graphdef, ['fc2/y_out'])

        return tf.graph_util.remove_training_nodes(frozen_graph)
    else:
        return False

def main():
    if FLAGS.is_train:
        train()
    elif FLAGS.is_test:
        test()
    else:
        graph_def = save_pb_file()

        if graph_def is False:
            raise ValueError("The meta graph do not exist!!!")

        output_file = './graph.pb'
        with tf.gfile.GFile(name = output_file, mode = 'w') as f:
            s = graph_def.SerializeToString()
            f.write(s)

if __name__ == '__main__':
    try:
        main()
    except (ValueError, IndexError) as ve:
        print(ve)

今天的TensorFlow模型保存以及加载，以及将三个训练阶段使用的模型文件整合到一个pb文件中，这个pb文件不仅仅可以在构建TensorRT的网络中使用，也可以使用在部署TensorFlow serving中。