TFrecord的制作以及tf.train.shuffle_batch、Dataset详解

亚古兽要进化

已于 2022-03-18 10:20:16 修改

阅读量5.2k

点赞数 2

分类专栏： TensorFlow 深度学习文章标签： batch tensorflow 深度学习

于 2018-04-10 15:35:44 首次发布

本文链接：https://blog.csdn.net/qq26983255/article/details/79880772

版权

深度学习同时被 2 个专栏收录

15 篇文章 0 订阅

订阅专栏

TensorFlow

10 篇文章 1 订阅

订阅专栏

安装Tensorflow-GPU:

安装gpu版本的tensorflow还是有一定的坑的。主要是不同的tensorflow版本需要对应不同的cuda和cudnn，他们三个之间的联系比较强烈。在次安装时仍然不免需要查看一些流程，现在整理一下当时引用的博客：

Ubuntu 18.04 基于NVIDIA 2080安装TensorFlow-GPU 1.13.1_毛虫小臭臭的技术博客_51CTO博客

深度学习应用系列（一）| 在Ubuntu 18.04安装tensorflow 1.10 GPU版本 - 可可心心 - 博客园

TensorFlow GPU在ubuntu 18.04上安装的注意事项_王玉成的博客的博客-CSDN博客

从源代码构建 | TensorFlow

ubuntu 下安装NVIDIA显卡驱动出现X service error问题解决方法_lien0906的博客-CSDN博客

Ubuntu安装Tensorflow（GPU版)_Louiseluke的博客-CSDN博客_ubuntu 安装tensorflow-gpu

在window上安装cuda和cudnn要方便的多，网上有很多教程以及conda的安装。例如1.14和pytorch的安装命令可以直接通过conda的命令安装如下：

一cuda=10.2和cudnn=7.X版本

conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
或者（conda加入国内的channel镜像）
conda install pytorch torchvision cudatoolkit=10.2
TensorFlow：
在conda环境下也可以直接使用pip进行安装，conda在安装的时候顺带着把pip包也给安装好了
pip install tensorflow-gpu==1.14

一，如何制作TFRecord：

先上代码：

def test():
    writer = tf.python_io.TFRecordWriter("/Users/szx/Desktop/train.tfrecords")
    for i in range(100):
        example = tf.train.Example(features=tf.train.Features(feature={
                "label": tf.train.Feature(int64_list=tf.train.Int64List(value=[i])),
                'data_raw': tf.train.Feature(int64_list=tf.train.Int64List(value=[i*100]))
            }))
        writer.write(example.SerializeToString())

首先代码比较简单，之所以这么简单是为了解释后面提到的两个函数的功能。一般而言data_raw是图片格式转换成byte类型存储，这里的代码和label相同都是Int类型，主要是为了数据打印，能够比较直观的查看。label在这里采用稀疏方式存储。顺便提一下对于label什么是稀疏和非稀疏表达方式：

比如在对于深度学习中入门级别的mnist识别中的任务，假如待识别的一个batch中的两张图片的groundTruth分别是2和5的话，稀疏表达方式为[2, 5]，而非稀疏表达方式则为[[0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]] 。因此在softmax阶段中，是选择tf.nn.softmax_cross_entropy_with_logits还是sparse_softmax_cross_entropy_with_logits 需要看当初的groundtruth的存储方式。

如何表示的具体API接口可查询tf.one_hot()和tf.sparse_to_dense()

二，读取TFRecord

代码如下：

def test_read(filename):
    filename_queue = tf.train.string_input_producer([filename])
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)
    features = tf.parse_single_example(serialized_example,
                                       features={
                                           'label': tf.FixedLenFeature([], tf.int64),
                                           'data_raw' : tf.FixedLenFeature([], tf.int64),
                                       })

    data_raw = tf.cast(features['data_raw'], tf.int32)
    label = tf.cast(features['label'], tf.int32)

    return data_raw, label

应当注意的是，在tf.parse_single_example（）函数中features的格式要和当初制作的时候一一对应。

if __name__ == '__main__':

    data_raw, label = test_read("/Users/szx/Desktop/train.tfrecords")
    # data_raw_batch, label_batch = tf.train.shuffle_batch([data_raw, label],
    #                                                 batch_size=1, capacity=3,
    #                                                 min_after_dequeue=2
    #                                                 )
    data_raw_batch, label_batch = tf.train.batch([data_raw, label],
                                                    batch_size=5, capacity=50,
                                                    )
    init = tf.global_variables_initializer()
    with tf.Session() as sess:
        sess.run(init)
        threads = tf.train.start_queue_runners(sess=sess)
        for i in range(100):
            data, l= sess.run([data_raw_batch, label_batch])
            print (val, l)
            print i

详细解释一下

tf.train.batch（）和tf.train.shuffle_batch 两个函数

对于tf.train.batch()一般较好理解，其中的参数capacity就是此队列的容量，batch_size就是一个batch的大小。从队列中顺序出一个batch，batch里的数据不会打乱。执行此函数后打印结果如下：

(array([  0, 100, 200, 300, 400], dtype=int32), array([0, 1, 2, 3, 4], dtype=int32))
0
(array([500, 600, 700, 800, 900], dtype=int32), array([5, 6, 7, 8, 9], dtype=int32))
1
(array([1000, 1100, 1200, 1300, 1400], dtype=int32), array([10, 11, 12, 13, 14], dtype=int32))

....... 

.......

对于执行tf.train.shuffle_batch()函数，batch里的数据在队列里面被打乱，执行此函数打印结果如下：

(array([   0,  300, 2600,  200, 1200], dtype=int32), array([ 0,  3, 26,  2, 12], dtype=int32))
0
(array([2400, 2200, 3300, 3800, 4400], dtype=int32), array([24, 22, 33, 38, 44], dtype=int32))
1
(array([4800, 2800, 3200, 1000, 3700], dtype=int32), array([48, 28, 32, 10, 37], dtype=int32)

......

对于此函数多了一个min_after_dequeue的参数，该参数如何解释我也不太清楚，但是可以确定的是他的值一定要比capacity要小，而且数值越大，混乱程度越大。官方解释为：在一组batch元素出队后，队列里面需要剩余元素的最小数。查看很多资料，min_after_dequeue的值的设置为：capacity = min_after_dequeue + 3*batch_size。

多tfrecord文件读取

上面的例子是单个tfrecord，如果由于种种原因，我们不能将所有的数据制作成一个tfrecord而是多个，并且训练时用用到所有的数据集。这里只需将多个tfrecord写入到一个数组里面即可：

def test_read():
    tf_file1 = './1.tfrecord'
    tf_file2 = './2.tfrecord'
    tf_file3 = './3.tfrecord'
    filename_queue = tf.train.string_input_producer([tf_file1,tf_file2,tf_file3])
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)
    features = tf.parse_single_example(serialized_example,
                                       features={
                                           'label': tf.FixedLenFeature([], tf.int64),
                                           'data_raw' : tf.FixedLenFeature([], tf.int64),
                                       })

    data_raw = tf.cast(features['data_raw'], tf.int32)
    label = tf.cast(features['label'], tf.int32)

    return data_raw, label

其中，tf_file1、tf_file2、tf_file3是tfrecord文件的路径。

三：对于numpy类型矩阵制作成tfrecord文件的方法

在网上应该有很多对于图片制作成tfrecord文件的博客，如果仅仅按照图片加载进来后成为shape为[长，宽，通道]的张量或矩阵做处理的话，你会发现报错数据类型格式错误。

结合自己在制作矩阵数据类型的tfrecord文件时遇到的问题，需要强调的是，当将tf.train.shuffle_batch函数执行tf.parse_single_example（）输出来的feature时，必须明确送进来的feature的shape。否者会报：ValueError: All shapes must be fully defined: ..................，类似错误。在下面给出的程序中，我会标注。由此产生的问题是，假如制作成tfrecord的矩阵数据的形状不统一（不像图片一样，长、宽和通道一般都是确定的），这个在送入tf.train.shuffle_batch是就没有办法reshape的，暂时还没有找到解决的办法。如果大神解决，请留言帮助。

整个工程代码如下：

def make_tfrecords():
    writer = tf.python_io.TFRecordWriter(FLAGS.output_tfrecords_file)
    pca_mean = np.load(os.path.join(FLAGS.model_dir, 'mean.npy'))[:, 0]
    pca_eigenvals = np.load(os.path.join(FLAGS.model_dir, 'eigenvals.npy'))[:1024, 0]
    pca_eigenvecs = np.load(os.path.join(FLAGS.model_dir, 'eigenvecs.npy')).T[:, :1024]
    inception_proto_file = os.path.join(FLAGS.model_dir, 'classify_image_graph_def.pb')
    graph_def = tf.GraphDef.FromString(open(inception_proto_file, 'rb').read())
    min_quantized_value = -2
    max_quantized_value = 2
    max_num_frame = 20
    with tf.Graph().as_default() as g1:

        _ = tf.import_graph_def(graph_def, name='')
        sess = tf.Session()
        Frame_Features = sess.graph.get_tensor_by_name('pool_3/_reshape:0')
        Pca_Mean = tf.constant(value=pca_mean, dtype=tf.float32)
        Pca_Eigenvecs = tf.constant(value=pca_eigenvecs, dtype=tf.float32)
        Pca_Eigenvals = tf.constant(value=pca_eigenvals, dtype=tf.float32)
        Feats = Frame_Features[0] - Pca_Mean
        Feats = tf.reshape(tf.matmul(tf.reshape(Feats, [1, 2048]), Pca_Eigenvecs), [1024, ])
        features = tf.divide(Feats, tf.sqrt(Pca_Eigenvals + 1e-4), name='pca_final_feature')
        total_written = 0
        for video_file, labels in csv.reader(open(FLAGS.input_videos_csv)):
            rgb_features = []
            for rgb in frame_iterator(video_file):
                out_feature = sess.run(features, feed_dict={'DecodeJpeg:0': rgb[:, :, ::-1]})
                print(out_feature)
                rgb_features.append(out_feature)
            print(rgb_features)
            mat_features = rgb_features
            rgb_features = np.reshape(rgb_features,[-1,1024])
            num_frame = np.shape(rgb_features)[0]
            if num_frame < max_num_frame:
                for i in range(int(max_num_frame) - int(num_frame)):
                    random_num = random.sample(range(0, num_frame), 1)
                    mat_features.append(rgb_features[random_num[0]])
            else:
                # 由于在读取视频帧的时候就可以让帧率保持在最大值之内。这步一般不会走
                mat_features = rgb_features[0:max_num_frame]
            mat_features = np.reshape(mat_features,[-1,1024])
            print(mat_features)
            print(np.shape(mat_features))
            feature = {'labels': _int64_feature(int(labels)),
                       'feature': _bytes_feature(tf.compat.as_bytes(mat_features.tostring()))}
            # 创建一个 example protocol buffer
            example = tf.train.Example(features=tf.train.Features(feature=feature))
            writer.write(example.SerializeToString())
        writer.close()
def reader_tfrecord():
    feature = {
        'feature': tf.FixedLenFeature([], tf.string),
        'labels': tf.FixedLenFeature([], tf.int64)
    }
    filename_queue = tf.train.string_input_producer([FLAGS.output_tfrecords_file])
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)
    features = tf.parse_single_example(serialized_example, features=feature)
    feature_frame = tf.decode_raw(features['feature'], tf.float32)
    label = tf.cast(features['labels'], tf.int32)
    #转换成稀疏表达形式
    label = (tf.cast(
        tf.sparse_to_dense(label, (101,), 1,
                           validate_indices=False),
        tf.bool))
    # 保证送进shuffle_batch的tensor具有相同的shape
    feature_frame = tf.reshape(feature_frame, [20, 1024])
    images, labels = tf.train.shuffle_batch([feature_frame, label], batch_size=2, capacity=5, min_after_dequeue=2)
    return images,labels
if __name__ == '__main__':
    feature_bn, labels_bn = reader_tfrecord()
    with tf.Session() as sess:  # 开始一个会话
        init_op = tf.initialize_all_variables()
        sess.run(init_op)
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coord)
        for i in range(4):
            example, l = sess.run([feature_bn,labels_bn])  # 在会话中取出image和label
            print(np.shape(example))
            print(np.shape(l))
        coord.request_stop()
        coord.join(threads)

简要介绍一下该段程序，就是对视频中抽取的帧送入inception网络，然后在全连接层后输出特征向量，随后进行PCA降维成1024长度。由于每个视频的帧数不一致，因此会造成一个视频的特征集合维度不一样。这里由于是示例代码，就随便设置了每个视频最高提取20帧，如果不够的话，会从该特征集里面随机抽取后若干特征向量，凑齐成20.

_bytes_feature(),_int64_feature,两个tfrecord格式的方法。可安装相应的格式自己编写：

file_record = tf.train.Example(features=tf.train.Features(feature={

                    'image_raw': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_raw])),
                    'label_raw': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(j)]))

                }))

感觉和图片的制作方法的区别就在tf.compat.as_bytes(mat_features.tostring())这行代码。

四 Dataset

dataset可以说上面介绍的tensorflow对于数据处理的一种升级，他的类图可以用下图表示：

其中包含了四种创建方式：

这个函数直接从内存中读取数据，数据的形式可以是数组、矩阵、dict等。
一：tf.data.Dataset.from_tensor_slices()：

这个函数是用来读TFRecord文件的，dataset中的每一个元素就是一个TFExample。
二：tf.data.TFRecordDataset()：

这个函数的输入是一个文件的列表，输出是一个dataset。dataset中的每一个元素就对应了文件中的一行。可以使用这个函数来读入CSV文件
三：tf.data.TextLineDataset()

这个函数的输入是一个文件的列表和一个record_bytes，之后dataset的每一个元素就是文件中固定字节数record_bytes的内容。通常用来读取以二进制形式保存的文件，如CIFAR10数据集就是这种形式
四 tf.data.FixedLengthRecordDataset()

这里重点介绍一下第二种：tf.data.TFRecordDataset()

filenames = "file1.tfrecord"
dataset = tf.data.TFRecordDataset(filenames)

其实算做一种tf.train.batch（）和tf.train.shuffle_batch替代；我们在使用这两个函数是，首先也是使用tf.parse_single_example().把tfrecord文件序列化。

def parse_exmp(serial_exmp): 
feats = tf.parse_single_example(serial_exmp, features={'feature':tf.FixedLenFeature([], tf.string),\
	'label':tf.FixedLenFeature([10],tf.float32)})
image = tf.decode_raw(feats['feature'], tf.float32)
label = feats['label']
return image, label

然后使用dataset的map方法进行匹配

dataset = dataset.map(parse_exmp)

然后可以设置一些epoch、batchsize等参数，这在循环调用数据集的时候很有必要。

dataset = dataset.repeat(epochs)
dataset = dataset.shuffle(buffer_size)
dataset = dataset.batch(batch_size, drop_remainder=True)

#或者一块设置
dataset = dataset.repeat(epochs).shuffle(buffer_size).batch(batch_size)
#查了一些资料，在一起进行设置的时候，repeat的设置顺序对数据的产生还有一定的影响：

dataset_train.repeat(epochs).shuffle(1000).batch(batch_size) 
# make sure repeat is ahead batch
# this is different from dataset.shuffle(1000).batch(batch_size).repeat(epochs)
# the latter means that there will be a batch data with nums less than batch_size for each epoch
# if when batch_size can't be divided by nDatas.

shuffle也就是混乱程度的设置，跟之前使用tf.train.shuffle_batch作用相同。repeat就是循环epoch的次数，epochs为空，：dataset = dataset.repeat()，表示这无限次数循环。当然这里可以在外部进行设置，用来控制循环次数。

接下来进行迭代的设置：

iterator = dataset.make_one_shot_iterator()
batch_image, batch_label = iterator.get_next()

这样就可以在sess里面进行读取了：

 with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for i in range(100):  # number of mini-batch (step)
            print("Step %d" % i)
            [batch_image_r, batch_label_r] = sess.run([batch_image, batch_label])

亚古兽要进化

关注

2
点赞
踩
15

收藏

觉得还不错? 一键收藏
3
评论
TFrecord的制作以及tf.train.shuffle_batch、Dataset详解

安装Tensorflow-GPU: 安装gpu版本的tensorflow还是有一定的坑的。主要是不同的tensorflow版本需要对应不同的cuda和cudnn，他们三个之间的联系比较强烈。在次安装时仍然不免需要查看一些流程，现在整理一下当时引用的博客：https://blog.51cto.com/moerjinrong/2368993https://www.cnblogs.co...
复制链接

扫一扫