最近一段时间接触到用tfrecord储存数据和读取,期间踩了数之不尽的坑,在消bug的路上艰难行走,所以在这里记录下我所遇见过的各种坑,望共勉。
TFRecord是谷歌推荐的一种二进制文件格式,理论上它可以保存任何格式的信息。使用tfrecord时,实际上是先读取原生数据,然后转换成tfrecord格式,在存储在硬盘上。以后使用数据时,就可以从tfrecord文件 解码读出。
TFRecords文件中包含了类型为tf.train.Example的协议内存块(protocol buffer),而在协议内存块中又包含了字段features(tf.train.Features)。features中又包含了若干个feature,每一个feature是一个map,也就是key-value的键值对, key取值是String类型,而value是Feature类型的消息体,它包含三种,BytesList,FloatList和Int64List,它们都是列表list的形式。如下面的函数int64_feature和int64_list_feature,两者最大的区别在于前者是value=[value]和后者的value=value, []表示列表。
def int64_feature(value): return tf.train.Feature(int64_list=tf.train.Int64List(value=[value])) def int64_list_feature(value): return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
def bytes_feature(value): return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
example = tf.train.Example(features=tf.train.Features(feature={ 'label': int64_list_feature(image_label), 'image': bytes_feature(image), 'h': int64_feature(shape[0]), 'w': int64_feature(shape[1]), 'c': int64_feature(shape[2]) }))
如上所定义的Example消息体,包含一张图片image的信息,及其标签label信息和shape大小信息(height, width, channel),在这里和大多数博客里不一样的在于‘label’标签,通常的数据标签是一个整数,例如猫狗图片,用‘0’表示猫,用‘1’表示狗,即使是多分类标签,也可以用0-N来表示各个类别,而我们的图像标签是一串中文或者英文,长度不一,首先在字典中查找其对应下标,形成list数组。关于这个标签的处理,所以在这里提供了两个解决方案(都是踩过的坑):
- 方案一是将标签list数组转换成one-hot形式,不使用tensorflow的tf.one-hot表示,而是自己定义函数,最后使得每个类别标签为一个字典大小的向量,读取时, feature中定义'label': tf.FixedLenFeature([VOCUBLARY_SIZE], tf.int64),如果不加大小VOCUBLARY_SIZE,会报错
- 方案二是针对tf.nn.ctc_loss中labels参数的SparseTensor稀疏张量的要求,上述方案一得到的虽然是一个类one-hot形式,终究不是稀疏张量,所以将读取到的label直接传给参数 时还是labels时,是要报错,所以为了得到稀疏向量,直接将标签存储,读取时,feature中定义'label': tf.VarLenFeature(dtype=tf.int64),使用的是变长读取,这样得到的是一个稀疏张量SparseTensor
另一个特殊之处在于图片shape信息的储存,可以看到这里不是直接存储shape,而是分开存储,因为每一个图片的尺寸大小不同,所以如果直接以shape的大小存储,也同样会报错。所以在这里定义了三个键值对。读取image时,就可以使用读取的h,w,c这三个数据reshape图像,如果图像是定长的,shape的大小就可以直接定义,例如shape=[224, 224, 3]等等。
h = tf.cast(image_features['h'], tf.int32) w = tf.cast(image_features['w'], tf.int32) c = tf.cast(image_features['c'], tf.int32) image = tf.decode_raw(image_features['image'], tf.uint8) image = tf.cast(image, tf.float32) image = tf.reshape(image, shape=[h, w, c])
最后说到图像image方面,现在面对是image尺寸不一,目的是要图片的高要resize到同一大小,宽度不定长。因为要读取数据时,数据量巨大,程序每次运行时是需要分batch,而每一batch里面要求大小一致,所以如果对image不处理,也是会报错。
- 第一种情况是我碰见的image数据,height大小一致,宽度不定长,这样存储时,不用对数据进行resize,只是数据读取时,对每一个图像reshape后,使用resized_image = tf.image.resize_image_with_crop_or_pad(image, target_height=32, target_width=max_width)对图像进行填充,虽说有剪裁,但是剪裁后会影响结果,所以这里max_width的设定要尽可能包含所有的图片的宽度,这样后面在对图像进行reshape后resized_image = tf.reshape(resized_image, shape=[32, max_width, 3])就可以batch数据了
- 第二种情况是碰见的image数据长宽各不一,所以在存储前就需要对image进行resize,等比例缩小。后面处理和第一种情况类似
关于存储成tfrecord的步骤和读取tfrecord文件步骤,现在已经有很多博客进行详细描述,我就不用过多赘述,下面粘贴的是完整的代码。这个代码得到的tfrecord文件要比原图像文件大10倍左右,原图像有2.2G左右,生成的tfrecord文件大约有22G,网上也有人给出答案,例如一个图像共有h*w*c个像素,将图片转变成byte类型时,这些像素按顺序存在一个二进制序列中,每个像素需要用一个的字符进行表示,所以这样就使得图像文件存储变大,如何解决,tensorflow中提供了一个tf.gfile.FastGFile类,可以直接读取图像的bytes形式
tf.gfile.FastGFile(filename, 'rb').read()
'r'表示要从文件中读取数据,‘b’表示要读取二进制数据,但是由于我们数据的复杂性,所以就不再尝试此种方法了。
import tensorflow as tf
from PIL import Image
import numpy as np
import os
import random
from config import CHAR_VECTOR
from config import NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN
from config import NUM_EXAMPLES_PER_EPOCH_FOR_TEST
VOCUBLARY_SIZE = len(CHAR_VECTOR)
def resize_image(image):
'''resize the size of image'''
width, height = image.size
ratio = 32.0 / float(height)
image = image.resize((int(width * ratio), 32))
return image
def generation_vocublary(CHAR_VECTOR):
vocublary = {}
index = 0
for char in CHAR_VECTOR:
vocublary[char] = index
index = index + 1
return vocublary
def int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def int64_list_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
def bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def generation_TFRecord(data_dir):
vocublary = generation_vocublary(CHAR_VECTOR)
image_name_list = []
for file in os.listdir(data_dir):
if file.endswith('.jpg'):
image_name_list.append(file)
random.shuffle(image_name_list)
capacity = len(image_name_list)
# 生成train tfrecord文件
train_writer = tf.python_io.TFRecordWriter('./dataset/train_dataset.tfrecords')
train_image_name_list = image_name_list[0:int(capacity * 0.9)]
for train_name in train_image_name_list:
train_image_label = []
for s in train_name.strip('.jpg'):
train_image_label.append(vocublary[s])
train_image = Image.open(os.path.join(data_dir, train_name))
train_image = resize_image(train_image)
# print(image.size)
train_image_array = np.asarray(train_image, np.uint8)
train_shape = np.array(train_image_array.shape, np.int32)
train_image = train_image.tobytes()
train_example = tf.train.Example(features=tf.train.Features(feature={
'label': int64_list_feature(train_image_label),
'image': bytes_feature(train_image),
'h': int64_feature(train_shape[0]),
'w': int64_feature(train_shape[1]),
'c': int64_feature(train_shape[2])
}))
train_writer.write(train_example.SerializeToString())
train_writer.close()
# 生成test tfrecord文件
test_writer = tf.python_io.TFRecordWriter('./dataset/test_dataset.tfrecords')
test_image_name_list = image_name_list[int(capacity * 0.9):capacity]
for test_name in test_image_name_list:
test_image_label = []
for s in test_name.strip('.jpg'):
test_image_label.append(vocublary[s])
test_image = Image.open(os.path.join(data_dir, test_name))
test_image = resize_image(test_image)
# print(image.size)
test_image_array = np.asarray(test_image, np.uint8)
test_shape = np.array(test_image_array.shape, np.int32)
test_image = test_image.tobytes()
test_example = tf.train.Example(features=tf.train.Features(feature={
'label': int64_list_feature(test_image_label),
'image': bytes_feature(test_image),
'h': int64_feature(test_shape[0]),
'w': int64_feature(test_shape[1]),
'c': int64_feature(test_shape[2])
}))
test_writer.write(test_example.SerializeToString())
test_writer.close()
def read_tfrecord(filename, max_width, batch_size, train=True):
filename_queue = tf.train.string_input_producer([filename])
reader = tf.TFRecordReader()
_, serialize_example = reader.read(filename_queue)
image_features = tf.parse_single_example(serialized=serialize_example,
features={
# 'label': tf.FixedLenFeature([VOCUBLARY_SIZE], tf.int64),
'label': tf.VarLenFeature(dtype=tf.int64),
'image': tf.FixedLenFeature([], tf.string),
'h': tf.FixedLenFeature([], tf.int64),
'w': tf.FixedLenFeature([], tf.int64),
'c': tf.FixedLenFeature([], tf.int64)
})
h = tf.cast(image_features['h'], tf.int32)
w = tf.cast(image_features['w'], tf.int32)
c = tf.cast(image_features['c'], tf.int32)
image = tf.decode_raw(image_features['image'], tf.uint8)
image = tf.cast(image, tf.float32)
image = tf.reshape(image, shape=[h, w, c])
resized_image = tf.image.resize_image_with_crop_or_pad(image, target_height=32, target_width=max_width)
resized_image = tf.reshape(resized_image, shape=[32, max_width, 3])
label = tf.cast(image_features['label'], tf.int32)
min_fraction_of_example_in_queue = 0.4
if train is True:
min_queue_examples = int(min_fraction_of_example_in_queue * NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN)
train_image_batch, train_label_batch = tf.train.shuffle_batch([resized_image, label],
batch_size=batch_size,
capacity=min_queue_examples + 3 * batch_size,
min_after_dequeue=min_queue_examples,
num_threads=32)
return train_image_batch, train_label_batch
else:
min_queue_examples = int(min_fraction_of_example_in_queue * NUM_EXAMPLES_PER_EPOCH_FOR_TEST)
test_image_batch, test_label_batch = tf.train.batch([resized_image, label],
batch_size=batch_size,
capacity=min_queue_examples + 3 * batch_size,
num_threads=32)
return test_image_batch, test_label_batch
def index_to_word(result):
return ''.join([CHAR_VECTOR[i] for i in result])
def main(argv):
generation_TFRecord('./dataset/images')
train_image, train_label = read_tfrecord('./dataset/train_dataset.tfrecords', 250, 32)
test_image, test_label = read_tfrecord('./dataset/test_dataset.tfrecords', 250, 32)
with tf.Session() as session:
session.run(tf.group(tf.global_variables_initializer(),
tf.local_variables_initializer()))
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
image_train, label_train = session.run([train_image, train_label])
print(image_train.shape)
image_test, label_test = session.run([test_image, test_label])
print(image_test.shape)
for image, label in zip(image_test, label_test):
# 将array转换成image
img = Image.fromarray(image, 'RGB')
img.save(index_to_word(label) + '.jpg')
print(index_to_word(label))
coord.request_stop()
coord.join(threads=threads)
if __name__ == '__main__':
tf.app.run()