tensorflow基础3-文件读取

最新推荐文章于 2024-05-06 21:57:58 发布

慢慢ss

最新推荐文章于 2024-05-06 21:57:58 发布

阅读量443

点赞数

分类专栏： python 深度学习

本文链接：https://blog.csdn.net/adamyouyou/article/details/89764620

版权

python 同时被 2 个专栏收录

19 篇文章 0 订阅

订阅专栏

深度学习

3 篇文章 0 订阅

订阅专栏

一、文件处理流程
在这里插入图片描述
二.文件处理步骤：
1.构造好一个路径+文件名的队列返回一个文件队列

tf.train.string_input_producer(string_tensor,shuffle=True)
string_tensor：含有文件名+路径的1阶张量
num_epochs:过几遍数据，默认无限过数据
return 文件队列

2.进行文件名队列的读取
TensorFlow默认每次只读取一个样本，具体到文本文件读取一行、二进制文件读取指定字节数(最好一个样本)、图片文件默认读取一张图片、TFRecords默认读取一个example

tf.TextLineReader:
阅读文本文件逗号分隔值（CSV）格式,默认按行读取
return：读取器实例
tf.WholeFileReader:用于读取图片文件
tf.TFRecordReader:
读取TFRecords文件
tf.FixedLengthRecordReader:二进制文件
要读取每个记录是固定数量字节的二进制文件
record_bytes:整型，指定每次读取(一个样本)的字节数
return：读取器实例

以上三种文件读取方法中
1、他们有共同的读取方法：read(file_queue)：从队列中指定数量内容返回一个Tensors元组（key文件名字，value默认的内容(一个样本)）

2、由于默认只会读取一个样本，所以通常想要进行批处理。使用tf.train.batch或tf.train.shuffle_batch进行多样本获取，便于训练时候指定每批次多个样本的训练
3.进行解码操作

tf.decode_csv：解码文本文件内容
tf.decode_raw：解码二进制文件内容
与tf.FixedLengthRecordReader搭配使用，二进制读取为uint8格式
tf.image.decode_jpeg(contents)
将JPEG编码的图像解码为uint8张量
return:uint8张量，3-D形状[height, width, channels]
tf.image.decode_png(contents)
将PNG编码的图像解码为uint8张量
return:张量类型，3-D形状[height, width, channels]

进行批出之前需要固定文件的形状：例如图片需要做

tf.image.resize_images(images, size)
缩小放大图片
images：4-D形状[batch, height, width, channels]或3-D形状的张量[height, width, channels]的图片数据
size：1-D int32张量：new_height, new_width，图像的新尺寸
返回4-D格式或者3-D格式图片

4.放在样本队列当中，进行批处理

tf.train.batch(tensors,batch_size,num_threads = 1,capacity = 32,name=None)
读取指定大小（个数）的张量
tensors：可以是包含张量的列表,批处理的内容放到列表当中
batch_size:从队列中读取的批处理大小
num_threads：进入队列的线程数
capacity：整数，队列中元素的最大数量
return:tensors
tf.train.shuffle_batch

5.线程操作（操作需要在会话中开启）

tf.train.start_queue_runners(sess=None,coord=None)
收集所有图中的队列线程，并启动线程
sess:所在的会话中
coord：线程协调器
return：返回所有线程
tf.train.Coordinator()
线程协调员,实现一个简单的机制来协调一组线程的终止
request_stop()：请求停止
should_stop()：询问是否结束
join(threads=None, stop_grace_period_secs=120)：回收线程
return:线程协调员实例

三、图片基础
图片的所有像素值组成图片的特征值
图片三要素：图片长度、图片宽度、图片通道数
1.图片的通道数：
（1）.黑白图片：单通道值：灰度值 0~255
（2）.彩色照片：三通道：RGB（红绿蓝）
在这里插入图片描述
2. 3D:[height，width，channel] 4D[batch，height，width，channel]
读取图片之后，怎么用张量形状来表示呢。一张图片就是一个3D张量，[height, width, channel]，height就表示高，width表示宽，channel表示通道数。我们会经常遇到3D和4D的表示
单个图片：[height, width, channel]
多个图片：[batch,height, width, channel]，batch表示批数量

读取图片的步骤
1）构造图片文件队列tf.train.string_input_producer
2）读取图片数据并进行解码：dtype从string变成unit-8 tf.WholeFileReader 、tf.image.decode_jpeg
3）处理图片数据形状，批处理返回 tf.image.resize_images
4）形状必须固定之后才能进行批处理tf.train.batch
6）打印内容，运行：需要开启子线程读取运行，子线程就去把数据读取到队列，主线程去除数据去训练
4.读取图片代码

import tensorflow as tf
import os


def picread(file_list):
    """
    读取图片数据到张量
    :param file_list :图片的路径的列表
    :return:
    """
    # 1.构造文件队列
    file_queue = tf.train.string_input_producer(file_list)

    # 2.读取文件队列的内容
    # 2.1实例化
    reader = tf.WholeFileReader()
    # 2.2默认一次读取一张图片,没有形状
    key, value = reader.read(file_queue)
    # 2.3对图片数据进行解码，从String-unit8,形状从()-》(?,?,?)
    image = tf.image.decode_jpeg(value)

    # 3.1图片的形状固定、大小处理（不然后面批处理报错）
    # 把图片大小固定统一大小：因为算法训练要求样本的特征数量一样
    # 假设固定[200，200]
    image_resize = tf.image.resize_images(image, [200, 200])
    # 3.2设置图片形状（set_shape、reshape）是一个通道还是三个通道
    # 因为是彩色图片所以设置为3，如果黑白设置为1
    image_resize.set_shape([200, 200, 3])

    # 4.进行批处理:3D->4D batch_size =n 读取n张图片
    image_batch = tf.train.batch([image_resize], batch_size=10, num_threads=1, capacity=10)
    return image_batch


if __name__ == "__main__":
    file_name = os.listdir("./data/pics/")
    # file_list 路径+文件名
    file_list = [os.path.join("./data/pics/", file) for file in file_name]
    image_batch = picread(file_list)

    with tf.Session() as sess:
        # 需要手动开启子线程去进行批处理读取到队列操作
        # 无法直接用sess.run(image_batch)去运行

        # 1.创建线程回收的协调员
        coord = tf.train.Coordinator()
        # 2.手动开启子线程
        # threads = tf.train.start_queue_runners(sess=sess, coord=coord)返回几个线程
        # 是由 image_batch = tf.train.batch([image_resize], batch_size=10, num_threads=1, capacity=10)中开启的num_threads决定的
        threads = tf.train.start_queue_runners(sess=sess, coord=coord)
        print(sess.run(image_batch))
        # 3.回收线程
        # 3.1停止线程，有些线程可能没有停止，需要先停止线程
        coord.request_stop()
        # 3.2线程回收
        coord.join(threads)

注：运算的时候：需要矩阵计算：float32(提高精度)，读取的时候是uint8(节约空间)，所以传进神经网络需要转换

四、二进制文件数据读取（指定bytes：一个样本的）
步骤：
1）构造二进制文件队列tf.train.string_input_producer
2）读取二进制数据并进行解码：

tf.FixedLengthRecordReader(self.all_bytes)
 tf.image.decode_raw(value.tf.unit8)

3）分割目标值和特征值tf.slice
4）形状类型的改变
tf.reshape->[channel,height,width]
tf.transpose[channel,height,width]->[height,width,channel]
5）批处理

代码：

class BytesRead(object):
    def __int__(self):
        #每个图片样本的属性
        self.height=32
        self.width=32
        self.channel=3

        #bytes
        #1个样本
        self.label_bytes=1
        self.image_bytes=self.height*self.width*self.channel
        #数据中每个样本包含了特征值和目标值
        self.all_bytes=self.label_bytes+ self.image_bytes

    def bytes_read(self,file_list):
        """
        读取二进制文件
        :return:
        """
        #1.构建文件序列
        file_queue=tf.train.string_input_producer(file_list)
        # 2.读取文件队列的内容
        # 2.1实例化
        reader = tf.FixedLengthRecordReader(self.all_bytes)
        key, value = reader.read(file_queue)
        # 2.2对数据进行解码操作,decode_raw(value.tf.unit8)需要指定tf.unit8，图片不需要
        #（?,）（3073）=label（1，）+feature(3072，1)一维数组，包含了特征值和目标值
        label_image = tf.image.decode_raw(value.tf.unit8)
        #为了训练方便，一般会把特征值和目标值分开处理
        #分割用切片tf.slice
        label=tf.slice(label_image,[0],[self.label_bytes])
        image=tf.slice(label_image,[self.label_bytes],[self.image_bytes])
        #2.3处理类型和图片数据的形状
        #图片形状[32,32,3]reshape
        #但是不能直接转因为会乱掉，因为reshape转换为[channel,height,width],
        #实际上想要的是[height,width,channel]需要借用tf.transpose从[channel,height,width]-》[height,width,channel]
        channel_major=tf.reshape(image,[self.channel,self.height,self.width])
        #tf.transpose从[channel,height,width]-》[height,width,channel]
        image_shape=tf.transpose(channel_major,[1,2,0])

        #3.批处理
        image_batch,lable_batch=tf.train_batch([image_shape,label],batch_size=10, num_threads=1, capacity=10)
        return image_batch,lable_batch

if __name__ == "__main__":
    file_name = os.listdir("./data/bytes/")
    # file_list 路径+文件名(文件名后缀为bin的文件)
    file_list = [os.path.join("./data/pics/", file) for file in file_name if file[-3:] == "bin"]
    br=BytesRead()
    image_batch,label_batch=br.bytes_read(file_list)

    with tf.Session() as sess:
        # 需要手动开启子线程去进行批处理读取到队列操作
        # 无法直接用sess.run(image_batch)去运行

        # 1.创建线程回收的协调员
        coord = tf.train.Coordinator()
        # 2.手动开启子线程
        # threads = tf.train.start_queue_runners(sess=sess, coord=coord)返回几个线程
        # 是由 image_batch = tf.train.batch([image_resize], batch_size=10, num_threads=1, capacity=10)中开启的num_threads决定的
        threads = tf.train.start_queue_runners(sess=sess, coord=coord)
        print(sess.run([image_batch,label_batch]))
        # 3.回收线程
        # 3.1停止线程，有些线程可能没有停止，需要先停止线程
        coord.request_stop()
        # 3.2线程回收
        coord.join(threads)

五、TFRecords文件：
特点：更好的利用内存，更方便复制和移动，并且不需要单独的标签文件。
文件格式 *.tfrecords
1.写入TFRecords文件

 example = tf.train.Example(features=tf.train.Features(feature={
                "image": tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),
                "label": tf.train.Feature(int64_list=tf.train.Int64List(value=[label])),
            }))
tf.train.Example(features=None)
写入tfrecords文件
features:tf.train.Features类型的特征实例
return：example格式协议块
tf.train.Features(feature=None)
构建每个样本的信息键值对
feature:字典数据,key为要保存的名字
value为tf.train.Feature实例
return:Features类型
tf.train.Feature(options)
options：例如
bytes_list=tf.train. BytesList(value=[Bytes])
int64_list=tf.train. Int64List(value=[Value])
支持存入的类型如下
tf.train.Int64List(value=[Value])
tf.train.BytesList(value=[Bytes])
tf.train.FloatList(value=[value])

代码：

class BytesRead(object):
    def __int__(self):
        # 每个图片样本的属性
        self.height = 32
        self.width = 32
        self.channel = 3

        # bytes
        # 1个样本
        self.label_bytes = 1
        self.image_bytes = self.height * self.width * self.channel
        # 数据中每个样本包含了特征值和目标值
        self.all_bytes = self.label_bytes + self.image_bytes

    def bytes_read(self, file_list):
        """
        读取二进制文件
        :return:
        """
        # 1.构建文件序列
        file_queue = tf.train.string_input_producer(file_list)
        # 2.读取文件队列的内容
        # 2.1实例化
        reader = tf.FixedLengthRecordReader(self.all_bytes)
        key, value = reader.read(file_queue)
        # 2.2对数据进行解码操作,decode_raw(value.tf.unit8)需要指定tf.unit8，图片不需要
        # （?,）（3073）=label（1，）+feature(3072，1)一维数组，包含了特征值和目标值
        label_image = tf.image.decode_raw(value.tf.unit8)
        # 为了训练方便，一般会把特征值和目标值分开处理
        # 分割用切片tf.slice
        label = tf.slice(label_image, [0], [self.label_bytes])
        image = tf.slice(label_image, [self.label_bytes], [self.image_bytes])
        # 2.3处理类型和图片数据的形状
        # 图片形状[32,32,3]reshape
        # 但是不能直接转因为会乱掉，因为reshape转换为[channel,height,width],
        # 实际上想要的是[height,width,channel]需要借用tf.transpose从[channel,height,width]-》[height,width,channel]
        channel_major = tf.reshape(image, [self.channel, self.height, self.width])
        # tf.transpose从[channel,height,width]-》[height,width,channel]
        image_shape = tf.transpose(channel_major, [1, 2, 0])

        # 3.批处理
        image_batch, lable_batch = tf.train_batch([image_shape, label], batch_size=10, num_threads=1, capacity=10)
        return image_batch, lable_batch

    def write_to_tfrecord(self, image_batch, lable_batch):
        """
        将数据写入TFRecords文件
        :param image_batch: 特征值
        :param lable_batch: 目标值
        :return:
        """
        # 构造TFRecords存储器
        writer = tf.paython_io.TFRecordWriter("./tmp/cifar.tfrecords")

        # 循环将每个样本构造成一个example，然后序列化写入
        for i in range(10):
            # 去除相应的第i个样本的特征值和目标值
            # tensor(name,[],shape)中的值，需要用eval()取值
            image = image_batch[i].eval().tostring()
            # [10,1] 因为输出label_batch是二位数组
            label = int(label_batch[i].eval()[0])
            # 每个样本的example
            example = tf.train.Example(features=tf.train.Features(feature={
                "image": tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),
                "label": tf.train.Feature(int64_list=tf.train.Int64List(value=[label])),
            }))
            # 写入第i个样本的example
            writer.write(example.SerialToString())
        writer.close()
        return None


if __name__ == "__main__":
    file_name = os.listdir("./data/bytes/")
    # file_list 路径+文件名(文件名后缀为bin的文件)
    file_list = [os.path.join("./data/pics/", file) for file in file_name if file[-3:] == "bin"]
    br = BytesRead()
    image_batch, label_batch = br.bytes_read(file_list)

    with tf.Session() as sess:
        # 需要手动开启子线程去进行批处理读取到队列操作
        # 无法直接用sess.run(image_batch)去运行

        # 1.创建线程回收的协调员
        coord = tf.train.Coordinator()
        # 2.手动开启子线程
        # threads = tf.train.start_queue_runners(sess=sess, coord=coord)返回几个线程
        # 是由 image_batch = tf.train.batch([image_resize], batch_size=10, num_threads=1, capacity=10)中开启的num_threads决定的
        threads = tf.train.start_queue_runners(sess=sess, coord=coord)
        print(sess.run([image_batch, label_batch]))

        br.write_to_tfrecord(image_batch, label_batch)
        # 3.回收线程
        # 3.1停止线程，有些线程可能没有停止，需要先停止线程
        coord.request_stop()
        # 3.2线程回收
        coord.join(threads)

2.读取TFRecords文件
读取这种文件整个过程与其他文件一样，只不过需要有个解析Example的步骤。从TFRecords文件中读取数据，可以使用tf.TFRecordReader的tf.parse_single_example解析器。这个操作可以将Example协议内存块(protocol buffer)解析为张量。

# 多了解析example的一个步骤
        feature = tf.parse_single_example(values, features={
            "image": tf.FixedLenFeature([], tf.string),
            "label": tf.FixedLenFeature([], tf.int64)
        })
tf.parse_single_example(serialized,features=None,name=None)

解析一个单一的Example原型
serialized：标量字符串Tensor，一个序列化的Example
features：dict字典数据，键为读取的名字，值为FixedLenFeature
return:一个键值对组成的字典，键为读取的名字
tf.FixedLenFeature(shape,dtype)

shape：输入数据的形状，一般不指定,为空列表
dtype：输入数据类型，与存储进文件的类型要一致
类型只能是float32,int64,string

步骤：
1）使用tf.train.string_input_producer构造文件队列
2) tf.TFRecordReader 读取TFRecords数据并进行解析
tf.parse_single_example进行解析
3) tf.decode_raw解码
类型是bytes类型需要解码
其他类型不需要
4) 处理图片数据形状以及数据类型，批处理返回
5) 开启会话线程运行
代码

    def read_tfrecord(self):
        """
        读取TFRecord文件
        :return:
        """
        # 1.构造文件序列
        file_queue = tf.train.string_input_producer("./temp/cifar.tfrecords")
        # 2.tf.TFRecordReader:读取TFRecord数据并解析example协议
        reader = tf.TFRecordReader()
        # 默认只读取一个样本
        key, value = reader.read(file_queue)
        # 解析examole协议
        feature = tf.parse_single_example(values, features={
            "image": tf.FixedLenFeature([], tf.string),
            "label": tf.FixedLenFeature([], tf.int64)
        })
        #3.解码操作 tfrecord文件实际是二进制文件，需要解码
        image=tf.decode_raw(feature['image'],tf.unit8)
        label=tf.cast(feature['label'],tf.int32)
        #形状固定[32,32,3]->bytes=>tf.unit8 不需要tf.transpose
        tf.reshape(image,[self.height,self.width,self.channel])
        #4.批处理
        image_batch, lable_batch = tf.train_batch([image_shape, label], batch_size=10, num_threads=1, capacity=10)
        return image_batch, lable_batch
      ```

慢慢ss

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
tensorflow基础3-文件读取

一、文件处理流程二.文件处理步骤：1.构造好一个路径+文件名的队列返回一个文件队列tf.train.string_input_producer(string_tensor,shuffle=True)string_tensor：含有文件名+路径的1阶张量num_epochs:过几遍数据，默认无限过数据return 文件队列2.进行文件名队列的读取TensorFlow默认每次只读取...
复制链接

扫一扫