Tensorflow是一个十分受欢迎的深度学习框架。为了提高框架的性能和易使用性,随着版本的迭代,tensorflow逐步添加了许多高级API。这些高级API中,有一部分是对原来API的更高级封装,还有一部分就是为了提高性能(取代旧API)而开发出来的新API。其中,Dataset API和Estimator API是TensorFlow 1.3 中引入的高级API,官方文档也推荐用户使用它们创建模型。
- Datasets:一种为 TensorFlow 模型创建输入管道的新方式。The Dataset API has methods to load and manipulatedata,and feed it into your model. The Datasets API meshes well with the Estimators API.
- Estimators:用来表示一个完整的 TensorFlow 模型。The Estimator API provides methods to train the model, to judgethe model's accuracy, and to generate predictions.
下图是tensorflow API的完整架构图:
在TensorFlow 1.3以前的版本中总体来说有两种读取数据方法:
- 使用placeholder和feed_dict读内存中的数据
- 使用queue pipeline(队列式管道)读取硬盘中的数据(原理介绍可以参考这篇文章:十图详解tensorflow数据读取机制)
Datasets API是由以下图中所示的类组成:
其中:
- Dataset: Base class containing methods tocreate and transform datasets. Also allows you to initialize a dataset from data in memory, or from a Python generator.
- TextLineDataset: Reads lines from text files(txt,csv...).
- TFRecordDataset: Reads records from TFRecord files.
- FixedLengthRecordDataset: Reads fixed size records from binary files.
- Iterator: Provides a way to access one data set element at a time.
1. 加载数据形成数据集
(1)从内存或迭代器中加载数据:
A single element of a Dataset contains one or more tf.Tensor objects, called components.Which may be a single tensor,
a tuple of tensors, or a nested tuple of tensors. And in addition to tuples, you