TensorFlow-IO中文件系统接口详解

王mountain

于 2023-08-02 16:26:14 发布

阅读量342

点赞数 1

文章标签： tensorflow 人工智能

本文链接：https://blog.csdn.net/wwwakdf/article/details/132021852

版权

本文详细解析了tensorflow_io中文件系统接口的实现过程，从使用MINIST数据集入手，追踪到核心动态链接库和Operation接口，以及Env和RandomAccessFile在文件操作中的作用，揭示了不同文件系统的选择和区分机制。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

引言

tensorflow-IO提供了丰富的对接文件系统的接口，本文从源代码出发分析tensorflow-IO的文件系统的相关接口，学习如何为这些以C编写的文件系统提供Python接口。

从使用出发

以最简单的MINIST手写数据集训练出发，分析函数的调用过程。首先使用tensorflow_io需要pip install对应的库。之后就可以在使用tensorflow_io加载对应的数据集。

import tensorflow_io as tfio
    dataset_url = "https://storage.googleapis.com/cvdf-datasets/mnist/"
    d_train = tfio.IODataset.from_mnist(
        dataset_url + "train-images-idx3-ubyte.gz",
        dataset_url + "train-labels-idx1-ubyte.gz",
    )
    (xs, ys), _ ,path= datasets.mnist.load_data()  # 自动下载

我们知道tensorflow中的dataset是已经处理好的tensor,本文的重点不在数据预处理而是重点关注数据从磁盘如何加载的，方便之后在这方面做出改进。持续跟踪代码

@classmethod
def from_mnist(cls, images=None, labels=None, **kwargs):

   with tf.name_scope(kwargs.get("name", "IOFromMNIST")):
       return mnist_dataset_ops.MNISTIODataset(
           images, labels, internal=True, **kwargs
       )
def MNISTIODataset(images=None, labels=None, internal=True):
    """MNISTIODataset"""
    assert internal, (
        "MNISTIODataset constructor is private; please use one "
        "of the factory methods instead (e.g., "
        "IODataset.from_mnist())"
    )

    assert (
        images is not None or labels is not None
    ), "images and labels could not be all None"

    images_dataset = MNISTImageIODataset(images) if images is not None else None

    labels_dataset = MNISTLabelIODataset(labels) if labels is not None else None

    if images is None:
        return labels_dataset
    if labels is None:
        return images_dataset

    return tf.data.Dataset.zip((images_dataset, labels_dataset))
class MNISTImageIODataset(tf.data.Dataset):
    def __init__(self, filename):
        _, compression = core_ops.io_file_info(filename)
        rows = tf.io.decode_raw(
            core_ops.io_file_read(filename, 8, 4, compression=compression),
            tf.int32,
            little_endian=False,
        )
        cols = tf.io.decode_raw(
            core_ops.io_file_read(filename, 12, 4, compression=compression),
            tf.int32,
            little_endian=False,
        )

会发现最终文件路径被传入core_ops.io_file_info()函数中，而这个io_file_info()函数是没看不到源代码的。只能先看一下core_ops的定义，如下。

core_ops = LazyLoader("core_ops", "libtensorflow_io.so")

这里能够看到core_ops是一个python中的module模块，而这个模块的功能是从libtensorflow_io.so这个二文件(其实为动态链接库)中加载的。LazyLoader函数最终将调用_load_library()函数，传入文件的名字。注意这个"core_ops"并没有实际的作用，只是module对象的名字。

class LazyLoader(types.ModuleType):
    def __init__(self, name, library):
        self._mod = None
        self._module_name = name
        self._library = library
        super().__init__(self._module_name)

    def _load(self):
        if self._mod is None:
            self._mod = _load_library(self._library)
        return self._mod

    def __getattr__(self, attrb):
        return getattr(self._load(), attrb)

    def __dir__(self):
        return dir(self._load())

在_load_library中会根据功能的具体分类来调用不同的方法加载。

def _load_library(filename, lib="op"):
    """_load_library"""
    f = inspect.getfile(sys._getframe(1))  # pylint: disable=protected-access

    # Construct filename
    f = os.path.join(os.path.dirname(f), filename)
    filenames = [f]

    # Add datapath to load if en var is set, used for running tests where shared
    # libraries are built in a different path
    datapath = os.environ.get("TFIO_DATAPATH")
    if datapath is not None:
        # Build filename from: