把.tfrecord转存为.shard文件的好处

Ly.Leo

于 2022-05-18 15:45:22 发布

阅读量174

点赞数

分类专栏： tensorflow 文章标签： TFRecord shard文件数据洗牌并行下载内存管理

原文链接：https://datascience.stackexchange.com/questions/16318/what-is-the-benefit-of-splitting-tfrecord-file-into-shards

版权

tensorflow 专栏收录该内容

22 篇文章 3 订阅

订阅专栏

把.tfrecord转存为.shard文件的好处

简明介绍
详细介绍

简明介绍

Splitting a TFRecords file into multiple shards has essentially 3 advantages:

Easier to shuffle. As others have pointed out, it makes it easy to shuffle the data at a coarse level (before using a shuffle buffer).
Faster to download. If the files are spread across multiple servers, downloading several files from different servers in parallel will optimize bandwidth usage (rather than downloading one file from a single server). This can improve performance significantly compared to downloading the data from a single server.
Simpler to manipulate. It’s easier to deal with 10,000 files of 100MB each rather than with a single 1TB file. Huge files can be a pain to handle: in particular, transfers are much more likely to fail. It’s also harder to manipulate subsets of the data when it’s all in a single file.

详细介绍

Splitting TFRecord files into shards helps you shuffle large datasets that won’t fit into memory.

Imagine you have millions of training examples saved on disk and you want to repeatedly run them through a training process. Furthermore, suppose that for each repetition of the training data (i.e. each epoch) you want to load the data in a completely random order.

One approach is to have one file per training example and generate a list of all filenames. Then at the beginning of each epoch you shuffle the list of filenames and load the individual files. The problem with this approach is that you are loading millions of files from random locations on your disk. This can be slow especially on a hard disk drive. Even a RAID 0 (RAID级别中最高的存储性能) array will not help with speed if you are loading millions of small files from random locations. The problem gets even worse if you are accessing the files over a network connection.

Another approach is to read the training examples in sequence from one large TFRecord file and shuffle the examples in memory using a shuffle buffer. However, the shuffle buffer typically cannot be larger than the DDR memory available to your CPU. And if the shuffle buffer is significantly smaller than your dataset, then it may not adequately shuffle the data. The data may be “locally” shuffled but not “globally” shuffled. That is, examples from the beginning of the dataset may not be shuffled with examples from the end of the dataset.

A good solution is to use a balanced combination of the above two approaches by splitting your dataset into multiple TFRecord files (called shards). During each epoch you can shuffle the shard filenames to obtain global shuffling and use a shuffle buffer to obtain local shuffling. A good balance will make the shards large enough to prevent disk speed issues but will keep the shards small enough to allow for adequately shuffling by a shuffle buffer.

Here are the exact steps:

Randomly place all training examples into multiple TFRecord files (shards).
At the beginning of each epoch, shuffle the list of shard filenames.
Read training examples from the shards and pass the examples through a shuffle buffer. Typically, the shuffle buffer should be larger than the shard size to ensure good shuffling across shards.
Pass the shuffled examples into your training process.