把.tfrecord转存为.shard文件的好处

把.tfrecord转存为.shard文件的好处

简明介绍

Splitting a TFRecords file into multiple shards has essentially 3 advantages:

  1. Easier to shuffle. As others have pointed out, it makes it easy to shuffle the data at a coarse level (before using a shuffle buffer).
  2. Faster to download. If the files are spread across multiple servers, downloading several files from different servers in parallel will optimize bandwidth usage (rather than downloading one file from a single server). This can improve performance significantly compared to downloading the data from a single server.
  3. Simpler to manipulate. It’s easier to deal with 10,000 files of 100MB each rather than with a single 1TB file. Huge files can be a pain to handle: in particular, transfers are much more likely to fail. It’s also harder to manipulate subsets of the data when it’s all in a single file.

详细介绍

Splitting TFRecord files into shards helps you shuffle large datasets that won’t fit into memory.

Imagine you have millions of training examples saved on disk and you want to repeatedly run them through a training process. Furthermore, suppose that for each repetition of the training data (i.e. each epoch) you want to load the data in a completely random order.

One approach is to have one file per training example and generate a list of all filenames. Then at the beginning of each epoch you shuffle the list of filenames and load the individual files. The problem with this approach is that you are loading millions of files from random locations on your disk. This can be slow especially on a hard disk drive. Even a RAID 0 (RAID级别中最高的存储性能) array will not help with speed if you are loading millions of small files from random locations. The problem gets even worse if you are accessing the files over a network connection.

Another approach is to read the training examples in sequence from one large TFRecord file and shuffle the examples in memory using a shuffle buffer. However, the shuffle buffer typically cannot be larger than the DDR memory available to your CPU. And if the shuffle buffer is significantly smaller than your dataset, then it may not adequately shuffle the data. The data may be “locally” shuffled but not “globally” shuffled. That is, examples from the beginning of the dataset may not be shuffled with examples from the end of the dataset.

A good solution is to use a balanced combination of the above two approaches by splitting your dataset into multiple TFRecord files (called shards). During each epoch you can shuffle the shard filenames to obtain global shuffling and use a shuffle buffer to obtain local shuffling. A good balance will make the shards large enough to prevent disk speed issues but will keep the shards small enough to allow for adequately shuffling by a shuffle buffer.

Here are the exact steps:

  1. Randomly place all training examples into multiple TFRecord files (shards).
  2. At the beginning of each epoch, shuffle the list of shard filenames.
  3. Read training examples from the shards and pass the examples through a shuffle buffer. Typically, the shuffle buffer should be larger than the shard size to ensure good shuffling across shards.
  4. Pass the shuffled examples into your training process.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值