As we know, the SGD need to shuffle all data in an epoch(the number of data), and read data sequentially batch by batch. How Caffe implement the shuffle mechanism?
The input data format of caffe is leveldb, lmdb, image filelist, hdf5, the source code file is in src/layers:
data_layer.cpp, image_data_layer.cpp,windows_data_layer.cpp,hdf5_data_layer.cpp
Please see the calling code:
void DataLayer<Dtype>::DataLayerSetup()
1.For leveldb and lmdb, please see data_layer.cpp, Caffe suppose the input of leveldb and lmdb has random shuffle, so Caffe does not need to shuffle the input again, it will read the database sequentially. When the database read cursor reaches the end, it will seek to first again.
2.For image file list, please see image_data_layer.cpp, Caffe read data line by line(each line contains filePath, label), then if the data layer has shuffle option, it will shuffle data for each epoch. The code is:
if (this->layer_param_.image_data_param().shuffle()) { // randomly shuffle data LOG(INFO) << "Shuffling data"; const unsigned int prefetch_rng_seed = caffe_rng_rand(); prefetch_rng_.reset(new Caffe::RNG(prefetch_rng_seed)); ShuffleImages(); }
if (lines_id_ >= lines_size) { // We have reached the end. Restart from the first. DLOG(INFO) << "Restarting data prefetching from start."; lines_id_ = 0; if (this->layer_param_.image_data_param().shuffle()) { ShuffleImages(); } }
3. For hdf5, we need to do shuffle when we store hdf5.