caffe源码之VideoDataLayer

最新推荐文章于 2019-10-27 12:03:46 发布

paranoid_CNN

最新推荐文章于 2019-10-27 12:03:46 发布

阅读量3.3k

点赞数 1

分类专栏：从caffe到放弃

本文链接：https://blog.csdn.net/paranoid_CNN/article/details/53208109

版权

从caffe到放弃专栏收录该内容

9 篇文章 0 订阅

订阅专栏

解析

写这个文章是为了了解caffe的数据读取如果不是图片，而是视频，数据层是如何设计的。

直观感觉

对于caffe的入口层数据不是图像而是一帧帧视频的读取层，直观上的设计是好像没啥区别，因为视频本质也是一帧帧图像，但我真不知道怎么处理，可能最暴力的办法就是把一帧帧图像分别保存为单通道图像，然后将多幅单通道图像合并为一个多通道图像，那么一个视频本质上就是一幅多通道的图像，但问题就在于好像这样做不是很简单的事情，而且，我们网络的输入，如果输入全部的帧，那运算量真是感人，好吧，确实是有这个问题，但其实更重要的一个原因就是：我编不出来。

那么我们就借助先人做的工作，来看看他们是怎么实现的：
1、打开caffe.proto看一下VideoDataLayer层参数

message VideoDataParameter{
  // Specify the data source.
  optional string source = 1;
  // Specify the batch size.
  optional uint32 batch_size = 4;
  // The rand_skip variable is for the data layer to skip a few data points
  // to avoid all asynchronous sgd clients to start at the same point. The skip
  // point would be set as rand_skip * rand(0,1). Note that rand_skip should not
  // be larger than the number of keys in the leveldb.
  optional uint32 rand_skip = 7 [default = 0];
  // Whether or not ImageLayer should shuffle the list of files at every epoch.
  optional bool shuffle = 8 [default = false];
  // It will also resize images if new_height or new_width are not zero.
  optional uint32 new_height = 9 [default = 0];
  optional uint32 new_width = 10 [default = 0];
  optional uint32 new_length = 11 [default = 1];
  optional uint32 num_segments = 12 [default = 1];
  // DEPRECATED. See TransformationParameter. For data pre-processing, we can do
  // simple scaling and subtracting the data mean, if provided. Note that the
  // mean subtraction is always carried out before scaling.
  optional float scale = 2 [default = 1];
  optional string mean_file = 3;
  // DEPRECATED. See TransformationParameter. Specify if we would like to randomly
  // crop an image.
  optional uint32 crop_size = 5 [default = 0];
  // DEPRECATED. See TransformationParameter. Specify if we want to randomly mirror
  // data.
  optional bool mirror = 6 [default = false];
  enum Modality {
    RGB = 0;
    FLOW = 1;
  }
  optional Modality modality = 13 [default = FLOW];

  // the name pattern for frame images,
  // for RGB modality it is default to "img_%04d.jpg", for FLOW "flow_x_%04d" and "flow_y_%04d"
  optional string name_pattern = 14;

  // The type of input
  optional bool encoded = 15 [default = false];
}

只说没有注释的变量，new_length 这个变量是输入是光流时，累积输入的一个样本的光流帧数。num_segments，这个也是在光流里的变量，默认为1，猜测意思是对于一个视频样本，将这个视频样本分割为几部分视频片段。实际我们用的也是num_segments=1，即一个视频就是一个样本。name_pattern即保存的Flow和img的格式，对于RGB通道的图像，我们的求出来的光流图是“flow_x_%04d.jpg”和“flow_y_%04d.jpg”的格式，对于RGB保存的image的名字格式是“img_%04d.jpg”，而数据读取层对于img和flow分别猜去不同的读取模式，这一点在源码中会有体现，于是，modality的两个选项就是RGB和FLOW。

2、我们看看VideoDataLayer层的功能声明是怎么做到的，打开VideoDataLayer.hpp：

template <typename Dtype>
class VideoDataLayer : public BasePrefetchingDataLayer<Dtype> {
public:
    explicit VideoDataLayer(const LayerParameter& param)
    : BasePrefetchingDataLayer<Dtype>(param) {}
    virtual ~VideoDataLayer();
    virtual void DataLayerSetUp(const vector<Blob<Dtype>*>& bottom,
            const vector<Blob<Dtype>*>& top);

    virtual inline const char* type() const { return "VideoData"; }
    virtual inline int ExactNumBottomBlobs() const { return 0; }
    virtual inline int ExactNumTopBlobs() const { return 2; }

和ImageDataLayer一样，VideoDataLayer层继承自BasePretchingDataLayer，最重要的还是该层要实现的具体功能，见DataLayerSetUp，具体实现代码中解析；然后声明了几个重要的概念，光流随机数和帧随机数，caffe里的随机数是一个很重要的概念，最主要用于初始化权重，shuffle的时候也要用到。

protected:
    shared_ptr<Caffe::RNG> prefetch_rng_;
    shared_ptr<Caffe::RNG> prefetch_rng_2_;
    shared_ptr<Caffe::RNG> prefetch_rng_1_;
    shared_ptr<Caffe::RNG> frame_prefetch_rng_;
    virtual void ShuffleVideos();
    virtual void InternalThreadEntry();

prfetch_rng_1和pretch_rng_2分别代表了flow_x和flow_y的随机数，而prefetch_rng_是为了保证flow_x和flow_y产生的随机数是一样的，frame_prefetch_rng是帧图像随机数，线程入口也与ImageDataLayer不同，具体实现中解析。

#ifdef USE_MPI
    inline virtual void advance_cursor(){
        lines_id_++;
        if (lines_id_ >= lines_.size()) {
            // We have reached the end. Restart from the first.
            DLOG(INFO) << "Restarting data prefetching from start.";
            lines_id_ = 0;
            if (this->layer_param_.video_data_param().shuffle()) {
                ShuffleVideos();
            }
        }
    }
#endif

上面那一坨我也不清楚，大概就是一个游标，每个eproch时候shuffle，产生的随机数都不同。

vector<std::pair<std::string, int> > lines_;
    vector<int> lines_duration_;
    int lines_id_;
    string name_pattern_;

lines_是映射对，记录了source（.txt）里的样本和图像的对应关系。lines_[0].size便是样本个数。lines_duration_是每个视频的帧数，lines_id_ 是样本index，name_pattern_是…………自己悟。

3、VideoDataLayer.cpp

namespace caffe{
template <typename Dtype>
VideoDataLayer<Dtype>:: ~VideoDataLayer<Dtype>(){
    this->JoinPrefetchThread();
}

不具体了解，大概就是加入预取进程的意思。

template <typename Dtype>
void VideoDataLayer<Dtype>:: DataLayerSetUp(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top){
    const int new_height  = this->layer_param_.video_data_param().new_height();
    const int new_width  = this->layer_param_.video_data_param().new_width();
    const int new_length  = this->layer_param_.video_data_param().new_length();
    const int num_segments = this->layer_param_.video_data_param().num_segments();
    const string& source = this->layer_param_.video_data_param().source();

    LOG(INFO) << "Opening file: " << source;
    std:: ifstream infile(source.c_str());
    string filename;
    int label;
    int length;
    while (infile >> filename >> length >> label){
        lines_.push_back(std::make_pair(filename,label));
        lines_duration_.push_back(length);
    }

new_length是累积的光流帧长度，对应论文中的Ｌ，num_segments默认为1，感觉应该是每个视频所取的帧数为1帧。ifstream 是从硬盘读取文件的命令，infile是识别字，括号里为文件名。

if (this->layer_param_.video_data_param().shuffle()){
        const unsigned int prefectch_rng_seed = caffe_rng_rand();
        prefetch_rng_1_.reset(new Caffe::RNG(prefectch_rng_seed));
        prefetch_rng_2_.reset(new Caffe::RNG(prefectch_rng_seed));
        ShuffleVideos();
    }

    LOG(INFO) << "A total of " << lines_.size() << " videos.";

显而易见，这里的与随机数相关的内容是用来shuffle的。

if (this->layer_param_.video_data_param().name_pattern() == ""){
        if (this->layer_param_.video_data_param().modality() == VideoDataParameter_Modality_RGB){
            name_pattern_ = "image_%04d.jpg";
        }else if (this->layer_param_.video_data_param().modality() == VideoDataParameter_Modality_FLOW){
            name_pattern_ = "flow_%c_%04d.jpg";
        }
    }else{
        name_pattern_ = this->layer_param_.video_data_param().name_pattern();
    }

这部分用来确定名称格式的，帧图像是img_%04d.jpg，光流图像是flow_x_%04d.jpg和flow_y_%04d.jpg。

Datum datum;
    const unsigned int frame_prefectch_rng_seed = caffe_rng_rand();
    frame_prefetch_rng_.reset(new Caffe::RNG(frame_prefectch_rng_seed));
    int average_duration = (int) lines_duration_[lines_id_]/num_segments;
    vector<int> offsets;
    for (int i = 0; i < num_segments; ++i){
        caffe::rng_t* frame_rng = static_cast<caffe::rng_t*>(frame_prefetch_rng_->generator());
        int offset = (*frame_rng)() % (average_duration - new_length + 1);
        offsets.push_back(offset+i*average_duration);
    }

offsets是一个vector，一开始研究了很久没搞懂啥意思，搞了个啥么，连点说明都没有就直接定义一个vector，后来才明白，由于每个视频的num_segments=1，那么每个视频只取一帧，那么问题来了，在一个视频取一帧之后怎么跳到下个视频，这里有一个average_duration，这里由于num_segments=1，所以average_duration就是前面的length，也就是每一个视频的帧数，可以看到循环结构只执行了一次，说明每个视频只取了一帧，offsets.push_back(offset+i*average_duration);这个意思便是跳过一个视频到下一个视频。并把那一帧push_back到offsets vector中。

if (this->layer_param_.video_data_param().modality() == VideoDataParameter_Modality_FLOW)
        CHECK(ReadSegmentFlowToDatum(lines_[lines_id_].first, lines_[lines_id_].second,
                                     offsets, new_height, new_width, new_length, &datum, name_pattern_.c_str()));
    else
        CHECK(ReadSegmentRGBToDatum(lines_[lines_id_].first, lines_[lines_id_].second,
                                    offsets, new_height, new_width, new_length, &datum, true, name_pattern_.c_str()));

这里是很关键的两种读入方式，具体解析如下：
打开/src/caffe/util/io.cpp，首先是ReadSegmentRGBToDatum：

bool ReadSegmentRGBToDatum(const string& filename, const int label,
    const vector<int> offsets, const int height, const int width, const int length, Datum* datum, bool is_color,
    const char* name_pattern ){
    cv::Mat cv_img;
    string* datum_string;
    char tmp[30];
    int cv_read_flag = (is_color ? CV_LOAD_IMAGE_COLOR :
        CV_LOAD_IMAGE_GRAYSCALE);
    for (int i = 0; i < offsets.size(); ++i){
        int offset = offsets[i];
        for (int file_id = 1; file_id < length+1; ++file_id){
            sprintf(tmp, name_pattern, int(file_id+offset));
            string filename_t = filename + "/" + tmp;
            cv::Mat cv_img_origin = cv::imread(filename_t, cv_read_flag);
            if (!cv_img_origin.data){
                LOG(ERROR) << "Could not load file " << filename;
                return false;
            }
            if (height > 0 && width > 0){
                cv::resize(cv_img_origin, cv_img, cv::Size(width, height));
            }else{
                cv_img = cv_img_origin;
            }
            int num_channels = (is_color ? 3 : 1);
            if (file_id==1 && i==0){
                datum->set_channels(num_channels*length*offsets.size());
                datum->set_height(cv_img.rows);
                datum->set_width(cv_img.cols);
                datum->set_label(label);
                datum->clear_data();
                datum->clear_float_data();
                datum_string = datum->mutable_data();
            }
            if (is_color) {
                for (int c = 0; c < num_channels; ++c) {
                  for (int h = 0; h < cv_img.rows; ++h) {
                    for (int w = 0; w < cv_img.cols; ++w) {
                      datum_string->push_back(
                        static_cast<char>(cv_img.at<cv::Vec3b>(h, w)[c]));
                    }
                  }
                }
              } else {  // Faster than repeatedly testing is_color for each pixel w/i loop
                for (int h = 0; h < cv_img.rows; ++h) {
                  for (int w = 0; w < cv_img.cols; ++w) {
                    datum_string->push_back(
                      static_cast<char>(cv_img.at<uchar>(h, w)));
                    }
                  }
              }
        }
    }
    return true;
}

这一块，自己慢慢看。

bool ReadSegmentFlowToDatum(const string& filename, const int label,
    const vector<int> offsets, const int height, const int width, const int length, Datum* datum,
    const char* name_pattern ){
    cv::Mat cv_img_x, cv_img_y;
    string* datum_string;
    char tmp[30];
    for (int i = 0; i < offsets.size(); ++i){
        int offset = offsets[i];
        for (int file_id = 1; file_id < length+1; ++file_id){
            sprintf(tmp,name_pattern, 'x', int(file_id+offset));
            string filename_x = filename + "/" + tmp;
            cv::Mat cv_img_origin_x = cv::imread(filename_x, CV_LOAD_IMAGE_GRAYSCALE);
            sprintf(tmp, name_pattern, 'y', int(file_id+offset));
            string filename_y = filename + "/" + tmp;
            cv::Mat cv_img_origin_y = cv::imread(filename_y, CV_LOAD_IMAGE_GRAYSCALE);
            if (!cv_img_origin_x.data || !cv_img_origin_y.data){
                LOG(ERROR) << "Could not load file " << filename_x << " or " << filename_y;
                return false;
            }
            if (height > 0 && width > 0){
                cv::resize(cv_img_origin_x, cv_img_x, cv::Size(width, height));
                cv::resize(cv_img_origin_y, cv_img_y, cv::Size(width, height));
            }else{
                cv_img_x = cv_img_origin_x;
                cv_img_y = cv_img_origin_y;
            }
            if (file_id==1 && i==0){
                int num_channels = 2;
                datum->set_channels(num_channels*length*offsets.size());
                datum->set_height(cv_img_x.rows);
                datum->set_width(cv_img_x.cols);
                datum->set_label(label);
                datum->clear_data();
                datum->clear_float_data();
                datum_string = datum->mutable_data();
            }
            for (int h = 0; h < cv_img_x.rows; ++h){
                for (int w = 0; w < cv_img_x.cols; ++w){
                    datum_string->push_back(static_cast<char>(cv_img_x.at<uchar>(h,w)));
                }
            }
            for (int h = 0; h < cv_img_y.rows; ++h){
                for (int w = 0; w < cv_img_y.cols; ++w){
                    datum_string->push_back(static_cast<char>(cv_img_y.at<uchar>(h,w)));
                }
            }
        }
    }
    return true;

同样，自己悟。

const int crop_size = this->layer_param_.transform_param().crop_size();
    const int batch_size = this->layer_param_.video_data_param().batch_size();
    if (crop_size > 0){
        top[0]->Reshape(batch_size, datum.channels(), crop_size, crop_size);
        this->prefetch_data_.Reshape(batch_size, datum.channels(), crop_size, crop_size);
    } else {
        top[0]->Reshape(batch_size, datum.channels(), datum.height(), datum.width());
        this->prefetch_data_.Reshape(batch_size, datum.channels(), datum.height(), datum.width());
    }
    LOG(INFO) << "output data size: " << top[0]->num() << "," << top[0]->channels() << "," << top[0]->height() << "," << top[0]->width();

    top[1]->Reshape(batcch_size, 1, 1, 1);
    this->prefetch_label_.Reshape(batch_size, 1, 1, 1);

    vector<int> top_shape = this->data_transformer_->InferBlobShape(datum);
    this->transformed_data_.Reshape(top_shape);
}

template <typename Dtype>
void VideoDataLayer<Dtype>::InternalThreadEntry(){

    Datum datum;
    CHECK(this->prefetch_data_.count());
    Dtype* top_data = this->prefetch_data_.mutable_cpu_data();
    Dtype* top_label = this->prefetch_label_.mutable_cpu_data();
    VideoDataParameter video_data_param = this->layer_param_.video_data_param();
    const int batch_size = video_data_param.batch_size();
    const int new_height = video_data_param.new_height();
    const int new_width = video_data_param.new_width();
    const int new_length = video_data_param.new_length();
    const int num_segments = video_data_param.num_segments();
    const int lines_size = lines_.size();

    for (int item_id = 0; item_id < batch_size; ++item_id){
        CHECK_GT(lines_size, lines_id_);
        vector<int> offsets;
        int average_duration = (int) lines_duration_[lines_id_] / num_segments;
        for (int i = 0; i < num_segments; ++i){
            if (this->phase_==TRAIN){
                if (average_duration >= new_length){
                    caffe::rng_t* frame_rng = static_cast<caffe::rng_t*>(frame_prefetch_rng_->generator());
                    int offset = (*frame_rng)() % (average_duration - new_length + 1);
                    offsets.push_back(offset+i*average_duration);
                } else {
                    offsets.push_back(1);
                }
            } else{
                if (average_duration >= new_length)
                offsets.push_back(int((average_duration-new_length+1)/2 + i*average_duration));
                else
                offsets.push_back(1);
            }
        }
        if (this->layer_param_.video_data_param().modality() == VideoDataParameter_Modality_FLOW){
            if(!ReadSegmentFlowToDatum(lines_[lines_id_].first, lines_[lines_id_].second,
                                       offsets, new_height, new_width, new_length, &datum, name_pattern_.c_str())) {
                continue;
            }
        } else{
            if(!ReadSegmentRGBToDatum(lines_[lines_id_].first, lines_[lines_id_].second,
                                      offsets, new_height, new_width, new_length, &datum, true, name_pattern_.c_str())) {
                continue;
            }
        }

        int offset1 = this->prefetch_data_.offset(item_id);
                this->transformed_data_.set_cpu_data(top_data + offset1);
        this->data_transformer_->Transform(datum, &(this->transformed_data_));
        top_label[item_id] = lines_[lines_id_].second;
        //LOG()

        //next iteration
        lines_id_++;
        if (lines_id_ >= lines_size) {
            DLOG(INFO) << "Restarting data prefetching from start.";
            lines_id_ = 0;
            if(this->layer_param_.video_data_param().shuffle()){
                ShuffleVideos();
            }
        }
    }
}

INSTANTIATE_CLASS(VideoDataLayer);
REGISTER_LAYER_CLASS(VideoData);
}

基本和前面相同。

总结：
自己其实也挺水的，时间久了感觉自己一直在打酱油，很长时间一点没有进展，头疼，记录下来，以这种方式方便自己回头看，也希望自己的一点微小的理解可以帮助到他人，就这样，又啰嗦了。

paranoid_CNN

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
10
评论
caffe源码之VideoDataLayer

解析写这个文章是为了了解caffe的数据读取如果不是图片，而是视频，数据层是如何设计的。直观感觉对于caffe的入口层数据不是图像而是一帧帧视频的读取层，直观上的设计是好像没啥区别，因为视频本质也是一帧帧图像，但我真不知道怎么处理，可能最暴力的办法就是把一帧帧图像分别保存为单通道图像，然后将多幅单通道图像合并为一个多通道图像，那么一个视频本质上就是一幅多通道的图像，但问题就在于好像这样做不是很简单的
复制链接

扫一扫