解析
写这个文章是为了了解caffe的数据读取如果不是图片,而是视频,数据层是如何设计的。
直观感觉
对于caffe的入口层数据不是图像而是一帧帧视频的读取层,直观上的设计是好像没啥区别,因为视频本质也是一帧帧图像,但我真不知道怎么处理,可能最暴力的办法就是把一帧帧图像分别保存为单通道图像,然后将多幅单通道图像合并为一个多通道图像,那么一个视频本质上就是一幅多通道的图像,但问题就在于好像这样做不是很简单的事情,而且,我们网络的输入,如果输入全部的帧,那运算量真是感人,好吧,确实是有这个问题,但其实更重要的一个原因就是:我编不出来。
那么我们就借助先人做的工作,来看看他们是怎么实现的:
1、打开caffe.proto看一下VideoDataLayer层参数
message VideoDataParameter{
// Specify the data source.
optional string source = 1;
// Specify the batch size.
optional uint32 batch_size = 4;
// The rand_skip variable is for the data layer to skip a few data points
// to avoid all asynchronous sgd clients to start at the same point. The skip
// point would be set as rand_skip * rand(0,1). Note that rand_skip should not
// be larger than the number of keys in the leveldb.
optional uint32 rand_skip = 7 [default = 0];
// Whether or not ImageLayer should shuffle the list of files at every epoch.
optional bool shuffle = 8 [default = false];
// It will also resize images if new_height or new_width are not zero.
optional uint32 new_height = 9 [default = 0];
optional uint32 new_width = 10 [default = 0];
optional uint32 new_length = 11 [default = 1];
optional uint32 num_segments = 12 [default = 1];
// DEPRECATED. See TransformationParameter. For data pre-processing, we can do
// simple scaling and subtracting the data mean, if provided. Note that the
// mean subtraction is always carried out before scaling.
optional float scale = 2 [default = 1];
optional string mean_file = 3;
// DEPRECATED. See TransformationParameter. Specify if we would like to randomly
// crop an image.
optional uint32 crop_size = 5 [default = 0];
// DEPRECATED. See TransformationParameter. Specify if we want to randomly mirror
// data.
optional bool mirror = 6 [default = false];
enum Modality {
RGB = 0;
FLOW = 1;
}
optional Modality modality = 13 [default = FLOW];
// the name pattern for frame images,
// for RGB modality it is default to "img_%04d.jpg", for FLOW "flow_x_%04d" and "flow_y_%04d"
optional string name_pattern = 14;
// The type of input
optional bool encoded = 15 [default = false];
}
只说没有注释的变量,new_length 这个变量是输入是光流时,累积输入的一个样本的光流帧数。num_segments,这个也是在光流里的变量,默认为1,猜测意思是对于一个视频样本,将这个视频样本分割为几部分视频片段。实际我们用的也是num_segments=1,即一个视频就是一个样本。name_pattern即保存的Flow和img的格式,对于RGB通道的图像,我们的求出来的光流图是“flow_x_%04d.jpg”和“flow_y_%04d.jpg”的格式,对于RGB保存的image的名字格式是“img_%04d.jpg”,而数据读取层对于img和flow分别猜去不同的读取模式 ,这一点在源码中会有体现,于是,modality的两个选项就是RGB和FLOW。
2、我们看看VideoDataLayer层的功能声明是怎么做到的,打开VideoDataLayer.hpp:
template <typename Dtype>
class VideoDataLayer : public BasePrefetchingDataLayer<Dtype> {
public:
explicit VideoDataLayer(const LayerParameter& param)
: BasePrefetchingDataLayer<Dtype>(param) {}
virtual ~VideoDataLayer();
virtual void DataLayerSetUp(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual inline const char* type() const { return "VideoData"; }
virtual inline int ExactNumBottomBlobs() const { return 0; }
virtual inline int ExactNumTopBlobs() const { return 2; }
和ImageDataLayer一样,VideoDataLayer层继承自BasePretchingDataLayer,最重要的还是该层要实现的具体功能,见DataLayerSetUp,具体实现代码中解析;然后声明了几个重要的概念,光流随机数和帧随机数,caffe里的随机数是一个很重要的概念,最主要用于初始化权重,shuffle的时候也要用到。
protected:
shared_ptr<Caffe::RNG> prefetch_rng_;
shared_ptr<Caffe::RNG> prefetch_rng_2_;
shared_ptr<Caffe::RNG> prefetch_rng_1_;
shared_ptr<Caffe::RNG> frame_prefetch_rng_;
virtual void ShuffleVideos();
virtual void InternalThreadEntry();
prfetch_rng_1和pretch_rng_2分别代表了flow_x和flow_y的随机数,而prefetch_rng_是为了保证flow_x和flow_y产生的随机数是一样的,frame_prefetch_rng是帧图像随机数,线程入口也与ImageDataLayer不同,具体实现中解析。
#ifdef USE_MPI
inline virtual void advance_cursor(){
lines_id_++;
if (lines_id_ >= lines_.size()) {
// We have reached the end. Restart from the first.
DLOG(INFO) << "Restarting data prefetching from start.";
lines_id_ = 0;
if (this->layer_param_.video_data_param().shuffle()) {
ShuffleVideos();
}
}
}
#endif
上面那一坨我也不清楚,大概就是一个游标,每个eproch时候shuffle,产生的随机数都不同。
vector<std::pair<std::string, int> > lines_;
vector<int> lines_duration_;
int lines_id_;
string name_pattern_;
lines_是映射对,记录了source(.txt)里的样本和图像的对应关系。lines_[0].size便是样本个数。lines_duration_是每个视频的帧数,lines_id_ 是样本index,name_pattern_是…………自己悟。
3、VideoDataLayer.cpp
namespace caffe{
template <typename Dtype>
VideoDataLayer<Dtype>:: ~VideoDataLayer<Dtype>(){
this->JoinPrefetchThread();
}
不具体了解,大概就是加入预取进程的意思。
template <typename Dtype>
void VideoDataLayer<Dtype>:: DataLayerSetUp(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top){
const int new_height = this->layer_param_.video_data_param().new_height();
const int new_width = this->layer_param_.video_data_param().new_width();
const int new_length = this->layer_param_.video_data_param().new_length();
const int num_segments = this->layer_param_.video_data_param().num_segments();
const string& source = this->layer_param_.video_data_param().source();
LOG(INFO) << "Opening file: " << source;
std:: ifstream infile(source.c_str());
string filename;
int label;
int length;
while (infile >> filename >> length >> label){
lines_.push_back(std::make_pair(filename,label));
lines_duration_.push_back(length);
}
new_length是累积的光流帧长度,对应论文中的L,num_segments默认为1,感觉应该是每个视频所取的帧数为1帧。ifstream 是从硬盘读取文件的命令,infile是识别字,括号里为文件名。
if (this->layer_param_.video_data_param().shuffle()){
const unsigned int prefectch_rng_seed = caffe_rng_rand();
prefetch_rng_1_.reset(new Caffe::RNG(prefectch_rng_seed));
prefetch_rng_2_.reset(new Caffe::RNG(prefectch_rng_seed));
ShuffleVideos();
}
LOG(INFO) << "A total of " << lines_.size() << " videos.";
显而易见,这里的与随机数相关的内容是用来shuffle的。
if (this->layer_param_.video_data_param().name_pattern() == ""){
if (this->layer_param_.video_data_param().modality() == VideoDataParameter_Modality_RGB){
name_pattern_ = "image_%04d.jpg";
}else if (this->layer_param_.video_data_param().modality() == VideoDataParameter_Modality_FLOW){
name_pattern_ = "flow_%c_%04d.jpg";
}
}else{
name_pattern_ = this->layer_param_.video_data_param().name_pattern();
}
这部分用来确定名称格式的,帧图像是img_%04d.jpg,光流图像是flow_x_%04d.jpg和flow_y_%04d.jpg。
Datum datum;
const unsigned int frame_prefectch_rng_seed = caffe_rng_rand();
frame_prefetch_rng_.reset(new Caffe::RNG(frame_prefectch_rng_seed));
int average_duration = (int) lines_duration_[lines_id_]/num_segments;
vector<int> offsets;
for (int i = 0; i < num_segments; ++i){
caffe::rng_t* frame_rng = static_cast<caffe::rng_t*>(frame_prefetch_rng_->generator());
int offset = (*frame_rng)() % (average_duration - new_length + 1);
offsets.push_back(offset+i*average_duration);
}
offsets是一个vector,一开始研究了很久没搞懂啥意思,搞了个啥么,连点说明都没有就直接定义一个vector,后来才明白,由于每个视频的num_segments=1,那么每个视频只取一帧,那么问题来了,在一个视频取一帧之后怎么跳到下个视频,这里有一个average_duration,这里由于num_segments=1,所以average_duration就是前面的length,也就是每一个视频的帧数,可以看到循环结构只执行了一次,说明每个视频只取了一帧,offsets.push_back(offset+i*average_duration);这个意思便是跳过一个视频到下一个视频。并把那一帧push_back到offsets vector中。
if (this->layer_param_.video_data_param().modality() == VideoDataParameter_Modality_FLOW)
CHECK(ReadSegmentFlowToDatum(lines_[lines_id_].first, lines_[lines_id_].second,
offsets, new_height, new_width, new_length, &datum, name_pattern_.c_str()));
else
CHECK(ReadSegmentRGBToDatum(lines_[lines_id_].first, lines_[lines_id_].second,
offsets, new_height, new_width, new_length, &datum, true, name_pattern_.c_str()));
这里是很关键的两种读入方式,具体解析如下:
打开/src/caffe/util/io.cpp,首先是ReadSegmentRGBToDatum:
bool ReadSegmentRGBToDatum(const string& filename, const int label,
const vector<int> offsets, const int height, const int width, const int length, Datum* datum, bool is_color,
const char* name_pattern ){
cv::Mat cv_img;
string* datum_string;
char tmp[30];
int cv_read_flag = (is_color ? CV_LOAD_IMAGE_COLOR :
CV_LOAD_IMAGE_GRAYSCALE);
for (int i = 0; i < offsets.size(); ++i){
int offset = offsets[i];
for (int file_id = 1; file_id < length+1; ++file_id){
sprintf(tmp, name_pattern, int(file_id+offset));
string filename_t = filename + "/" + tmp;
cv::Mat cv_img_origin = cv::imread(filename_t, cv_read_flag);
if (!cv_img_origin.data){
LOG(ERROR) << "Could not load file " << filename;
return false;
}
if (height > 0 && width > 0){
cv::resize(cv_img_origin, cv_img, cv::Size(width, height));
}else{
cv_img = cv_img_origin;
}
int num_channels = (is_color ? 3 : 1);
if (file_id==1 && i==0){
datum->set_channels(num_channels*length*offsets.size());
datum->set_height(cv_img.rows);
datum->set_width(cv_img.cols);
datum->set_label(label);
datum->clear_data();
datum->clear_float_data();
datum_string = datum->mutable_data();
}
if (is_color) {
for (int c = 0; c < num_channels; ++c) {
for (int h = 0; h < cv_img.rows; ++h) {
for (int w = 0; w < cv_img.cols; ++w) {
datum_string->push_back(
static_cast<char>(cv_img.at<cv::Vec3b>(h, w)[c]));
}
}
}
} else { // Faster than repeatedly testing is_color for each pixel w/i loop
for (int h = 0; h < cv_img.rows; ++h) {
for (int w = 0; w < cv_img.cols; ++w) {
datum_string->push_back(
static_cast<char>(cv_img.at<uchar>(h, w)));
}
}
}
}
}
return true;
}
这一块,自己慢慢看。
bool ReadSegmentFlowToDatum(const string& filename, const int label,
const vector<int> offsets, const int height, const int width, const int length, Datum* datum,
const char* name_pattern ){
cv::Mat cv_img_x, cv_img_y;
string* datum_string;
char tmp[30];
for (int i = 0; i < offsets.size(); ++i){
int offset = offsets[i];
for (int file_id = 1; file_id < length+1; ++file_id){
sprintf(tmp,name_pattern, 'x', int(file_id+offset));
string filename_x = filename + "/" + tmp;
cv::Mat cv_img_origin_x = cv::imread(filename_x, CV_LOAD_IMAGE_GRAYSCALE);
sprintf(tmp, name_pattern, 'y', int(file_id+offset));
string filename_y = filename + "/" + tmp;
cv::Mat cv_img_origin_y = cv::imread(filename_y, CV_LOAD_IMAGE_GRAYSCALE);
if (!cv_img_origin_x.data || !cv_img_origin_y.data){
LOG(ERROR) << "Could not load file " << filename_x << " or " << filename_y;
return false;
}
if (height > 0 && width > 0){
cv::resize(cv_img_origin_x, cv_img_x, cv::Size(width, height));
cv::resize(cv_img_origin_y, cv_img_y, cv::Size(width, height));
}else{
cv_img_x = cv_img_origin_x;
cv_img_y = cv_img_origin_y;
}
if (file_id==1 && i==0){
int num_channels = 2;
datum->set_channels(num_channels*length*offsets.size());
datum->set_height(cv_img_x.rows);
datum->set_width(cv_img_x.cols);
datum->set_label(label);
datum->clear_data();
datum->clear_float_data();
datum_string = datum->mutable_data();
}
for (int h = 0; h < cv_img_x.rows; ++h){
for (int w = 0; w < cv_img_x.cols; ++w){
datum_string->push_back(static_cast<char>(cv_img_x.at<uchar>(h,w)));
}
}
for (int h = 0; h < cv_img_y.rows; ++h){
for (int w = 0; w < cv_img_y.cols; ++w){
datum_string->push_back(static_cast<char>(cv_img_y.at<uchar>(h,w)));
}
}
}
}
return true;
同样,自己悟。
const int crop_size = this->layer_param_.transform_param().crop_size();
const int batch_size = this->layer_param_.video_data_param().batch_size();
if (crop_size > 0){
top[0]->Reshape(batch_size, datum.channels(), crop_size, crop_size);
this->prefetch_data_.Reshape(batch_size, datum.channels(), crop_size, crop_size);
} else {
top[0]->Reshape(batch_size, datum.channels(), datum.height(), datum.width());
this->prefetch_data_.Reshape(batch_size, datum.channels(), datum.height(), datum.width());
}
LOG(INFO) << "output data size: " << top[0]->num() << "," << top[0]->channels() << "," << top[0]->height() << "," << top[0]->width();
top[1]->Reshape(batcch_size, 1, 1, 1);
this->prefetch_label_.Reshape(batch_size, 1, 1, 1);
vector<int> top_shape = this->data_transformer_->InferBlobShape(datum);
this->transformed_data_.Reshape(top_shape);
}
template <typename Dtype>
void VideoDataLayer<Dtype>::InternalThreadEntry(){
Datum datum;
CHECK(this->prefetch_data_.count());
Dtype* top_data = this->prefetch_data_.mutable_cpu_data();
Dtype* top_label = this->prefetch_label_.mutable_cpu_data();
VideoDataParameter video_data_param = this->layer_param_.video_data_param();
const int batch_size = video_data_param.batch_size();
const int new_height = video_data_param.new_height();
const int new_width = video_data_param.new_width();
const int new_length = video_data_param.new_length();
const int num_segments = video_data_param.num_segments();
const int lines_size = lines_.size();
for (int item_id = 0; item_id < batch_size; ++item_id){
CHECK_GT(lines_size, lines_id_);
vector<int> offsets;
int average_duration = (int) lines_duration_[lines_id_] / num_segments;
for (int i = 0; i < num_segments; ++i){
if (this->phase_==TRAIN){
if (average_duration >= new_length){
caffe::rng_t* frame_rng = static_cast<caffe::rng_t*>(frame_prefetch_rng_->generator());
int offset = (*frame_rng)() % (average_duration - new_length + 1);
offsets.push_back(offset+i*average_duration);
} else {
offsets.push_back(1);
}
} else{
if (average_duration >= new_length)
offsets.push_back(int((average_duration-new_length+1)/2 + i*average_duration));
else
offsets.push_back(1);
}
}
if (this->layer_param_.video_data_param().modality() == VideoDataParameter_Modality_FLOW){
if(!ReadSegmentFlowToDatum(lines_[lines_id_].first, lines_[lines_id_].second,
offsets, new_height, new_width, new_length, &datum, name_pattern_.c_str())) {
continue;
}
} else{
if(!ReadSegmentRGBToDatum(lines_[lines_id_].first, lines_[lines_id_].second,
offsets, new_height, new_width, new_length, &datum, true, name_pattern_.c_str())) {
continue;
}
}
int offset1 = this->prefetch_data_.offset(item_id);
this->transformed_data_.set_cpu_data(top_data + offset1);
this->data_transformer_->Transform(datum, &(this->transformed_data_));
top_label[item_id] = lines_[lines_id_].second;
//LOG()
//next iteration
lines_id_++;
if (lines_id_ >= lines_size) {
DLOG(INFO) << "Restarting data prefetching from start.";
lines_id_ = 0;
if(this->layer_param_.video_data_param().shuffle()){
ShuffleVideos();
}
}
}
}
INSTANTIATE_CLASS(VideoDataLayer);
REGISTER_LAYER_CLASS(VideoData);
}
基本和前面相同。
总结:
自己其实也挺水的,时间久了感觉自己一直在打酱油,很长时间一点没有进展,头疼,记录下来,以这种方式方便自己回头看,也希望自己的一点微小的理解可以帮助到他人,就这样,又啰嗦了。