CAFFE源码学习笔记之初始化Filler

最新推荐文章于 2019-08-07 22:17:36 发布

王里扬洛夫

最新推荐文章于 2019-08-07 22:17:36 发布

阅读量1.2k

点赞数

分类专栏： CAFFE源码

本文链接：https://blog.csdn.net/sinat_22336563/article/details/70755791

版权

CAFFE源码专栏收录该内容

17 篇文章 4 订阅

订阅专栏

一、前言
为什么CNN中的初始化那么重要呢？

我想总结的话就是因为他更深一点，相比浅层学习，比如logistics或者SVM,最终问题都转换成了凸优化，函数优化的目标唯一，所以参数初始化随便设置为0都不影响，因为跟着梯度走，总归是会走向最小值的附近的。

但是CNN不一样：
1、多层神经网络加上各种非线性变换的激活函数，最终的目标函数是个非凸函数，也就是有多个局部最小值。

2、如果使用sigmod类的激活函数，会因为深层累积导致梯度弥散等问题；使用relu等激活函数，又因为对数据压缩不充分造成数据随着层数增加，数据间的方差过大或者过小。

filler.hpp提供了7种权值初始化的方法，分别为：常量初始化（constant）、均匀分布初始化（uniform）、高斯分布初始化（gaussian）、positive_unitball初始化、xavier初始化、msra初始化、双线性初始化（bilinear）。

二、常量初始化

常量初始化主要是初始化偏置的。

1、参数

  optional string type = 1 [default = 'constant'];
  optional float value = 2 [default = 0]; //

2、源码

/// 把权值或着偏置初始化为一个常数，默认为0
template <typename Dtype>
class ConstantFiller : public Filler<Dtype> {
 public:
  explicit ConstantFiller(const FillerParameter& param)
      : Filler<Dtype>(param) {}
  virtual void Fill(Blob<Dtype>* blob) {
    Dtype* data = blob->mutable_cpu_data();
    const int count = blob->count();//每个点
    const Dtype value = this->filler_param_.value();
    CHECK(count);
    for (int i = 0; i < count; ++i) {
      data[i] = value;
    }
    CHECK_EQ(this->filler_param_.sparse(), -1)
         << "Sparsity not supported by this Filler.";
  }
};

三、均匀分布初始化（uniform）

符合均匀分布U（a,b）的随机变量数学期望和方差分别是——数学期望：E(X)=(a+b)/2，方差：D(X)=(b-a)²/12

假设x服从 $(-\frac{1}{\sqrt{d}},\frac{1}{\sqrt{d}})$

$\begin{equation} Var(w_i) = (\frac{2}{\sqrt{d}})^2/12=\frac{1}{3d} \end{equation}$

$\begin{equation} Var(\sum_{i=1}^{d} w_i x_i) = d*Var(w_i)= \frac{1}{3} \end{equation}$

最终，x服从均值=0，方差=1/3的正态分布。

1、参数

  optional float min = 3 [default = 0]; // the min value in uniform filler
  optional float max = 4 [default = 1]; // the max value in uniform filler

2、源码

template <typename Dtype>
class UniformFiller : public Filler<Dtype> {
 public:
  explicit UniformFiller(const FillerParameter& param)
      : Filler<Dtype>(param) {}
  virtual void Fill(Blob<Dtype>* blob) {
    CHECK(blob->count());
    caffe_rng_uniform<Dtype>(blob->count(), Dtype(this->filler_param_.min()),
        Dtype(this->filler_param_.max()), blob->mutable_cpu_data());
    CHECK_EQ(this->filler_param_.sparse(), -1)
         << "Sparsity not supported by this Filler.";
  }
};

其中关键则是caffe_rng_uniform函数

template <>
void caffe_gpu_rng_uniform<float>(const int n, const float a, const float b,float* r) {
  CURAND_CHECK(curandGenerateUniform(Caffe::curand_generator(), r, n));
  const float range = b - a;
  if (range != static_cast<float>(1)) {
    caffe_gpu_scal(n, range, r);
  }
  if (a != static_cast<float>(0)) {
    caffe_gpu_add_scalar(n, a, r);//r[index] += a;
  }
}

四、高斯分布初始化

template <typename Dtype>
class GaussianFiller : public Filler<Dtype> {
 public:
  explicit GaussianFiller(const FillerParameter& param)
      : Filler<Dtype>(param) {}
  virtual void Fill(Blob<Dtype>* blob) {
    Dtype* data = blob->mutable_cpu_data();
    CHECK(blob->count());
    caffe_rng_gaussian<Dtype>(blob->count(), Dtype(this->filler_param_.mean()),//均值
        Dtype(this->filler_param_.std()), blob->mutable_cpu_data());//方差
    int sparse = this->filler_param_.sparse();
    CHECK_GE(sparse, -1);
    if (sparse >= 0) {//gaussina初始化可以进行稀疏
      // 稀疏化是针对weight的

      CHECK_GE(blob->num_axes(), 1);
      const int num_outputs = blob->shape(0);//
      Dtype non_zero_probability = Dtype(sparse) / Dtype(num_outputs);//非零概率
      rand_vec_.reset(new SyncedMemory(blob->count() * sizeof(int)));
      int* mask = reinterpret_cast<int*>(rand_vec_->mutable_cpu_data());
      caffe_rng_bernoulli(blob->count(), non_zero_probability, mask);//稀疏矩阵mask
      for (int i = 0; i < blob->count(); ++i) {
        data[i] *= mask[i];
      }
    }
  }

 protected:
  shared_ptr<SyncedMemory> rand_vec_;
};

五、单元球初始化

让每一个单元的输入的权值的和为 1，如果一个神经元输入为n个，先对这n个权值赋值为在（0，1）之间的均匀分布，然后每一个权值再除以它们的和。

为了防止梯度权值不断增加，使得sigmod函数过早进入饱和区。

template <typename Dtype>
class PositiveUnitballFiller : public Filler<Dtype> {
 public:
  explicit PositiveUnitballFiller(const FillerParameter& param)
      : Filler<Dtype>(param) {}
  virtual void Fill(Blob<Dtype>* blob) {
    Dtype* data = blob->mutable_cpu_data();
    DCHECK(blob->count());
    caffe_rng_uniform<Dtype>(blob->count(), 0, 1, blob->mutable_cpu_data());//先给输入赋予均匀分布
    int dim = blob->count() / blob->num();
    CHECK(dim);
    for (int i = 0; i < blob->num(); ++i) {
      Dtype sum = 0;
      for (int j = 0; j < dim; ++j) {
        sum += data[i * dim + j];//将权值累加
      }
      for (int j = 0; j < dim; ++j) {
        data[i * dim + j] /= sum;//除以和，相当于归一化
      }
    }
    CHECK_EQ(this->filler_param_.sparse(), -1)
         << "Sparsity not supported by this Filler.";
  }
};

六、Xavier初始化

如果输入维度为n，输入维度为m，则对权值以 $这里写图片描述$ 的均匀分布进行初始化。

假设输入和参数的分布为均值为0，方差分别为 $\delta_i$ ， $\delta_w$ 。

因为 $z_i=\sum_i^nw_{ij}*x_j$

所以 $z_i$ 服从均值为0，方差为 $n*\delta_i*\delta_w$ 的分布

简单讲就是：
$\delta_{z_i}^2 = n*\delta_i^1*\delta_w^1$

为简化，考虑非线性变换的线性部分，所以最终的方差是前面所有层方差的累积。如果每个方差都大于1，最终方差将会溢出，如果每个方差都小于1，最终数据之间差异变小，梯度下降变缓。

为了使得输入和输出之间的方差相等

令 $n*\delta_w^1 = 1$ ，

前向计算考虑输入个数，反向计算则考虑输出个数，同时考虑则由于输入输出的个数往往不相等，所以最终的结果就是：

方差最终为： $\frac{2}{n_i+n_{i+1}}$

如果实现均匀分布，方差： $\frac{(a-b)^2}{12}$

解得： $[-\sqrt{\frac{6}{n_i+n_{i+1}}},\sqrt{\frac{6}{n_i+n_{i+1}}}]$

template <typename Dtype>
class XavierFiller : public Filler<Dtype> {
 public:
  explicit XavierFiller(const FillerParameter& param)
      : Filler<Dtype>(param) {}
  virtual void Fill(Blob<Dtype>* blob) {
    CHECK(blob->count());
    int fan_in = blob->count() / blob->num();
    int fan_out = blob->count() / blob->channels();
    Dtype n = fan_in;  // 默认考虑输入个数
    if (this->filler_param_.variance_norm() ==
        FillerParameter_VarianceNorm_AVERAGE) {
      n = (fan_in + fan_out) / Dtype(2);//方差同时考虑输入和输出个数
    } else if (this->filler_param_.variance_norm() ==
        FillerParameter_VarianceNorm_FAN_OUT) {
      n = fan_out;//方差只考虑输出个数
    }
    Dtype scale = sqrt(Dtype(3) / n);
    caffe_rng_uniform<Dtype>(blob->count(), -scale, scale,
        blob->mutable_cpu_data());
    CHECK_EQ(this->filler_param_.sparse(), -1)
         << "Sparsity not supported by this Filler.";
  }
};

通过以上分析，其实该方法考虑的更多的是激活函数的线性部分，如果是sigmod，勉强可以；但是如果是ReLu的话，就不是很合适了，这是一点微小的思考。。。。。

七、MSRA初始化

只考虑输入时，参数初始化为一个均值为0，方差为 $\frac{2}{n}$ 的高斯分布
其他情况与Xavier类似。

template <typename Dtype>
class MSRAFiller : public Filler<Dtype> {
 public:
  explicit MSRAFiller(const FillerParameter& param)
      : Filler<Dtype>(param) {}
  virtual void Fill(Blob<Dtype>* blob) {
    CHECK(blob->count());
    int fan_in = blob->count() / blob->num();
    int fan_out = blob->count() / blob->channels();
    Dtype n = fan_in;  // default to fan_in
    if (this->filler_param_.variance_norm() ==
        FillerParameter_VarianceNorm_AVERAGE) {
      n = (fan_in + fan_out) / Dtype(2);
    } else if (this->filler_param_.variance_norm() ==
        FillerParameter_VarianceNorm_FAN_OUT) {
      n = fan_out;
    }
    Dtype std = sqrt(Dtype(2) / n);
    caffe_rng_gaussian<Dtype>(blob->count(), Dtype(0), std,
        blob->mutable_cpu_data());
    CHECK_EQ(this->filler_param_.sparse(), -1)
         << "Sparsity not supported by this Filler.";
  }
};

王里扬洛夫

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
CAFFE源码学习笔记之初始化Filler

一、前言为什么CNN中的初始化那么重要呢？我想总结的话就是因为他更深一点，相比浅层学习，比如logistics或者SVM,最终问题都转换成了凸优化，函数优化的目标唯一，所以参数初始化随便设置为0都不影响，因为跟着梯度走，总归是会走向最小值的附近的。但是CNN不一样： 1、多层神经网络加上各种非线性变换的激活函数，最终的目标函数是个非凸函数，也就是有多个局部最小值。2、如果使用sigmod类的激
复制链接

扫一扫