网络加速之mobilenetV1

最新推荐文章于 2024-09-18 21:45:57 发布

Iriving_shu

最新推荐文章于 2024-09-18 21:45:57 发布

阅读量1.1k

点赞数 3

分类专栏：模型压缩与加速

本文链接：https://blog.csdn.net/Iriving_shu/article/details/79186970

版权

模型压缩与加速专栏收录该内容

4 篇文章 0 订阅

订阅专栏

简介

mobilenet是google发布的手机端的网络的结构，其目的是从结构上减少网络参数，加速网络运行，整个网络深度是28层。这方面的文章还有shufflenet, condenset。
论文地址：https://arxiv.org/pdf/1704.04861.pdf
代码：https://github.com/shicai/MobileNet-Caffe

创新点

depthwise separable（深度分分离卷积）
引入参数进行通道和feature map调整

深度可分离卷积

mobilenet实现加速的核心部分就是使用深度可分离卷积进行加速。传统的卷积方式是卷积核对所有输入的feature map卷积求和。计算复杂度为：
这里写图片描述
而深度可分离卷积将标准卷积分解成一个深度卷积和一个点卷积（1 × 1卷积核）。

深度卷积

深度卷积如下，即计算过程是每一个卷积核只计算对应的feature map：
这里写图片描述
因此计算复杂度为：

点卷积

点卷积也就是1x1卷积，其目的使为了将通道信息进行融合。时间复杂度为：
这里写图片描述

最终深度可分离卷积与标准卷积计算量之比为：
这里写图片描述

Width Multiplier: Thinner Models

为了使模型变得更加瘦，引入宽度因子α (Width multiplier )，用于控制输入和输出的通道数，即输入通道从M变为 αM，输出通道变为αM。计算量为:
在这里插入图片描述

Resolution Multiplier: Reduced Representation

引入分辨率因子来控制分辨率大小。分辨率因子为ρ，通过设置输入是 224, 192, 160 or 128.来减少参数。
在这里插入图片描述

Experiment

模型大小影响

在这里插入图片描述

与其他模型对比

在这里插入图片描述

网络结构

这里写图片描述

depthwise separable实现

传统的depthwise separable的实现是使用for循环计算每一次卷积, 因此速度较慢。代码如下：
https://github.com/BVLC/caffe/blob/master/src/caffe/layers/base_conv_layer.cpp

template <typename Dtype>
void BaseConvolutionLayer<Dtype>::forward_gpu_gemm(const Dtype* input,
    const Dtype* weights, Dtype* output, bool skip_im2col) {
  const Dtype* col_buff = input;
  if (!is_1x1_) {
    if (!skip_im2col) {
      conv_im2col_gpu(input, col_buffer_.mutable_gpu_data());
    }
    col_buff = col_buffer_.gpu_data();
  }
  for (int g = 0; g < group_; ++g) {
    caffe_gpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, conv_out_channels_ /
        group_, conv_out_spatial_dim_, kernel_dim_,
        (Dtype)1., weights + weight_offset_ * g, col_buff + col_offset_ * g,
        (Dtype)0., output + output_offset_ * g);
  }
}

知乎上的分析如下：
https://www.zhihu.com/question/265434464

作者：cs sun
链接：https://www.zhihu.com/question/265434464/answer/306493409
来源：知乎
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。
首先，caffe原先的gpu实现group convolution很糟糕，用for循环每次算一个卷积，速度极慢。第二，cudnn7.0及之后直接支持group convolution，但本人实测，速度比github上几个直接写cuda kernel计算的dw convolution速度慢。例如对于n=128, c=512, h=32, w=32, group=512的卷积跑100次，cudnn 7.0里的group convolution需要4秒多，而yonghenglh6/DepthwiseConvolution大概只需要1秒。本人分析了一下dw convolution与普通convolution的理论计算复杂度，举例如下：卷积1：普通卷积，输入为6464256，输出为6464256，卷积核大小为33。参数为33256256=590K，计算量为646425633256=2.42G，计算过程的工作集内存总量（输入输出数据+参数）为64642562 + 33256256 = 2.69M。卷积2：dw卷积，输入为6464256，输出为6464256，卷积核大小为33。参数为33256=2.3K个，计算量为646425633=9.44M，计算过程的工作集内存总量为64642562 + 33256=2.10M。卷积3：普通卷积，输入为646416，输出为646416，卷积核大小为33。参数为331616=2.3K个，计算量为6464163316=9.44M，计算过程的工作集内存总量为6464162 + 331616=133K。可以看到卷积2肯定比卷积1快，因为计算量下降到1/256了，但卷积2实际上无法达到卷积1的256倍速度（我记得我测得结果大概是快10倍左右），因为工作集内存大小并没有显著降低。卷积2也无法达到卷积3的速度，因为虽然FLOPS相同，但工作集内存大小相差了很多倍，因此单位数据的计算密度小很多，很难充分利用GPU上的计算单元。

比较高效的实现参见:
https://github.com/yonghenglh6/DepthwiseConvolution/blob/master/caffe/src/caffe/layers/depthwise_conv_layer.cu
卷积kernel上如何实现：

template <typename Dtype>
__global__ void ConvForward(const int nthreads,
		const Dtype* const bottom_data, const int num, const int channels,
		const int height, const int width,const int conved_height,
		const int conved_width,const int kernel_h, const int kernel_w,
		const int stride_h, const int stride_w, const int pad_h, const int pad_w,
		Dtype* const top_data,const Dtype* const weight,const Dtype* const bias,const bool bias_term_) {
		// nthreads top blob输出的总数 NCHW
		//遍历top blob输出的索引。可理解为(int i=0; i < nthreads: i++)
	CUDA_KERNEL_LOOP(index, nthreads) {
	
		const int pw = index % conved_width; //得到输出Feature Map的中的宽的索引
		const int ph = (index / conved_width) % conved_height; //得到输出Feature Map的中的高的索引
		const int c = (index / conved_width / conved_height) % channels; //得到输出Feature Map的中的通道的索引。
		const int n = index / conved_width / conved_height / channels; //得到Feature Map的中的N的索引。
		int hstart = ph * stride_h - pad_h; //计算输入feature map上高方向上的开始索引
		int wstart = pw * stride_w - pad_w; // 计算输入feature map上宽方向上的开始索引
		int hend = min(hstart + kernel_h, height + pad_h); //计算输入feature map上高方向上的结束索引
		int wend = min(wstart + kernel_w, width + pad_w);//计算输入feature map上宽方向上的结束索引
//		const int pool_size = (hend - hstart) * (wend - wstart);
		hstart = max(hstart, 0);
		wstart = max(wstart, 0);
		hend = min(hend, height);
		wend = min(wend, width);
		Dtype aveval = 0;
		const Dtype* const bottom_slice =
		bottom_data + (n * channels + c) * height * width;
		const Dtype* const weight_slice =
		weight + c * kernel_h * kernel_w;
//		if (index==1) {
//			printf("pw%d ph%d c%d n%d \n",pw,ph,c,n);
//			printf("hstart%d wstart%d hend%d wend%d \n",hstart,wstart,hend,wend);
//		}

		int khstart=hend<kernel_h?kernel_h-hend:0;
		int kwstart=wend<kernel_w?kernel_w-wend:0;
		for (int h = hstart; h < hend; ++h) {
			for (int w = wstart; w < wend; ++w) {

				aveval += bottom_slice[h * width + w]*weight_slice[(khstart+h-hstart) * kernel_w + (kwstart+w-wstart)];
//				if (index==1) {
//					printf("pos:h%d w%d\n",h,w);
//					printf("cal:bottom%f weight%f\n",bottom_slice[h * width + w],weight_slice[(h-hstart) * kernel_w + (w-wstart)]);
//				}
			}
		}
		if(bias_term_) {
			aveval+=bias[c];
		}
		top_data[index] = aveval;
	}
}