darknet阅读——卷积篇

最新推荐文章于 2023-04-04 10:02:23 发布

红鲤鱼绿鲤鱼

最新推荐文章于 2023-04-04 10:02:23 发布

阅读量272

点赞数

文章标签： cuda 卷积

本文链接：https://blog.csdn.net/qq_35688148/article/details/117483527

版权

convolution

卷积操作主要分为两步：im2col和gemm，由于传统的卷积操作不适合直接并行处理，因此采用im2col操作将输入矩阵的ksize*ksize的部分拉成一条，将卷积转换为矩阵相乘，其中im2col根据通道是否为32倍数进行考虑（？）

im2col_gpu

考虑为32倍数时比较简单

    int height_col = (height + 2 * pad - ksize) / stride + 1;
    int width_col = (width + 2 * pad - ksize) / stride + 1;
    int num_kernels = channels * height_col * width_col;

判断列向和行向要几次卷积（很容易理解~）
BLOCK为512
接下里就是很普通的网格内并行操作

    int index = blockIdx.x*blockDim.x+threadIdx.x;
    for(; index < n; index += blockDim.x*gridDim.x){
        int w_out = index % width_col;
        int h_index = index / width_col;
        int h_out = h_index % height_col;
        int channel_in = h_index / height_col;
        int channel_out = channel_in * ksize * ksize;
        int h_in = h_out * stride - pad;
        int w_in = w_out * stride - pad;
        float* data_col_ptr = data_col;
        data_col_ptr += (channel_out * height_col + h_out) * width_col + w_out;
        const float* data_im_ptr = data_im;
        data_im_ptr += (channel_in * height + h_in) * width + w_in;
        for (int i = 0; i < ksize; ++i) {
            for (int j = 0; j < ksize; ++j) {
                int h = h_in + i;
                int w = w_in + j;

                *data_col_ptr = (h >= 0 && w >= 0 && h < height && w < width) ?
                    data_im_ptr[i * width + j] : 0;

                //data_im[(channel_in * height + h_in) * width + w_in + i * width + j];
                //(*data_col_ptr) = data_im_ptr[ii * width + jj];

                data_col_ptr += height_col * width_col;
            }
        }
	}

由于按照CHW顺序排列，因此先

/=width，然后%=height && /=height

起始的h_id和w_id分别为

h * stride - pad， w * stride - pad

输出的channel和输入的channel关系为
channel_out = channel_in * ksize * ksize
在ksizeksize内串行处理，判断是否超出边界（感觉会线程束分化，是不是一开始补零会好点？）
每次偏移h * w的地址，转换为CHW9的矩形

transpose_uint32_gpu

对卷积核进行转置（后续写）

get_cuda_stream

    int i = cuda_get_device();
    if (!streamInit[i]) {
        printf("Create CUDA-stream - %d \n", i);
#ifdef CUDNN
        cudaError_t status = cudaStreamCreateWithFlags(&streamsArray[i], cudaStreamNonBlocking);
#else
        cudaError_t status = cudaStreamCreate(&streamsArray[i]);
#endif
        if (status != cudaSuccess) {
            printf(" cudaStreamCreate error: %d \n", status);
            const char *s = cudaGetErrorString(status);
            printf("CUDA Error: %s\n", s);
            status = cudaStreamCreateWithFlags(&streamsArray[i], cudaStreamNonBlocking);    // cudaStreamDefault
            CHECK_CUDA(status);
        }
        streamInit[i] = 1;
    }
    return streamsArray[i];

采用cuda流进行加速，根据GPU目前用的id采用第id个流。

gemm_nn_custom_bin_mean_transposed_tensor_kernel

cuda版本大于等于10.0时，转换成tensor进行wmma（

gemm_nn_custom_bin_mean_transposed_gpu

一时有点没看懂（
未完待续…

红鲤鱼绿鲤鱼

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
darknet阅读——卷积篇

convolution卷积操作主要分为两步：im2col和gemm，由于传统的卷积操作不适合直接并行处理，因此采用im2col操作将输入矩阵的ksize*ksize的部分拉成一条，将卷积转换为矩阵相乘，其中im2col根据通道是否为32倍数进行考虑（？）im2col_gpu考虑为32倍数时比较简单 int height_col = (height + 2 * pad - ksize) / stride + 1; int width_col = (width + 2 * pad - ks
复制链接

扫一扫