Optimize Separable Convolution for the Specified Size on NVIDIA® CUDA®

最新推荐文章于 2021-02-02 00:21:39 发布

u013606170

最新推荐文章于 2021-02-02 00:21:39 发布

阅读量188

点赞数

分类专栏：編程優化

本文链接：https://blog.csdn.net/u013606170/article/details/88596745

版权

編程優化专栏收录该内容

6 篇文章 0 订阅

订阅专栏

※The code corresponding to this blog in here.

Continue this post, in here I will show how to optimize the CUDA code of separable convolution for a specified size. Due to the size is specified, the procedures could be customized to meet the size, brings the performance improved.

There are three metrics we should pay attention to in this optimization : memory coalescing, branching condition and memory throughput, the three metrics are able to help us judging how good the optimization reaches.

1. Memory coalescence: If the threads in a block sequentially access the consecutive addresses, it is called memory coalescing. High coalescing brings the high efficiency of transaction.

2. Branching condition : the branch controlling requires additional overhead, both in CPU and GPU. Since the the branch control unit in GPU is much simple than that in GPU, the branching judgement is costly.

3. Memory throughput : The more memory accessing of a thread is, the higher memory throughput is. We could observe this metrics to evaluate whether there is room to optimize the accessing or not .

Essentially, a loop statement is a branching statement, and the branching judgement is expensive. Hence, we should reduce the if-statement and loop-statement as much as possible. If the repeating number is able to determined statically, we could do the loop unrolling to remove the loop branching.

In most cases, highly memory coalescing means high throughput, bringing the memory accessing in an efficient way .i.e, reducing the accessing overhead. One of implementation method for optimal memory coalescing is to expand the procedures. This is also called loop nest optimization.

In Session 一, we will unroll the loops and reduce the number of branching judgement; in 二, we will expand the procedures. Those two optimizations both get noteworthy improvement.

This optimization targets kernel size = 31, output size = 1400x1400, and my graphic card is GTX1060 3G

零. Review the original code.

Row function:

j = blockDim.y*blockIdx.y + threadIdx.y;
for (; j < height; j += blockDim.y * gridDim.y) {

 i = blockDim.x*blockIdx.x + threadIdx.x;
 for (; i < width; i += blockDim.x * gridDim.x) {
  int ii;
  float sum;

  sum = 0;
  ii = 0;

  do {
   p_input_in_block[threadIdx.y*shared_mem_pitch
    + ii*blockDim.x + threadIdx.x] =
    p_column_done_extended_input_dev[j*extended_width
    + ii*blockDim.x + i];

   ii++;
  } while (threadIdx.x + ii * blockDim.x < block_width);

  __syncthreads();


  for (ii = 0; ii < kernel_length; ii++) {
   sum += kernel_const_mem[ii] * p_input_in_block[
    threadIdx.y*shared_mem_pitch + ii + threadIdx.x];
  }/*for kernel_length*/

  p_output_dev[j*width + i] = sum;
  __syncthreads();
 }/*for width*/

}/*for j*/

We could find there are 4 loops: width, height, do-while and kernel_length. All of those loops are able to be unrolled.

Besides, the memory accessing in a block is not heavy, it is able to increase the accessing to achieve memory coalescing.

一. Unroll the loops

The most obvious part for the optimization, is kernel_length loop. Due to our purpose is to optimize the code for the specified kernel length, the variable, kernel_length, could be simply replaced as a macro constant. With the #pragma unroll directive, the unrolling would be done automatically in compiling time.

If the grid and block size perfectly meet the input size, the height and width loop are removable: i.e. the grid_size x block_size is equal to the input size.

The do-while loop, which lies in the middle, is able to be rewritten as a for-loop which contains a if-statement, and we could pre-calculate the repeating number as a macro constant in this loop. Also beware, the if-statement, preventing the accessing from out of bounds, be false only at last round. Therefore, we could separate the if-statement from the loop, to reduce the branching judgement.

The code adopted those unrolling is:

#define ROW_CP_STEPS      \
 (KERNEL_LENGTH + 2*X_NUM_THREADS - 1 + (X_NUM_THREADS - 1))/(X_NUM_THREADS)

 j = blockDim.y*blockIdx.y + threadIdx.y;
 i = blockDim.x*blockIdx.x + threadIdx.x;

 int jj;
 float sum;

#pragma unroll (COLUMN_CP_STEPS - 1)
 for (int jj = 0; jj < (COLUMN_CP_STEPS - 1) ; jj++ ) {
   p_input_in_block[(threadIdx.y + jj*blockDim.y)* shared_mem_pitch
    + threadIdx.x]
    = p_extended_input_dev
    [(j + jj*blockDim.y)*extended_width
    + kernel_radius + i];
 }/*for */

 jj = (COLUMN_CP_STEPS - 1);
 if (threadIdx.y + jj * blockDim.y < block_height) {
  p_input_in_block[(threadIdx.y + jj*blockDim.y)* shared_mem_pitch
   + threadIdx.x]
   = p_extended_input_dev
   [(j + jj*blockDim.y)*extended_width
   + kernel_radius + i];
 }/*COLUMN_CP_STEPS - 1*/


 __syncthreads();

 sum = 0;
#pragma unroll KERNEL_LENGTH
 for (jj = 0; jj < KERNEL_LENGTH; jj++) {
  sum += kernel_const_mem[jj] * p_input_in_block[
   (threadIdx.y + jj)*shared_mem_pitch + threadIdx.x];
 }/*for kernel*/

 p_column_done_extended_output_dev[j*extended_width + kernel_radius + i]
  = sum;

The gridDim could no longer be a arbitrary value, it must completely meet the input size:

num_blocks.x = (width + (num_threads.x - 1)) / num_threads.x;
num_blocks.y = (height + (num_threads.y - 1)) / num_threads.y;

The runtime and the nvprof metrics are as below:

kernel size = 13 1400x1400	Runtime for 100 rounds	Duration time Column/Row	Control-Flow Instrunctions Column/Row	Global Store Throughput Column/Row	Global Load Throughput Row/Column
Non Specified	243 ms	461.886 / 479.101 μs	11900000 / 11900000	33.906 / 24.965 GB/s	54.794 / 51.86 GB/s
Unrolling	216 ms	270.238 / 264.447 μs	1960000 / 1960000	59.683 / 45.983 GB/s	96.3 / 95.522 GB/s

kernel size = 13

1400x1400

Runtime

for 100 rounds

Duration time

Column/Row

Control-Flow Instrunctions

Column/Row

Global Store Throughput

Column/Row

Global Load Throughput

Row/Column

Non Specified

243 ms

461.886 / 479.101 μs

11900000 / 11900000

33.906 / 24.965 GB/s

54.794 / 51.86 GB/s

Unrolling

216 ms

270.238 / 264.447 μs

1960000 / 1960000

59.683 / 45.983 GB/s

96.3 / 95.522 GB/s

The reduction of branching improves the continuity of the procedures, bringing the memory throughput increasing.

二. Expand the procedures.

Expanding the procedures makes the accessing actions merged and the accessing overhead diminished. It increases the throughput to improve performance. Besides, the merged procedures also bring the branching-judgement further reduced.

To implement the procedures expansion is to use one block of threads tackles over one block of data.

We expand the procedures in x-direction and y-direction both, the column function code is :

#define COLUMN_CP_STEPS_EXPANDING \
 (KERNEL_LENGTH + 2 * Y_NUM_THREADS - 1 + (Y_NUM_THREADS - 1))/(Y_NUM_THREADS)

block_height = (2 * blockDim.y) + (kernel_length - 1);

 shared_mem_pitch = 2 * blockDim.x;
 shared_mem_pitch += padding;

 j = 2 * blockDim.y*blockIdx.y + threadIdx.y;
 i = 2 * blockDim.x*blockIdx.x + threadIdx.x;


#pragma unroll (COLUMN_CP_STEPS_EXPANDING - 1)
 for (int jj = 0; jj < (COLUMN_CP_STEPS_EXPANDING - 1); jj++) {
  p_input_in_block[(threadIdx.y + jj*blockDim.y)* shared_mem_pitch
   + threadIdx.x]
   = p_extended_input_dev
   [(j + jj*blockDim.y)*extended_width
   + kernel_radius + i];

 }/*for */

#pragma unroll (COLUMN_CP_STEPS_EXPANDING - 1)
 for (int jj = 0; jj < (COLUMN_CP_STEPS_EXPANDING - 1); jj++) {
  p_input_in_block[(threadIdx.y + jj*blockDim.y)* shared_mem_pitch
   + threadIdx.x + blockDim.x]
   = p_extended_input_dev
   [(j + jj*blockDim.y)*extended_width
   + kernel_radius + i + blockDim.x];
 }/*for */

 jj = (COLUMN_CP_STEPS_EXPANDING - 1);
 if (threadIdx.y + jj * blockDim.y < block_height) {
  p_input_in_block[(threadIdx.y + jj*blockDim.y)* shared_mem_pitch
   + threadIdx.x]
   = p_extended_input_dev
   [(j + jj*blockDim.y)*extended_width
   + kernel_radius + i];

  p_input_in_block[(threadIdx.y + jj*blockDim.y)* shared_mem_pitch
   + threadIdx.x + blockDim.x]
   = p_extended_input_dev
   [(j + jj*blockDim.y)*extended_width
   + kernel_radius + i + blockDim.x];

 }/*if COLUMN_CP_STEPS - 1*/

 __syncthreads();

 sum = 0;
#pragma unroll KERNEL_LENGTH
 for (jj = 0; jj < KERNEL_LENGTH; jj++) {
  sum += kernel_const_mem[jj] * p_input_in_block[
   (threadIdx.y + jj)*shared_mem_pitch + threadIdx.x];
 }/*for kernel*/

 p_column_done_extended_output_dev[j*extended_width + kernel_radius + i]
  = sum;


 sum = 0;
#pragma unroll KERNEL_LENGTH
 for (jj = 0; jj < KERNEL_LENGTH; jj++) {
  sum += kernel_const_mem[jj] * p_input_in_block[
   (threadIdx.y + jj)*shared_mem_pitch + threadIdx.x + blockDim.x];
 }/*for kernel*/

 p_column_done_extended_output_dev[j*extended_width + kernel_radius 
  + i + blockDim.x] = sum;


 sum = 0;
#pragma unroll KERNEL_LENGTH
 for (jj = 0; jj < KERNEL_LENGTH; jj++) {
  sum += kernel_const_mem[jj] * p_input_in_block[
   (threadIdx.y + jj + blockDim.y)*shared_mem_pitch + threadIdx.x];
 }/*for kernel*/

 p_column_done_extended_output_dev[(j + blockDim.y)*extended_width 
  + kernel_radius + i] = sum;


 sum = 0;
#pragma unroll KERNEL_LENGTH
 for (jj = 0; jj < KERNEL_LENGTH; jj++) {
  sum += kernel_const_mem[jj] * p_input_in_block[
   (threadIdx.y + jj + blockDim.y)*shared_mem_pitch 
   + threadIdx.x + blockDim.x];
 }/*for kernel*/

 p_column_done_extended_output_dev[(j + blockDim.y)*extended_width + kernel_radius 
  + i + blockDim.x] = sum;

The required shared memory size and block number need to be changed while the procedures have been expanded:

block_height = (2 * num_threads.y) + (kernel_length - 1);
padding = 0;
shared_mem_size = sizeof(float)
  * (2 * num_threads.x + padding) * (block_height);

num_blocks.x /= 2;
num_blocks.y /= 2;

However, the runtime gets worse .

kernel size = 13 1400x1400	Runtime for 100 rounds
Non Specified	243 ms
Unrollment	216 ms
Unrollment + Expanding	224 ms

kernel size = 13

1400x1400

Runtime

for 100 rounds

Non Specified

243 ms

Unrollment

216 ms

Unrollment

+ Expanding

224 ms

nvprof reveals why the expanding worsen the performance : the branching judgement indeed gets reduced, but memory throughput gets declined. The matrics, shared memory transaction per request, states there is a bank conflict in the column function. This bank conflict restricts the efficiency of memory throughput.

kernel size = 13 1400x1400	Duration time Column/Row	Control-Flow Instrunctions Column/Row	Global Load Throughput Column/Row	Shared Memory Store Transactions Per Request Column/Row	Global Store Throughput Column/Row	Shared Memory Load Transactions Per Request Column/Row
Non Specified	461.886 / 479.101 μs	11900000 / 11900000	54.794 / 51.86 GB/s	1 / 1	33.906 / 24.965 GB/s	1 / 1
Unrollment	279.519 / 264.447 μs	1960000 / 1960000	96.3 / 95.522 GB/s	0.693 / 0.987	59.683 / 45.983 GB/s	1 / 1
Unrollment + Expanding	315.073 / 217.568 μs	1470000 / 1470000	59.161 / 80.46 GB/s	1.948 / 1	49.217 / 52.644 GB/s	1.96 / 1

kernel size = 13

1400x1400

Duration time

Column/Row

Control-Flow Instrunctions

Column/Row

Global Load Throughput

Column/Row

Shared Memory Store Transactions Per Request

Column/Row

Global Store Throughput

Column/Row

Shared Memory Load Transactions Per Request

Column/Row

Non Specified

461.886 / 479.101 μs

11900000 / 11900000

54.794 / 51.86 GB/s

1 / 1

33.906 / 24.965 GB/s

1 / 1

Unrollment

279.519 / 264.447 μs

1960000 / 1960000

96.3 / 95.522 GB/s

0.693 / 0.987

59.683 / 45.983 GB/s

1 / 1

Unrollment
+ Expanding

315.073 / 217.568 μs

1470000 / 1470000

59.161 / 80.46 GB/s

1.948 / 1

49.217 / 52.644 GB/s

1.96 / 1

To solve the bank conflict, we could exploit the formula I mentioned in the last post:

padding = WARP_SIZE * n - (block_size + (WARP_SIZE - num_threads.x))

In this case, the block_size = 2 * num_threads.x, so padding is WARP_SIZE - num_threads.x. Filling this value in the device caller, the band conflict would not exist.

:
padding = WARP_SIZE - num_threads.x;
:

After the fix, The result is :

kernel size = 13 1400x1400	Runtime for 100 rounds
Non Specified	243 ms
Unrollment	216 ms
Unrollment + Expanding, bank conflict fixed	197 ms

kernel size = 13

1400x1400

Runtime

for 100 rounds

Non Specified

243 ms

Unrollment

216 ms

Unrollment + Expanding,

bank conflict fixed

197 ms

nvprof shows the memory throughput has been improved, as we expected :

kernel size = 13 1400x1400	Duration time Column/Row	Global Load Throughtout Column/Row	Global Memory Load Efficiency Column/Row	Global Store Throughtout Column/Row	Global Memory Store Efficiency Column/Row	branch
Non Specified	461.886 / 479.101 μs	54.794 / 51.86 GB/s	66.8 / 68.1 %	33.906 / 24.965 GB/s	52.1 / 68.3 %	1132500 / 1245000
Unrollment	279.519 / 264.447 μs	96.3 / 95.522 GB/s	66.8 / 68.1 %	55.906 / 45.983 GB/s	52.1 / 68.3 %	62500 / 62500
Unrollment + Expanding, bank conflict fixed	203.646 / 217.568 μs	95.696 / 80.46 GB/s	66.7 / 69.2 %	59.683 / 52.644 GB/s	52.1 / 68.3 %	32500 / 46250

kernel size = 13

1400x1400

Duration time

Column/Row

Global Load Throughtout

Column/Row

Global Memory

Load Efficiency

Column/Row

Global Store Throughtout

Column/Row

Global Memory

Store Efficiency

Column/Row

branch

Non Specified

461.886 / 479.101 μs

54.794 / 51.86 GB/s

66.8 / 68.1 %

33.906 / 24.965 GB/s

52.1 / 68.3 %

1132500 / 1245000

Unrollment

279.519 / 264.447 μs

96.3 / 95.522 GB/s

66.8 / 68.1 %

55.906 / 45.983 GB/s

52.1 / 68.3 %

62500 / 62500

Unrollment + Expanding,

bank conflict fixed

203.646 / 217.568 μs

95.696 / 80.46 GB/s

66.7 / 69.2 %

59.683 / 52.644 GB/s

52.1 / 68.3 %

32500 / 46250

The expanding version also benefits from the further reducing of branching judgement, its performance gets obviously improved.

三. Summary

1. Avoid if statement and loop statement, whose repeating number is undetermined. If the repeating number of a loop is an immediate, the compiler directive "#pragma unroll" could unwind the loop in compiling time.

2. Expanding the procedures is able to increase the continuity of processing, to curtail accessing overhead and reduce more branch condition.

3. Bank conflict is a critical issue when using shared memory, it excessively hampers the process from reaching high performance.

u013606170

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Optimize Separable Convolution for the Specified Size on NVIDIA® CUDA®

※The code corresponding to this blog inhere. Continue thispost, in here I will show how to optimize the CUDA code of separable convolution for a specified size. Due to the size is specified, t...
复制链接

扫一扫

专栏目录