Halide Lesson05: 向量化, 并行, 循环展开 以及 分块

Halide Lesson05: 向量化, 并行, 循环展开 以及 分块

注意:Halide 默认图像按列存储 column first,x为内循环,y为外循环

Func gradient("gradient");
gradient(x, y) = x + y;
gradient.trace_stores();

打印gradient的所有中间结果。

gradient.print_loop_nest();

打印gradient函数的循环过程

produce gradient:
  for y:
    for x:
      gradient(...) = ...
gradient.reorder(y, x);

调整循环顺序,将y作为内循环,x作为外循环,此时循环过程为:

Pseudo-code for the schedule:
produce gradient_col_major:
  for x:
    for y:
      gradient_col_major(...) = ...
gradient.split(x, x_outer, x_inner, 2);

x -> 拆分的维度
x_outer -> 拆分后的外层循环
x_inner-> 拆分后的内层循环
2 -> 拆分的因子,内存循环从0到factor,外层循环从0到x/factor,原来的index现在为index = outer * factor + inner

for (int y = 0; y < 4; y++) {
	for (int x_outer = 0; x_outer < 2; x_outer++) {
		for (int x_inner = 0; x_inner < 2; x_inner++) {
		    int x = x_outer * 2 + x_inner;
		    printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
		}
	}
}

如果拆分factor不能整除维度总数,Halide仍然可以很好的处理split,假设factor为3,x维度总数为7

gradient.split(x, x_outer, x_inner, 3);
Buffer<int> output = gradient.realize(7, 2);

此时等价循环为

for (int y = 0; y < 2; y++) {
    for (int x_outer = 0; x_outer < 3; x_outer++) { // Now runs from 0 to 2
        for (int x_inner = 0; x_inner < 3; x_inner++) {
            int x = x_outer * 3;
            // Before we add x_inner, make sure we don't
            // evaluate points outside of the 7x2 box. We'll
            // clamp x to be at most 4 (7 minus the split
            // factor).
            if (x > 4) x = 4;
            x += x_inner;
            printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
        }
    }
}
gradient.fuse(x, y, fused);

参数融合,将两个参数融合成一个参数

for (int fused = 0; fused < 4*4; fused++) {
    int y = fused / 4;
    int x = fused % 4;
    printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
}
gradient.vectorize(x_inner);
Var x_outer, x_inner;
gradient.split(x, x_outer, x_inner, 4);
gradient.vectorize(x_inner);

或者使用gradient.vectorize(x, 4);等价于上面split和vectorize的融合

循环过程为

for (int y = 0; y < 4; y++) {
    for (int x_outer = 0; x_outer < 2; x_outer++) {
        // The loop over x_inner has gone away, and has been
        // replaced by a vectorized version of the
        // expression. On x86 processors, Halide generates SSE
        // for all of this.
        int x_vec[] = {x_outer * 4 + 0,
                       x_outer * 4 + 1,
                       x_outer * 4 + 2,
                       x_outer * 4 + 3};
        int val[] = {x_vec[0] + y,
                     x_vec[1] + y,
                     x_vec[2] + y,
                     x_vec[3] + y};
        printf("Evaluating at <%d, %d, %d, %d>, <%d, %d, %d, %d>:"
               " <%d, %d, %d, %d>\n",
               x_vec[0], x_vec[1], x_vec[2], x_vec[3],
               y, y, y, y,
               val[0], val[1], val[2], val[3]);
    }
}
gradient.unroll(x_inner);

在某个维度上进行循环展开

Var x_outer, x_inner;
gradient.split(x, x_outer, x_inner, 2);
gradient.unroll(x_inner);

上面两个操作可以替换为gradient.unroll(x, 2);
循环展开为

for (int y = 0; y < 4; y++) {
    for (int x_outer = 0; x_outer < 2; x_outer++) {
        // Instead of a for loop over x_inner, we get two
        // copies of the innermost statement.
        {
            int x_inner = 0;
            int x = x_outer * 2 + x_inner;
            printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
        }
        {
            int x_inner = 1;
            int x = x_outer * 2 + x_inner;
            printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
        }
    }
}
Fusing, tiling, and parallelizing.

融合 分tile 并行三个操作

// First we'll tile, then we'll fuse the tile indices and
// parallelize across the combination.
Var x_outer, y_outer, x_inner, y_inner, tile_index;
gradient.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4);
gradient.fuse(x_outer, y_outer, tile_index);
gradient.parallel(tile_index);

Buffer<int> output = gradient.realize(8, 8);

可以写到一个序列中:

gradient
     .tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4)
     .fuse(x_outer, y_outer, tile_index)
     .parallel(tile_index);

Buffer<int> output = gradient.realize(8, 8);

等价的循环为

// 仍然是列优先
for (int tile_index = 0; tile_index < 4; tile_index++) {
    int y_outer = tile_index / 2;
    int x_outer = tile_index % 2;
    for (int y_inner = 0; y_inner < 4; y_inner++) {
        for (int x_inner = 0; x_inner < 4; x_inner++) {
            int y = y_outer * 4 + y_inner;
            int x = x_outer * 4 + x_inner;
            printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
        }
    }
}
问题
  • vectorize的时候,如果数字不满足SIMD寄存器要求会怎么样?比如传入3 5 7个数
  • 各个操作的性能如何?
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值