前言
接上, 本文学习Halide lesson_05_scheduling
** schedule Func in different ways **
内容
本节主要学习几个概念: 矢量, 并行, 展开, 平铺 用来加速图像像素级计算的过程;
row-major && column-major
//row-major
Func gradient;
gradient(x, y) = x + y;
;;;
//column-major
gradient.reorder(y, x);
;;;
split && fuse
** split给不能整除的split factor时, 会出现重复计算的问题, **
//breaks the loop over x into two nested loops;
Func gradient;
gradient(x, y) = x + y;
Var x_outer, x_inner;
gradient.split(x, x_outer, x_inner, 2); //2--split factor 应该是内层for循环的次数
//Fuse two variables into one; the opposite of split
Var fused;
gradient.fuse(x, y, fused);
tiled traversal
** can be good for performance if neighboring pixels use overlapping input data, for example in a blur **
//Evaluating in tiles; split and reorder
Func gradient;
gradient(x, y) = x + y;
Var x_outer, x_inner, y_outer, y_inner;
gradient.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4); //4--split factor
vectorize
** 向量化, 好像是更快了, 可以指定split factor; **
//Evaluating in vectors;
Func gradient;
gradient(x, y) = x + y;
gradient.vectorize(x, 4); //拆分内循环为4的向量;
//because on X86 we can use SSE to compute in 4-wide vectors.
unroll
** If multiple pixels share overlapping data, it can make sense to unroll a computation so that shared values are only computed or loaded once. **
//unroll 好像是把内循环展开了, 变成并列语句;
Func gradient;
gradient(x, y) = x + y;
gradient.unroll(x, 2);
tiles in parallel
** combine parallel with fusing as tiling to express a useful pattern; **
** This is where fusing shines. **
//fusing parallel避免低效的嵌套并行,
//The tiles should occur in arbitrary order, but within each
// tile the pixels will be traversed in row-major order.
Func gradient;
gradient(x, y) = x + y;
Var x_outer, y_outer, x_inner, y_inner, tile_index;
gradient
.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4)
.fuse(x_outer, y_outer, tile_index)
.parallel(tile_index); //tile_index应该是x_outer, y_outer融合后的维度
Finally
** Putting it all together. using all of the features above **
Func gradient_fast;
gradient_fast(x, y) = x + y;
// We'll process 64x64 tiles in parallel.
Var x_outer, y_outer, x_inner, y_inner, tile_index;
gradient_fast
.tile(x, y, x_outer, y_outer, x_inner, y_inner, 64, 64)
.fuse(x_outer, y_outer, tile_index)
.parallel(tile_index);
// We'll compute two scanlines at once while we walk across
// each tile. We'll also vectorize in x. The easiest way to
// express this is to recursively tile again within each tile
// into 4x2 subtiles, then vectorize the subtiles across x and
// unroll them across y:
Var x_inner_outer, y_inner_outer, x_vectors, y_pairs;
gradient_fast
.tile(x_inner, y_inner, x_inner_outer, y_inner_outer, x_vectors, y_pairs, 4, 2)
.vectorize(x_vectors)
.unroll(y_pairs);
// Note that we didn't do any explicit splitting or
// reordering. Those are the most important primitive
// operations, but mostly they are buried underneath tiling,
// vectorizing, or unrolling calls.
// Now let's evaluate this over a range which is not a
// multiple of the tile size.
// If you like you can turn on tracing, but it's going to
// produce a lot of printfs. Instead we'll compute the answer
// both in C and Halide and see if the answers match.
Buffer<int> result = gradient_fast.realize({350, 250});
** Note that in the Halide version, the algorithm is specified once at the top, separately from the optimizations, and there aren’t that many lines of code total. **
End
感觉到现在才接触到Halide的特性, the algorithm is specified once at the top, separately from the optimizations.