Halide入门教程19
// Halide tutorial lesson 19: Wrapper Funcs
// Halide入门第19课:包裹函数
// This lesson demonstrates how to use Func::in and ImageParam::in to
// schedule a Func differently in different places, and to stage loads
// from a Func or an ImageParam.
// 本课演示了如何使用Func::in和ImageParam::in来使用不同的方法在不同的地方调度同一个函数。
// On linux, you can compile and run it like so:
// g++ lesson_19*.cpp -g -I ../include -L ../bin -lHalide -lpthread -ldl -o lesson_19 -std=c++11
// LD_LIBRARY_PATH=../bin ./lesson_19
// The only Halide header file you need is Halide.h. It includes all of Halide.
#include "Halide.h"
// We'll also include stdio for printf.
#include <stdio.h>
using namespace Halide;
int main(int argc, char **argv) {
// First we'll declare some Vars to use below.
Var x("x"), y("y"), xo("xo"), yo("yo"), xi("xi"), yi("yi");
// This lesson will be about "wrapping" a Func or an ImageParam using the
// Func::in and ImageParam::in directives
// 本课主要是关于使用Func::in和ImageParam::in指令来包裹函数。
{
// Consider a simple two-stage pipeline:
Func f("f_local"), g("g_local");
f(x, y) = x + y;
g(x, y) = 2 * f(x, y) + 3;
f.compute_root();
// This produces the following loop nests:
// for y:
// for x:
// f(x, y) = x + y
// for y:
// for x:
// g(x, y) = 2 * f(x, y) + 3
// Using Func::in, we can interpose a new Func in between f
// and g using the schedule alone:
// 使用Func::in,我们可以在f和g之间插入一个新的函数。插入的新函数可以单独调度。
Func f_in_g = f.in(g);
f_in_g.compute_root();
// Equivalently, we could also chain the schedules like so:
// f.in(g).compute_root();
// This produces the following three loop nests:
// for y:
// for x:
// f(x, y) = x + y
// for y:
// for x:
// f_in_g(x, y) = f(x, y)
// for y:
// for x:
// g(x, y) = 2 * f_in_g(x, y) + 3
g.realize(5, 5);
// See figures/lesson_19_wrapper_local.mp4 for a visualization.
// The schedule directive f.in(g) replaces all calls to 'f'
// inside 'g' with a wrapper Func and then returns that
// wrapper. Essentially, it rewrites the original pipeline
// above into the following:
// 调度指令f.in(g)替换了所有g中关于f的调用,并且返回这个包裹函数。本质上是将整个pipeline按
// 如下方式重写了。
{
Func f_in_g("f_in_g"), f("f"), g("g");
f(x, y) = x + y;
f_in_g(x, y) = f(x, y);
g(x, y) = 2 * f_in_g(x, y) + 3;
f.compute_root();
f_in_g.compute_root();
g.compute_root();
}
// In isolation, such a transformation seems pointless, but it
// can be used for a variety of scheduling tricks.
// 孤立的看,这样的转化似乎没有意义,但是它可以用来实现一些小技巧。
}
{
// In the schedule above, only the calls to 'f' made by 'g'
// are replaced. Other calls made to f would still call 'f'
// directly. If we wish to globally replace all calls to 'f'
// with a single wrapper, we simply say f.in().
// 在上面的调度里,只有g对f的调度里的f被替换了,其他对f的调用仍旧没变。如果我们想全局替换f,那么
// 直接简单调用f.in()即可
// Consider a three stage pipeline, with two consumers of f:
Func f("f_global"), g("g_global"), h("h_global");
f(x, y) = x + y;
g(x, y) = 2 * f(x, y);
h(x, y) = 3 + g(x, y) - f(x, y);
f.compute_root();
g.compute_root();
h.compute_root();
// We will replace all calls to 'f' inside both 'g' and 'h'
// with calls to a single wrapper:
f.in().compute_root();
// The equivalent loop nests are:
// for y:
// for x:
// f(x, y) = x + y
// for y:
// for x:
// f_in(x, y) = f(x, y)
// for y:
// for x:
// g(x, y) = 2 * f_in(x, y)
// for y:
// for x:
// h(x, y) = 3 + g(x, y) - f_in(x, y)
h.realize(5, 5);
// See figures/lesson_19_wrapper_global.mp4 and for a
// visualization of what this did.
}
{
// We could also give g and h their own unique wrappers of
// f. This time we'll schedule them each inside the loop nests
// of the consumer, which is not something we could do with a
// single global wrapper.
// 我们可以给g和h各自一个f的独立的包裹函数。这样我们可以可以将它们各自独立地调度,而这在单独一个
// 全局包裹函数时是无法实现的。
Func f("f_unique"), g("g_unique"), h("h_unique");
f(x, y) = x + y;
g(x, y) = 2 * f(x, y);
h(x, y) = 3 + g(x, y) - f(x, y);
f.compute_root();
g.compute_root();
h.compute_root();
f.in(g).compute_at(g, y);
f.in(h).compute_at(h, y);
// This creates the loop nests:
// for y:
// for x:
// f(x, y) = x + y
// for y:
// for x:
// f_in_g(x, y) = f(x, y)
// for x:
// g(x, y) = 2 * f_in_g(x, y)
// for y:
// for x:
// f_in_h(x, y) = f(x, y)
// for x:
// h(x, y) = 3 + g(x, y) - f_in_h(x, y)
h.realize(5, 5);
// See figures/lesson_19_wrapper_unique.mp4 for a visualization.
}
{
// So far this may seem like a lot of pointless copying of
// memory. Func::in can be combined with other scheduling
// directives for a variety of purposes. The first we will
// examine is creating distinct realizations of a Func for
// several consumers and scheduling each differently.
// Func::in可以和其他的调度指令达到一系列的目的。其中第一个就是可以为不同的消费者的实现提供不同的
// 调度。
// We'll start with nearly the same pipeline.
Func f("f_sched"), g("g_sched"), h("h_sched");
f(x, y) = x + y;
g(x, y) = 2 * f(x, y);
// h will use a far-away region of f
// h要使用的f像素点的坐标离h对应的坐标比较远。
h(x, y) = 3 + g(x, y) - f(x + 93, y - 87);
// This time we'll inline f.
// f.compute_root();
g.compute_root();
h.compute_root();
f.in(g).compute_at(g, y);
f.in(h).compute_at(h, y);
// g and h now call f via distinct wrappers. The wrappers are
// scheduled, but f is not, which means that f is inlined into
// its two wrappers. They will each independently compute the
// region of f required by their consumer. If we had scheduled
// f compute_root, we'd be computing the bounding box of the
// region required by g and the region required by h, which
// would mostly be unused data.
// g和h仙子啊可以通过不同的包裹函数调用f。f的包裹函数被调度了,而f本身被内联到包裹函数内部,并没有
// 被调度。这两个包裹函数将分别独立的被他们的消费者调用。如果没有使用包裹函数做两个克隆出来,他们将
// 有大片的数据被计算,而计算之后并没有被使用,从而导致数据冗余。
// We can also schedule each of these wrappers
// differently. For scheduling purposes, wrappers inherit the
// pure vars of the Func they wrap, so we use the same x and y
// that we used when defining f:
// 我们可以各自独立调用包裹函数。从调度的角度来看,包裹函数成函数继承了变量,因此他们使用f中的变量
f.in(g).vectorize(x, 4);
f.in(h).split(x, xo, xi, 2).reorder(xo, xi);
// Note that calling f.in(g) a second time returns the wrapper
// already created by the first call, it doesn't make a new one.
// 注意到,这里第二次调用f.in(g)返回第一次调用已经生成的包裹函数,不会重新创建了。
h.realize(8, 8);
// See figures/lesson_19_wrapper_vary_schedule.mp4 for a
// visualization.
// Note that because f is inlined into its two wrappers, it is
// the wrappers that do the work of computing f, rather than
// just loading from an existing computed realization.
// f被内联到两个包裹函数中,因此两个包裹函数会独立计算f,而不是从已有的实现中直接载入f
}
{
// Func::in is useful to stage loads from a Func via some
// smaller intermediate buffer, perhaps on the stack or in
// shared GPU memory.
// Fun::in对于从很小的中间buffer中load数据很有用,尤其是在战区或者共享的GPU内存中。
// Consider a pipeline that transposes some compute_root'd Func:
Func f("f_transpose"), g("g_transpose");
f(x, y) = sin(((x + y) * sqrt(y)) / 10);
f.compute_root();
g(x, y) = f(y, x);
// The execution strategy we want is to load an 4x4 tile of f
// into registers, transpose it in-register, and then write it
// out as an 4x4 tile of g. We will use Func::in to express this:
// 执行的策略是从f的4x4的tile读数据到寄存器,转置,然后写入了到g的4x4的tile中。
Func f_tile = f.in(g);
// We now have a three stage pipeline:
// f -> f_tile -> g
// f_tile will load vectors of f, and store them transposed
// into registers. g will then write this data back to main
// memory.
// f_tile从f重读取数据到寄存器中,g将数据回写到内存中。
g.tile(x, y, xo, yo, xi, yi, 4, 4)
.vectorize(xi)
.unroll(yi);
// We will compute f_transpose at tiles of g, and use
// Func::reorder_storage to state that f_transpose should be
// stored column-major, so that the loads to it done by g can
// be dense vector loads.
// 从g的tile中计算f_transpose, 然后用reorder_storage来表示f_transpose,从而达到列有限的存储
// 因此可以用向量化的load达到目的。
f_tile.compute_at(g, xo)
.reorder_storage(y, x)
.vectorize(x)
.unroll(y);
// We take care to make sure f_transpose is only ever accessed
// at constant indicies. The full unrolling/vectorization of
// all loops that exist inside its compute_at level has this
// effect. Allocations that are only ever accessed at constant
// indices can be promoted into registers.
// 必须小心确保f_transpose是在常数下标下执行的。在compute_at级别平铺循环或这向量化驯化有这种效果。
// 只有常数大小的内存分配可能会被提升到寄存器中。
g.realize(16, 16);
// See figures/lesson_19_transpose.mp4 for a visualization
}
{
// ImageParam::in behaves the same way as Func::in, and you
// can use it to stage loads in similar ways. Instead of
// transposing again, we'll use ImageParam::in to stage tiles
// of an input image into GPU shared memory, effectively using
// shared/local memory as an explicitly-managed cache.
// ImageParam::in和Func::in的行为类似,可以采用相同的方法使用。在不不再赘述。
ImageParam img(Int(32), 2);
// We will compute a small blur of the input.
Func blur("blur");
blur(x, y) = (img(x - 1, y - 1) + img(x, y - 1) + img(x + 1, y - 1) +
img(x - 1, y ) + img(x, y ) + img(x + 1, y ) +
img(x - 1, y + 1) + img(x, y + 1) + img(x + 1, y + 1));
blur.compute_root().gpu_tile(x, y, xo, yo, xi, yi, 8, 8);
// The wrapper Func created by ImageParam::in has pure vars
// named _0, _1, etc. Schedule it per tile of "blur", and map
// _0 and _1 to gpu threads.
img.in(blur).compute_at(blur, xo).gpu_threads(_0, _1);
// Without Func::in, computing an 8x8 tile of blur would do
// 8*8*9 loads to global memory. With Func::in, the wrapper
// does 10*10 loads to global memory up front, and then blur
// does 8*8*9 loads to shared/local memory.
// Select an appropriate GPU API, as we did in lesson 12
Target target = get_host_target();
if (target.os == Target::OSX) {
target.set_feature(Target::Metal);
} else {
target.set_feature(Target::OpenCL);
}
// Create an interesting input image to use.
Buffer<int> input(258, 258);
input.set_min(-1, -1);
for (int y = input.top(); y <= input.bottom(); y++) {
for (int x = input.left(); x <= input.right(); x++) {
input(x, y) = x * 17 + y % 4;
}
}
img.set(input);
blur.compile_jit(target);
Buffer<int> out = blur.realize(256, 256);
// Check the output is what we expected
for (int y = out.top(); y <= out.bottom(); y++) {
for (int x = out.left(); x <= out.right(); x++) {
int val = out(x, y);
int expected = (input(x - 1, y - 1) + input(x, y - 1) + input(x + 1, y - 1) +
input(x - 1, y ) + input(x, y ) + input(x + 1, y ) +
input(x - 1, y + 1) + input(x, y + 1) + input(x + 1, y + 1));
if (val != expected) {
printf("out(%d, %d) = %d instead of %d\n",
x, y, val, expected);
return -1;
}
}
}
}
{
// Func::in can also be used to group multiple stages of a
// Func into the same loop nest. Consider the following
// pipeline, which computes a value per pixel, then sweeps
// from left to right and back across each scanline.
// Func::in可以用来将多个阶段的函数聚合到同一个循环网络中。
Func f("f_group"), g("g_group"), h("h_group");
// Initialize f
f(x, y) = sin(x - y);
RDom r(1, 7);
// Sweep from left to right
f(r, y) = (f(r, y) + f(r - 1, y)) / 2;
// Sweep from right to left
f(7 - r, y) = (f(7 - r, y) + f(8 - r, y)) / 2;
// Then we do something with a complicated access pattern: A
// 45 degree rotation with wrap-around
g(x, y) = f((x + y) % 8, (x - y) % 8);
// f should be scheduled compute_root, because its consumer
// accesses it in a complicated way. But that means all stages
// of f are computed in separate loop nests:
// f必须被调度为compute_root,因为消费者访问f非常复杂。但是compute_root意味着所有的循环必须
// 单独执行。
// for y:
// for x:
// f(x, y) = sin(x - y)
// for y:
// for r:
// f(r, y) = (f(r, y) + f(r - 1, y)) / 2
// for y:
// for r:
// f(7 - r, y) = (f(7 - r, y) + f(8 - r, y)) / 2
// for y:
// for x:
// g(x, y) = f((x + y) % 8, (x - y) % 8);
// We can get better locality if we schedule the work done by
// f to share a common loop over y. We can do this by
// computing f at scanlines of a wrapper like so:
// 如果能够共享一个y循环,那么数据的局部性更好。因此在y扫描方向均分配一个包裹函数。
f.in(g).compute_root();
f.compute_at(f.in(g), y);
// f has the default schedule for a Func with update stages,
// which is to be computed at the innermost loop of its
// consumer, which is now the wrapper f.in(g). This therefore
// generates the following loop nest, which has better
// locality:
// f在函数的更新阶段有一个默认的调度策略,在消费者的最内层循环中计算f,现在f被f.in(g)这个包裹
// 函数替换了。形成了下面的循环网,这样的循环具有更好的数据局部性。
// for y:
// for x:
// f(x, y) = sin(x - y)
// for r:
// f(r, y) = (f(r, y) + f(r - 1, y)) / 2
// for r:
// f(7 - r, y) = (f(7 - r, y) + f(8 - r, y)) / 2
// for x:
// f_in_g(x, y) = f(x, y)
// for y:
// for x:
// g(x, y) = f_in_g((x + y) % 8, (x - y) % 8);
// We'll additionally vectorize the initialization of, and
// then transfer of pixel values from f into its wrapper:
// 向量化f的初始化过程,同时向量化f到f的包裹函数的数据转移
f.vectorize(x, 4);
f.in(g).vectorize(x, 4);
g.realize(8, 8);
// See figures/lesson_19_group_updates.mp4 for a visualization.
}
printf("Success!\n");
return 0;
}
要点提炼:
1.局部wrapper函数
f(x, y) = x + y;
g(x, y) = 2 * f(x, y) + 3;
f_in_g = f.in(g);
f_in_g.compute_root();
等价于
f(x, y) = x + y;
f_in_g(x , y) = f(x , y);
g(x, y) = 2 * f_in_g(x, y) + 2;
2.全局wrapper函数
f(x, y) = x + y;
g(x, y) = 2 * f(x, y);
h(x, y) = 3 + g(x, y) - f(x, y);
f.in().compute_root();
等价于;
f(x, y) = x + y;
f_in(x, y) = f(x, y);
g(x, y) = 2 * f_in(x, y);
h(x, y) = 3 + g(x, y) - f_in(x, y);
3. wrapper函数的另一个好处是,多个wrapper可以分开进行调度。
f(x, y) = x + y;
g(x, y) = 2 * f(x, y);
h(x, y) = 3 + g(x, y) - f(x, y);
f.in(g).compute_at(g, y);
f.in(h).compute_at(h, y);