Halide学习笔记----Halide tutorial源码阅读19

最新推荐文章于 2021-09-02 14:54:01 发布

姑苏隐士

最新推荐文章于 2021-09-02 14:54:01 发布

阅读量661

点赞数

分类专栏： HALIDE 文章标签：函数阅读源码

本文链接：https://blog.csdn.net/luzhanbo207/article/details/78905381

版权

HALIDE 专栏收录该内容

22 篇文章 31 订阅

订阅专栏

Halide入门教程19

// Halide tutorial lesson 19: Wrapper Funcs
// Halide入门第19课：包裹函数

// This lesson demonstrates how to use Func::in and ImageParam::in to
// schedule a Func differently in different places, and to stage loads
// from a Func or an ImageParam.
// 本课演示了如何使用Func::in和ImageParam::in来使用不同的方法在不同的地方调度同一个函数。

// On linux, you can compile and run it like so:
// g++ lesson_19*.cpp -g -I ../include -L ../bin -lHalide -lpthread -ldl -o lesson_19 -std=c++11
// LD_LIBRARY_PATH=../bin ./lesson_19

// The only Halide header file you need is Halide.h. It includes all of Halide.
#include "Halide.h"

// We'll also include stdio for printf.
#include <stdio.h>

using namespace Halide;

int main(int argc, char **argv) {
    // First we'll declare some Vars to use below.
    Var x("x"), y("y"), xo("xo"), yo("yo"), xi("xi"), yi("yi");

    // This lesson will be about "wrapping" a Func or an ImageParam using the
    // Func::in and ImageParam::in directives
    // 本课主要是关于使用Func::in和ImageParam::in指令来包裹函数。
    {
        // Consider a simple two-stage pipeline:
        Func f("f_local"), g("g_local");
        f(x, y) = x + y;
        g(x, y) = 2 * f(x, y) + 3;

        f.compute_root();

        // This produces the following loop nests:
        // for y:
        //   for x:
        //     f(x, y) = x + y
        // for y:
        //   for x:
        //     g(x, y) = 2 * f(x, y) + 3

        // Using Func::in, we can interpose a new Func in between f
        // and g using the schedule alone:
        // 使用Func::in，我们可以在f和g之间插入一个新的函数。插入的新函数可以单独调度。
        Func f_in_g = f.in(g);
        f_in_g.compute_root();

        // Equivalently, we could also chain the schedules like so:
        // f.in(g).compute_root();

        // This produces the following three loop nests:
        // for y:
        //   for x:
        //     f(x, y) = x + y
        // for y:
        //   for x:
        //     f_in_g(x, y) = f(x, y)
        // for y:
        //   for x:
        //     g(x, y) = 2 * f_in_g(x, y) + 3

        g.realize(5, 5);

        // See figures/lesson_19_wrapper_local.mp4 for a visualization.

        // The schedule directive f.in(g) replaces all calls to 'f'
        // inside 'g' with a wrapper Func and then returns that
        // wrapper. Essentially, it rewrites the original pipeline
        // above into the following:
        // 调度指令f.in(g)替换了所有g中关于f的调用，并且返回这个包裹函数。本质上是将整个pipeline按
        // 如下方式重写了。
        {
            Func f_in_g("f_in_g"), f("f"), g("g");
            f(x, y) = x + y;
            f_in_g(x, y) = f(x, y);
            g(x, y) = 2 * f_in_g(x, y) + 3;

            f.compute_root();
            f_in_g.compute_root();
            g.compute_root();
        }

        // In isolation, such a transformation seems pointless, but it
        // can be used for a variety of scheduling tricks.
        // 孤立的看，这样的转化似乎没有意义，但是它可以用来实现一些小技巧。
    }

    {
        // In the schedule above, only the calls to 'f' made by 'g'
        // are replaced. Other calls made to f would still call 'f'
        // directly. If we wish to globally replace all calls to 'f'
        // with a single wrapper, we simply say f.in().
        // 在上面的调度里，只有g对f的调度里的f被替换了，其他对f的调用仍旧没变。如果我们想全局替换f，那么
        // 直接简单调用f.in()即可

        // Consider a three stage pipeline, with two consumers of f:
        Func f("f_global"), g("g_global"), h("h_global");
        f(x, y) = x + y;
        g(x, y) = 2 * f(x, y);
        h(x, y) = 3 + g(x, y) - f(x, y);
        f.compute_root();
        g.compute_root();
        h.compute_root();

        // We will replace all calls to 'f' inside both 'g' and 'h'
        // with calls to a single wrapper:
        f.in().compute_root();

        // The equivalent loop nests are:
        // for y:
        //   for x:
        //     f(x, y) = x + y
        // for y:
        //   for x:
        //     f_in(x, y) = f(x, y)
        // for y:
        //   for x:
        //     g(x, y) = 2 * f_in(x, y)
        // for y:
        //   for x:
        //     h(x, y) = 3 + g(x, y) - f_in(x, y)

        h.realize(5, 5);

        // See figures/lesson_19_wrapper_global.mp4 and for a
        // visualization of what this did.
    }

    {
        // We could also give g and h their own unique wrappers of
        // f. This time we'll schedule them each inside the loop nests
        // of the consumer, which is not something we could do with a
        // single global wrapper.
        // 我们可以给g和h各自一个f的独立的包裹函数。这样我们可以可以将它们各自独立地调度，而这在单独一个
        // 全局包裹函数时是无法实现的。

        Func f("f_unique"), g("g_unique"), h("h_unique");
        f(x, y) = x + y;
        g(x, y) = 2 * f(x, y);
        h(x, y) = 3 + g(x, y) - f(x, y);

        f.compute_root();
        g.compute_root();
        h.compute_root();

        f.in(g).compute_at(g, y);
        f.in(h).compute_at(h, y);

        // This creates the loop nests:
        // for y:
        //   for x:
        //     f(x, y) = x + y
        // for y:
        //   for x:
        //     f_in_g(x, y) = f(x, y)
        //   for x:
        //     g(x, y) = 2 * f_in_g(x, y)
        // for y:
        //   for x:
        //     f_in_h(x, y) = f(x, y)
        //   for x:
        //     h(x, y) = 3 + g(x, y) - f_in_h(x, y)

        h.realize(5, 5);
        // See figures/lesson_19_wrapper_unique.mp4 for a visualization.
    }

    {
        // So far this may seem like a lot of pointless copying of
        // memory. Func::in can be combined with other scheduling
        // directives for a variety of purposes. The first we will
        // examine is creating distinct realizations of a Func for
        // several consumers and scheduling each differently.
        // Func::in可以和其他的调度指令达到一系列的目的。其中第一个就是可以为不同的消费者的实现提供不同的
        // 调度。

        // We'll start with nearly the same pipeline.
        Func f("f_sched"), g("g_sched"), h("h_sched");
        f(x, y) = x + y;
        g(x, y) = 2 * f(x, y);
        // h will use a far-away region of f
        // h要使用的f像素点的坐标离h对应的坐标比较远。
        h(x, y) = 3 + g(x, y) - f(x + 93, y - 87);

        // This time we'll inline f.
        // f.compute_root();
        g.compute_root();
        h.compute_root();

        f.in(g).compute_at(g, y);
        f.in(h).compute_at(h, y);

        // g and h now call f via distinct wrappers. The wrappers are
        // scheduled, but f is not, which means that f is inlined into
        // its two wrappers. They will each independently compute the
        // region of f required by their consumer. If we had scheduled
        // f compute_root, we'd be computing the bounding box of the
        // region required by g and the region required by h, which
        // would mostly be unused data.
        // g和h仙子啊可以通过不同的包裹函数调用f。f的包裹函数被调度了，而f本身被内联到包裹函数内部，并没有
        // 被调度。这两个包裹函数将分别独立的被他们的消费者调用。如果没有使用包裹函数做两个克隆出来，他们将
        // 有大片的数据被计算，而计算之后并没有被使用，从而导致数据冗余。

        // We can also schedule each of these wrappers
        // differently. For scheduling purposes, wrappers inherit the
        // pure vars of the Func they wrap, so we use the same x and y
        // that we used when defining f:
        // 我们可以各自独立调用包裹函数。从调度的角度来看，包裹函数成函数继承了变量，因此他们使用f中的变量
        f.in(g).vectorize(x, 4);
        f.in(h).split(x, xo, xi, 2).reorder(xo, xi);

        // Note that calling f.in(g) a second time returns the wrapper
        // already created by the first call, it doesn't make a new one.
        // 注意到，这里第二次调用f.in(g)返回第一次调用已经生成的包裹函数，不会重新创建了。

        h.realize(8, 8);
        // See figures/lesson_19_wrapper_vary_schedule.mp4 for a
        // visualization.

        // Note that because f is inlined into its two wrappers, it is
        // the wrappers that do the work of computing f, rather than
        // just loading from an existing computed realization.
        // f被内联到两个包裹函数中，因此两个包裹函数会独立计算f，而不是从已有的实现中直接载入f
    }

    {
        // Func::in is useful to stage loads from a Func via some
        // smaller intermediate buffer, perhaps on the stack or in
        // shared GPU memory.
        // Fun::in对于从很小的中间buffer中load数据很有用，尤其是在战区或者共享的GPU内存中。

        // Consider a pipeline that transposes some compute_root'd Func:

        Func f("f_transpose"), g("g_transpose");
        f(x, y) = sin(((x + y) * sqrt(y)) / 10);
        f.compute_root();

        g(x, y) = f(y, x);

        // The execution strategy we want is to load an 4x4 tile of f
        // into registers, transpose it in-register, and then write it
        // out as an 4x4 tile of g. We will use Func::in to express this:
        // 执行的策略是从f的4x4的tile读数据到寄存器，转置，然后写入了到g的4x4的tile中。

        Func f_tile = f.in(g);

        // We now have a three stage pipeline:
        // f -> f_tile -> g

        // f_tile will load vectors of f, and store them transposed
        // into registers. g will then write this data back to main
        // memory.
        // f_tile从f重读取数据到寄存器中，g将数据回写到内存中。
        g.tile(x, y, xo, yo, xi, yi, 4, 4)
            .vectorize(xi)
            .unroll(yi);

        // We will compute f_transpose at tiles of g, and use
        // Func::reorder_storage to state that f_transpose should be
        // stored column-major, so that the loads to it done by g can
        // be dense vector loads.
        // 从g的tile中计算f_transpose， 然后用reorder_storage来表示f_transpose，从而达到列有限的存储
        // 因此可以用向量化的load达到目的。
        f_tile.compute_at(g, xo)
            .reorder_storage(y, x)
            .vectorize(x)
            .unroll(y);

        // We take care to make sure f_transpose is only ever accessed
        // at constant indicies. The full unrolling/vectorization of
        // all loops that exist inside its compute_at level has this
        // effect. Allocations that are only ever accessed at constant
        // indices can be promoted into registers.
        // 必须小心确保f_transpose是在常数下标下执行的。在compute_at级别平铺循环或这向量化驯化有这种效果。
        // 只有常数大小的内存分配可能会被提升到寄存器中。

        g.realize(16, 16);
        // See figures/lesson_19_transpose.mp4 for a visualization
    }

    {
        // ImageParam::in behaves the same way as Func::in, and you
        // can use it to stage loads in similar ways. Instead of
        // transposing again, we'll use ImageParam::in to stage tiles
        // of an input image into GPU shared memory, effectively using
        // shared/local memory as an explicitly-managed cache.
        // ImageParam::in和Func::in的行为类似，可以采用相同的方法使用。在不不再赘述。

        ImageParam img(Int(32), 2);

        // We will compute a small blur of the input.
        Func blur("blur");
        blur(x, y) = (img(x - 1, y - 1) + img(x, y - 1) + img(x + 1, y - 1) +
                      img(x - 1, y    ) + img(x, y    ) + img(x + 1, y    ) +
                      img(x - 1, y + 1) + img(x, y + 1) + img(x + 1, y + 1));

        blur.compute_root().gpu_tile(x, y, xo, yo, xi, yi, 8, 8);

        // The wrapper Func created by ImageParam::in has pure vars
        // named _0, _1, etc. Schedule it per tile of "blur", and map
        // _0 and _1 to gpu threads.
        img.in(blur).compute_at(blur, xo).gpu_threads(_0, _1);

        // Without Func::in, computing an 8x8 tile of blur would do
        // 8*8*9 loads to global memory. With Func::in, the wrapper
        // does 10*10 loads to global memory up front, and then blur
        // does 8*8*9 loads to shared/local memory.

        // Select an appropriate GPU API, as we did in lesson 12
        Target target = get_host_target();
        if (target.os == Target::OSX) {
            target.set_feature(Target::Metal);
        } else {
            target.set_feature(Target::OpenCL);
        }

        // Create an interesting input image to use.
        Buffer<int> input(258, 258);
        input.set_min(-1, -1);
        for (int y = input.top(); y <= input.bottom(); y++) {
            for (int x = input.left(); x <= input.right(); x++) {
                input(x, y) = x * 17 + y % 4;
            }
        }

        img.set(input);
        blur.compile_jit(target);
        Buffer<int> out = blur.realize(256, 256);

        // Check the output is what we expected
        for (int y = out.top(); y <= out.bottom(); y++) {
            for (int x = out.left(); x <= out.right(); x++) {
                int val = out(x, y);
                int expected = (input(x - 1, y - 1) + input(x, y - 1) + input(x + 1, y - 1) +
                                input(x - 1, y    ) + input(x, y    ) + input(x + 1, y    ) +
                                input(x - 1, y + 1) + input(x, y + 1) + input(x + 1, y + 1));
                if (val != expected) {
                    printf("out(%d, %d) = %d instead of %d\n",
                           x, y, val, expected);
                    return -1;
                }
            }
        }
    }

    {
        // Func::in can also be used to group multiple stages of a
        // Func into the same loop nest. Consider the following
        // pipeline, which computes a value per pixel, then sweeps
        // from left to right and back across each scanline.
        // Func::in可以用来将多个阶段的函数聚合到同一个循环网络中。
        Func f("f_group"), g("g_group"), h("h_group");

        // Initialize f
        f(x, y) = sin(x - y);
        RDom r(1, 7);

        // Sweep from left to right
        f(r, y) = (f(r, y) + f(r - 1, y)) / 2;

        // Sweep from right to left
        f(7 - r, y) = (f(7 - r, y) + f(8 - r, y)) / 2;

        // Then we do something with a complicated access pattern: A
        // 45 degree rotation with wrap-around
        g(x, y) = f((x + y) % 8, (x - y) % 8);

        // f should be scheduled compute_root, because its consumer
        // accesses it in a complicated way. But that means all stages
        // of f are computed in separate loop nests:
        // f必须被调度为compute_root，因为消费者访问f非常复杂。但是compute_root意味着所有的循环必须
        // 单独执行。

        // for y:
        //   for x:
        //     f(x, y) = sin(x - y)
        // for y:
        //   for r:
        //     f(r, y) = (f(r, y) + f(r - 1, y)) / 2
        // for y:
        //   for r:
        //     f(7 - r, y) = (f(7 - r, y) + f(8 - r, y)) / 2
        // for y:
        //   for x:
        //     g(x, y) = f((x + y) % 8, (x - y) % 8);

        // We can get better locality if we schedule the work done by
        // f to share a common loop over y. We can do this by
        // computing f at scanlines of a wrapper like so:
        // 如果能够共享一个y循环，那么数据的局部性更好。因此在y扫描方向均分配一个包裹函数。

        f.in(g).compute_root();
        f.compute_at(f.in(g), y);

        // f has the default schedule for a Func with update stages,
        // which is to be computed at the innermost loop of its
        // consumer, which is now the wrapper f.in(g). This therefore
        // generates the following loop nest, which has better
        // locality:
        // f在函数的更新阶段有一个默认的调度策略，在消费者的最内层循环中计算f，现在f被f.in(g)这个包裹
        // 函数替换了。形成了下面的循环网，这样的循环具有更好的数据局部性。

        // for y:
        //   for x:
        //     f(x, y) = sin(x - y)
        //   for r:
        //     f(r, y) = (f(r, y) + f(r - 1, y)) / 2
        //   for r:
        //     f(7 - r, y) = (f(7 - r, y) + f(8 - r, y)) / 2
        //   for x:
        //     f_in_g(x, y) = f(x, y)
        // for y:
        //   for x:
        //     g(x, y) = f_in_g((x + y) % 8, (x - y) % 8);

        // We'll additionally vectorize the initialization of, and
        // then transfer of pixel values from f into its wrapper:
        // 向量化f的初始化过程，同时向量化f到f的包裹函数的数据转移
        f.vectorize(x, 4);
        f.in(g).vectorize(x, 4);

        g.realize(8, 8);
        // See figures/lesson_19_group_updates.mp4 for a visualization.
    }

    printf("Success!\n");

    return 0;
}

要点提炼：

1.局部wrapper函数
f(x, y) = x + y;
g(x, y) = 2 * f(x, y) + 3;

f_in_g = f.in(g);
f_in_g.compute_root();
等价于
f(x, y) = x + y;
f_in_g(x , y) = f(x , y);
g(x, y) = 2 * f_in_g(x, y) + 2;

2.全局wrapper函数
f(x, y) = x + y;
g(x, y) = 2 * f(x, y);
h(x, y) = 3 + g(x, y) - f(x, y);

f.in().compute_root();
等价于；
f(x, y) = x + y;
f_in(x, y) = f(x, y);
g(x, y) = 2 * f_in(x, y);
h(x, y) = 3 + g(x, y) - f_in(x, y);

3. wrapper函数的另一个好处是，多个wrapper可以分开进行调度。
f(x, y) = x + y;
g(x, y) = 2 * f(x, y);
h(x, y) = 3 + g(x, y) - f(x, y);

f.in(g).compute_at(g, y);
f.in(h).compute_at(h, y);