halide编程技术指南（连载二）

最新推荐文章于 2024-03-20 10:00:09 发布

Aoulun

最新推荐文章于 2024-03-20 10:00:09 发布

阅读量1.1k

点赞数

分类专栏：深度学习

本文链接：https://blog.csdn.net/Aoulun/article/details/105249604

版权

深度学习专栏收录该内容

45 篇文章 5 订阅

订阅专栏

本文是halide编程指南的连载，已同步至公众号

第二章图像处理

第三章检查生成的代码

第四章使用tracing, print, 和print_when进行调试

第五章 Vectorize, parallelize, unroll and tile 你的代码

第二章图像处理

本章演示如何传递输入图像并进行操作。

// halide提供了对应方法用于导入png图像
#include "halide_image_io.h"

using namespace Halide::Tools;

int main(int argc, char **argv) {
    // 下面这个程序是对图像做亮度增强

    // 首先导入图像
    Halide::Buffer<uint8_t> input = load_image("images/rgb.png");

    // 就是下面这个图
    参考下面的图21

    // 下面定义一个代表这个图像处理管道的
    Halide::Func brighter;

    // ”Func” 有三个参数，代表了图像的位置和颜色通道，halide把颜色通道作为图像的额外维度。
    Halide::Var x, y, c;

    // 通常，我们可能会将整个函数定义写在一行上。 在这里，我们将其分解，以便我们可以解释我们在每个步骤中所做的事情。

    // 对于输入图像的每个像素
    Halide::Expr value = input(x, y, c);

    // 将其强制转换为浮点值
    value = Halide::cast<float>(value);

    // 乘以1.5使其变亮。halide将实数表示为浮点数，而不是双精度数，因此我们在常数的末尾加上“ f”。
    value = value * 1.5f;

    // 将其限制为小于255，因此将其强制转换回8位unsigned int时不会溢出。
    value = Halide::min(value, 255.0f);

    // 将其强制转换为8位无符号整数。
    value = Halide::cast<uint8_t>(value);

    // 定义函数
    brighter(x, y, c) = value;

    // 将上面所有的等效成一行
    // brighter(x, y, c) = Halide::cast<uint8_t>(min(input(x, y, c) * 1.5f, 255));
    // 在这个版本中
    // - 跳过了强制转换为浮点数，因为乘以1.5f会自动执行。
    // - 还使用了整数常量作为对min的调用中的第二个参数，因为它被强制转换为float以便与第一个参数兼容。
    // - 在调用min函数时，省略了halide::,因为使用了koenig lookup。(也叫做argument dependent lookup(ADL)。

    // 请记住，到目前为止，我们所做的只是在内存中构建Halide程序的表示形式。 我们实际上尚未处理任何像素。我们甚至还没有编译该Halide程序。

    // 因此，现在我们将实现Func。输出图像的尺寸应与输入图像的尺寸匹配。如果我们只是想加亮输入图像的一部分，我们可以要求使用较小的尺寸。如果我们请求更大的尺寸，那么Halide将在运行时抛出错误，告诉我们正试图读取输入图像上的边界。
    Halide::Buffer<uint8_t> output =
        brighter.realize(input.width(), input.height(), input.channels());

    // 保存输出以进行检查。它看起来像只鹦鹉。
    save_image(output, "brighter.png");

    // 有关输出的小版本，请参见下文。
    参考下面的图22
    printf("Success!\n");
    return 0;
}

图21

图22

第三章检查生成的代码

本节演示了如何检查Halide编译器正在生成的内容。

#include "Halide.h"
#include <stdio.h>

// 这次我们将导入整个 Halide 命名空间
using namespace Halide;
int main(int argc, char **argv) {
    // 我们将从第1课定义的简单的单阶段成像管道开始。
    // 本课将涉及调试，但是不幸的是，在C ++中，对象不知道自己的名称，这使我们很难理解生成的代码。要解决此问题，您可以将字符串传递给Func和Var构造函数以为其命名，从而进行调试。也就是说，给对象一个名字，在调试的时候，只需要看这个名字，就知道这个变量是，值是多少。
    Func gradient("gradient");
    Var x("x"), y("y");
    gradient(x, y) = x + y;

    // 实现该功能以产生输出图像。在本课程中，我们将使其保持很小的尺寸。
    Buffer<int> output = gradient.realize(8, 8);

    // 那一行编译并运行了管道。 尝试在环境变量HL_DEBUG_CODEGEN设置为1的情况下运行本课程。它将打印出编译的各个阶段以及最终管道的伪代码表示。
    // 如果将HL_DEBUG_CODEGEN设置为更高的数字，则可以看到有关Halide如何编译管道的越来越多的详细信息。设置 HL_DEBUG_CODEGEN = 2将显示编译的每个阶段的Halide代码，以及最后生成的llvm位代码。
    // Halide还将输出此输出的HTML版本，该版本支持语法突出显示和代码折叠，因此对于大型管道而言，阅读起来会更好。运行本教程后，用浏览器打开gradient.html。
    gradient.compile_to_lowered_stmt("gradient.html", {}, HTML);

    // 通常，您可以找出Halide使用此伪代码生成的代码。 在下一课中，我们将看到如何在运行时监听Halide。
    printf("Success!\n");
    return 0;
}

第四章使用tracing, print, 和print_when进行调试

本课演示如何在运行时遵循Halide的操作。

#include "Halide.h"
#include <stdio.h>
using namespace Halide;

int main(int argc, char **argv) {
    Var x("x"), y("y");
    // 在计算Funcs的值时将其打印出来。
    {
        // 我们将像以前一样定义渐变函数。
        Func gradient("gradient");
        gradient(x, y) = x + y;

        // 并告诉Halide我们希望收到所有评估的通知。
        gradient.trace_stores();

        // 在8x8区域上实现该功能。
        printf("Evaluating gradient\n");
        Buffer<int> output = gradient.realize(8, 8);
---------------------------------------
 > Begin pipeline gradient.0()
 > Store gradient.0(0, 0) = 0
 > Store gradient.0(1, 0) = 1
 > Store gradient.0(2, 0) = 2
 > Store gradient.0(3, 0) = 3
 > Store gradient.0(4, 0) = 4
 > Store gradient.0(5, 0) = 5
 > Store gradient.0(6, 0) = 6
 > Store gradient.0(7, 0) = 7
 > Store gradient.0(0, 1) = 1
 > Store gradient.0(1, 1) = 2
 > Store gradient.0(2, 1) = 3
 > Store gradient.0(3, 1) = 4
 > Store gradient.0(4, 1) = 5
 > Store gradient.0(5, 1) = 6
 > Store gradient.0(6, 1) = 7
 > Store gradient.0(7, 1) = 8
 > Store gradient.0(0, 2) = 2
 > Store gradient.0(1, 2) = 3
 > Store gradient.0(2, 2) = 4
 > Store gradient.0(3, 2) = 5
 > Store gradient.0(4, 2) = 6
 > Store gradient.0(5, 2) = 7
 > Store gradient.0(6, 2) = 8
 > Store gradient.0(7, 2) = 9
 > Store gradient.0(0, 3) = 3
 > Store gradient.0(1, 3) = 4
 > Store gradient.0(2, 3) = 5
 > Store gradient.0(3, 3) = 6
 > Store gradient.0(4, 3) = 7
 > Store gradient.0(5, 3) = 8
 > Store gradient.0(6, 3) = 9
 > Store gradient.0(7, 3) = 10
 > Store gradient.0(0, 4) = 4
 > Store gradient.0(1, 4) = 5
 > Store gradient.0(2, 4) = 6
 > Store gradient.0(3, 4) = 7
 > Store gradient.0(4, 4) = 8
 > Store gradient.0(5, 4) = 9
 > Store gradient.0(6, 4) = 10
 > Store gradient.0(7, 4) = 11
 > Store gradient.0(0, 5) = 5
 > Store gradient.0(1, 5) = 6
 > Store gradient.0(2, 5) = 7
 > Store gradient.0(3, 5) = 8
 > Store gradient.0(4, 5) = 9
 > Store gradient.0(5, 5) = 10
 > Store gradient.0(6, 5) = 11
 > Store gradient.0(7, 5) = 12
 > Store gradient.0(0, 6) = 6
 > Store gradient.0(1, 6) = 7
 > Store gradient.0(2, 6) = 8
 > Store gradient.0(3, 6) = 9
 > Store gradient.0(4, 6) = 10
 > Store gradient.0(5, 6) = 11
 > Store gradient.0(6, 6) = 12
 > Store gradient.0(7, 6) = 13
 > Store gradient.0(0, 7) = 7
 > Store gradient.0(1, 7) = 8
 > Store gradient.0(2, 7) = 9
 > Store gradient.0(3, 7) = 10
 > Store gradient.0(4, 7) = 11
 > Store gradient.0(5, 7) = 12
 > Store gradient.0(6, 7) = 13
 > Store gradient.0(7, 7) = 14
 > End pipeline gradient.0()
---------------------------------------
        // 这将打印出所有评估gradient（x，y）的时间。

        // 现在我们可以窥探Halide在做什么，让我们尝试第一个调度原语。我们将制作一个新版本的渐变，它可以并行处理每个扫描线。
        Func parallel_gradient("parallel_gradient");
        parallel_gradient(x, y) = x + y;

        // 我们还将跟踪此功能。
        parallel_gradient.trace_stores();

        // 到目前为止，情况是一样的。 我们已经定义了算法，但是还没有说明如何安排它。 通常，探索不同的调度决策不会改变描述算法的代码。

        // 现在，我们告诉Halide在y坐标上使用并行for循环。 在Linux上，我们使用线程池和任务队列来运行它。 在OS X上，我们调用大型中央调度，这对我们执行相同的操作。
        parallel_gradient.parallel(y);

        // 这次，printfs应该出现故障，因为每条扫描线都可能在不同的线程中处理。 线程数应适合您的系统，但是在  Linux上，您可以使用环境变量HL_NUM_THREADS手动控制它。
        printf("\nEvaluating parallel_gradient\n");
        parallel_gradient.realize(8, 8);
--------------------------------------------
 > Begin pipeline parallel_gradient.0()
 > Store parallel_gradient.0(0, 0) = 0
 > Store parallel_gradient.0(1, 0) = 1
 > Store parallel_gradient.0(2, 0) = 2
 > Store parallel_gradient.0(3, 0) = 3
 > Store parallel_gradient.0(4, 0) = 4
 > Store parallel_gradient.0(5, 0) = 5
 > Store parallel_gradient.0(6, 0) = 6
 > Store parallel_gradient.0(7, 0) = 7
 > Store parallel_gradient.0(0, 1) = 1
 > Store parallel_gradient.0(0, 3) = 3
 > Store parallel_gradient.0(0, 2) = 2
 > Store parallel_gradient.0(1, 3) = 4
 > Store parallel_gradient.0(1, 1) = 2
 > Store parallel_gradient.0(1, 2) = 3
 > Store parallel_gradient.0(2, 3) = 5
 > Store parallel_gradient.0(0, 4) = 4
 > Store parallel_gradient.0(2, 1) = 3
 > Store parallel_gradient.0(3, 3) = 6
 > Store parallel_gradient.0(2, 2) = 4
 > Store parallel_gradient.0(0, 5) = 5
 > Store parallel_gradient.0(0, 7) = 7
 > Store parallel_gradient.0(0, 6) = 6
 > Store parallel_gradient.0(1, 6) = 7
 > Store parallel_gradient.0(1, 7) = 8
 > Store parallel_gradient.0(3, 1) = 4
 > Store parallel_gradient.0(2, 6) = 8
 > Store parallel_gradient.0(2, 7) = 9
 > Store parallel_gradient.0(1, 4) = 5
 > Store parallel_gradient.0(3, 7) = 10
 > Store parallel_gradient.0(3, 6) = 9
 > Store parallel_gradient.0(4, 3) = 7
 > Store parallel_gradient.0(4, 6) = 10
 > Store parallel_gradient.0(4, 7) = 11
 > Store parallel_gradient.0(3, 2) = 5
 > Store parallel_gradient.0(1, 5) = 6
 > Store parallel_gradient.0(5, 7) = 12
 > Store parallel_gradient.0(5, 6) = 11
 > Store parallel_gradient.0(2, 4) = 6
 > Store parallel_gradient.0(6, 6) = 12
 > Store parallel_gradient.0(6, 7) = 13
 > Store parallel_gradient.0(7, 6) = 13
 > Store parallel_gradient.0(7, 7) = 14
 > Store parallel_gradient.0(4, 1) = 5
 > Store parallel_gradient.0(4, 2) = 6
 > Store parallel_gradient.0(2, 5) = 7
 > Store parallel_gradient.0(5, 1) = 6
 > Store parallel_gradient.0(3, 4) = 7
 > Store parallel_gradient.0(5, 3) = 8
 > Store parallel_gradient.0(5, 2) = 7
 > Store parallel_gradient.0(3, 5) = 8
 > Store parallel_gradient.0(6, 2) = 8
 > Store parallel_gradient.0(4, 5) = 9
 > Store parallel_gradient.0(7, 2) = 9
 > Store parallel_gradient.0(5, 5) = 10
 > Store parallel_gradient.0(6, 1) = 7
 > Store parallel_gradient.0(6, 3) = 9
 > Store parallel_gradient.0(6, 5) = 11
 > Store parallel_gradient.0(7, 3) = 10
 > Store parallel_gradient.0(7, 1) = 8
 > Store parallel_gradient.0(4, 4) = 8
 > Store parallel_gradient.0(7, 5) = 12
 > Store parallel_gradient.0(5, 4) = 9
 > Store parallel_gradient.0(6, 4) = 10
 > Store parallel_gradient.0(7, 4) = 11
 > End pipeline parallel_gradient.0()
----------------------------------------------
    }

    // 打印单个Exprs。
    {
        // trace_stores()只能打印Func的值。有时您想检查子表达式的值，而不是整个Func。内置函数'print'可以包装在任何Expr周围，以在每次评估时打印该Expr的值。

        // 例如，假设我们有一些Func，它是两个项的和：
        Func f;
        f(x, y) = sin(x) + cos(y);

        // 如果我们只想检查其中一个术语，则可以像这样包装“print”：
        Func g;
        g(x, y) = sin(x) + print(cos(y));

        printf("\nEvaluating sin(x) + cos(y), and just printing cos(y)\n");
        g.realize(4, 4);
--------------------------------------
 > 1.000000
 > 1.000000
 > 1.000000
 > 1.000000
 > 0.540302
 > 0.540302
 > 0.540302
 > 0.540302
 > -0.416147
 > -0.416147
 > -0.416147
 > -0.416147
 > -0.989992
 > -0.989992
 > -0.989992
 > -0.989992
-----------------------------------------
    }

    // 打印其他上下文。
    {
        // print可以使用多个参数。它会打印所有内容并评估为第一个。参数可以是Exprs或常量字符串。这可以用于在值旁边打印其他上下文：
        Func f;
        f(x, y) = sin(x) + print(cos(y), "<- this is cos(", y, ") when x =", x);

        printf("\nEvaluating sin(x) + cos(y), and printing cos(y) with more context\n");
        f.realize(4, 4);
------------------------------------------------
 > 1.000000 <- this is cos( 0 ) when x = 0
 > 1.000000 <- this is cos( 0 ) when x = 1
 > 1.000000 <- this is cos( 0 ) when x = 2
 > 1.000000 <- this is cos( 0 ) when x = 3
 > 0.540302 <- this is cos( 1 ) when x = 0
 > 0.540302 <- this is cos( 1 ) when x = 1
 > 0.540302 <- this is cos( 1 ) when x = 2
 > 0.540302 <- this is cos( 1 ) when x = 3
 > -0.416147 <- this is cos( 2 ) when x = 0
 > -0.416147 <- this is cos( 2 ) when x = 1
 > -0.416147 <- this is cos( 2 ) when x = 2
 > -0.416147 <- this is cos( 2 ) when x = 3
 > -0.989992 <- this is cos( 3 ) when x = 0
 > -0.989992 <- this is cos( 3 ) when x = 1
 > -0.989992 <- this is cos( 3 ) when x = 2
 > -0.989992 <- this is cos( 3 ) when x = 3
-------------------------------------------------
        // 将上面的表达式分成多行可能会很有用，以使调试时更容易打开和关闭打印某些值。
        Expr e = cos(y);
        // 取消注释以下行以打印cos（y）的值
        // e = print(e, "<- this is cos(", y, ") when x =", x);
        Func g;
        g(x, y) = sin(x) + e;
        g.realize(4, 4);
    }

    // 条件打印
    {
        // print和trace_stores均可产生大量输出。如果您正在寻找一个罕见的事件，或者只是想看看单个像素发生了什么，那么很难挖掘出如此大量的输出。可以使用函数print_when有条件地打印Expr。print_when的第一个参数是布尔Expr。 如果Expr的计算结果为true，则返回第二个参数并打印所有参数。如果Expr的计算结果为false，则仅返回第二个参数，并且不打印。

        Func f;
        Expr e = cos(y);
        e = print_when(x == 37 && y == 42, e, "<- this is cos(y) at x, y == (37, 42)");
        f(x, y) = sin(x) + e;
        printf("\nEvaluating sin(x) + cos(y), and printing cos(y) at a single pixel\n");
        f.realize(640, 480);
---------------------------------------------------------------
> -0.399985 <- this is cos(y) at x, y == (37, 42)
----------------------------------------------------------------
        // print_when也可以用来检查您不期望的值：
        Func g;
        e = cos(y);
        e = print_when(e < 0, e, "cos(y) < 0 at y ==", y);
        g(x, y) = sin(x) + e;
        printf("\nEvaluating sin(x) + cos(y), and printing whenever cos(y) < 0\n");
        g.realize(4, 4);
------------------------------------------
 > -0.416147 cos(y) < 0 at y == 2
 > -0.416147 cos(y) < 0 at y == 2
 > -0.416147 cos(y) < 0 at y == 2
 > -0.416147 cos(y) < 0 at y == 2
 > -0.989992 cos(y) < 0 at y == 3
 > -0.989992 cos(y) < 0 at y == 3
 > -0.989992 cos(y) < 0 at y == 3
 > -0.989992 cos(y) < 0 at y == 3
-------------------------------------------
    }

    // 在编译时打印表达式。
    {
        // 上面的代码跨几行代码构建了Halide Expr。如果您正在以编程方式构造一个复杂的表达式，并且想要检查所创建的Expr是否符合您的想法，则还可以使用C++流打印出表达式本身：
        Var fizz("fizz"), buzz("buzz");
        Expr e = 1;
        for (int i = 2; i < 100; i++) {
            if (i % 3 == 0 && i % 5 == 0) e += fizz*buzz;
            else if (i % 3 == 0) e += fizz;
            else if (i % 5 == 0) e += buzz;
            else e += i;
        }
        std::cout << "Printing a complex Expr: " << e << "\n";
-------------------------------------------------------------------------------
> Printing a complex Expr: ((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((1 + 2) + fizz) + 4) + buzz) + fizz) + 7) + 8) + fizz) + buzz) + 11) + fizz) + 13) + 14) + (fizz*buzz)) + 16) + 17) + fizz) + 19) + buzz) + fizz) + 22) + 23) + fizz) + buzz) + 26) + fizz) + 28) + 29) + (fizz*buzz)) + 31) + 32) + fizz) + 34) + buzz) + fizz) + 37) + 38) + fizz) + buzz) + 41) + fizz) + 43) + 44) + (fizz*buzz)) + 46) + 47) + fizz) + 49) + buzz) + fizz) + 52) + 53) + fizz) + buzz) + 56) + fizz) + 58) + 59) + (fizz*buzz)) + 61) + 62) + fizz) + 64) + buzz) + fizz) + 67) + 68) + fizz) + buzz) + 71) + fizz) + 73) + 74) + (fizz*buzz)) + 76) + 77) + fizz) + 79) + buzz) + fizz) + 82) + 83) + fizz) + buzz) + 86) + fizz) + 88) + 89) + (fizz*buzz)) + 91) + 92) + fizz) + 94) + buzz) + fizz) + 97) + 98) + fizz)    }
-----------------------------------------------------------------------------------
    printf("Success!\n");
    return 0;
}

第五章 Vectorize, parallelize, unroll and tile 你的代码

#include "Halide.h"
#include <stdio.h>
#include <algorithm>
using namespace Halide;
int main(int argc, char **argv) {
    // 我们将以几种不同的方式定义和安排渐变函数，并查看像素的计算顺序。
    Var x("x"), y("y");
    // 首先，我们观察默认顺序。
    {
        Func gradient("gradient");
        gradient(x, y) = x + y;
        gradient.trace_stores();

        // 默认情况下，我们沿着行走，然后沿着列走。这意味着x变化很快，而y变化缓慢。x是列，y是行，因此这是行主要遍历。
        printf("Evaluating gradient row-major\n");
        Buffer<int> output = gradient.realize(4, 4);
----------------------------------------------------------------
 > Begin pipeline gradient.0()
 > Store gradient.0(0, 0) = 0
 > Store gradient.0(1, 0) = 1
 > Store gradient.0(2, 0) = 2
 > Store gradient.0(3, 0) = 3
 > Store gradient.0(0, 1) = 1
 > Store gradient.0(1, 1) = 2
 > Store gradient.0(2, 1) = 3
 > Store gradient.0(3, 1) = 4
 > Store gradient.0(0, 2) = 2
 > Store gradient.0(1, 2) = 3
 > Store gradient.0(2, 2) = 4
 > Store gradient.0(3, 2) = 5
 > Store gradient.0(0, 3) = 3
 > Store gradient.0(1, 3) = 4
 > Store gradient.0(2, 3) = 5
 > Store gradient.0(3, 3) = 6
 > End pipeline gradient.0()
----------------------------------------------------------------
        // 有关此操作的可视化信息，请参见下图。
        参考下面图51

        // 等价的C代码为:
        printf("Equivalent C:\n");
        for (int y = 0; y < 4; y++) {
            for (int x = 0; x < 4; x++) {
                printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
            }
        }
        printf("\n\n");

        // Tracing 是了解执行步骤的一种有用方法。您还可以要求Halide打印出伪代码，以显示Halide正在生成哪些循环:
        printf("Pseudo-code for the schedule:\n");
        gradient.print_loop_nest();
        printf("\n");
----------------------------------------------------------
 > produce gradient:
 >   for y:
 >     for x:
 >       gradient(...) = ...
----------------------------------------------------------
        // 因为我们使用的是默认顺序，所以它应该打印:
        // compute gradient:
        //   for y:
        //     for x:
        //       gradient(...) = ...
    }

    // 重新排列变量.
    {
        Func gradient("gradient_col_major");
        gradient(x, y) = x + y;
        gradient.trace_stores();

        // 如果我们对x和y重新排序，我们可以改为沿列走。重新排序的调用采用Func的参数，并为生成的for循环设置新的嵌套顺序。参数是从最内层循环中指定的，因此以下调用将y放入内层循环:
        gradient.reorder(y, x);

        // 这意味着y（行）将快速变化，而x（列）将缓慢变化，因此这是列的主要遍历。

        printf("Evaluating gradient column-major\n");
        Buffer<int> output = gradient.realize(4, 4);
--------------------------------------------------------
 > Begin pipeline gradient_col_major.0()
 > Store gradient_col_major.0(0, 0) = 0
 > Store gradient_col_major.0(0, 1) = 1
 > Store gradient_col_major.0(0, 2) = 2
 > Store gradient_col_major.0(0, 3) = 3
 > Store gradient_col_major.0(1, 0) = 1
 > Store gradient_col_major.0(1, 1) = 2
 > Store gradient_col_major.0(1, 2) = 3
 > Store gradient_col_major.0(1, 3) = 4
 > Store gradient_col_major.0(2, 0) = 2
 > Store gradient_col_major.0(2, 1) = 3
 > Store gradient_col_major.0(2, 2) = 4
 > Store gradient_col_major.0(2, 3) = 5
 > Store gradient_col_major.0(3, 0) = 3
 > Store gradient_col_major.0(3, 1) = 4
 > Store gradient_col_major.0(3, 2) = 5
 > Store gradient_col_major.0(3, 3) = 6
 > End pipeline gradient_col_major.0()
-------------------------------------------------------
        // 有关此操作的可视化信息，请参见下图。
        参考下面图52

        printf("Equivalent C:\n");
        for (int x = 0; x < 4; x++) {
            for (int y = 0; y < 4; y++) {
                printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
            }
        }
        printf("\n");

        // 如果我们为此时间表打印伪代码，我们将看到y循环现在位于x循环内。
        printf("Pseudo-code for the schedule:\n");
        gradient.print_loop_nest();
        printf("\n");
----------------------------------------------------------------
 > produce gradient_col_major:
 >   for x:
 >     for y:
 >       gradient_col_major(...) = ...
----------------------------------------------------------------
    }

    // 将变量分成两个.
    {
        Func gradient("gradient_split");
        gradient(x, y) = x + y;
        gradient.trace_stores();

        // 对var执行的最强大的原始调度操作是将其拆分为内部和外部子变量：
        Var x_outer, x_inner;
        gradient.split(x, x_outer, x_inner, 2);

        // 这将x上的循环分为两个嵌套循环：x_outer上的外部循环和x_inner上的内部循环。 拆分的最后一个参数是“split factor(分离因子)”。 内部循环从零到分离因子。 外循环从零开始到所需的范围x（在这种情况下为4）除以分离因子。 在循环中，旧变量定义为外部*因子+内部。 如果旧循环以非零值开始，则该值也将添加到循环中。

        printf("Evaluating gradient with x split into x_outer and x_inner \n");
        Buffer<int> output = gradient.realize(4, 4);
------------------------------------------------------
 > Begin pipeline gradient_split.0()
 > Store gradient_split.0(0, 0) = 0
 > Store gradient_split.0(1, 0) = 1
 > Store gradient_split.0(2, 0) = 2
 > Store gradient_split.0(3, 0) = 3
 > Store gradient_split.0(0, 1) = 1
 > Store gradient_split.0(1, 1) = 2
 > Store gradient_split.0(2, 1) = 3
 > Store gradient_split.0(3, 1) = 4
 > Store gradient_split.0(0, 2) = 2
 > Store gradient_split.0(1, 2) = 3
 > Store gradient_split.0(2, 2) = 4
 > Store gradient_split.0(3, 2) = 5
 > Store gradient_split.0(0, 3) = 3
 > Store gradient_split.0(1, 3) = 4
 > Store gradient_split.0(2, 3) = 5
 > Store gradient_split.0(3, 3) = 6
 > End pipeline gradient_split.0()
--------------------------------------------------------
        printf("Equivalent C:\n");
        for (int y = 0; y < 4; y++) {
            for (int x_outer = 0; x_outer < 2; x_outer++) {
                for (int x_inner = 0; x_inner < 2; x_inner++) {
                    int x = x_outer * 2 + x_inner;
                    printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
                }
            }
        }
        printf("\n");

        printf("Pseudo-code for the schedule:\n");
        gradient.print_loop_nest();
        printf("\n");
------------------------------------------------------
 > produce gradient_split:
 >   for y:
 >     for x.x_outer:
 >       for x.x_inner in [0, 1]:
 >         gradient_split(...) = ...
------------------------------------------------------
        // 请注意，像素的评估顺序实际上并没有改变！ 拆分本身没有任何作用，但是它确实提供了我们将在下面探讨的所有调度可能性。
    }

    // 将两个变量融合为一个.
    {
        Func gradient("gradient_fused");
        gradient(x, y) = x + y;

        // 分离的相反是“融合”。 融合两个变量会将两个循环合并到扩展范围乘积上的单个for循环中。融合的重要性不如拆分，但它也很有用（我们将在本课程的后面看到）。像拆分一样，融合本身不会改变评估的顺序。
        Var fused;
        gradient.fuse(x, y, fused);

        printf("Evaluating gradient with x and y fused\n");
        Buffer<int> output = gradient.realize(4, 4);

        printf("Equivalent C:\n");
        for (int fused = 0; fused < 4*4; fused++) {
            int y = fused / 4;
            int x = fused % 4;
            printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
        }
        printf("\n");

        printf("Pseudo-code for the schedule:\n");
        gradient.print_loop_nest();
        printf("\n");
-------------------------------------------
 > produce gradient_fused:
 >   for x.fused:
 >     gradient_fused(...) = ...
-------------------------------------------
    }

    // 在图块中评估。
    {
        Func gradient("gradient_tiled");
        gradient(x, y) = x + y;
        gradient.trace_stores();

        // 现在我们既可以拆分又可以重新排序，我们可以进行切片评估。让我们将x和y分别除以4，然后对var重新排序以表示块遍历。块遍历将域划分为小的矩形块，并且最外层迭代在块上，并且在其中进行遍历每个块内的点。 如果相邻像素使用重叠的输入数据（例如模糊），则可能对性能有好处。 我们可以这样表示块遍历：
        Var x_outer, x_inner, y_outer, y_inner;
        gradient.split(x, x_outer, x_inner, 4);
        gradient.split(y, y_outer, y_inner, 4);
        gradient.reorder(x_inner, y_inner, x_outer, y_outer);

        // 这种模式很常见，因此有一个简写:
        // gradient.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4);

        printf("Evaluating gradient in 4x4 tiles\n");
        Buffer<int> output = gradient.realize(8, 8);
-------------------------------------------------------------
 > Begin pipeline gradient_tiled.0()
 > Store gradient_tiled.0(0, 0) = 0
 > Store gradient_tiled.0(1, 0) = 1
 > Store gradient_tiled.0(2, 0) = 2
 > Store gradient_tiled.0(3, 0) = 3
 > Store gradient_tiled.0(0, 1) = 1
 > Store gradient_tiled.0(1, 1) = 2
 > Store gradient_tiled.0(2, 1) = 3
 > Store gradient_tiled.0(3, 1) = 4
 > Store gradient_tiled.0(0, 2) = 2
 > Store gradient_tiled.0(1, 2) = 3
 > Store gradient_tiled.0(2, 2) = 4
 > Store gradient_tiled.0(3, 2) = 5
 > Store gradient_tiled.0(0, 3) = 3
 > Store gradient_tiled.0(1, 3) = 4
 > Store gradient_tiled.0(2, 3) = 5
 > Store gradient_tiled.0(3, 3) = 6
 > Store gradient_tiled.0(4, 0) = 4
 > Store gradient_tiled.0(5, 0) = 5
 > Store gradient_tiled.0(6, 0) = 6
 > Store gradient_tiled.0(7, 0) = 7
 > Store gradient_tiled.0(4, 1) = 5
 > Store gradient_tiled.0(5, 1) = 6
 > Store gradient_tiled.0(6, 1) = 7
 > Store gradient_tiled.0(7, 1) = 8
 > Store gradient_tiled.0(4, 2) = 6
 > Store gradient_tiled.0(5, 2) = 7
 > Store gradient_tiled.0(6, 2) = 8
 > Store gradient_tiled.0(7, 2) = 9
 > Store gradient_tiled.0(4, 3) = 7
 > Store gradient_tiled.0(5, 3) = 8
 > Store gradient_tiled.0(6, 3) = 9
 > Store gradient_tiled.0(7, 3) = 10
 > Store gradient_tiled.0(0, 4) = 4
 > Store gradient_tiled.0(1, 4) = 5
 > Store gradient_tiled.0(2, 4) = 6
 > Store gradient_tiled.0(3, 4) = 7
 > Store gradient_tiled.0(0, 5) = 5
 > Store gradient_tiled.0(1, 5) = 6
 > Store gradient_tiled.0(2, 5) = 7
 > Store gradient_tiled.0(3, 5) = 8
 > Store gradient_tiled.0(0, 6) = 6
 > Store gradient_tiled.0(1, 6) = 7
 > Store gradient_tiled.0(2, 6) = 8
 > Store gradient_tiled.0(3, 6) = 9
 > Store gradient_tiled.0(0, 7) = 7
 > Store gradient_tiled.0(1, 7) = 8
 > Store gradient_tiled.0(2, 7) = 9
 > Store gradient_tiled.0(3, 7) = 10
 > Store gradient_tiled.0(4, 4) = 8
 > Store gradient_tiled.0(5, 4) = 9
 > Store gradient_tiled.0(6, 4) = 10
 > Store gradient_tiled.0(7, 4) = 11
 > Store gradient_tiled.0(4, 5) = 9
 > Store gradient_tiled.0(5, 5) = 10
 > Store gradient_tiled.0(6, 5) = 11
 > Store gradient_tiled.0(7, 5) = 12
 > Store gradient_tiled.0(4, 6) = 10
 > Store gradient_tiled.0(5, 6) = 11
 > Store gradient_tiled.0(6, 6) = 12
 > Store gradient_tiled.0(7, 6) = 13
 > Store gradient_tiled.0(4, 7) = 11
 > Store gradient_tiled.0(5, 7) = 12
 > Store gradient_tiled.0(6, 7) = 13
 > Store gradient_tiled.0(7, 7) = 14
 > End pipeline gradient_tiled.0()
-----------------------------------------------------------------------
        // 可视化图如下
        参考下面图53

        printf("Equivalent C:\n");
        for (int y_outer = 0; y_outer < 2; y_outer++) {
            for (int x_outer = 0; x_outer < 2; x_outer++) {
                for (int y_inner = 0; y_inner < 4; y_inner++) {
                    for (int x_inner = 0; x_inner < 4; x_inner++) {
                        int x = x_outer * 4 + x_inner;
                        int y = y_outer * 4 + y_inner;
                        printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
                    }
                }
            }
        }
        printf("\n");

        printf("Pseudo-code for the schedule:\n");
        gradient.print_loop_nest();
        printf("\n");
-------------------------------------------------------------
 > produce gradient_tiled:
 >   for y.y_outer:
 >     for x.x_outer:
 >       for y.y_inner in [0, 3]:
 >         for x.x_inner in [0, 3]:
 >           gradient_tiled(...) = ...
-------------------------------------------------------------
    }

    // 用向量评估.
    {
        Func gradient("gradient_in_vectors");
        gradient(x, y) = x + y;
        gradient.trace_stores();

        // 拆分的好处在于，它确保内部变量从零到分割因子。在大多数情况下，分割因子将是编译时常量，因此我们可以使用单个向量来替换内部变量上的循环。这次我们将除以四，因为在X86上我们可以使用SSE来计算4宽向量。
        Var x_outer, x_inner;
        gradient.split(x, x_outer, x_inner, 4);
        gradient.vectorize(x_inner);

        // 拆分然后向量化内部变量非常普遍，以至于它有一个简写形式。 我们还可以说:
        // gradient.vectorize(x, 4);
        // 等价于:
        // gradient.split(x, x, x_inner, 4);
        // gradient.vectorize(x_inner);
        // 请注意，在这种情况下，我们将名称“x”作为新的外部变量重用。以后引用x的调度调用将引用这个名为x的新外部变量。

        // 这次，我们将对一个8x4的框进行评估，以便每条扫描线有多个工作向量.
        printf("Evaluating gradient with x_inner vectorized \n");
        Buffer<int> output = gradient.realize(8, 4);
---------------------------------------------------------------------------------
 > Begin pipeline gradient_in_vectors.0()
 > Store gradient_in_vectors.0(<0, 1, 2, 3>, <0, 0, 0, 0>) = <0, 1, 2, 3>
 > Store gradient_in_vectors.0(<4, 5, 6, 7>, <0, 0, 0, 0>) = <4, 5, 6, 7>
 > Store gradient_in_vectors.0(<0, 1, 2, 3>, <1, 1, 1, 1>) = <1, 2, 3, 4>
 > Store gradient_in_vectors.0(<4, 5, 6, 7>, <1, 1, 1, 1>) = <5, 6, 7, 8>
 > Store gradient_in_vectors.0(<0, 1, 2, 3>, <2, 2, 2, 2>) = <2, 3, 4, 5>
 > Store gradient_in_vectors.0(<4, 5, 6, 7>, <2, 2, 2, 2>) = <6, 7, 8, 9>
 > Store gradient_in_vectors.0(<0, 1, 2, 3>, <3, 3, 3, 3>) = <3, 4, 5, 6>
 > Store gradient_in_vectors.0(<4, 5, 6, 7>, <3, 3, 3, 3>) = <7, 8, 9, 10>
 > End pipeline gradient_in_vectors.0()
----------------------------------------------------------------------------------
        // 可视化如下
        参考下面图54

        printf("Equivalent C:\n");
        for (int y = 0; y < 4; y++) {
            for (int x_outer = 0; x_outer < 2; x_outer++) {
                // x_inner上的循环已消失，并由表达式的矢量化版本取代。在x86处理器上，Halide为此生成所有SSE.
                int x_vec[] = {x_outer * 4 + 0,
                               x_outer * 4 + 1,
                               x_outer * 4 + 2,
                               x_outer * 4 + 3};
                int val[] = {x_vec[0] + y,
                             x_vec[1] + y,
                             x_vec[2] + y,
                             x_vec[3] + y};
                printf("Evaluating at <%d, %d, %d, %d>, <%d, %d, %d, %d>:"
                       " <%d, %d, %d, %d>\n",
                       x_vec[0], x_vec[1], x_vec[2], x_vec[3],
                       y, y, y, y,
                       val[0], val[1], val[2], val[3]);
            }
        }
        printf("\n");

        printf("Pseudo-code for the schedule:\n");
        gradient.print_loop_nest();
        printf("\n");
--------------------------------------------------------------------
 > produce gradient_in_vectors:
 >   for y:
 >     for x.x_outer:
 >       vectorized x.x_inner in [0, 3]:
 >         gradient_in_vectors(...) = ...
----------------------------------------------------------------------
    }

    // 展开循环.
    {
        Func gradient("gradient_unroll");
        gradient(x, y) = x + y;
        gradient.trace_stores();

        // 如果多个像素共享重叠的数据，则展开计算是有意义的，以便共享值仅计算或加载一次。我们这样做与表示向量化的方式类似。我们拆分一个维度，然后完全展开内部变量的循环。展开不会改变评估的顺序.
        Var x_outer, x_inner;
        gradient.split(x, x_outer, x_inner, 2);
        gradient.unroll(x_inner);

        // 简写为:
        // gradient.unroll(x, 2);

        printf("Evaluating gradient unrolled by a factor of two\n");
        Buffer<int> result = gradient.realize(4, 4);
-----------------------------------------------------------------------------
 > Begin pipeline gradient_unroll.0()
 > Store gradient_unroll.0(0, 0) = 0
 > Store gradient_unroll.0(1, 0) = 1
 > Store gradient_unroll.0(2, 0) = 2
 > Store gradient_unroll.0(3, 0) = 3
 > Store gradient_unroll.0(0, 1) = 1
 > Store gradient_unroll.0(1, 1) = 2
 > Store gradient_unroll.0(2, 1) = 3
 > Store gradient_unroll.0(3, 1) = 4
 > Store gradient_unroll.0(0, 2) = 2
 > Store gradient_unroll.0(1, 2) = 3
 > Store gradient_unroll.0(2, 2) = 4
 > Store gradient_unroll.0(3, 2) = 5
 > Store gradient_unroll.0(0, 3) = 3
 > Store gradient_unroll.0(1, 3) = 4
 > Store gradient_unroll.0(2, 3) = 5
 > Store gradient_unroll.0(3, 3) = 6
 > End pipeline gradient_unroll.0()
-----------------------------------------------------------------------------
        printf("Equivalent C:\n");
        for (int y = 0; y < 4; y++) {
            for (int x_outer = 0; x_outer < 2; x_outer++) {
                // 我们得到了最里面的语句的两个副本，而不是x_inner上的for循环。
                {
                    int x_inner = 0;
                    int x = x_outer * 2 + x_inner;
                    printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
                }
                {
                    int x_inner = 1;
                    int x = x_outer * 2 + x_inner;
                    printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
                }
            }
        }
        printf("\n");

        printf("Pseudo-code for the schedule:\n");
        gradient.print_loop_nest();
        printf("\n");
----------------------------------------------------------------
 > produce gradient_unroll:
 >   for y:
 >     for x.x_outer:
 >       unrolled x.x_inner in [0, 1]:
 >         gradient_unroll(...) = ...
------------------------------------------------------------------
    }

    // 在不分割内容的条件下，利用因子进行分割
    {
        Func gradient("gradient_split_7x2");
        gradient(x, y) = x + y;
        gradient.trace_stores();

        // 拆分可确保内部循环从零到拆分因子，这对于我们上面看到的用法很重要。那么，当我们希望评估x 的总范围不是分割因子的倍数时，会发生什么？ 我们将系数除以三，然后评估7x2框而不是我们一直使用的4x4框的渐变。
        Var x_outer, x_inner;
        gradient.split(x, x_outer, x_inner, 3);

        printf("Evaluating gradient over a 7x2 box with x split by three \n");
        Buffer<int> output = gradient.realize(7, 2);
-----------------------------------------------------------------------
 > Begin pipeline gradient_split_7x2.0()
 > Store gradient_split_7x2.0(0, 0) = 0
 > Store gradient_split_7x2.0(1, 0) = 1
 > Store gradient_split_7x2.0(2, 0) = 2
 > Store gradient_split_7x2.0(3, 0) = 3
 > Store gradient_split_7x2.0(4, 0) = 4
 > Store gradient_split_7x2.0(5, 0) = 5
 > Store gradient_split_7x2.0(4, 0) = 4
 > Store gradient_split_7x2.0(5, 0) = 5
 > Store gradient_split_7x2.0(6, 0) = 6
 > Store gradient_split_7x2.0(0, 1) = 1
 > Store gradient_split_7x2.0(1, 1) = 2
 > Store gradient_split_7x2.0(2, 1) = 3
 > Store gradient_split_7x2.0(3, 1) = 4
 > Store gradient_split_7x2.0(4, 1) = 5
 > Store gradient_split_7x2.0(5, 1) = 6
 > Store gradient_split_7x2.0(4, 1) = 5
 > Store gradient_split_7x2.0(5, 1) = 6
 > Store gradient_split_7x2.0(6, 1) = 7
 > End pipeline gradient_split_7x2.0()
--------------------------------------------------------------------------
        // 请参阅以下内容以了解发生的情况。 请注意，有些点得到的评价不止一次！
        参考下面图55

        printf("Equivalent C:\n");
        for (int y = 0; y < 2; y++) {
            for (int x_outer = 0; x_outer < 3; x_outer++) { // Now runs from 0 to 2
                for (int x_inner = 0; x_inner < 3; x_inner++) {
                    int x = x_outer * 3;
                    // 在添加x_inner之前，请确保不要评估7x2框之外的点。 我们将x限制为最大4（7减去分割因子）。
                    if (x > 4) x = 4;
                    x += x_inner;
                    printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
                }
            }
        }
        printf("\n");

        printf("Pseudo-code for the schedule:\n");
        gradient.print_loop_nest();
        printf("\n");
--------------------------------------------------------------------
 > produce gradient_split_7x2:
 >   for y:
 >     for x.x_outer:
 >       for x.x_inner in [0, 2]:
 >         gradient_split_7x2(...) = ...
-------------------------------------------------------------------

        // 如果看输出，您将看到对某些坐标进行了多次评估。 通常没关系，因为纯Halide函数没有副作用，因此可以安全地多次评估同一点。 如果您像我们一样使用C函数，则有责任确保可以处理多次被评估的同一点。

        // 规则是: 如果我们要求x从x_min到x_min + x_extent，并除以因子“ factor”，则:
        // x_outer从0 to (x_extent + factor - 1)/factor
        // x_inner 从 0 to factor
        // x = min(x_outer * factor, x_extent - factor) + x_inner + x_min
        // 在本例中, x_min 是0, x_extent 是7, factor 是3.

        // 但是，如果您编写带有更新定义的Halide函数（请参阅第9节），那么多次评估同一点是不安全的，因此我们不会应用此技巧。 取而代之的是，计算的值范围将四舍五入到拆分因子的下一个倍数。
    }

    // 融合，平铺和并行化.
    {
        // 在上一课中，我们看到了可以对变量进行并行化的过程。 在这里，我们将其与融合和平铺结合起来以表达有用的模式-并行处理图块。

        // 这是融合的光芒。 当您想跨多个维度并行化而不引入嵌套并行性时，融合会有所帮助。嵌套并行性（并行for循环中的parallel for循环）受Halide支持，但与将并行变量融合到单个并行for循环中相比，性能通常较差。

        Func gradient("gradient_fused_tiles");
        gradient(x, y) = x + y;
        gradient.trace_stores();

        // 首先，我们将平铺，然后将平铺索引融合在一起，并在组合中进行并行化。
        Var x_outer, y_outer, x_inner, y_inner, tile_index;
        gradient.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4);
        gradient.fuse(x_outer, y_outer, tile_index);
        gradient.parallel(tile_index);

        // 调度调用全部返回对Func的引用，因此您也可以将它们链接到一个语句中，以使事情更加清晰:
        // gradient
        //     .tile(x, y, x_outer, y_outer, x_inner, y_inner, 2, 2)
        //     .fuse(x_outer, y_outer, tile_index)
        //     .parallel(tile_index);


        printf("Evaluating gradient tiles in parallel\n");
        Buffer<int> output = gradient.realize(8, 8);
---------------------------------------------------------------
 > Begin pipeline gradient_fused_tiles.0()
 > Store gradient_fused_tiles.0(0, 0) = 0
 > Store gradient_fused_tiles.0(1, 0) = 1
 > Store gradient_fused_tiles.0(2, 0) = 2
 > Store gradient_fused_tiles.0(3, 0) = 3
 > Store gradient_fused_tiles.0(0, 1) = 1
 > Store gradient_fused_tiles.0(1, 1) = 2
 > Store gradient_fused_tiles.0(2, 1) = 3
 > Store gradient_fused_tiles.0(3, 1) = 4
 > Store gradient_fused_tiles.0(0, 2) = 2
 > Store gradient_fused_tiles.0(1, 2) = 3
 > Store gradient_fused_tiles.0(4, 0) = 4
 > Store gradient_fused_tiles.0(2, 2) = 4
 > Store gradient_fused_tiles.0(5, 0) = 5
 > Store gradient_fused_tiles.0(0, 4) = 4
 > Store gradient_fused_tiles.0(6, 0) = 6
 > Store gradient_fused_tiles.0(1, 4) = 5
 > Store gradient_fused_tiles.0(7, 0) = 7
 > Store gradient_fused_tiles.0(3, 2) = 5
 > Store gradient_fused_tiles.0(4, 1) = 5
 > Store gradient_fused_tiles.0(2, 4) = 6
 > Store gradient_fused_tiles.0(0, 3) = 3
 > Store gradient_fused_tiles.0(5, 1) = 6
 > Store gradient_fused_tiles.0(3, 4) = 7
 > Store gradient_fused_tiles.0(6, 1) = 7
 > Store gradient_fused_tiles.0(1, 3) = 4
 > Store gradient_fused_tiles.0(4, 4) = 8
 > Store gradient_fused_tiles.0(7, 1) = 8
 > Store gradient_fused_tiles.0(0, 5) = 5
 > Store gradient_fused_tiles.0(5, 4) = 9
 > Store gradient_fused_tiles.0(2, 3) = 5
 > Store gradient_fused_tiles.0(4, 2) = 6
 > Store gradient_fused_tiles.0(3, 3) = 6
 > Store gradient_fused_tiles.0(1, 5) = 6
 > Store gradient_fused_tiles.0(5, 2) = 7
 > Store gradient_fused_tiles.0(2, 5) = 7
 > Store gradient_fused_tiles.0(6, 4) = 10
 > Store gradient_fused_tiles.0(6, 2) = 8
 > Store gradient_fused_tiles.0(3, 5) = 8
 > Store gradient_fused_tiles.0(7, 4) = 11
 > Store gradient_fused_tiles.0(7, 2) = 9
 > Store gradient_fused_tiles.0(0, 6) = 6
 > Store gradient_fused_tiles.0(4, 3) = 7
 > Store gradient_fused_tiles.0(4, 5) = 9
 > Store gradient_fused_tiles.0(1, 6) = 7
 > Store gradient_fused_tiles.0(5, 3) = 8
 > Store gradient_fused_tiles.0(5, 5) = 10
 > Store gradient_fused_tiles.0(6, 3) = 9
 > Store gradient_fused_tiles.0(2, 6) = 8
 > Store gradient_fused_tiles.0(6, 5) = 11
 > Store gradient_fused_tiles.0(3, 6) = 9
 > Store gradient_fused_tiles.0(7, 3) = 10
 > Store gradient_fused_tiles.0(7, 5) = 12
 > Store gradient_fused_tiles.0(0, 7) = 7
 > Store gradient_fused_tiles.0(4, 6) = 10
 > Store gradient_fused_tiles.0(1, 7) = 8
 > Store gradient_fused_tiles.0(5, 6) = 11
 > Store gradient_fused_tiles.0(2, 7) = 9
 > Store gradient_fused_tiles.0(6, 6) = 12
 > Store gradient_fused_tiles.0(3, 7) = 10
 > Store gradient_fused_tiles.0(7, 6) = 13
 > Store gradient_fused_tiles.0(4, 7) = 11
 > Store gradient_fused_tiles.0(5, 7) = 12
 > Store gradient_fused_tiles.0(6, 7) = 13
 > Store gradient_fused_tiles.0(7, 7) = 14
 > End pipeline gradient_fused_tiles.0()
-------------------------------------------------------------
        // 切片应以任意顺序出现，但在每个切片中，像素将以行优先顺序遍历。 请参阅下面的可视化内容。
        参考下面图56

        printf("Equivalent (serial) C:\n");
        // 这个最外面的循环应该是并行的for循环，但是在C语言中很难。
        for (int tile_index = 0; tile_index < 4; tile_index++) {
            int y_outer = tile_index / 2;
            int x_outer = tile_index % 2;
            for (int y_inner = 0; y_inner < 4; y_inner++) {
                for (int x_inner = 0; x_inner < 4; x_inner++) {
                    int y = y_outer * 4 + y_inner;
                    int x = x_outer * 4 + x_inner;
                    printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
                }
            }
        }
        printf("\n");

        printf("Pseudo-code for the schedule:\n");
        gradient.print_loop_nest();
        printf("\n");
------------------------------------------------------------------------
 > produce gradient_fused_tiles:
 >   parallel x.x_outer.tile_index:
 >     for y.y_inner in [0, 3]:
 >       for x.x_inner in [0, 3]:
 >         gradient_fused_tiles(...) = ...
--------------------------------------------------------------------------
    }

    // 融合到一起.
    {
        // 你准备好了吗？ 现在，我们将使用上面的所有功能。
        Func gradient_fast("gradient_fast");
        gradient_fast(x, y) = x + y;

        // 我们将并行处理64x64片。
        Var x_outer, y_outer, x_inner, y_inner, tile_index;
        gradient_fast
            .tile(x, y, x_outer, y_outer, x_inner, y_inner, 64, 64)
            .fuse(x_outer, y_outer, tile_index)
            .parallel(tile_index);

        // 当我们跨过每个图块时，我们将一次计算两条扫描线。 我们还将在x中向量化。 最简单的表达方式是在每个图块中再次递归地将其划分为4x2子图块，然后将x上的图块矢量化并在y上展开它们：
        Var x_inner_outer, y_inner_outer, x_vectors, y_pairs;
        gradient_fast
            .tile(x_inner, y_inner, x_inner_outer, y_inner_outer, x_vectors, y_pairs, 4, 2)
            .vectorize(x_vectors)
            .unroll(y_pairs);

        // 请注意，我们没有进行任何显式的拆分或重新排序。 这些是最重要的原始操作，但是大多数情况下，它们埋藏在平铺，矢量化或展开调用之下。
        // 现在让我们在不是图块大小倍数的范围内进行评估。

        // 如果愿意，您可以打开跟踪，但是它将产生很多printfs。 相反，我们将同时使用C语言和Halide计算答案，并查看答案是否匹配。
        Buffer<int> result = gradient_fast.realize(350, 250);

        // 看一下可视化的图
        参考下面图57
         

        printf("Checking Halide result against equivalent C...\n");
        for (int tile_index = 0; tile_index < 6 * 4; tile_index++) {
            int y_outer = tile_index / 4;
            int x_outer = tile_index % 4;
            for (int y_inner_outer = 0; y_inner_outer < 64/2; y_inner_outer++) {
                for (int x_inner_outer = 0; x_inner_outer < 64/4; x_inner_outer++) {
                    // 我们在x上向量化
                    int x = std::min(x_outer * 64, 350-64) + x_inner_outer*4;
                    int x_vec[4] = {x + 0,
                                    x + 1,
                                    x + 2,
                                    x + 3};

                    // 我们在y上展开
                    int y_base = std::min(y_outer * 64, 250-64) + y_inner_outer*2;
                    {
                        // y_pairs = 0
                        int y = y_base + 0;
                        int y_vec[4] = {y, y, y, y};
                        int val[4] = {x_vec[0] + y_vec[0],
                                      x_vec[1] + y_vec[1],
                                      x_vec[2] + y_vec[2],
                                      x_vec[3] + y_vec[3]};

                        // 检查结果.
                        for (int i = 0; i < 4; i++) {
                            if (result(x_vec[i], y_vec[i]) != val[i]) {
                                printf("There was an error at %d %d!\n",
                                       x_vec[i], y_vec[i]);
                                return -1;
                            }
                        }
                    }
                    {
                        // y_pairs = 1
                        int y = y_base + 1;
                        int y_vec[4] = {y, y, y, y};
                        int val[4] = {x_vec[0] + y_vec[0],
                                      x_vec[1] + y_vec[1],
                                      x_vec[2] + y_vec[2],
                                      x_vec[3] + y_vec[3]};

                        // 检查结果
                        for (int i = 0; i < 4; i++) {
                            if (result(x_vec[i], y_vec[i]) != val[i]) {
                                printf("There was an error at %d %d!\n",
                                       x_vec[i], y_vec[i]);
                                return -1;
                            }
                        }
                    }
                }
            }
        }
        printf("\n");

        printf("Pseudo-code for the schedule:\n");
        gradient_fast.print_loop_nest();
        printf("\n");
------------------------------------------------------------------------------
 > produce gradient_fast:
 >   parallel x.x_outer.tile_index:
 >     for y.y_inner.y_inner_outer:
 >       for x.x_inner.x_inner_outer:
 >         unrolled y.y_inner.y_pairs in [0, 1]:
 >           vectorized x.x_inner.x_vectors in [0, 3]:
 >             gradient_fast(...) = ...
-------------------------------------------------------------------------------
        // 请注意，在Halide版本中，算法与优化分开在算法顶部指定一次，并且总共没有多少行代码。将此与C版本进行比较。还有更多的代码（甚至没有并行化或向量化）。更令人讨厌的是，算法的语句（结果是x加y）被埋在混乱中的多个地方。 此C代码难以编写，难以读取，难以调试以及难以进一步优化。 这就是halide存在的原因。
    }
    printf("Success!\n");
    return 0;
}

图51

图52

图53

图54

图55

图56

图57

Aoulun

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
halide编程技术指南（连载二）

本文是halide编程指南的连载，已同步至公众号目录第二章图像处理第三章检查生成的代码第四章使用tracing, print, 和print_when进行调试第五章 Vectorize, parallelize, unroll and tile 你的代码第二章图像处理本章演示如何传递输入图像并进行操作。// halide提供了对应方法用于导入png图像...
复制链接

扫一扫