Halide学习笔记----Halide tutorial源码阅读18

最新推荐文章于 2022-05-04 10:25:09 发布
姑苏隐士
最新推荐文章于 2022-05-04 10:25:09 发布
阅读量711
点赞数
分类专栏： HALIDE 文章标签：阅读源码
本文链接：https://blog.csdn.net/luzhanbo207/article/details/78902355
版权
HALIDE 专栏收录该内容
22 篇文章 31 订阅
订阅专栏
Halide入门教程18

// Halide tutorial lesson 18: Factoring an associative reduction using rfactor
// Halide教程第18课：用r因子将有关联的约减（约减区域）进行分解

// This lesson demonstrates how to parallelize or vectorize an associative
// reduction using the scheduling directive 'rfactor'.
// 本课展示如何用r因子对有关联的约减（约减区域）进行并行化和向量化

// On linux, you can compile and run it like so:
// g++ lesson_18*.cpp -g -I ../include -L ../bin -lHalide -lpthread -ldl -o lesson_18 -std=c++11
// LD_LIBRARY_PATH=../bin ./lesson_18

#include "Halide.h"
#include <stdio.h>

using namespace Halide;

int main(int argc, char **argv) {
    // Declare some Vars to use below.
    Var x("x"), y("y"), i("i"), u("u"), v("v");

    // Create an input with random values.
    Buffer<uint8_t> input(8, 8, "input");
    for (int y = 0; y < 8; ++y) {
        for (int x = 0; x < 8; ++x) {
            input(x, y) = (rand() % 256);
        }
    }

    {
        // As mentioned previously in lesson 9, parallelizing variables that
        // are part of a reduction domain is tricky, since there may be data
        // dependencies across those variables.
        // 如同第9课所提到的那样，约减区域内的变量进行并行处理需要很高的技巧性，因为数据之间可能存在依赖关系

        // Consider the histogram example in lesson 9:
        Func histogram("hist_serial");
        histogram(i) = 0;
        RDom r(0, input.width(), 0, input.height());
        histogram(input(r.x, r.y) / 32) += 1;

        histogram.vectorize(i, 8);
        histogram.realize(8);

        // See figures/lesson_18_hist_serial.mp4 for a visualization of
        // what this does.

        // We can vectorize the initialization of the histogram
        // buckets, but since there are data dependencies across r.x
        // and r.y in the update definition (i.e. the update refers to
        // value computed in the previous iteration), we can't
        // parallelize or vectorize r.x or r.y without introducing a
        // race condition. The following code would produce an error:
        // histogram.update().parallel(r.y);
        // 我们可以在初始化阶段将histogram向量化，但是在更新阶段有有数据依赖性，在没有引入竞争条件约束时
        // 不能多r.x或者r.y进行向量化或者并行化。
    }

    {
        // Note, however, that the histogram operation (which is a
        // kind of sum reduction) is associative. A common trick to
        // speed-up associative reductions is to slice up the
        // reduction domain into smaller slices, compute a partial
        // result over each slice, and then merge the results. Since
        // the computation of each slice is independent, we can
        // parallelize over slices.
        // 既然直方图操作是具有关联的。一个常用的加速小技巧就是将有关联的区域切分成小的碎片，在每一个小切片
        // 上做统计，然后汇总所有结果。由于每个小切片是相互独立的，因而我们可以在切片层级进行并行化。

        // Going back to the histogram example, we slice the reduction
        // domain into rows by defining an intermediate function that
        // computes the histogram of each row independently:
        // 沿着行方向切片，然后每一行的统计就相互独立了。这是第一步，切片划分
        Func intermediate("intm_par_manual");
        intermediate(i, y) = 0;
        RDom rx(0, input.width());
        intermediate(input(rx, y) / 32, y) += 1;

        // We then define a second stage which sums those partial
        // results:
        // 然后定义第二阶段，加上每个切片的结果。这是第二步，汇总结果。
        Func histogram("merge_par_manual");
        histogram(i) = 0;
        RDom ry(0, input.height());
        histogram(i) += intermediate(i, ry);

        // Since the intermediate no longer has data dependencies
        // across the y dimension, we can parallelize it over y:
        // 由于中间变量不再有数据的依赖性，因此可以沿着y方向并行化。
        intermediate.compute_root().update().parallel(y);

        // We can also vectorize the initializations.
        // 初始化过程可以向量化
        intermediate.vectorize(i, 8);
        histogram.vectorize(i, 8);

        histogram.realize(8);

        // See figures/lesson_18_hist_manual_par.mp4 for a visualization of
        // what this does.
    }

    {
        // This manual factorization of an associative reduction can
        // be tedious and bug-prone. Although it's fairly easy to do
        // manually for the histogram, it can get complex pretty fast,
        // especially if the RDom may has a predicate (RDom::where),
        // or when the function reduces onto a multi-dimensional
        // tuple.

        // Halide provides a way to do this type of factorization
        // through the scheduling directive 'rfactor'. rfactor splits
        // an associative update definition into an intermediate which
        // computes the partial results over slices of a reduction
        // domain and replaces the current update definition with a
        // new definition which merges those partial results.
        // Halide提供一种做这种分解方法的调度指令rfactor。rfactor将有依赖更新区域分解成一系列的小切片，
        // 对小切片分别进行处理后，将所有结果汇总起来，得到整体的结果。

        // Using rfactor, we don't need to change the algorithm at all:
        // 使用rfactor时，不需要改变算法描述部分。
        Func histogram("hist_rfactor_par");
        histogram(x) = 0;
        RDom r(0, input.width(), 0, input.height());
        histogram(input(r.x, r.y) / 32) += 1;

        // The task of factoring of associative reduction is moved
        // into the schedule, via rfactor. rfactor takes as input a
        // list of <RVar, Var> pairs, which contains list of reduction
        // variables (RVars) to be made "parallelizable". In the
        // generated intermediate Func, all references to this
        // reduction variables are replaced with references to "pure"
        // variables (the Vars). Since, by construction, Vars are
        // race-condition free, the intermediate reduction is now
        // parallelizable across those dimensions. All reduction
        // variables not in the list are removed from the original
        // function and "lifted" to the intermediate.
        // 通过rfactor对有关联的约减任务分解被移动到调度部分了。rfactor将一个<RVar, Var>对作为输入，
        // 这里RVar是将要并行化的RDom变量，Var是一个可以并行化的纯Var变量。在中间函数里，所有的约减变量
        // 被纯Var变量替代了。因而这样处理的中间函数可以在新的维度上进行并行化。

        // To generate the same code as the manually-factored version,
        // we do the following:
        // 为了生成和手工分解一样的版本的代码，按照如下方式进行划分。
        Func intermediate = histogram.update().rfactor({{r.y, y}});
        // We pass {r.y, y} as the argument to rfactor to make the
        // histogram parallelizable across the y dimension, similar to
        // the manually-factored version.
        // 沿y方向进行并行。
        intermediate.compute_root().update().parallel(y);

        // In the case where you are only slicing up the domain across
        // a single variable, you can actually drop the braces and
        // write the rfactor the following way.
        // 在只有一个维度进行切分的情况下，可以缺掉大括号，按照如下方式进行切分。
        // Func intermediate = histogram.update().rfactor(r.y, y);

        // Vectorize the initializations, as we did above.
        // 初始化过程向量化
        intermediate.vectorize(x, 8);
        histogram.vectorize(x, 8);

        // It is important to note that rfactor (or reduction
        // factorization in general) only works for associative
        // reductions. Associative reductions have the nice property
        // that their results are the same no matter how the
        // computation is grouped (i.e. split into chunks). If rfactor
        // can't prove the associativity of a reduction, it will throw
        // an error.
        // rfactor只有在有依赖的RDom起作用。有依赖的RDom有很好的性质，不管如何拆分，最后汇总时的结果是一致
        // 的，如果rfactor不能约减去有的关联性，它将抛出异常。

        Buffer<int> halide_result = histogram.realize(8);

        // See figures/lesson_18_hist_rfactor_par.mp4 for a
        // visualization of what this does.

        // The equivalent C is:
        int c_intm[8][8];
        for (int y = 0; y < input.height(); y++) {
            for (int x = 0; x < 8; x++) {
                c_intm[y][x] = 0;
            }
        }
        /* parallel */ for (int y = 0; y < input.height(); y++) {
            for (int r_x = 0; r_x < input.width(); r_x++) {
                c_intm[y][input(r_x, y) / 32] += 1;
            }
        }

        int c_result[8];
        for (int x = 0; x < 8; x++) {
            c_result[x] = 0;
        }
        for (int x = 0; x < 8; x++) {
            for (int r_y = 0; r_y < input.height(); r_y++) {
                c_result[x] += c_intm[r_y][x];
            }
        }

        // Check the answers agree:
        for (int x = 0; x < 8; x++) {
            if (c_result[x] != halide_result(x)) {
                printf("halide_result(%d) = %d instead of %d\n",
                       x, halide_result(x), c_result[x]);
                return -1;
            }
        }
    }

    {
        // Now that we can factor associative reductions with the
        // scheduling directive 'rfactor', we can explore various
        // factorization strategies using the schedule alone. Given
        // the same serial histogram code:
        // 探索其他的拆分策略
        Func histogram("hist_rfactor_vec");
        histogram(x) = 0;
        RDom r(0, input.width(), 0, input.height());
        histogram(input(r.x, r.y) / 32) += 1;

        // Instead of r.y, we rfactor on r.x this time to slice the
        // domain into columns.
        // 沿x方向进行切片，将图像切成列形式的。
        Func intermediate = histogram.update().rfactor(r.x, u);

        // Now that we're computing an independent histogram
        // per-column, we can vectorize over columns.
        // 我们可以在列方向上向量化。
        intermediate.compute_root().update().vectorize(u, 8);

        // Note that since vectorizing the inner dimension changes the
        // order in which values are added to the final histogram
        // buckets computations, so this trick only works if the
        // associative reduction is associative *and*
        // commutative. rfactor will attempt to prove these properties
        // hold and will throw an error if it can't.
        // 由于内层循环向量化会改变最后直方图汇总的顺序，因此这个小机器只在约减区域是累积求和型的约减起作用。
        // 如果不满足这种情况，那么rfactor会抛出错误。

        // Vectorize the initializations.
        intermediate.vectorize(x, 8);
        histogram.vectorize(x, 8);

        Buffer<int> halide_result = histogram.realize(8);

        // See figures/lesson_18_hist_rfactor_vec.mp4 for a
        // visualization of what this does.

        // The equivalent C is:
        int c_intm[8][8];
        for (int u = 0; u < input.width(); u++) {
            for (int x = 0; x < 8; x++) {
                c_intm[u][x] = 0;
            }
        }
        for (int r_y = 0; r_y < input.height(); r_y++) {
            for (int u = 0; u < input.width() / 8; u++) {
                /* vectorize */ for (int u_i = 0; u_i < 8; u_i++) {
                    c_intm[u*4 + u_i][input(u*8 + u_i, r_y) / 32] += 1;
                }
            }
        }

        int c_result[8];
        for (int x = 0; x < 8; x++) {
            c_result[x] = 0;
        }
        for (int x = 0; x < 8; x++) {
            for (int r_x = 0; r_x < input.width(); r_x++) {
                c_result[x] += c_intm[r_x][x];
            }
        }

        // Check the answers agree:
        for (int x = 0; x < 8; x++) {
            if (c_result[x] != halide_result(x)) {
                printf("halide_result(%d) = %d instead of %d\n",
                       x, halide_result(x), c_result[x]);
                return -1;
            }
        }
    }

    {
        // We can also slice a reduction domain up over multiple
        // dimensions at once. This time, we'll compute partial
        // histograms over tiles of the domain.
        // 在多个维度上同时切片。以tile形式计算每一个tile的直方图，然后汇总统计
        Func histogram("hist_rfactor_tile");
        histogram(x) = 0;
        RDom r(0, input.width(), 0, input.height());
        histogram(input(r.x, r.y) / 32) += 1;

        // We first split both r.x and r.y by a factor of four.
        RVar rx_outer("rx_outer"), rx_inner("rx_inner");
        RVar ry_outer("ry_outer"), ry_inner("ry_inner");
        histogram.update()
            .split(r.x, rx_outer, rx_inner, 4)
            .split(r.y, ry_outer, ry_inner, 4);

        // We now call rfactor to make an intermediate function that
        // independently computes a histogram of each tile.
        // 调用rfactor产生一个中间行数，独立地计算每一个tile的直方图。
        Func intermediate = histogram.update().rfactor({{rx_outer, u}, {ry_outer, v}});

        // We can now parallelize the intermediate over tiles.
        // 我们可以在中间tile的中间函数上进行并行化。
        intermediate.compute_root().update().parallel(u).parallel(v);

        // We also reorder the tile indices outermost to give the
        // classic tiled traversal.
        // reorder每个tile的外层循环，按照经典的tile模式进行遍历。
        intermediate.update().reorder(rx_inner, ry_inner, u, v);

        // Vectorize the initializations.
        // 初始化过程向量化。
        intermediate.vectorize(x, 8);
        histogram.vectorize(x, 8);

        Buffer<int> halide_result = histogram.realize(8);

        // See figures/lesson_18_hist_rfactor_tile.mp4 for a visualization of
        // what this does.

        // The equivalent C is:
        int c_intm[4][4][8];
        for (int v = 0; v < input.height() / 2; v++) {
            for (int u = 0; u < input.width() / 2; u++) {
                for (int x = 0; x < 8; x++) {
                    c_intm[v][u][x] = 0;
                }
            }
        }
        /* parallel */ for (int v = 0; v < input.height() / 2; v++) {
            /* parallel */ for (int u = 0; u < input.width() / 2; u++) {
                for (int ry_inner = 0; ry_inner < 2; ry_inner++) {
                    for (int rx_inner = 0; rx_inner < 2; rx_inner++) {
                        c_intm[v][u][input(u*2 + rx_inner, v*2 + ry_inner) / 32] += 1;
                    }
                }
            }
        }

        int c_result[8];
        for (int x = 0; x < 8; x++) {
            c_result[x] = 0;
        }
        for (int x = 0; x < 8; x++) {
            for (int ry_outer = 0; ry_outer < input.height() / 2; ry_outer++) {
                for (int rx_outer = 0; rx_outer < input.width() / 2; rx_outer++) {
                    c_result[x] += c_intm[ry_outer][rx_outer][x];
                }
            }
        }

        // Check the answers agree:
        for (int x = 0; x < 8; x++) {
            if (c_result[x] != halide_result(x)) {
                printf("halide_result(%d) = %d instead of %d\n",
                       x, halide_result(x), c_result[x]);
                return -1;
            }
        }
    }

    printf("Success!\n");

    return 0;
}
本节主要讲解通过rfactor对约减区域进行拆分并行化。
1.沿y方向拆分
intermediate = histogram.update().rfactor(r.y, y);
intermediate.compute_root().update().parallel(y);
2.沿x方向拆分，并向量化
intermediate = histogram.update().rfactor(r.x, u);
intermediate.compute_root().update().vectorize(u, 8);
3.拆分成tile
histogram.update().split(r.x, rx_outer, rx_inner, 4).split(r.y, ry_outer, ry_inner, 4);
intermediate = histogram.update().rfactor({{rx_outer, u}, {ry_outer, v}});
intermediate.compute_root().update().parallel(u).parallel(v);
intermediate.update().reorder(rx_inner, ry_inner, u, v);