把你自己的代码生成TVM

最新推荐文章于 2022-03-22 05:26:59 发布

牛牛存

最新推荐文章于 2022-03-22 05:26:59 发布

阅读量1.6k

点赞数 3

分类专栏： tvm TVM Relay

本文链接：https://blog.csdn.net/weixin_42164269/article/details/104291635

版权

TVM Relay 同时被 2 个专栏收录

25 篇文章 21 订阅

订阅专栏

tvm

14 篇文章 6 订阅

订阅专栏

把你自己的代码生成TVM

实现【CreateCSourceModule 】

注册您的代码生成

为您的表示实现一个代码生成

实现【ExampleJsonCodeGen 】

实现【SaveToBinary】和【LoadFromBinary 】

总结

简介

随着深度学习工作负载所针对的硬件设备的数量不断增加，用户在各种设备上实现高性能所需的知识也在不断增加。为了使数据科学家不必担心开发新模型时的性能，硬件后端提供者要么提供像MKLDNN或cuDNN之类的库包含许多常用的深度学习运算符，要么提供诸如TensorRT这样的框架使用户以某种方式描述其模型以实现高性能。但是，用户尝试在新的库或设备上工作时必须学习新的编程接口。结果，对统一编程接口的需求变得越来越重要，来1）让所有用户和硬件后端提供者站在同一页面上，2）提供一种可行的解决方案，以允许专用硬件或库仅支持具有极高性能的广泛使用的运算符，但将不支持的运算符回退到CPU / GPU等常规设备。

在本开发人员指南中，我们演示了作为硬件后端提供者，您如何轻松实现自己的代码生成并将其注册为Relay后端编译器以支持您的硬件设备/库。本指南根据您需要的不同图形表示形式涵盖两种类型的代码生成器：

1.您要生成C代码。

如果您的硬件已经具有经过优化的C/C ++库，例如对CPU拥有Intel CBLAS / MKL，GPU拥有NVIDIA CUBLAS，那么这就是您所需要的。幸运的是，C源代码模块与TVM运行时模块完全兼容，这意味着生成的代码可以由具有适当编译标志的任何C / C ++编译器进行编译，因此您唯一的任务就是实现一个为子图生成C代码的代码生成器和一个C源模块以集成到TVM运行时模块中。在下一节中，我们将演示如何为您的硬件实现C代码生成器。

2.您要生成任何其他图形表示。

您的硬件可能需要其他形式的图形表示形式，例如JSON。在这种情况下，您不仅需要实现代码生成，还需要实现自定义的TVM运行时模块，以使TVM运行时知道应如何执行此图形表示。如果您已经为硬件配备了完整的图形执行引擎，例如用于GPU的TensorRT，则可以考虑采用这种解决方案。

在完成代码生成和运行时之后，您可以让客户使用您的自定义标签来注释他们的模型以使用它们。最终用户注释和启动特定代码生成的教程在此处（TBA）。

实现一个C代码生成器

在这一部分中，我们演示如何实现使用预实现的运算符函数生成C代码的代码生成器。为简化起见，我们的示例代码生成器不依赖于第三方库。相反，我们在C中手动实现了两个宏：

#define CSOURCE_BINARY_OP_1D(p_ID_, p_OP_, p_DIM1_)         \
    extern "C" void p_ID_(float* a, float* b, float* out) { \
        for (int64_t i = 0; i < p_DIM1_; ++i) {             \
            out[i] = a[i] p_OP_ b[i];                       \
        }                                                   \
    }

#define CSOURCE_BINARY_OP_2D(p_ID_, p_OP_, p_DIM1_, p_DIM2_)  \
    extern "C" void p_ID_(float* a, float* b, float* out) {   \
        for (int64_t i = 0; i < p_DIM1_; ++i) {               \
            for (int64_t j = 0; j < p_DIM2_; ++j) {           \
                int64_t k = i * p_DIM2_ + j;                  \
                out[k] = a[k] p_OP_ b[k];                     \
            }                                                 \
        }                                                     \
    }

使用这两个宏，我们可以为一维和二维张量生成二进制运算符。例如，给定一个子图如下。假设所有输入都是二维张量，其形状为（10，10）。

c_compiler_input0
       |
      add <-- c_compiler_input1
       |
    subtract <-- c_compiler_input2
       |
    multiply <-- c_compiler_input3
       |
      out

我们的目标是生成以下可编译代码以执行子图：

#include <tvm/runtime/c_runtime_api.h>
#include <tvm/runtime/packed_func.h>
#include <dlpack/dlpack.h>
#include <cstdint>
#include <cstring>
#include <iostream>

#define GCC_BINARY_OP_1D(p_ID_, p_OP_, p_DIM1_)           \
  extern "C" void p_ID_(float* a, float* b, float* out) { \
    for (int64_t i = 0; i < p_DIM1_; ++i) {               \
      out[i] = a[i] p_OP_ b[i];                           \
    }                                                     \
  }

#define GCC_BINARY_OP_2D(p_ID_, p_OP_, p_DIM1_, p_DIM2_)  \
  extern "C" void p_ID_(float* a, float* b, float* out) { \
    for (int64_t i = 0; i < p_DIM1_; ++i) {               \
      for (int64_t j = 0; j < p_DIM2_; ++j) {             \
        int64_t k = i * p_DIM2_ + j;                      \
        out[k] = a[k] p_OP_ b[k];                         \
      }                                                   \
    }                                                     \
  }

// Note 1
GCC_BINARY_OP_2D(gcc_0_0, *, 10, 10);
GCC_BINARY_OP_2D(gcc_0_1, -, 10, 10);
GCC_BINARY_OP_2D(gcc_0_2, +, 10, 10);

// Note 2
extern "C" void gcc_0_(float* gcc_input0, float* gcc_input1,
                       float* gcc_input2, float* gcc_input3, float* out) {
  float* buf_0 = (float*)malloc(4 * 100);
  float* buf_1 = (float*)malloc(4 * 100);
  gcc_0_2(gcc_input0, gcc_input1, buf_0);
  gcc_0_1(buf_0, gcc_input2, buf_1);
  gcc_0_0(buf_1, gcc_input3, out);
  free(buf_0);
  free(buf_1);
}

// Note 3
extern "C" int gcc_0_wrapper(DLTensor* arg0, DLTensor* arg1, DLTensor* arg2,
                             DLTensor* arg3, DLTensor* out) {
  gcc_0_(static_cast<float*>(arg0->data), static_cast<float*>(arg1->data),
         static_cast<float*>(arg2->data), static_cast<float*>(arg3->data),
         static_cast<float*>(out->data));
  return 0;
}
TVM_DLL_EXPORT_TYPED_FUNC(gcc_0, gcc_0_wrapper);

在这里，我们突出显示上面代码中标记的注释：

Note1是子图中三个节点的函数实现。
Note2是一个函数，通过分配中间缓冲区并调用相应函数来执行子图。
Note3是TVM运行时兼容的包装函数。它接受一个输入张量和一个输出张量的列表（最后一个参数），将它们转换为正确的数据类型，并调用Note2中描述的子图函数。此外，它【TVM_DLL_EXPORT_TYPED_FUNC】是一个TVM宏，它生成另一个函数【gcc_0】，【gcc_0】具有统一的函数参数通过把所有的参数张量打包成【TVMArgs】。结果，TVM运行时可以直接调用gcc_0以执行子图，而无需付出额外的努力。使用上面生成的代码，TVM可以将其与图的其余部分一起编译，并导出单个库以进行部署。

在本节的其余部分，我们将逐步实现一个codegen以生成上述代码。您自己的代码源必须位于src/relay/backend/contrib/<your-codegen-name>/。在我们的示例中，我们将代码源命名为“ codegen_c”，并将其放在“此处<https://github.com/apache/incubator-tvm/blob/master/src/relay/backend/contrib/codegen_c/codegen.cc>下`_。您可以随时检查此文件以获取完整的实现。

具体来说，我们将在此文件中实现两个类，这是它们之间的关系：

                     subgraph                                subgraph
TVM backend -----------------------------> CSourceCodegen -------------> CodegenC
       ^                                       |    ^                       |
       |                                       |    |                       |
       ----------------------------------------      ------------------------
          generated C source runtime module              generated C code

当TVM后端在Relay中找到一个函数（子图）时，使用已注册的编译器标记进行注释（【ccompiler】在此示例中），TVM后端将调用【CSourceCodegen】并转换该子图。【CSourceCodegen】的成员函数【CreateCSourceModule】将1）为子图生成C代码，2）将生成的C代码包装到C源运行时模块中，以供TVM后端编译和部署。特别地，C代码生成对于【CodegenC】类是透明的，因为它提供了许多有用的实用程序来简化代码生成的实现。以下各节将以自底向上的顺序实现这两个类。

实现【CodegenC】

在中src/relay/backend/contrib/codegen_c/codegen.cc，我们首先在【tvm.relay.contrib】名称空间下创建一个代码生成类骨架：

#include <tvm/relay/expr_functor.h>
#include <tvm/relay/transform.h>
#include <tvm/relay/type.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/object.h>

#include <fstream>
#include <sstream>

#include "codegen_c.h"

namespace tvm {
namespace relay {
namespace contrib {

class CodegenC : public ExprVisitor, public CodegenCBase {
  public:
    explicit CodegenC(const std::string& id) { this->ext_func_id_ = id; }

    void VisitExpr_(const VarNode* node) { ; }
    void VisitExpr_(const CallNode* call) final { ; }
    std::string JIT() { ; }

  private:
    /*! \brief The function id that represents a C source function. */
    std::string ext_func_id_ = "";
    /*! \brief The index of a wrapped C function. */
    int func_idx = 0;
    /*! \brief The index of allocated buffers. */
    int buf_idx_ = 0;
    /*! \brief The arguments of a C compiler compatible function. */
    std::vector<std::string> ext_func_args_;
    /*! \brief The statements of a C compiler compatible function. */
    std::vector<std::string> ext_func_body;
    /*! \brief The declaration statements of a C compiler compatible function. */
    std::vector<std::string> func_decl_;
    /*! \brief The declaration statements of buffers. */
    std::vector<std::string> buf_decl_;
    /*! \brief The name and index pairs for output. */
    std::vector<std::pair<std::string, int>> out_;
}

【CodegenC】类继承两个类：【ExprVisitor】提供遍历子图，并收集所需的信息并生成子图的功能的能力，例如【gcc_0_】; 【CodegenCBase】提供了生成包装函数的功能和用法，例如gcc_0上面的示例。可以看出，我们只需要在此codegen类中实现三个函数即可使其工作。

运算符代码生成

我们首先实现【VisitExpr_(const CallNode* call)】。遍历子图时，此函数访问所有调用节点。每个调用节点都包含一个我们要卸载到硬件上的运算符。结果，我们需要按照拓扑顺序使用正确的运算符生成相应的C代码。我们按以下步骤逐步实现此功能。

1.生成函数声明

结果示例：【GCC_BINARY_OP_2D(gcc_0_0, *, 10, 10);】

如上所示，要生成函数声明，我们需要1）函数名称（例如gcc_0_0），2）运算符的类型（例如*）和3）输入张量形状（例如(10, 10)）。幸运的是，可以从【CallNode】位置轻松获取此信息：

std::ostringstream macro_stream;
std::ostringstream decl_stream;
std::ostringstream buf_stream;

// Generate a unique function name you like.
std::string func_name = ext_func_id_ + "_" + std::to_string(func_idx++);

// Make function declaration string.
macro_stream << "CSOURCE_BINARY_OP_" << call->args.size() << "D(" << func_name << ", ";

// Check the operator type.
if (IsOp(call, "add")) {
  macro_stream << "+";
} else if (IsOp(call, "subtract")) {
  macro_stream << "-";
} else if (IsOp(call, "multiply")) {
  macro_stream << "*";
} else {
  LOG(FATAL) << "Unrecognized op";
}

// Extract the input tensor shape.
auto in_shape = GetShape(call->args[0]->checked_type());
for (size_t i = 0; i < in_shape.size(); ++i) {
  macro_stream << ", " << in_shape[i];
}
macro_stream << ");";
func_decl_.push_back(macro_stream.str());

可以看出，我们将生成的代码放到类成员变量【func_decl_】。这意味着在完成遍历整个子图之后，我们已经收集了所有必需的函数声明，而我们唯一需要做的就是让它们由GCC进行编译。【VisitExpr_(const CallNode* call)】的实现也遵循此概念。

2.生成函数调用

结果示例：【gcc_0_0(buf_1, gcc_input3, out);】

生成函数声明后，我们需要生成具有正确输入和输出的函数调用。要知道在调用此函数时应放置哪些输入或缓冲区，我们必须访问其参数：

bool first = true;
decl_stream << func_name << "(";
for (size_t i = 0; i < call->args.size(); ++i) {
  VisitExpr(call->args[i]); // Note 1
  for (auto out : out_) {
    if (!first) {
      decl_stream << ", ";
    }
    first = false;
    decl_stream << out.first;
  }
}
// Note 2

同样，我们要突出显示以上代码中的注释：

Note1：【VisitExpr(call->args[i])】是递归调用，以访问当前函数的参数。参数可以是另一个节点的输出或输入张量。在示例实现中，我们确保每个节点在离开访问器之前都更新一个类变量【out_】。这是一个例子：

  arg_node                 arg_node <- Visit arg (Note 1)       arg_node
     |                        |                                    |
 curr_node <- Process      curr_node                            curr_node <- Put "buf_0" as an input buffer

(a) out_ = {}            (b) out_ = {}                   (c) out_ = {("buf_0", 20)}

我们可以在上图中看到，在访问参数节点之前类变量【out_】为空，并填充了【arg_node】输出缓冲区的名称和大小。结果，当我们完成访问参数节点时，我们知道可以通过查看【out_】知道应该放置适当的输入缓冲区。您将在本节末尾和下一节中找到我们的更新【out_】的方式。

注意2：您可能会注意到，在此步骤中我们没有关闭函数调用字符串。当前的函数调用字符串如下所示：【gcc_0_0(buf_1, gcc_input3】。这是因为我们没有将最后一个参数（即输出）放入此调用。函数调用的输出可以是分配的临时缓冲区，也可以是子图输出张量。为了简化起见，在此示例中，我们为每个调用节点分配一个输出缓冲区（下一步），并将结果从最后一个缓冲区复制到输出张量。

3.生成输出缓冲区

结果示例：【float* buf_0 = (float*)malloc(4 * 100);】

如上一步所述，除了子图输入和输出张量以外，我们可能还需要缓冲区来保留中间结果。为了生成缓冲区，我们提取形状信息以确定缓冲区的类型和大小：

// This example only supports single output.
auto type_node = call->checked_type().as<TensorTypeNode>();
CHECK(type_node != nullptr && runtime::TypeMatch(type_node->dtype, kDLFloat, 32))
      << "Only support single output tensor with float type";

// Generate a unique buffer name.
std::string out = "buf_" + std::to_string(buf_idx_++);

// Extract the shape to be the buffer size.
auto out_shape = GetShape(call->checked_type());
int out_size = 1;
for (size_t i = 0; i < out_shape.size(); ++i) {
  out_size *= out_shape[i];
}

// Make the buffer allocation and push to the buffer declarations.
buf_stream << "float* " << out << " = (float*)std::malloc(4 * " << out_size << ");";
buf_decl_.push_back(buf_stream.str());

分配输出缓冲区后，我们现在可以关闭函数调用字符串，并将生成的函数调用放到类变量【ext_func_body】。

decl_stream << ", " << out << ");";
ext_func_body.push_back(decl_stream.str());

4.更新输出缓冲区

为了让接受当前调用节点的输出作为其输入的下一个节点知道其应使用的缓冲区，我们需要在离开此访问函数之前更新类变量【out_】。

out_.clear();
out_.push_back({out, out_size});

恭喜你！我们已经完成了本课程中最困难的功能。在接下来的两节中，我们只需要组成此函数中的一些次要缺失部分。

输入变量的代码生成

回想一下，我们通过访问调用节点的参数来收集输入缓冲区的信息（上一节的第二步），并处理了其参数是另一个调用节点的情况（第四步）。在本节中，我们以【VarNode】示例为例演示如何处理其他节点。

【VarNode】表示模型中的输入张量。它拥有的唯一的，但重要的信息是名称提示（如data，weight等）。在访问【VarNode】时，我们只需更新类变量【out_】以传递名称提示，以便后代调用节点可以生成正确的函数调用。

void VisitExpr_(const VarNode* node) {
  ext_func_args_.push_back(node->name_hint());
  out_.clear();
  out_.push_back({node->name_hint(), 0});
}

请注意，在此示例中，我们假设要卸载的子图仅具有调用节点和变量节点。如果子图包含其他类型的节点，例如TupleNode，则还需要访问它们并绕过输出缓冲区信息。

代码发送

该【codegen】类的最后一部分是一个【JIT】函数，该函数为子图发送C函数，并将我们刚生成的C代码用作函数体。请记住，除了前面几节中生成的子图函数外，我们还需要一个包装器函数，该函数具有统一的参数，TVM运行时可以调用和传递数据。幸运的是，我们继承的基类已经提供了实现【JitImpl】来生成函数。例如，我们可以调用【JitImpl】如下：

JitImpl("gcc_0" /* Subgraph symbol (ID) */,
        {"gcc_input0", "gcc_input1", "gcc_input2", "gcc_input3"} /* Input arguments */,
        {"float *buf_0 = (float*)malloc(4 * 20)", ...} /* Buffer allocations */,
        {"gcc_0_2(gcc_input0, gcc_input1, buf_0);"} /* Function body */,
        {"out"} /* Output */);

上面的调用将生成三个函数（一个来自TVM包装器宏）：

1.子图函数【gcc_0_】（在函数名的末尾还有一个下划线），其中包含我们生成的所有C代码以执行子图。

2.装饰函数【gcc_0__wrapper_】带有【DLTensor】参数列表，该参数列表将数据转换为正确的类型并调用【gcc_0_】。

3.TVM运行时兼容函数【gcc_0】具有TVM统一函数参数可解压缩TVM打包的张量并调用【gcc_0__wrapper_】。

因此，【JIT】实现过程中唯一需要做的就是将我们生成的所有子图函数代码传递给【JitImpl】：

std::string JIT() {
  // Write function macros
  for (auto decl : func_decl_) {
    code_stream_ << decl << "\n";
  }
  return JitImpl(ext_func_id_, ext_func_args_, buf_decl_, ext_func_body, out_);
}

我们传递的所有的变量（【ext_func_id】等）都是类变量，并且在遍历子图时会被填充。

实现【CSourceCodegen 】

同样，让我们创建一个类框架并实现所需的功能。请注意，它继承【CSourceModuleCodegenBase】

class CSourceCodegen : public CSourceModuleCodegenBase {
 public:
  // Pass a subgraph function, and generate the C code.
  void GenCFunc(const Function& func) { ; }

  // Use GenCFunc to generate the C code and wrap it as a C source module.
  runtime::Module CreateCSourceModule(const NodeRef& ref) override { ; }

 private:
  std::ostringstream code_stream_;
};

实现【GenCFunc 】

【GenCFunc】只需使用【CodegenC】，我们只是实现遍历Relay函数（子图）并获得生成的C代码即可。内置函数【GetExtSymbol】在Relay 函数中检索唯一的符号名称（例如gcc_0），我们必须将其用作C函数名称，因为该符号将用于DSO运行时查找。

void GenCFunc(const Function& func) {
  CHECK(func.defined()) << "Input error: expect a Relay function.";

  // Record the external symbol for runtime lookup.
  auto sid = GetExtSymbol(func);

  CodeGenC builder(sid);
  builder.VisitExpr(func->body);
  code_stream_ << builder.JIT();
}

实现【CreateCSourceModule 】

该函数为外部库创建一个运行时模块。在此示例中，我们创建了一个【CSourceModule】，它可以直接编译并与TVM生成的DSOModule链接在一起。实现【CodegenC】后，实现此功能相对简单：

runtime::Module CreateCSourceModule(const NodeRef& ref) override {
  // Create headers
  code_stream_ << "#include <cstdint>\n";
  code_stream_ << "#include <iostream>\n";
  code_stream_ << "#include <cstdlib>\n";
  code_stream_ << "#include <stdio.h>\n";
  code_stream_ << "#include <cstring>\n";
  code_stream_ << "#include <tvm/runtime/c_runtime_api.h>\n";
  code_stream_ << "#include <dlpack/dlpack.h>\n";

  // Append some common macro for operator definition.
  const char* operator_macro = R"op_macro(
  #define CSOURCE_BINARY_OP_1D(p_ID_, p_OP_, p_DIM1_)       \
    extern "C" void p_ID_(float* a, float* b, float* out) { \
      for (int64_t i = 0; i < p_DIM1_; ++i) {               \
        out[i] = a[i] p_OP_ b[i];                           \
      }                                                     \
    }

  #define CSOURCE_BINARY_OP_2D(p_ID_, p_OP_, p_DIM1_, p_DIM2_)  \
    extern "C" void p_ID_(float* a, float* b, float* out) {     \
      for (int64_t i = 0; i < p_DIM1_; ++i) {                   \
        for (int64_t j = 0; j < p_DIM2_; ++j) {                 \
          int64_t k = i * p_DIM2_ + j;                          \
          out[k] = a[k] p_OP_ b[k];                             \
        }                                                       \
      }                                                         \
    }
  )op_macro";

  code_stream_ << operator_macro << "\n\n";

  // Generate C code for the subgraph.
  if (ref->IsInstance<FunctionNode>()) {
    GenCFunc(Downcast<Function>(ref));
  } else if (ref->IsInstance<relay::ModuleNode>()) {
    relay::Module mod = Downcast<relay::Module>(ref);
    for (const auto& it : mod->functions) {
      GenCFunc(Downcast<Function>(it.second));
    }
  } else {
    LOG(FATAL) << "The input ref is expected to be a Relay function or module"
               << "\n";
  }

  // Create a CSourceModule
  const auto* pf = runtime::Registry::Get("module.csource_module_create");
  CHECK(pf != nullptr) << "Cannot find csource module to create the external runtime module";
  return (*pf)(code_stream_.str(), "cc");
}

注册您的代码生成

最后一步是将您的代码生成器注册到TVM后端。我们首先实现一个简单的函数来调用我们的代码生成器并生成一个运行时模块。

runtime::Module CCompiler(const NodeRef& ref) {
  CSourceCodegen csource;
  return csource.CreateCSourceModule(ref);
}

最后，我们将此功能注册到TVM后端：

TVM_REGISTER_GLOBAL("relay.ext.ccompiler").set_body_typed(CCompiler);

其中【ccompiler】是一个自定义标签，用于让TVM知道这是在用【ccompiler】注释子图时应使用它生成和卸载子图的代码生成器。

最后，一个好的做法是设置CMake配置标志，使其仅为客户提供编译器。我们首先创建一个cmake文件【cmake/modules/contrib/CODEGENC.cmake】：

if(USE_CODEGENC)
  file(GLOB CSOURCE_RELAY_CONTRIB_SRC src/relay/backend/contrib/codegen_c/codegen.cc)
  list(APPEND COMPILER_SRCS ${CSOURCE_RELAY_CONTRIB_SRC})
endif(USE_CODEGENC)

这样，用户可以在配置TVM时使用【config.cmake】以下命令配置是否包括您的编译器：

set(USE_CODEGENC ON)

为您的表示实现一个代码生成

尽管我们已经演示了如何实现C代码生成，但是您的硬件可能需要其他的图形表示形式，例如JSON。在这种情况下，您可以修改【CodegenC】类，我们已经实现了自己的图形表示，并实现定制的运行时模块，以使TVM运行时知道应如何执行该图形表示。

为了简化，我们在本指南中定义了一个名为“ ExampleJSON”的图表示。ExampleJSON并不意味着真正的JSON，而仅仅是没有控制流的图的简单表示。例如，假设我们有一个名为【subgraph_0】的子图：

 input0
   |
  add <-- input1
   |
subtract <-- input2
   |
multiply <-- input3
   |
  out

然后，该子图的【ExampleJON】如下所示：

subgraph_0
  input 0 10 10
  input 1 10 10
  input 2 10 10
  input 3 10 10
  add 4 inputs: 0 1 shape: 10 10
  sub 5 inputs: 4 2 shape: 10 10
  add 6 inputs: 5 3 shape: 10 10

【input】关键字声明输入张量的ID和形状; 其他语句则以语法描述计算:

【<op> <output ID> inputs: [input ID] shape: [shape]】

在本节中，我们的目标是实现以下定制的TVM运行时模块以执行【ExampleJSON】图。

runtime::Module ExampleJsonCompiler(const NodeRef& ref) {
    ExampleJsonCodeGen codegen(ref);
    std::string code = codegen.gen(); // Note 1
    const auto* pf = runtime::Registry::Get("module.examplejson_module_create"); // Note 2
    CHECK(pf != nullptr) << "Cannot find ExampleJson module to create the external runtime module";
    return (*pf)(code);
}
TVM_REGISTER_GLOBAL("relay.ext.examplejsoncompiler").set_body_typed(ExampleJsonCompiler);

Note1：我们稍后将实现自定义代码生成，以通过子图来生成ExampleJSON代码字符串。

Note2：此行获得指向用于创建定制运行时模块的函数的指针。您可以看到它采用了我们刚刚生成的ExampleJSON格式的子图代码，并初始化了运行时模块。

在以下各节中，我们将介绍1）如何实现【ExampleJsonCodeGen】和2）如何实现和注册【examplejson_module_create】。

实现【ExampleJsonCodeGen 】

类似于C代码生成器，我们还从【ExprVisitor】派生了【ExampleJsonCodeGen】，利用访问者模式进行子图遍历的方法。另一方面，我们不需要继承【CodegenCBase】，因为我们不需要TVM C ++装饰器。codegen类的实现如下：

#include <tvm/relay/expr_functor.h>
#include <tvm/relay/transform.h>
#include <tvm/relay/type.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/object.h>

#include <fstream>
#include <sstream>

namespace tvm {
namespace relay {
namespace contrib {

class ExampleJsonCodeGen : public ExprVisitor {
  public:
    explicit ExampleJsonCodeGen();

    // Note 1
    void VisitExpr_(const VarNode* node) { /* Skip in this example. */ }
    void VisitExpr_(const CallNode* call) final { /* Skip in this example. */ }

    // Note 2
    std::string gen(NodeRef& ref) {
        this->code = "";
        if (ref->IsInstance<FunctionNode>()) {
            this->visit(Downcast<Function>(ref));
        } else if (ref->IsInstance<relay::ModuleNode>()) {
            relay::Module mod = Downcast<relay::Module>(ref);
            for (const auto& it : mod->functions) {
                this->visit(Downcast<Function>(it.second));
            }
        } else {
            LOG(FATAL) << "The input ref is expected to be a Relay function or module";
        }
        return this->code;
    }

  private:
      /*! \brief The function id that represents a C source function. */
     std::string code;
}

Note1：我们再次实现相应的访问者函数，以生成ExampleJSON代码并将其存储到类变量【code】中（在本示例中，我们跳过了访问器函数的实现，因为它们的概念与C代码基本相同）。完成图访问之后，我们应该在【code】中有一个ExampleJSON图。

Note2：我们定义了一个内部API gen来获取子图并生成ExampleJSON代码。该API可以采用您喜欢的任意名称。

下一步是实施自定义的运行时，以利用的输出ExampleJsonCodeGen。

实现自定义运行时

在本节中，我们将逐步实现自定义的TVM运行时并将其注册到TVM运行时模块。自定义的运行时应位于src/runtime/contrib/<your-runtime-name>/。在我们的示例中，我们将运行时命名为“ example_ext_runtime”，并将其放在“ here <src / runtime / contrib / example_ext_runtime / example_ext_runtime.cc>” _下。随时检查此文件以获取完整的实现。

再次，我们首先定义一个自定义的运行时类，如下所示。该类必须从TVM派生【ModuleNode】，以便与其他TVM运行时模块兼容。

#include <dmlc/logging.h>
#include <tvm/runtime/c_runtime_api.h>
#include <tvm/runtime/memory.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/ndarray.h>
#include <tvm/runtime/object.h>
#include <tvm/runtime/packed_func.h>
#include <tvm/runtime/registry.h>

#include <fstream>
#include <cmath>
#include <map>
#include <sstream>
#include <string>
#include <vector>

namespace tvm {
namespace runtime {
class ExampleJsonModule : public ModuleNode {
 public:
  explicit ExampleJsonModule(std::string graph_json);

  PackedFunc GetFunction(const std::string& name,
                         const ObjectPtr<Object>& sptr_to_self) final;

  const char* type_key() const { return "examplejson"; }

  void SaveToBinary(dmlc::Stream* stream) final;

  static Module LoadFromBinary(void* strm);

  static Module Create(const std::string& path);

  std::string GetSource(const std::string& format = "");

  void Run(int id, const std::vector<int>& inputs, int output);

  void ParseJson(const std::string& json);

 private:
  /* \brief The json string that represents a computational graph. */
  std::string graph_json_;
  /* \brief The subgraph that being processed. */
  std::string curr_subgraph_;
  /*! \brief A simple graph from subgraph id to node entries. */
  std::map<std::string, std::vector<NodeEntry> > graph_;
  /* \brief A simple pool to contain the tensor for each node in the graph. */
  std::vector<NDArray> data_entry_;
  /* \brief A mapping from node id to op name. */
  std::vector<std::string> op_id_;
};

特别的，我们必须在【ExampleJsonModule】中实现一些【ModuleNode】派生的函数：

构造函数：此类的构造函数应接受一个子图（以您的表示形式），以所需的任何方式对其进行处理和存储。保存的子图可由以下两个函数使用。
【GetFunction】：这是此类中最重要的函数。当TVM运行时要使用您的编译器标记执行子图时，TVM运行时会从您的自定义运行时模块调用此函数。它提供函数名称以及运行时参数，并且【GetFunction】应返回打包的函数实现以供TVM运行时执行。
【SaveToBinary】和【LoadFromBinary】：【SaveToBinary】将运行时模块序列化为二进制格式，以供以后部署。用户使用【export_libraryAPI 】时，TVM将调用此函数。另一方面，由于我们现在使用自己的图表示形式，因此必须确保【LoadFromBinary】能够通过采用【SaveToBinary】生成的序列化二进制文件来构造相同的运行时模块。
【GetSource】（可选）：如果您想查看生成的【ExampleJSON】代码，则可以实现此函数以将其转储；否则，您可以跳过实施。

其他功能和类变量将与上述必备功能的实现一起引入。

实现构造函数

explicit ExampleJsonModule(std::string graph_json) {
  this->graph_json_ = graph_json;
  ParseJson(this->graph_json_);
}

然后，我们实现【ParseJson】来解析ExampleJSON格式的子图，并在内存中构造一个图供以后使用。由于在此示例中我们不支持带有分支的子图，因此我们仅使用数组按顺序存储子图中的每个节点。

void ParseJson(const std::string& json) {
  std::string line;
  std::string curr_subgraph;
  std::stringstream ss(json);

  while (std::getline(ss, line, '\n')) {
    std::stringstream ss2(line);
    std::string token;
    int id = 0;

    ss2 >> token;
    if (token.find("subgraph_") != std::string::npos) {
      curr_subgraph = token;
      continue;
    }

    ss2 >> id;
    if (op_id_.size() <= static_cast<size_t>(id)) {
      op_id_.resize(id + 1);
      data_entry_.resize(id + 1);
    }

    int64_t total_elements = 1;
    std::vector<int64_t> shape;
    if (token == "input") {
      int64_t size = 0;
      while (ss2 >> size) {
        total_elements *= size;
        shape.push_back(size);
      }
    } else {
      op_id_[id] = token; // Note 1
      bool shape_data = false;
      NodeEntry entry;
      while (ss2 >> token) {
        if (token == "shape:") {
          shape_data = true;
        } else if (shape_data) {
          total_elements *= std::stoll(token);
          shape.push_back(std::stoll(token));
        } else if (token != "inputs:") {
          entry.inputs.push_back(std::stoi(token));
        }
      }
      entry.id = id;
      entry.output = id;
      graph_[curr_subgraph].push_back(entry); // Note 2
    }
    DLContext ctx;
    ctx.device_type = static_cast<DLDeviceType>(1);
    ctx.device_id = 0;
    data_entry_[id] = NDArray::Empty(shape, DLDataType{kDLFloat, 32, 1}, ctx); // Note 3
  }
}

Note1：我们使用类变量【op_id_】将子图节点ID映射到运算符名称（例如【add】），以便我们可以在运行时调用相应的运算符函数。

Note2：我们使用类变量【graph_】将子图名称映射到节点数组。【GetFunction】将在运行时通过子图ID查询图节点。

Note3：我们使用类变量【data_entry_】将子图节点ID映射到张量数据占位符。我们将在运行时将输入和输出放入相应的数据条目。

实现【GetFunction 】

构造后，我们应该准备好上述类变量。然后，我们实现【GetFunction】为TVM运行时提供可执行的子图函数：

PackedFunc GetFunction(const std::string& name,
                       const ObjectPtr<Object>& sptr_to_self) final {
  if (this->graph_.find(name) != this->graph_.end()) {
    this->curr_subgraph_ = name;
    return PackedFunc([sptr_to_self, this](TVMArgs args, TVMRetValue* rv) {

      // Copy input tensors to corresponding data entries.
      for (auto i = 0; i < args.size(); ++i) {
        CHECK(args[i].type_code() == kNDArrayContainer || args[i].type_code() == kArrayHandle)
            << "Expect NDArray or DLTensor as inputs\n";
        if (args[i].type_code() == kArrayHandle) {
          DLTensor* arg = args[i];
          this->data_entry_[i].CopyFrom(arg);
        } else {
          NDArray arg = args[i];
          this->data_entry_[i].CopyFrom(arg);
        }
      }

      // Execute the subgraph.
      for (const auto& it : this->graph_[this->curr_subgraph_]) {
        this->Run(it.id, it.inputs, it.output);
      }
      CHECK_GT(graph_.count(this->curr_subgraph_), 0U);

      // Copy the output from a data entry back to TVM runtime argument.
      auto out_idx = graph_[this->curr_subgraph_].back().output;
      if (args[args.size() - 1].type_code() == kArrayHandle) {
        DLTensor* arg = args[args.size() - 1];
        this->data_entry_[out_idx].CopyTo(arg);
      } else {
        NDArray arg = args[args.size() - 1];
        this->data_entry_[out_idx].CopyTo(arg);
      }
      *rv = data_entry_.back();
    });
  } else {
    LOG(FATAL) << "Unknown subgraph: " << name << "\n";
    return PackedFunc();
  }
}

可以看出，【GetFunction】它由三个主要部分组成。第一部分将数据从TVM运行时参数复制到我们在构造函数中分配的相应数据条目。第二部分使用【Run】函数（将在以后实现）执行子图并将结果保存到另一个数据条目中。第三部分将结果从输出数据条目复制回相应的TVM运行时参数以进行输出。

实现运行

现在让我们实现【Run】函数。此函数接受：1）一个子图ID；2）输入数据条目索引的列表以及3）输出数据条目索引。

void Run(int id, const std::vector<int>& inputs, int output) {
  // Make a list data entry indexs.
  std::vector<int> args(inputs.begin(), inputs.end());
  args.push_back(output);

  // Initialize data holders.
  std::vector<TVMValue> values(args.size());
  std::vector<int> type_codes(args.size());

  // Initialize a TVM arg setter with TVMValue and its type code.
  TVMArgsSetter setter(values.data(), type_codes.data());

  // Set each argument to its corresponding data entry.
  if (op_id_[id] == "add" || op_id_[id] == "sub" || op_id_[id] == "mul") {
    for (size_t i = 0; i < args.size(); i++) {
      setter(i, data_entry_[args[i]]);
    }
  }

  // Invoke the corresponding operator function.
  if (op_id_[id] == "add") {
    Add(values.data(), type_codes.data(), args.size());
  } else if (op_id_[id] == "sub") {
    Sub(values.data(), type_codes.data(), args.size());
  } else if (op_id_[id] == "mul") {
    Mul(values.data(), type_codes.data(), args.size());
  } else {
    LOG(FATAL) << "Unknown op: " << op_id_[id] << "\n";
  }
}

【Run】函数主要有两个部分。第一部分分配一个【TVMValue】列表，并映射相应的数据条目块。这将成为我们运算符函数的参数。第二部分将调用我们的运算符函数。虽然我们使用与前面的例子相同的C函数，可以用自己的引擎更换Add，Sub以及Mul。您只需要确保引擎将结果存储到最后一个参数，就可以将其传输回TVM运行时。

通过实现上述功能，我们自定义的代码生成和运行时现在可以执行子图。最后一步是注册API（【examplejson_module_create】）以创建此模块：

TVM_REGISTER_GLOBAL("module.examplejson_module_create")
.set_body_typed([](std::string code){
    auto n = make_object<ExampleJsonModule>(code);
    return runtime::Module(n);
});

实现【SaveToBinary】和【LoadFromBinary 】

到目前为止，我们已经实现了自定义运行时的主要功能，以便可以将其用作其他TVM运行时。但是，当用户要将已构建的运行时保存到磁盘以进行部署时，TVM不知道如何保存它。这就是我们要实现【SaveToBinary】和【LoadFromBinary】的原因，它们告诉TVM如何保留和恢复自定义的运行时。

我们首先实现【SaveToBinary】允许用户将该模块保存在磁盘中的功能。

void SaveToBinary(dmlc::Stream* stream) final {
    stream->Write(this->graph_json_);
}

我们可以发现此函数非常简单。回想一下，我们在构造函数中使用的唯一参数是一个子图表示，这意味着我们只需要一个子图表示即可构造/恢复此定制的运行时模块。结果，【SaveToBinary】只需将子图写入输出DMLC流。也就是说，当用户使用【export_library】API导出模块时，自定义模块将是子图的ExampleJSON流。

相似，【LoadFromBinary】读取子图流并重新构建自定义的运行时模块：

static Module LoadFromBinary(void* strm) {
  dmlc::Stream* stream = static_cast<dmlc::Stream*>(strm);
  std::string graph_json;
  stream->Read(&graph_json);
  auto n = tvm::runtime::make_object<ExampleJsonModule>(graph_json);
  return Module(n);
}

我们还需要注册此函数以启用相应的Python API：

TVM_REGISTER_GLOBAL("module.loadbinary_examplejson")
.set_body_typed(ExampleJsonModule::LoadFromBinary);

上面的注册意味着当用户调用【tvm.runtime.load(lib_path)】API并且导出的库具有ExampleJSON流时，我们【LoadFromBinary】将被调用以创建相同的自定义运行时模块。

另外，如果您想直接从ExampleJSON文件支持模块创建，则还可以实现一个简单的函数并注册Python API，如下所示：

static Module Create(const std::string& path) {
    std::ifstream filep;
    filep.open(path, std::ios::in);
    std::string graph_json;
    std::string line;
    while (std::getline(filep, line)) {
        graph_json += line;
        graph_json += "\n";
    }
    filep.close();
    auto n = tvm::runtime::make_object<ExampleJsonModule>(graph_json);
    return Module(n);
}

TVM_REGISTER_GLOBAL("module.loadfile_examplejson")
.set_body([](TVMArgs args, TVMRetValue* rv) {
    *rv = ExampleJsonModule::Create(args[0]);
});

这意味着用户可以手动编写/修改ExampleJSON文件，并使用Python API 【tvm.runtime.load("mysubgraph.examplejson", "examplejson")】构造自定义模块。

总结

总而言之，这是一份清单供您参考：

派生自【ExprVisitor】和【CodegenCBase】的代码生成类和（仅对于C代码生成）具有以下函数。
- 【VisitExpr_(const CallNode* call)】 收集调用节点信息。
- 收集子图信息所需的其他访问器函数。
- 【JIT 】生成子图代码。
- 注册代码生成器。
创建【CSourceModule】的函数（用于C代码生成）。
从【ModuleNode】派生的运行时模块类具有下面的函数（用于图形表示）。
- 构造函数。
- 【GetFunction】生成TVM运行时兼容的【PackedFunc】。
- 【Run 】执行子图。
- 注册运行时创建API。
- 【SaveToBinary】和【LoadFromBinary】序列化/反序列化自定义的运行时模块。
- 注册【LoadFromBinary】API以支持【tvm.runtime.load(your_module_lib_path)】。
- （可选）【Create】以从表示中的子图文件支持定制的运行时模块构造。
一个用于对用户Relay程序进行注释的注释器，以利用您的编译器和运行时（TBA）。

牛牛存

关注

3
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
把你自己的代码生成TVM

把你自己的代码生成要TVM随着深度学习工作负载所针对的硬件设备的数量不断增加，用户在各种设备上实现高性能所需的知识也在不断增加。为了使数据科学家不必担心开发新模型时的性能，硬件后端提供程序要么为MKLDNN或cuDNN之类的库提供许多常用的深度学习运算符，要么提供诸如TensorRT的框架以使用户以某种方式描述其模型实现高性能。但是，用户尝试在新的库或设备上工作时必须学习新的编程界面。结果，...
复制链接

扫一扫