tvm tutorials 记录（1）

最新推荐文章于 2023-11-16 16:55:21 发布

红鲤鱼遇绿鲤鱼

最新推荐文章于 2023-11-16 16:55:21 发布

阅读量1.1k

点赞数 1

分类专栏： LLVM/TVM/编译理论文章标签：深度学习

本文链接：https://blog.csdn.net/tengfei461807914/article/details/122761551

版权

LLVM/TVM/编译理论专栏收录该内容

10 篇文章 2 订阅

订阅专栏

第一篇文章记录了relay_quick_start.py文件中的内容，主要是展示了如何使用relay构建网络结构图，以及调用什么接口可以实现对网络进行编译优化并部署。

接下来看一下tensor_expr_get_started.py这个文件，文件900多行，注释占了绝大部分，提供这个demo的作者人很实在。

作者上来来了一段解释，原文如下：

In this tutorial we will turn our attention to how TVM works with Tensor Expression (TE) to define tensor computations and apply loop optimizations. TE describes tensor computations in a pure functional language (that is each expression has no side effects). When viewed in context of the TVM as a whole, Relay describes a computation as a set of operators, and each of these operators can be represented as a TE expression where each TE expression takes input tensors and produces an output tensor.

This is an introductory tutorial to the Tensor Expression language in TVM. TVM uses a domain specific tensor expression for efficient kernel construction. We will demonstrate the basic workflow with two examples of using the tensor expression language. The first example introduces TE and scheduling with vector addition. The second expands on these concepts with a step-by-step optimization of a matrix multiplication with TE. This matrix multiplication example will serve as the comparative basis for future tutorials covering more advanced features of TVM.

tensor expression & schedule

TVM里面提供了tensor expression这样的概念和工具用于实现对tensor计算的优化，通过改变循环以及计算顺序等，并结合不同硬件特点等方法，实现对tensor的高性能计算，这里以向量加法为例，代码如下：

import tvm
import tvm.testing
from tvm import te
import numpy as np

# You will get better performance if you can identify the CPU you are targeting
# and specify it. If you're using llvm, you can get this information from the
# command ``llc --version`` to get the CPU type, and you can check
# ``/proc/cpuinfo`` for additional extensions that your processor might
# support. For example, you can use "llvm -mcpu=skylake-avx512" for CPUs with
# AVX-512 instructions.

tgt = tvm.target.Target(target="llvm", host="llvm")

################################################################################
# Describing the Vector Computation
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# We describe a vector addition computation. TVM adopts tensor semantics, with
# each intermediate result represented as a multi-dimensional array. The user
# needs to describe the computation rule that generates the tensors. We first
# define a symbolic variable ``n`` to represent the shape. We then define two
# placeholder Tensors, ``A`` and ``B``, with given shape ``(n,)``. We then
# describe the result tensor ``C``, with a ``compute`` operation. The
# ``compute`` defines a computation, with the output conforming to the
# specified tensor shape and the computation to be performed at each position
# in the tensor defined by the lambda function. Note that while ``n`` is a
# variable, it defines a consistent shape between the ``A``, ``B`` and ``C``
# tensors. Remember, no actual computation happens during this phase, as we
# are only declaring how the computation should be done.

n = te.var("n")
A = te.placeholder((n,), name="A")
B = te.placeholder((n,), name="B")
C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")

################################################################################
# .. note:: Lambda Functions
#
#   The second argument to the ``te.compute`` method is the function that
#   performs the computation. In this example, we're using an anonymous function,
#   also known as a ``lambda`` function, to define the computation, in this case
#   addition on the ``i``th element of ``A`` and ``B``.

################################################################################
# Create a Default Schedule for the Computation
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
# While the above lines describe the computation rule, we can compute ``C`` in
# many different ways to fit different devices. For a tensor with multiple
# axes, you can choose which axis to iterate over first, or computations can be
# split across different threads. TVM requires that the user to provide a
# schedule, which is a description of how the computation should be performed.
# Scheduling operations within TE can change loop orders, split computations
# across different threads, group blocks of data together, amongst other
# operations. An important concept behind schedules is that they only describe
# how the computation is performed, so different schedules for the same TE will
# produce the same result.
#
# TVM allows you to create a naive schedule that will compute ``C`` in by
# iterating in row major order.
#
# .. code-block:: c
#
#   for (int i = 0; i < n; ++i) {
#     C[i] = A[i] + B[i];
#   }

s = te.create_schedule(C.op)
######################################################################
# Compile and Evaluate the Default Schedule
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# With the TE expression and a schedule, we can produce runnable code for our
# target language and architecture, in this case LLVM and a CPU. We provide
# TVM with the schedule, a list of the TE expressions that are in the schedule,
# the target and host, and the name of the function we are producing. The result
# of the output is a type-erased function that can be called directly from Python.
#
# In the following line, we use tvm.build to create a function. The build
# function takes the schedule, the desired signature of the function (including
# the inputs and outputs) as well as target language we want to compile to.

fadd = tvm.build(s, [A, B, C], tgt, name="myadd")

################################################################################
# Let's run the function, and compare the output to the same computation in
# numpy. The compiled TVM function is exposes a concise C API that can be invoked
# from any language. We begin by creating a device, which is a device (CPU in this
# example) that TVM can compile the schedule to. In this case the device is an
# LLVM CPU target. We can then initialize the tensors in our device and
# perform the custom addition operation. To verify that the computation is
# correct, we can compare the result of the output of the c tensor to the same
# computation performed by numpy.

dev = tvm.device(tgt.kind.name, 0)

n = 1024
a = tvm.nd.array(np.random.uniform(size=n).astype(A.dtype), dev)
b = tvm.nd.array(np.random.uniform(size=n).astype(B.dtype), dev)
c = tvm.nd.array(np.zeros(n, dtype=C.dtype), dev)
fadd(a, b, c)
tvm.testing.assert_allclose(c.numpy(), a.numpy() + b.numpy())

################################################################################
# To get a comparison of how fast this version is compared to numpy, create a
# helper function to run a profile of the TVM generated code.
import timeit

np_repeat = 100
np_running_time = timeit.timeit(
    setup="import numpy\n"
    "n = 32768\n"
    'dtype = "float32"\n'
    "a = numpy.random.rand(n, 1).astype(dtype)\n"
    "b = numpy.random.rand(n, 1).astype(dtype)\n",
    stmt="answer = a + b",
    number=np_repeat,
)
print("Numpy running time: %f" % (np_running_time / np_repeat))

上面代码使用tvm中的te模块实现自定义高性能的tensor运算，比如上面实现的向量机加法。
用于描述计算过程，即如何进行运算的代码为如下：

n = te.var("n")
A = te.placeholder((n,), name="A")
B = te.placeholder((n,), name="B")
C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")

代码是用lambda表达式来实现逻辑的，在这段代码的注释当中还特别提到了，这里仅仅指示描述计算的逻辑，并不执行真正的计算。这里就有个很屌的地方，因为实现计算逻辑式用lambda表达式实现的，如果式描述计算逻辑，那么最后应该式得到IR这样的计算图结构，te这东西是怎么把lambda表达式转换成IR的？

如果是构造了一个计算图，那么最后应该能输出一个描述这个张量运算逻辑图结构，后面的代码中会得到构建的计算图。

te模块中compute函数的实现在operation.py当中，除这个函数，var和placeholder在operation.py中实现，先简单看看里面是什么。

def compute(shape, fcompute, name="compute", tag="", attrs=None):
    """Construct a new tensor by computing over the shape domain.

    The compute rule is result[axis] = fcompute(axis)

    Parameters
    ----------
    shape: Tuple of Expr
        The shape of the tensor

    fcompute: lambda function of indices-> value
        Specifies the input source expression

    name: str, optional
        The name hint of the tensor

    tag: str, optional
        Additional tag information about the compute.

    attrs: dict, optional
        The additional auxiliary attributes about the compute.

    Returns
    -------
    tensor: Tensor
        The created tensor
    """
    if _tag.TagScope.get_current() is not None:
        if tag != "":
            raise ValueError("nested tag is not allowed for now")
        tag = _tag.TagScope.get_current().tag
    shape = (shape,) if isinstance(shape, tvm.tir.PrimExpr) else shape
    # for python3
    shape = tuple([int(s) if isinstance(s, float) else s for s in shape])
    ndim = len(shape)
    code = fcompute.__code__

    out_ndim = ndim
    if code.co_argcount == 0:
        arg_names = ["i%d" % i for i in range(ndim)]
    else:
        arg_names = code.co_varnames[: code.co_argcount]
        out_ndim = code.co_argcount

    if out_ndim != len(arg_names):
        raise ValueError("fcompute do not match dimension, ndim=%d" % ndim)

    dim_var = [tvm.tir.IterVar((0, s), x, 0) for x, s in zip(arg_names, shape[:out_ndim])]
    body = fcompute(*[v.var for v in dim_var])

    if isinstance(body, _tensor.TensorIntrinCall):
        for i, s in enumerate(shape[out_ndim:]):
            var_name = "ax" + str(i)
            dim_var.append(tvm.tir.IterVar((0, s), var_name, 4))
        op_node = _ffi_api.TensorComputeOp(
            name,
            tag,
            dim_var,
            body.reduce_axis,
            out_ndim,
            body.intrin,
            body.tensors,
            body.regions,
            body.scalar_inputs,
        )
    else:
        if not isinstance(body, (list, tuple)):
            body = [body]
        body = convert(body)
        op_node = _ffi_api.ComputeOp(name, tag, attrs, dim_var, body)

    num = op_node.num_outputs
    outputs = tuple(op_node.output(i) for i in range(num))
    return outputs[0] if num == 1 else outputs

上面代码中有一句

code = fcompute.__code__

这是python的内置函数，可以获得函数的信息，比如函数的参数以及函数体等信息。
可以看到，拿到函数的参数以及函数体的内容后,构建tir，并传递给最后调用_ffi实现调用C++的接口，得到op_node。

这里也大概可以解释了上面如何通过lambda表达式构建计算逻辑的，具体细节先不深究

schedule 调度

构造完计算逻辑，接下来可以对计算逻辑进行调度，在代码中调度的步骤可有可无，调度完成后进行编译即可得到调度后的计算逻辑图。代码在这一行仅仅建立了一个“空的调度逻辑”没有具体执行，但是后面有调度的逻辑，加在这里。

并行化

下面的schedule策略是并行执行这个矩阵加法的过程，cpu会调用多个核来执行执行这个运算

s = te.create_schedule(C.op)
s[C].parallel(C.op.axis[0])

并行化+向量化

CPU里面一般都有SIMD的向量指令，结合并行调度指令，可以进行更好的优化。
原始的两个向量相加的操作用一个循环遍历每个元素执行相加操作，如果有多个核，可以执行并行化将数据拆分到不同的核心上执行，假设电脑有4个核，两个相加的向量长度是1024，那么可以将数据拆分成4份，每个核循环执行256个数据；
再加上每个核心使用向量化指令操作，假设向量指令一个可以值运算4个float，那么每个核心只需要执行64次循环即可。

 outer, inner = s[C].split(C.op.axis[0], factor=factor)
 s[C].parallel(outer)
 s[C].vectorize(inner)

上面代码先将循环分成两个嵌套循环，外层循环执行outer次，内层循环执行inner次。
将外层循环并行化处理，即分给多个cpu运算；
内层循环执行向量化指令处理

build 编译

在指定的设备上编译刚刚构造好的计算逻辑（注意这里是tvm.build而不是relay.build）
tvm.build主要用于编译自定义的逻辑（或算子），而relay.build用于编译整个网络模型，比如上一节当中的mobilenet网络模型

fadd = tvm.build(s, [A, B, C], tgt, name="myadd")

build部分的代码在tvm/python/tvm/driver/build_module.py文件当中，看看里面是什么样（代码挺长）

def build(
    inputs: Union[schedule.Schedule, PrimFunc, IRModule, Mapping[str, IRModule]],
    args: Optional[List[Union[Buffer, tensor.Tensor, Var]]] = None,
    target: Optional[Union[str, Target]] = None,
    target_host: Optional[Union[str, Target]] = None,
    name: Optional[str] = "default_function",
    binds: Optional[Mapping[tensor.Tensor, Buffer]] = None,
):
    """Build a function with arguments as signature. Code will be generated
    for devices coupled with target information.

    Parameters
    ----------
    inputs : Union[tvm.te.schedule.Schedule, tvm.tir.PrimFunc, IRModule, Mapping[str, IRModule]]
        The input to be built

    args : Optional[List[Union[tvm.tir.Buffer, tensor.Tensor, Var]]]
        The argument lists to the function.

    target : Optional[Union[str, Target]]
        The target and option of the compilation.

    target_host : Optional[Union[str, Target]]
        Host compilation target, if target is device.
        When TVM compiles device specific program such as CUDA,
        we also need host(CPU) side code to interact with the driver
        setup the dimensions and parameters correctly.
        target_host is used to specify the host side codegen target.
        By default, llvm is used if it is enabled,
        otherwise a stackvm intepreter is used.

    name : Optional[str]
        The name of result function.

    binds : Optional[Mapping[tensor.Tensor, tvm.tir.Buffer]]
        Dictionary that maps the binding of symbolic buffer to Tensor.
        By default, a new buffer is created for each tensor in the argument.

    Returns
    -------
    ret : tvm.module
        A module that combines both host and device code.

    Examples
    ________
    There are two typical example uses of this function depending on the type
    of the argument `inputs`:
    1. it is an IRModule.

    .. code-block:: python

        n = 2
        A = te.placeholder((n,), name='A')
        B = te.placeholder((n,), name='B')
        C = te.compute(A.shape, lambda *i: A(*i) + B(*i), name='C')
        s = tvm.te.create_schedule(C.op)
        m = tvm.lower(s, [A, B, C], name="test_add")
        rt_mod = tvm.build(m, target="llvm")

    2. it is a dict of compilation target to IRModule.

    .. code-block:: python

        n = 2
        A = te.placeholder((n,), name='A')
        B = te.placeholder((n,), name='B')
        C = te.compute(A.shape, lambda *i: A(*i) + B(*i), name='C')
        s1 = tvm.te.create_schedule(C.op)
        with tvm.target.cuda() as cuda_tgt:
          s2 = topi.cuda.schedule_injective(cuda_tgt, [C])
          m1 = tvm.lower(s1, [A, B, C], name="test_add1")
          m2 = tvm.lower(s2, [A, B, C], name="test_add2")
          rt_mod = tvm.build({"llvm": m1, "cuda": m2}, target_host="llvm")

    Note
    ----
    See the note on :any:`tvm.target` on target string format.
    """
    if isinstance(inputs, schedule.Schedule):
        if args is None:
            raise ValueError("args must be given for build from schedule")
        input_mod = lower(inputs, args, name=name, binds=binds)
    elif isinstance(inputs, (list, tuple, container.Array)):
        merged_mod = tvm.IRModule({})
        for x in inputs:
            merged_mod.update(lower(x))
        input_mod = merged_mod
    elif isinstance(inputs, (tvm.IRModule, PrimFunc)):
        input_mod = lower(inputs)
    elif not isinstance(inputs, (dict, container.Map)):
        raise ValueError(
            f"Inputs must be Schedule, IRModule or dict of target to IRModule, "
            f"but got {type(inputs)}."
        )

    if not isinstance(inputs, (dict, container.Map)):
        target = Target.current() if target is None else target
        target = target if target else "llvm"
        target_input_mod = {target: input_mod}
    else:
        target_input_mod = inputs

    for tar, mod in target_input_mod.items():
        if not isinstance(tar, (str, Target)):
            raise ValueError("The key of inputs must be str or " "Target when inputs is dict.")
        if not isinstance(mod, tvm.IRModule):
            raise ValueError("inputs must be Schedule, IRModule," "or dict of str to IRModule.")

    target_input_mod, target_host = Target.check_and_update_host_consist(
        target_input_mod, target_host
    )

    if not target_host:
        for tar, mod in target_input_mod.items():
            tar = Target(tar)
            device_type = ndarray.device(tar.kind.name, 0).device_type
            if device_type == ndarray.cpu(0).device_type:
                target_host = tar
                break
    if not target_host:
        target_host = "llvm" if tvm.runtime.enabled("llvm") else "stackvm"

    target_input_mod, target_host = Target.check_and_update_host_consist(
        target_input_mod, target_host
    )

    mod_host_all = tvm.IRModule({})

    device_modules = []
    for tar, input_mod in target_input_mod.items():
        mod_host, mdev = _build_for_device(input_mod, tar, target_host)
        mod_host_all.update(mod_host)
        device_modules.append(mdev)

    # Generate a unified host module.
    rt_mod_host = codegen.build_module(mod_host_all, target_host)

    # Import all modules.
    for mdev in device_modules:
        if mdev:
            rt_mod_host.import_module(mdev)

    if not isinstance(target_host, Target):
        target_host = Target(target_host)
    if (
        target_host.attrs.get("runtime", tvm.runtime.String("c++")) == "c"
        and target_host.attrs.get("system-lib", 0) == 1
    ):
        if target_host.kind.name == "c":
            create_csource_crt_metadata_module = tvm._ffi.get_global_func(
                "runtime.CreateCSourceCrtMetadataModule"
            )
            to_return = create_csource_crt_metadata_module([rt_mod_host], target_host)

        elif target_host.kind.name == "llvm":
            create_llvm_crt_metadata_module = tvm._ffi.get_global_func(
                "runtime.CreateLLVMCrtMetadataModule"
            )
            to_return = create_llvm_crt_metadata_module([rt_mod_host], target_host)
    else:
        to_return = rt_mod_host

    return OperatorModule.from_module(to_return, ir_module_by_target=target_input_mod, name=name)

注释里面还出调用样例了，作者人很实在，从注释当中可以得知，图模型会生成一个与硬件强关联的指令代码，上面的代码中可以明显看到有codegen.build_module这么一行，返回类型为一个module，包括了设备端和主机端的代码（指令）

生成IR

lower操作可以生成IR，用来查看执行schedule的结果是否是自己想要的，simple_mode=True输出C-style语句

tvm.lower(s, [A, B, C], simple_mode=True)

打印出来的结果如下：

IR里面的部分信息可以参考官方文档的这个部分IR参考
IR当中各种符号的解释暂时还没找到详细解释的文档（粗略解释的也没有找到）
tvm 的IR是从Halide的IR继承过来的，不过Halide IR的解释目前也没发现有详细的资料

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {C: Buffer(C_2: Pointer(float32), float32, [n: int32], [stride: int32], type="auto"),
             A: Buffer(A_2: Pointer(float32), float32, [n], [stride_1: int32], type="auto"),
             B: Buffer(B_2: Pointer(float32), float32, [n], [stride_2: int32], type="auto")}
  buffer_map = {A_1: A, B_1: B, C_1: C} {
  for (i.outer: int32, 0, floordiv((n + 3), 4)) "parallel" {
    for (i.inner.s: int32, 0, 4) {
      if @tir.likely((((i.outer*4) + i.inner.s) < n), dtype=bool) {
        C_2[(((i.outer*4) + i.inner.s)*stride)] = ((float32*)A_2[(((i.outer*4) + i.inner.s)*stride_1)] + (float32*)B_2[(((i.outer*4) + i.inner.s)*stride_2)])
      }
    }
  }
}

使用schedule修改并构建的计算逻辑放在一个名字为primfn的函数当中，函数的参数后面都带个handle是什么意思不清楚。
接下来一行为attr，从缩写来看是属性的意思，没有在具体文档中发现。

在下一行，出现了buffers，看起来是个存储的数据结构，里面的内容具体什么意思没有搜到，这里猜测，第一个参数Pointer数据类型表示指向这个Buffer的是一个float指针，第二个参数float应该表示这里的数据类型是float，第三个参数n表示缓冲区的大小是n，第四个不知道（这里肯定是1，也许和数据摆放的类型有关系的一个参数）。

接下来buffer_map，里面的参数形式很像python的map，有一对key和value，那么明显就是用过key来索引value，这里从参数看primfn函数的参数可以索引上面的buffers。

for (i.outer: int32, 0, floordiv((n + 3), 4)) "parallel"

循环中一般都会标识4个内容，循环的起始值、终止值、循环步长以及循环下标，在上面分别对应0，floordiv((n + 3), 4)，1和i.outer
其中循环执行的步长没有直接表示出来（注意不是stride，具体stride是什么目前还不清楚）
最后面的parallel应该类似MPI中对循环多核执行的宏

看一下条件分支判断的部分

if @tir.likely((((i.outer*4) + i.inner.s) < n), dtype=bool)

判断外层循环加上内层循环得到的下标是否超过向量的长度（这里为什么用tir表示？）likely猜测应该是类似c++ 中的关键字，给分支预测一个投机行为，表示循环中很少出现不满足if的情况

最后看一下循环中的内容

C_2[(((i.outer*4) + i.inner.s)*stride)] = ((float32*)A_2[(((i.outer*4) + i.inner.s)*stride_1)] + (float32*)B_2[(((i.outer*4) + i.inner.s)*stride_2)])

两层循环用来索引下标的计算很明显，简化一下

C[i.outer * 4 + i.inner.s] = A[i.outer * 4 + i.inner.s] + B[i.outer * 4 + i.inner.s]

最后看一下lower里面执行了什么内容，lower文件在driver/build_module.py当中
官方文档中给出的内容如下：

Lowering step before build into target.

只有一句话，即将生成的当前的IR进行lowering，得到一个更贴近硬件的中间表示
函数的python代码如下：

 def lower(
     inp: Union[schedule.Schedule, PrimFunc, IRModule],
     args: Optional[List[Union[Buffer, tensor.Tensor, Var]]] = None,
     name: str = "main",
     binds: Optional[Mapping[tensor.Tensor, Buffer]] = None,
     simple_mode: bool = False,
 ) -> IRModule:
     """Lowering step before build into target.

     Parameters
     ----------
     inp : Union[tvm.te.schedule.Schedule, tvm.tir.PrimFunc, IRModule]
         The TE schedule or TensorIR PrimFunc/IRModule to be built

     args : Optional[List[Union[tvm.tir.Buffer, tensor.Tensor, Var]]]
         The argument lists to the function for TE schedule.
         It should be None if we want to lower TensorIR.

     name : str
         The name of the result function.

     binds : Optional[Mapping[tensor.Tensor, tvm.tir.Buffer]]
         Dictionary that maps the Tensor to Buffer which specified the data layout
         requirement of the function. By default, a new compact buffer is created
         for each tensor in the argument.

     simple_mode : bool
         Whether only output simple and compact statement, this will skip
         LoopPartition, api wrapper generation and Unrolling.

     Returns
     -------
     m : IRModule
        The result IRModule
     """
     if isinstance(inp, IRModule):
         return ffi.lower_module(inp, simple_mode)
     if isinstance(inp, PrimFunc):
         return ffi.lower_primfunc(inp, name, simple_mode)
     if isinstance(inp, schedule.Schedule): ############## 执行此条语句
         return ffi.lower_schedule(inp, args, name, binds, simple_mode)
     raise ValueError("Expected input to be an IRModule, PrimFunc or Schedule, but got, ", type(inp))

进入后会调用c++接口，暂时先不看

运行

编译好了指令代码，就可以塞入数据运行了

dev = tvm.device(tgt.kind.name, 0)

n = 1024
a = tvm.nd.array(np.random.uniform(size=n).astype(A.dtype), dev)
b = tvm.nd.array(np.random.uniform(size=n).astype(B.dtype), dev)
c = tvm.nd.array(np.zeros(n, dtype=C.dtype), dev)
fadd(a, b, c)
tvm.testing.assert_allclose(c.numpy(), a.numpy() + b.numpy())