TVM AVX512-VNNI原语手工调度matmul

手工算子

Relay Core Tensor Operators - tvm 0.9.dev0 documentation

python/tvm/topi/x86/tensor_intrin.py

def dot_16x1x16_uint8_int8_int32_skylake():
    """
    Int8 dot product by every 4 elements using AVX512 Skylake instructions.
    This function takes two arrays of uint8 and int8 datatype -- data[4] and
    kernel[16][4] -- and computes a dot product of data[4] with every
    4 elements of kernels, resulting in output[16] of int32 datatype.
    The pseudo code is as follows.
    .. code-block:: c
        void dot_16x1x16_uint8_int8_int32(uint8 data[4], int8 kernel[16][4],
                int32 output[16]){
            for (int i = 0; i < 16; i++){
                output[i] = 0;
                for (int k = 0; k < 4; k++){
                    output[i] += data[k] * kernel[i][k]
                }
            }
        }

    Physically, the kernel array sits in an AVX512 vector register and
    the data[4] is broadcasted to another AVX512 vector register. This
    function returns a TensorIntrin that can be used to tensorize
    a schedule.

    Returns
    -------
    intrin : TensorIntrin
        The Skylake int8 TensorIntrin that can be used in tensorizing schedule
    """

伪代码:

void dot_16x1x16_uint8_int8_int32(uint8 data[4], int8 kernel[16][4],
                int32 output[16]){
   
            for (int i = 0; i < 16; i++){
   
                output[i] = 0;
                for (int k = 0; k < 4; k++){
   
                    output[i] += data[k] * kernel[i][k]
                }
            }
        }

请添加图片描述

应用16x1x16到matmul

tests/python/unittest/test_tir_schedule_tensorize.py::test_tensorize_vnni

def test_tensorize_vnni():
    m, n, k = 128, 128, 128
    func = get_matmul_packed(m, n, k, "uint8", 16)
    sch = tir.Schedule(func, debug_mask="all")
    block = sch.get_block("compute")
    _, j, k = sch.get_loops(block)
    _, ji = sch.split(j, factors=[None, 16])
    ko, ki = sch.split(k, factors=[None, 4])
    sch.reorder(ko, ji, ki)
    sch.decompose_reduction(block, ko)
    sch.tensorize(ji, VNNI_DOT_16x4_INTRIN)
    verify_trace_roundtrip(sch=sch, mod=func)

现在需要把TIR映射到VNNI_DOT_16x4_INTRIN上。对于一个(128,128)*(128,128)的矩阵乘法,

_, ji = sch.split(j, factors=[None, 16])

分离B矩阵的columnj 。于是j的内循环长度为16。B矩阵column=128,128/16=8,即纵向分割成了8个每个长度为16的子矩阵。

# from tvm.script import tir as T
@tvm.script.ir_module
class Module:
    @T.prim_func
    def main(X: T.Buffer[(128, 128), "uint8"], packedW: T.Buffer[
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值