手工算子
Relay Core Tensor Operators - tvm 0.9.dev0 documentation
python/tvm/topi/x86/tensor_intrin.py
def dot_16x1x16_uint8_int8_int32_skylake():
"""
Int8 dot product by every 4 elements using AVX512 Skylake instructions.
This function takes two arrays of uint8 and int8 datatype -- data[4] and
kernel[16][4] -- and computes a dot product of data[4] with every
4 elements of kernels, resulting in output[16] of int32 datatype.
The pseudo code is as follows.
.. code-block:: c
void dot_16x1x16_uint8_int8_int32(uint8 data[4], int8 kernel[16][4],
int32 output[16]){
for (int i = 0; i < 16; i++){
output[i] = 0;
for (int k = 0; k < 4; k++){
output[i] += data[k] * kernel[i][k]
}
}
}
Physically, the kernel array sits in an AVX512 vector register and
the data[4] is broadcasted to another AVX512 vector register. This
function returns a TensorIntrin that can be used to tensorize
a schedule.
Returns
-------
intrin : TensorIntrin
The Skylake int8 TensorIntrin that can be used in tensorizing schedule
"""
伪代码:
void dot_16x1x16_uint8_int8_int32(uint8 data[4], int8 kernel[16][4],
int32 output[16]){
for (int i = 0; i < 16; i++){
output[i] = 0;
for (int k = 0; k < 4; k++){
output[i] += data[k] * kernel[i][k]
}
}
}

应用16x1x16到matmul
tests/python/unittest/test_tir_schedule_tensorize.py::test_tensorize_vnni
def test_tensorize_vnni():
m, n, k = 128, 128, 128
func = get_matmul_packed(m, n, k, "uint8", 16)
sch = tir.Schedule(func, debug_mask="all")
block = sch.get_block("compute")
_, j, k = sch.get_loops(block)
_, ji = sch.split(j, factors=[None, 16])
ko, ki = sch.split(k, factors=[None, 4])
sch.reorder(ko, ji, ki)
sch.decompose_reduction(block, ko)
sch.tensorize(ji, VNNI_DOT_16x4_INTRIN)
verify_trace_roundtrip(sch=sch, mod=func)
现在需要把TIR映射到VNNI_DOT_16x4_INTRIN上。对于一个(128,128)*(128,128)的矩阵乘法,
_, ji = sch.split(j, factors=[None, 16])
分离B矩阵的columnj 。于是j的内循环长度为16。B矩阵column=128,128/16=8,即纵向分割成了8个每个长度为16的子矩阵。
# from tvm.script import tir as T
@tvm.script.ir_module
class Module:
@T.prim_func
def main(X: T.Buffer[(128, 128), "uint8"], packedW: T.Buffer[
最低0.47元/天 解锁文章
883

被折叠的 条评论
为什么被折叠?



