tvm tutorial (1.2)

cache writing

紧接着上一篇实现了array packing对矩阵乘法进行优化的策略,接下来记录cache写回优化(block writing throung caching)

cache的写回说头比较多,有两种情况,一种叫write back,另一种是write through,详细内容可以参考下面链接:
https://blog.csdn.net/xingzhe22222/article/details/81988101

矩阵C的写回操作也是跳跃式的,不利于cache的更新操作。因此这里将计算结果保存在一个临时空间当中,并以顺序存储。

tvm代码如下,代码中暂时去掉了向量化和并行化的策略,生成的IR更清晰一些:

 packedB = te.compute((N / bn, K, bn), lambda bigN, k, littleN: B[k, bigN * bn + littleN], name="packedB")
 C = te.compute(
     (M, N),
     lambda x, y: te.sum(A[x, k] * packedB[y // bn, k, tvm.tir.indexmod(y, bn)], axis=k),
     name="C",
 )
 s = te.create_schedule(C.op)

 # Allocate write cache
 CC = s.cache_write(C, "global")

 xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

 # Write cache is computed at yo
 s[CC].compute_at(s[C], yo)

 # New inner axes
 xc, yc = s[CC].op.axis

 (k,) = s[CC].op.reduce_axis
 ko, ki = s[CC].split(k, factor=4)
 s[CC].reorder(ko, xc, ki, yc)

 print(tvm.lower(s, [A, B, C], simple_mode=True))

生成的IR如下:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
             C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
             B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
  buffer_map = {A_1: A, B_1: B, C_1: C} {
  allocate(packedB: Pointer(global float32), float32, [1048576]), storage_scope = global;
  allocate(C.global: Pointer(global float32), float32, [1024]), storage_scope = global {
    for (bigN: int32, 0, 32) {
      for (k: int32, 0, 1024) {
        for (littleN: int32, 0, 32) {
          packedB[(((bigN*32768) + (k*32)) + littleN)] = (float32*)B_2[(((k*1024) + (bigN*32)) + littleN)]
        }
      }
    }
    for (x.outer: int32, 0, 32) {
      for (y.outer: int32, 0, 32) {
        for (x.c.init: int32, 0, 32) {
          for (y.c.init: int32, 0, 32) {
            C.global[((x.c.init*32) + y.c.init)] = 0f32
          }
        }
        for (k.outer: int32, 0, 256) {
          for (x.c: int32, 0, 32) {
            for (k.inner: int32, 0, 4) {
              for (y.c: int32, 0, 32) {
                C.global[((x.c*32) + y.c)] = ((float32*)C.global[((x.c*32) + y.c)] + ((float32*)A_2[((((x.outer*32768) + (x.c*1024)) + (k.outer*4)) + k.inner)]*(float32*)packedB[((((y.outer*32768) + (k.outer*128)) + (k.inner*32)) + y.c)]))
              }
            }
          }
        }
        for (x.inner: int32, 0, 32) {
          for (y.inner: int32, 0, 32) {
            C_2[((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)) + y.inner)] = (float32*)C.global[((x.inner*32) + y.inner)]
          }
        }
      }
    }
  }
}

s[CC].cache_write 这条语句在文档中的表述如下:

Create a cache write of original tensor, before storing into tensor.
This will mutate the body of the tensor. A new cache stage will created before feed into the tensor.
This function can be used to support data layout transformation. If there is a split/fuse/reorder on the data parallel axis of tensor before cache_write is called. The intermediate cache stores the data in the layout as the iteration order of leave axis. The data will be transformed back to the original layout in the original tensor. User can further call compute_inline to inline the original layout and keep the data stored in the transformed layout.

文档中的第一句话就表达了含义,"在存储到tensor之前,创建一个原始张量的写入cache“
可以理解成“开辟了”一个cache的空间专门用于存储输出结果(CPU上的cache是用户不可见的)

(简单翻译一下)
这个函数可以用来支持数据layout的变换,如果在cache写入操作被执行之前,在数据可以并行化的axis上存在split/fuse/reorder这些操作,intermediate cache将存储这些数据以结束对应axis迭代的数据摆放顺序。数据也会转换会原始tensor中的数据layout形式。用户可以调用compute_inline来实现对原始数据layout的inline,并保持数据以变换后的layout。

总结一下,这个函数可以变换数据摆放形式,比如NCHW可以变成NHWC这样。
如果有split/fuse/reorder这样的操作,数据在cache当中会按照数据的连续迭代数据摆放,使用完后会转换回原有的layout.

这里对比一下生成的IR,将执行这条语句下面的所有schedule全部注释掉,得到的IR如下:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
             A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
             B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
  buffer_map = {A_1: A, B_1: B, C_1: C} {
  allocate(packedB: Pointer(global float32), float32, [1048576]), storage_scope = global;
  allocate(C.global: Pointer(global float32), float32, [1048576]), storage_scope = global {
    for (bigN: int32, 0, 32) {
      for (k: int32, 0, 1024) {
        for (littleN: int32, 0, 32) {
          packedB[(((bigN*32768) + (k*32)) + littleN)] = (float32*)B_2[(((k*1024) + (bigN*32)) + littleN)]
        }
      }
    }
    for (x.c: int32, 0, 1024) {
      for (y.c: int32, 0, 1024) {
        C.global[((x.c*1024) + y.c)] = 0f32
        for (k_1: int32, 0, 1024) {
          C.global[((x.c*1024) + y.c)] = ((float32*)C.global[((x.c*1024) + y.c)] + ((float32*)A_2[((x.c*1024) + k_1)]*(float32*)packedB[(((floordiv(y.c, 32)*32768) + (k_1*32)) + floormod(y.c, 32))]))
        }
      }
    }
    for (x: int32, 0, 1024) {
      for (y: int32, 0, 1024) {
        C_2[((x*1024) + y)] = (float32*)C.global[((x*1024) + y)]
      }
    }
  }
}

去掉这条语句后得到的IR为:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
             C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
             B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
  buffer_map = {A_1: A, B_1: B, C_1: C} {
  allocate(packedB: Pointer(global float32), float32, [1048576]), storage_scope = global {
    for (bigN: int32, 0, 32) {
      for (k: int32, 0, 1024) {
        for (littleN: int32, 0, 32) {
          packedB[(((bigN*32768) + (k*32)) + littleN)] = (float32*)B_2[(((k*1024) + (bigN*32)) + littleN)]
        }
      }
    }
    for (x: int32, 0, 1024) {
      for (y: int32, 0, 1024) {
        C_2[((x*1024) + y)] = 0f32
        for (k_1: int32, 0, 1024) {
          C_2[((x*1024) + y)] = ((float32*)C_2[((x*1024) + y)] + ((float32*)A_2[((x*1024) + k_1)]*(float32*)packedB[(((floordiv(y, 32)*32768) + (k_1*32)) + floormod(y, 32))]))
        }
      }
    }
  }
}

去掉不相干的IR,得到如下两个对比,最下面的是使用s[CC].cache_write

    for (x: int32, 0, 1024) {
      for (y: int32, 0, 1024) {
        C_2[((x*1024) + y)] = 0f32
        for (k_1: int32, 0, 1024) {
          C_2[((x*1024) + y)] = ((float32*)C_2[((x*1024) + y)] + ((float32*)A_2[((x*1024) + k_1)]*(float32*)packedB[(((floordiv(y, 32)*32768) + (k_1*32)) + floormod(y, 32))]))
        }
      }
    }
  }
}
  allocate(C.global: Pointer(global float32), float32, [1048576]), storage_scope = global {
    for (x.c: int32, 0, 1024) {
      for (y.c: int32, 0, 1024) {
        C.global[((x.c*1024) + y.c)] = 0f32
        for (k_1: int32, 0, 1024) {
          C.global[((x.c*1024) + y.c)] = ((float32*)C.global[((x.c*1024) + y.c)] + ((float32*)A_2[((x.c*1024) + k_1)]*(float32*)packedB[(((floordiv(y.c, 32)*32768) + (k_1*32)) + floormod(y.c, 32))]))
        }
      }
    }
    for (x: int32, 0, 1024) {
      for (y: int32, 0, 1024) {
        C_2[((x*1024) + y)] = (float32*)C.global[((x*1024) + y)]
      }
    }
  }
}

很明显哈,创建了一个临时变量C.global用来计算,计算完成后再写回到原来的tensor C_2当中

再看一个语句compute_at函数
官方文档给出的解释如下:

Attach the stage at parent’s scope

绑定到作用于的某一个循环轴上,对比一下增加与不增加这条语句得到的IR的区别(这里把tile部分加上)
代码如下:

 s = te.create_schedule(C.op)
 CC = s.cache_write(C, "global")
 xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
 s[CC].compute_at(s[C], yo)

生成的IR如下:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
             A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
             B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
  buffer_map = {A_1: A, B_1: B, C_1: C} {
  allocate(packedB: Pointer(global float32), float32, [1048576]), storage_scope = global;
  allocate(C.global: Pointer(global float32), float32, [1024]), storage_scope = global {
    for (bigN: int32, 0, 32) {
      for (k: int32, 0, 1024) {
        for (littleN: int32, 0, 32) {
          packedB[(((bigN*32768) + (k*32)) + littleN)] = (float32*)B_2[(((k*1024) + (bigN*32)) + littleN)]
        }
      }
    }
    for (x.outer: int32, 0, 32) {
      for (y.outer: int32, 0, 32) {
        for (x.c: int32, 0, 32) {
          for (y.c: int32, 0, 32) {
            C.global[((x.c*32) + y.c)] = 0f32
            for (k_1: int32, 0, 1024) {
              C.global[((x.c*32) + y.c)] = ((float32*)C.global[((x.c*32) + y.c)] + ((float32*)A_2[(((x.outer*32768) + (x.c*1024)) + k_1)]*(float32*)packedB[(((y.outer*32768) + (k_1*32)) + y.c)]))
            }
          }
        }
        for (x.inner: int32, 0, 32) {
          for (y.inner: int32, 0, 32) {
            C_2[((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)) + y.inner)] = (float32*)C.global[((x.inner*32) + y.inner)]
          }
        }
      }
    }
  }
}

当仅使用tile情况下的代码为:

 s = te.create_schedule(C.op)
 CC = s.cache_write(C, "global")
 xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

生成的IR如下:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
             C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
             B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
  buffer_map = {A_1: A, B_1: B, C_1: C} {
  allocate(packedB: Pointer(global float32), float32, [1048576]), storage_scope = global;
  allocate(C.global: Pointer(global float32), float32, [1048576]), storage_scope = global {
    for (bigN: int32, 0, 32) {
      for (k: int32, 0, 1024) {
        for (littleN: int32, 0, 32) {
          packedB[(((bigN*32768) + (k*32)) + littleN)] = (float32*)B_2[(((k*1024) + (bigN*32)) + littleN)]
        }
      }
    }
    for (x.c: int32, 0, 1024) {
      for (y.c: int32, 0, 1024) {
        C.global[((x.c*1024) + y.c)] = 0f32
        for (k_1: int32, 0, 1024) {
          C.global[((x.c*1024) + y.c)] = ((float32*)C.global[((x.c*1024) + y.c)] + ((float32*)A_2[((x.c*1024) + k_1)]*(float32*)packedB[(((floordiv(y.c, 32)*32768) + (k_1*32)) + floormod(y.c, 32))]))
        }
      }
    }
    for (x.outer: int32, 0, 32) {
      for (y.outer: int32, 0, 32) {
        for (x.inner: int32, 0, 32) {
          for (y.inner: int32, 0, 32) {
            C_2[((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)) + y.inner)] = (float32*)C.global[((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)) + y.inner)]
          }
        }
      }
    }
  }
}

对比最后将C.global写回到C_2部分可以发现,调用s[CC].compute_at(s[C], yo)函数后,会将写回的逻辑嵌套再C的计算逻辑中执行,并在会嵌套再yo的下面

最后补上向量化和循环展开以及并行化部分的逻辑,TVM的代码如下:

s = te.create_schedule(C.op)

# Allocate write cache
CC = s.cache_write(C, "global")

xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

# Write cache is computed at yo
s[CC].compute_at(s[C], yo)

# New inner axes
xc, yc = s[CC].op.axis

(k,) = s[CC].op.reduce_axis
ko, ki = s[CC].split(k, factor=4)
s[CC].reorder(ko, xc, ki, yc)
s[CC].unroll(ki)
s[CC].vectorize(yc)

x, y, z = s[packedB].op.axis
s[packedB].vectorize(z)
s[packedB].parallel(x)

得到的IR如下:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
             A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
             B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
  buffer_map = {A_1: A, B_1: B, C_1: C} {
  allocate(packedB: Pointer(global float32x32), float32x32, [32768]), storage_scope = global;
  allocate(C.global: Pointer(global float32), float32, [1024]), storage_scope = global {
    for (bigN: int32, 0, 32) "parallel" {
      for (k: int32, 0, 1024) {
        packedB[ramp(((bigN*32768) + (k*32)), 1, 32)] = (float32x32*)B_2[ramp(((k*1024) + (bigN*32)), 1, 32)]
      }
    }
    for (x.outer: int32, 0, 32) {
      for (y.outer: int32, 0, 32) {
        for (x.c.init: int32, 0, 32) {
          C.global[ramp((x.c.init*32), 1, 32)] = broadcast(0f32, 32)
        }
        for (k.outer: int32, 0, 256) {
          for (x.c: int32, 0, 32) {
            C.global[ramp((x.c*32), 1, 32)] = ((float32x32*)C.global[ramp((x.c*32), 1, 32)] + (broadcast((float32*)A_2[(((x.outer*32768) + (x.c*1024)) + (k.outer*4))], 32)*(float32x32*)packedB[ramp(((y.outer*32768) + (k.outer*128)), 1, 32)]))
            C.global[ramp((x.c*32), 1, 32)] = ((float32x32*)C.global[ramp((x.c*32), 1, 32)] + (broadcast((float32*)A_2[((((x.outer*32768) + (x.c*1024)) + (k.outer*4)) + 1)], 32)*(float32x32*)packedB[ramp((((y.outer*32768) + (k.outer*128)) + 32), 1, 32)]))
            C.global[ramp((x.c*32), 1, 32)] = ((float32x32*)C.global[ramp((x.c*32), 1, 32)] + (broadcast((float32*)A_2[((((x.outer*32768) + (x.c*1024)) + (k.outer*4)) + 2)], 32)*(float32x32*)packedB[ramp((((y.outer*32768) + (k.outer*128)) + 64), 1, 32)]))
            C.global[ramp((x.c*32), 1, 32)] = ((float32x32*)C.global[ramp((x.c*32), 1, 32)] + (broadcast((float32*)A_2[((((x.outer*32768) + (x.c*1024)) + (k.outer*4)) + 3)], 32)*(float32x32*)packedB[ramp((((y.outer*32768) + (k.outer*128)) + 96), 1, 32)]))
          }
        }
        for (x.inner: int32, 0, 32) {
          for (y.inner: int32, 0, 32) {
            C_2[((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)) + y.inner)] = (float32*)C.global[((x.inner*32) + y.inner)]
          }
        }
      }
    }
  }
}

并行化

并行化的操作再之前的优化中已经使用过,代码如下:

s = te.create_schedule(C.op)

# Allocate write cache
CC = s.cache_write(C, "global")

xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

# Write cache is computed at yo
s[CC].compute_at(s[C], yo)

# 并行化
s[C].parallel(xo)

# New inner axes
xc, yc = s[CC].op.axis

(k,) = s[CC].op.reduce_axis
ko, ki = s[CC].split(k, factor=4)
s[CC].reorder(ko, xc, ki, yc)
s[CC].unroll(ki)
s[CC].vectorize(yc)

x, y, z = s[packedB].op.axis
s[packedB].vectorize(z)
s[packedB].parallel(x)

代码中只对矩阵C的schedule上增加了一个对M外侧循环的并行化处理的指令

得到的IR如下:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
             A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
             B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
  buffer_map = {A_1: A, B_1: B, C_1: C} {
  allocate(packedB: Pointer(global float32x32), float32x32, [32768]), storage_scope = global {
    for (bigN: int32, 0, 32) "parallel" {
      for (k: int32, 0, 1024) {
        packedB[ramp(((bigN*32768) + (k*32)), 1, 32)] = (float32x32*)B_2[ramp(((k*1024) + (bigN*32)), 1, 32)]
      }
    }
    for (x.outer: int32, 0, 32) "parallel" {
      allocate(C.global: Pointer(global float32), float32, [1024]), storage_scope = global;
      for (y.outer: int32, 0, 32) {
        for (x.c.init: int32, 0, 32) {
          C.global[ramp((x.c.init*32), 1, 32)] = broadcast(0f32, 32)
        }
        for (k.outer: int32, 0, 256) {
          for (x.c: int32, 0, 32) {
            C.global[ramp((x.c*32), 1, 32)] = ((float32x32*)C.global[ramp((x.c*32), 1, 32)] + (broadcast((float32*)A_2[(((x.outer*32768) + (x.c*1024)) + (k.outer*4))], 32)*(float32x32*)packedB[ramp(((y.outer*32768) + (k.outer*128)), 1, 32)]))
            C.global[ramp((x.c*32), 1, 32)] = ((float32x32*)C.global[ramp((x.c*32), 1, 32)] + (broadcast((float32*)A_2[((((x.outer*32768) + (x.c*1024)) + (k.outer*4)) + 1)], 32)*(float32x32*)packedB[ramp((((y.outer*32768) + (k.outer*128)) + 32), 1, 32)]))
            C.global[ramp((x.c*32), 1, 32)] = ((float32x32*)C.global[ramp((x.c*32), 1, 32)] + (broadcast((float32*)A_2[((((x.outer*32768) + (x.c*1024)) + (k.outer*4)) + 2)], 32)*(float32x32*)packedB[ramp((((y.outer*32768) + (k.outer*128)) + 64), 1, 32)]))
            C.global[ramp((x.c*32), 1, 32)] = ((float32x32*)C.global[ramp((x.c*32), 1, 32)] + (broadcast((float32*)A_2[((((x.outer*32768) + (x.c*1024)) + (k.outer*4)) + 3)], 32)*(float32x32*)packedB[ramp((((y.outer*32768) + (k.outer*128)) + 96), 1, 32)]))
          }
        }
        for (x.inner: int32, 0, 32) {
          for (y.inner: int32, 0, 32) {
            C_2[((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)) + y.inner)] = (float32*)C.global[((x.inner*32) + y.inner)]
          }
        }
      }
    }
  }
}

总结

te的schedule提供了一系列用于计算优化的操作,如多核并行、循环展开以及循环分割等等,用户可以针对不同的数据规模和数据形状提供手动的优化。

样例代码中提供了te模块用于优化的使用步骤:

  1. 使用一系列operation来描述对数据的计算流程,即使用lambda表达式实现计算逻辑。
  2. 使用schedule primitives描述如何计算,即进行上面的优化过程。
  3. 编译到不同的平台上进行codegen
  4. 如果需要,可以把生成的IR以及机器码存储起来
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值