iree 编译流程(2)——buildGlobalOptimizationPassPipeline

11 篇文章 0 订阅
3 篇文章 0 订阅

buildGlobalOptimizationPassPipeline

  • IREE::Util::createSimplifyGlobalAccessesPass
    这个pass主要做这几件事:

    • 将不可变global tensor的 load 提前到了 block 的开头,将global tensor的 store 安全地挪到 block 的结尾。
    • 进行以下化简:
      • 如果load after store,则把 load 直接替换成 store 的 source。比如,
      store %0, @p
      %1 = load @p
      return %1
      
      转换成,
      store %0, @p
      return %0
      
      • 如果store after store,则直接消除前一个 store
      store %0, @p
      store %1, @p
      
      转换成,
      store %1, @p
      
      • 如果load after load,则消除后一个 load
      %0 = load @p
      %1 = load @p
      return %1
      
      转换成,
      %0 = load @p
      return %0
      
  • IREE::Util::createApplyPatternsPass
    执行IREE::Util dialect ODS中定义的Canonicalization Patterns,并执行 block 和跳转命令参数化简操作。

    • block 参数化简
    br ^bb1(%0, %0 : index, index)
    ^bb1(%arg0: index, %arg1: index):
      ...
    

    折叠相同的参数,化简为

    br ^bb1(%0 : index)
    ^bb1(%arg0: index):  // %arg1 remapped to %arg0
      ...
    
    • 跳转命令参数消除
    func.func @foo(%arg0: index) {
      br ^bb1(%arg0 : index)
      ^bb1(%0: index):
        ...
    }
    

    消除参数后,

    func.func @foo(%arg0: index) {
      br ^bb1
      ^bb1:  // %0 remapped to %arg0
        ...
    }
    
  • IREE::Util::createFoldGlobalsPass
    这个 pass 继续对global tensor的 load 和 store 操作进行优化,主要包括:

    • 内联常量 store,比如
    util.global mutable @a : i32
    func.func @fool {
      %c5 = arith.constant 5 : i32
      util.global.store %c5, @a : i32
      return
    }
    

    转换成,

    util.global @a = 5 : i32
    
    • 內联常量 load,比如
    util.global @a = 5 : i32
    func.func @fool {
      %1 = util.global.load @a : i32
      ...
    }
    

    转换成,

    func.func @fool {
      %1 = arith.constant 5 : i32
      ...
    }
    
    • 重命名互为链式的global tensor
    • 如果一个mutable global tensor只在 init 函数中被 store 过,则将它修改为 immutable。
    • 删除没有 load 过的global tensor
    • 合并相同初始值的immutable global tensor
  • IREE::Flow::createTensorPadToTensorInsertSlicePass
    tensor.pad转换为linalg.fill + tensor.insert_slice

    func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x1xf32>
      %padded = tensor.pad %0 low[1, 2] high[3, 4] {
      ^bb0(%arg1: index, %arg2: index):
        tensor.yield %cst : f32
      } : tensor<1x1xf32> to tensor<5x7xf32>
      %1 = hal.tensor.export %padded : tensor<5x7xf32> -> !hal.buffer_view
      return %1 : !hal.buffer_view
    }
    

    转换为,

    func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x1xf32>
      %1 = tensor.empty() : tensor<5x7xf32>
      %2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<5x7xf32>) -> tensor<5x7xf32>
      %inserted_slice = tensor.insert_slice %0 into %2[1, 2] [1, 1] [1, 1] : tensor<1x1xf32> into tensor<5x7xf32>
      %3 = hal.tensor.export %inserted_slice : tensor<5x7xf32> -> !hal.buffer_view
      return %3 : !hal.buffer_view
    }
    
  • mlir::createConvertElementwiseToLinalgPass
    把 elementwise 算子(带有Elementwise traits的 op)转换成linalg generic op,方便后续对elementwise op做算子融合。arith dialectmath dialect的 op 都是 Elementwise 的,所以实际上这个 pass 会把arith dialectmath dialect lowerlinalg dialect

    func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x3xf32>
      %1 = arith.addf %0, %0 : tensor<2x3xf32>
      %2 = hal.tensor.export %1 : tensor<2x3xf32> -> !hal.buffer_view
      return %2 : !hal.buffer_view
    }
    

    转换成,

    func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x3xf32>
      %1 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%0, %0 : tensor<2x3xf32>, tensor<2x3xf32>) outs(%0 : tensor<2x3xf32>) {
      ^bb0(%in: f32, %in_0: f32, %out: f32):
        %3 = arith.addf %in, %in_0 : f32
        linalg.yield %3 : f32
      } -> tensor<2x3xf32>
      %2 = hal.tensor.export %1 : tensor<2x3xf32> -> !hal.buffer_view
      return %2 : !hal.buffer_view
    }
    
  • mlir::createLinalgFoldUnitExtentDimsPass
    消除长度为 的维度或者循环。

    func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x3xf32>
      %1 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%0 : tensor<1x3xf32>) outs(%0 : tensor<1x3xf32>) {
      ^bb0(%in: f32, %out: f32):
        %3 = arith.addf %in, %in : f32
        linalg.yield %3 : f32
      } -> tensor<1x3xf32>
      %2 = hal.tensor.export %1 : tensor<1x3xf32> -> !hal.buffer_view
      return %2 : !hal.buffer_view
    }
    

    转换成,

    func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x3xf32>
      %collapsed = tensor.collapse_shape %0 [[0, 1]] : tensor<1x3xf32> into tensor<3xf32>
      %collapsed_0 = tensor.collapse_shape %0 [[0, 1]] : tensor<1x3xf32> into tensor<3xf32>
      %1 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%collapsed : tensor<3xf32>) outs(%collapsed_0 : tensor<3xf32>) {
      ^bb0(%in: f32, %out: f32):
        %3 = arith.addf %in, %in : f32
        linalg.yield %3 : f32
      } -> tensor<3xf32>
      %expanded = tensor.expand_shape %1 [[0, 1]] : tensor<3xf32> into tensor<1x3xf32>
      %2 = hal.tensor.export %expanded : tensor<1x3xf32> -> !hal.buffer_view
      return %2 : !hal.buffer_view
    }
    

    linalg.generic由 2 层循环缩减成了单层循环

  • createInterchangeGenericOpsPass
    循环维度变换。将 reduction 循环维度交换到最内层,相应的 parallel 循环维度被交换到外层。

    // sum(%arg0: tensor<2x3xf32>, 0) -> tensor<3xf32>
    func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x3xf32>
      %1 = tensor.empty() : tensor<3xf32>
      %2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<3xf32>) -> tensor<3xf32>
      %3 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>], iterator_types = ["reduction", "parallel"]} ins(%0 : tensor<2x3xf32>) outs(%2 : tensor<3xf32>) {
      ^bb0(%in: f32, %out: f32):
        %5 = arith.addf %in, %out : f32
        linalg.yield %5 : f32
      } -> tensor<3xf32>
      %4 = hal.tensor.export %3 : tensor<3xf32> -> !hal.buffer_view
      return %4 : !hal.buffer_view
    }
    

    交换循环之后转换成,

    func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x3xf32>
      %1 = tensor.empty() : tensor<3xf32>
      %2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<3xf32>) -> tensor<3xf32>
      %3 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d1, d0)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"]} ins(%0 : tensor<2x3xf32>) outs(%2 : tensor<3xf32>) {
      ^bb0(%in: f32, %out: f32):
        %5 = arith.addf %in, %out : f32
        linalg.yield %5 : f32
      } -> tensor<3xf32>
      %4 = hal.tensor.export %3 : tensor<3xf32> -> !hal.buffer_view
      return %4 : !hal.buffer_view
    }
    
  • memref::createResolveShapedTypeResultDimsPass

  • mlir::createCanonicalizerPass

  • mlir::createCSEPass

  • createFusionOfTensorOpsPass
    主要做 elementwise 的算子融合,其次也会将tensor.expand_shape转换成linalg generic op,方便进行算子融合。

    elementwise 算子融合的条件:

    • producer 和 comsumer 都是linalg generic op,且都为 tensor 语义。
    • producer 只有一个 user。
    • producer 所有维度的迭代类型都是 parallel,consumer 的 index map 必须和 producer 具有相同的循环嵌套层数。
    • producer 结果的 index map 必须是 Permutation,即结果的每个元素有且仅 store 一次(输出是 pointwise 的)。
    • consumer 可以包含 reduction 迭代类型,但需要保证融合后输入的 index map 可以覆盖每一个迭代维度,理由是如果缺失就无法确定该维度的循环边界。
    // reduce(mul(arg0, arg1), 0)
    // for (int d0 = 0; d0 < n; ++d0) {
    //   temp[d0] = arg0[d0] * arg1[d0];
    // }
    // result = 0;
    // for (int d0 = 0; d0 < n; ++d0) {
    //   result += temp[d0];
    // }
    func.func @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2xf32>
      %2 = tensor.empty() : tensor<2xf32>
      %3 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%0, %1 : tensor<2xf32>, tensor<2xf32>) outs(%2 : tensor<2xf32>) {
      ^bb0(%in: f32, %in_0: f32, %out: f32):
        %8 = arith.mulf %in, %in_0 : f32
        linalg.yield %8 : f32
      } -> tensor<2xf32>
      %4 = tensor.empty() : tensor<f32>
      %5 = linalg.fill ins(%cst : f32) outs(%4 : tensor<f32>) -> tensor<f32>
      %6 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> ()>], iterator_types = ["reduction"]} ins(%3 : tensor<2xf32>) outs(%5 : tensor<f32>) {
      ^bb0(%in: f32, %out: f32):
        %8 = arith.addf %in, %out : f32
        linalg.yield %8 : f32
      } -> tensor<f32>
      %7 = hal.tensor.export %6 : tensor<f32> -> !hal.buffer_view
      return %7 : !hal.buffer_view
    }
    

    融合mul和reduce之后转换成,

    // result = 0;
    // for (int d0 = 0; d0 < n; ++d0) {
    //   result += arg0[d0] * arg1[d0];
    // }
    func.func @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2xf32>
      %2 = tensor.empty() : tensor<f32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<f32>) -> tensor<f32>
      %4 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> ()>], iterator_types = ["reduction"]} ins(%0, %1 : tensor<2xf32>, tensor<2xf32>) outs(%3 : tensor<f32>) {
      ^bb0(%in: f32, %in_0: f32, %out: f32):
        %6 = arith.mulf %in, %in_0 : f32
        %7 = arith.addf %6, %out : f32
        linalg.yield %7 : f32
      } -> tensor<f32>
      %5 = hal.tensor.export %4 : tensor<f32> -> !hal.buffer_view
      return %5 : !hal.buffer_view
    }
    
  • mlir::createLinalgDetensorizePass
    将 0-D Tensor 转换为它的基础元素类型。

  • mlir::createCanonicalizerPass

  • mlir::createCSEPass

  • createSplitReductionPass
    将 matmul 和 topk 的单次 reduce 分成两次 reduce 操作(一次 batch matmul 和一次 add)。默认不开启,设置--iree-flow-split-matmul-reduction>=2可开启。

    func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32>
      %2 = linalg.init_tensor [128, 256] : tensor<128x256xf32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32>
      %4 = linalg.matmul ins(%0, %1 : tensor<128x256xf32>, tensor<256x256xf32>) outs(%3 : tensor<128x256xf32>) -> tensor<128x256xf32>
      %5 = hal.tensor.export %4 : tensor<128x256xf32> -> !hal.buffer_view
      return %5 : !hal.buffer_view
    }
    

    --iree-flow-split-matmul-reduction=2转换成,

    func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32>
      %2 = linalg.init_tensor [128, 256] : tensor<128x256xf32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32>
      %4 = tensor.expand_shape %0 [[0], [1, 2]] : tensor<128x256xf32> into tensor<128x2x128xf32>
      %5 = tensor.expand_shape %1 [[0, 1], [2]] : tensor<256x256xf32> into tensor<2x128x256xf32>
      %6 = linalg.init_tensor [2, 128, 256] : tensor<2x128x256xf32>
      %7 = linalg.fill ins(%cst : f32) outs(%6 : tensor<2x128x256xf32>) -> tensor<2x128x256xf32>
      %8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d1, d0, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d3, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"]} ins(%4, %5 : tensor<128x2x128xf32>, tensor<2x128x256xf32>) outs(%7 : tensor<2x128x256xf32>) attrs =  {__internal_linalg_transform__ = "SPLIT", linalg.memoized_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]} {
      ^bb0(%arg2: f32, %arg3: f32, %arg4: f32):
        %11 = arith.mulf %arg2, %arg3 : f32
        %12 = arith.addf %arg4, %11 : f32
        linalg.yield %12 : f32
      } -> tensor<2x128x256xf32>
      %9 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>], iterator_types = ["reduction", "parallel", "parallel"]} ins(%8 : tensor<2x128x256xf32>) outs(%3 : tensor<128x256xf32>) attrs =  {__internal_linalg_transform__ = "SPLIT"} {
      ^bb0(%arg2: f32, %arg3: f32):
        %11 = arith.addf %arg2, %arg3 : f32
        linalg.yield %11 : f32
      } -> tensor<128x256xf32>
      %10 = hal.tensor.export %9 : tensor<128x256xf32> -> !hal.buffer_view
      return %10 : !hal.buffer_view
    }
    
  • createInterchangeGenericOpsPass
    循环维度变换。将 reduction 循环维度交换到最内层,相应的 parallel 循环维度被交换到外层。

  • createInterchangeTransposeGenericOpsPass
    当输入 indexing map 是 permutation 时,交换循环维度使得输入的 indexing map 是 identity 的,其作用是使得输入尽可能变成连续访存。

  • createDispatchWithTransformDialect
    根据transform dialect对算子进行调度和派遣,需要另外加载一个transform dialect的 module 文件,默认不做该变换。transform dialect定义了一套调度规则,用于引导目标 IR 进行变换,比如循环展开、tiling 等。

  • createFormDispatchRegionsPass
    以包含reduction looplinalg opnamed linalg op为中心(root),按一定规则合并 producers 和 comsumers,划分出dispatch region子图。dispatch region是 IREE 中的原子执行单元,dispatch region内部可以直接复用输入和输出的内存,从而避免了内部的内存分配操作,内存分配只发生在dispatch region的边界,同时dispatch region之间会自动插入同步操作。

    func.func @predict(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x10xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<10x5xf32>
      %2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<5xf32>
      %3 = tensor.empty() : tensor<2x5xf32>
      %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<2x5xf32>) -> tensor<2x5xf32>
      %5 = linalg.matmul ins(%0, %1 : tensor<2x10xf32>, tensor<10x5xf32>) outs(%4 : tensor<2x5xf32>) -> tensor<2x5xf32>
      %6 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%5, %2 : tensor<2x5xf32>, tensor<5xf32>) outs(%3 : tensor<2x5xf32>) {
      ^bb0(%in: f32, %in_0: f32, %out: f32):
        %8 = arith.addf %in, %in_0 : f32
        linalg.yield %8 : f32
      } -> tensor<2x5xf32>
      %7 = hal.tensor.export %6 : tensor<2x5xf32> -> !hal.buffer_view
      return %7 : !hal.buffer_view
    }
    

    转换成,

    func.func @predict(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x10xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<10x5xf32>
      %2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<5xf32>
      %3 = tensor.empty() : tensor<2x5xf32>
      %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<2x5xf32>) -> tensor<2x5xf32>
      %c1 = arith.constant 1 : index
      %c0 = arith.constant 0 : index
      %c2 = arith.constant 2 : index
      %c1_0 = arith.constant 1 : index
      %5 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0, %c2, %c1_0]
      %c0_1 = arith.constant 0 : index
      %c5 = arith.constant 5 : index
      %c1_2 = arith.constant 1 : index
      %6 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1, %c5, %c1_2]
      %7 = flow.dispatch.region[%5, %6] -> (tensor<2x5xf32>) {
        %9 = linalg.matmul ins(%0, %1 : tensor<2x10xf32>, tensor<10x5xf32>) outs(%4 : tensor<2x5xf32>) -> tensor<2x5xf32>
        %10 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%9, %2 : tensor<2x5xf32>, tensor<5xf32>) outs(%3 : tensor<2x5xf32>) {
        ^bb0(%in: f32, %in_3: f32, %out: f32):
          %11 = arith.addf %in, %in_3 : f32
          linalg.yield %11 : f32
        } -> tensor<2x5xf32>
        flow.return %10 : tensor<2x5xf32>
      } count(%arg3: index, %arg4: index) -> (index, index, index) {
        %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4
        flow.return %x, %y, %z : index, index, index
      }
      %8 = hal.tensor.export %7 : tensor<2x5xf32> -> !hal.buffer_view
      return %8 : !hal.buffer_view
    }
    
  • createFormDispatchWorkgroupsPass
    dispatch region转换成dispatch work group的形式,并将 cloneable 的 op(比如tensor.filltensor.empty等)拷贝到 work group 中。如果在linalg层做了tiling,该 pass 也会把tiling引入的tensor.extract_slicetensor.insert_slice尽可能转换成flow.tensor.slice和flow.tensor.update,转换不了的后续再转换成flow.dispatch.tensor.loadflow.dispatch.tensor.store

  • createCaptureDispatchDynamicDimsPass
    由于flow.dispatch.workgroups的参数中动态形状 tensor 被替换成了!flow.dispatch.tensor和相应的动态维度 index,该 pass 捕获 workgroups 参数中的动态维度 index,插入flow.dispatch.tie_shape将参数中的动态维度 index 和!flow.dispatch.tensor进行绑定。

  • mlir::createCanonicalizerPass

  • createCSEPass

  • createInitializeEmptyTensorsPass
    如果tensor.empty op的 user 中存在非 linalg 或 IREE LinalgExt op,则把该tensor.empty op转换成flow.tensor.emptyflow.tensor.splat op

  • IREE::Flow::createOutlineDispatchRegionsPass
    把每个dispatch region转换成flow.executable + flow.dispatch op

  • IREE::Util::createStripDebugOpsPass
    消除DebugOnly op。

  • mlir::createCanonicalizerPass

  • IREE::Flow::createDeduplicateExecutablesPass
    消除重复的flow.executable

  • IREE::Flow::createInjectDispatchTracingPass
    注入跟踪运行时 dispatch 函数输入和输出信息的 op。默认不开启。

  • IREE::Flow::createCleanupTensorShapesPass
    删除flow.tensor.tie_shape op,并确认 module 中不再包含tensor.dimtensor.rank这两类形状查询 op。

  • mlir::createCanonicalizerPass

  • mlir::createCSEPass

  • mlir::createCanonicalizerPass

  • mlir::createCSEPass

  • mlir::createSymbolDCEPass

未完待续…

  • 11
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值