iree 编译流程(2)——buildGlobalOptimizationPassPipeline

哦豁灬

于 2024-07-29 13:46:23 发布

阅读量495

点赞数 11

分类专栏：学习笔记备忘 ai compiler 文章标签： iree ai编译器学习

本文链接：https://blog.csdn.net/qq_38342510/article/details/140759512

版权

学习笔记同时被 3 个专栏收录

17 篇文章 0 订阅

订阅专栏

备忘

11 篇文章 0 订阅

订阅专栏

ai compiler

3 篇文章 0 订阅

订阅专栏

buildGlobalOptimizationPassPipeline

IREE::Util::createSimplifyGlobalAccessesPass
这个pass主要做这几件事：
- 将不可变global tensor的 load 提前到了 block 的开头，将global tensor的 store 安全地挪到 block 的结尾。
- 进行以下化简：
  - 如果load after store，则把 load 直接替换成 store 的 source。比如，
```
store %0, @p
%1 = load @p
return %1
```
  转换成，
```
store %0, @p
return %0
```
  - 如果store after store，则直接消除前一个 store
```
store %0, @p
store %1, @p
```
  转换成，
```
store %1, @p
```
  - 如果load after load，则消除后一个 load
```
%0 = load @p
%1 = load @p
return %1
```
  转换成，
```
%0 = load @p
return %0
```

IREE::Util::createApplyPatternsPass
执行IREE::Util dialect ODS中定义的Canonicalization Patterns，并执行 block 和跳转命令参数化简操作。

block 参数化简

br ^bb1(%0, %0 : index, index)
^bb1(%arg0: index, %arg1: index):
  ...

折叠相同的参数，化简为

br ^bb1(%0 : index)
^bb1(%arg0: index):  // %arg1 remapped to %arg0
  ...

跳转命令参数消除

func.func @foo(%arg0: index) {
  br ^bb1(%arg0 : index)
  ^bb1(%0: index):
    ...
}

消除参数后，

func.func @foo(%arg0: index) {
  br ^bb1
  ^bb1:  // %0 remapped to %arg0
    ...
}

IREE::Util::createFoldGlobalsPass
这个 pass 继续对global tensor的 load 和 store 操作进行优化，主要包括：
- 内联常量 store，比如
```
util.global mutable @a : i32
func.func @fool {
  %c5 = arith.constant 5 : i32
  util.global.store %c5, @a : i32
  return
}
```
转换成，
```
util.global @a = 5 : i32
```
- 內联常量 load，比如
```
util.global @a = 5 : i32
func.func @fool {
  %1 = util.global.load @a : i32
  ...
}
```
转换成，
```
func.func @fool {
  %1 = arith.constant 5 : i32
  ...
}
```
- 重命名互为链式的global tensor。
- 如果一个mutable global tensor只在 init 函数中被 store 过，则将它修改为 immutable。
- 删除没有 load 过的global tensor。
- 合并相同初始值的immutable global tensor

IREE::Flow::createTensorPadToTensorInsertSlicePass
将tensor.pad转换为linalg.fill + tensor.insert_slice。

func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x1xf32>
  %padded = tensor.pad %0 low[1, 2] high[3, 4] {
  ^bb0(%arg1: index, %arg2: index):
    tensor.yield %cst : f32
  } : tensor<1x1xf32> to tensor<5x7xf32>
  %1 = hal.tensor.export %padded : tensor<5x7xf32> -> !hal.buffer_view
  return %1 : !hal.buffer_view
}

转换为，

func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x1xf32>
  %1 = tensor.empty() : tensor<5x7xf32>
  %2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<5x7xf32>) -> tensor<5x7xf32>
  %inserted_slice = tensor.insert_slice %0 into %2[1, 2] [1, 1] [1, 1] : tensor<1x1xf32> into tensor<5x7xf32>
  %3 = hal.tensor.export %inserted_slice : tensor<5x7xf32> -> !hal.buffer_view
  return %3 : !hal.buffer_view
}

mlir::createConvertElementwiseToLinalgPass
把 elementwise 算子（带有Elementwise traits的 op）转换成linalg generic op，方便后续对elementwise op做算子融合。arith dialect和math dialect的 op 都是 Elementwise 的，所以实际上这个 pass 会把arith dialect和math dialect lower到linalg dialect。

func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x3xf32>
  %1 = arith.addf %0, %0 : tensor<2x3xf32>
  %2 = hal.tensor.export %1 : tensor<2x3xf32> -> !hal.buffer_view
  return %2 : !hal.buffer_view
}

转换成，

func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x3xf32>
  %1 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%0, %0 : tensor<2x3xf32>, tensor<2x3xf32>) outs(%0 : tensor<2x3xf32>) {
  ^bb0(%in: f32, %in_0: f32, %out: f32):
    %3 = arith.addf %in, %in_0 : f32
    linalg.yield %3 : f32
  } -> tensor<2x3xf32>
  %2 = hal.tensor.export %1 : tensor<2x3xf32> -> !hal.buffer_view
  return %2 : !hal.buffer_view
}

mlir::createLinalgFoldUnitExtentDimsPass
消除长度为的维度或者循环。

func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x3xf32>
  %1 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%0 : tensor<1x3xf32>) outs(%0 : tensor<1x3xf32>) {
  ^bb0(%in: f32, %out: f32):
    %3 = arith.addf %in, %in : f32
    linalg.yield %3 : f32
  } -> tensor<1x3xf32>
  %2 = hal.tensor.export %1 : tensor<1x3xf32> -> !hal.buffer_view
  return %2 : !hal.buffer_view
}

转换成，

func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x3xf32>
  %collapsed = tensor.collapse_shape %0 [[0, 1]] : tensor<1x3xf32> into tensor<3xf32>
  %collapsed_0 = tensor.collapse_shape %0 [[0, 1]] : tensor<1x3xf32> into tensor<3xf32>
  %1 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%collapsed : tensor<3xf32>) outs(%collapsed_0 : tensor<3xf32>) {
  ^bb0(%in: f32, %out: f32):
    %3 = arith.addf %in, %in : f32
    linalg.yield %3 : f32
  } -> tensor<3xf32>
  %expanded = tensor.expand_shape %1 [[0, 1]] : tensor<3xf32> into tensor<1x3xf32>
  %2 = hal.tensor.export %expanded : tensor<1x3xf32> -> !hal.buffer_view
  return %2 : !hal.buffer_view
}

linalg.generic由 2 层循环缩减成了单层循环

createInterchangeGenericOpsPass
循环维度变换。将 reduction 循环维度交换到最内层，相应的 parallel 循环维度被交换到外层。

// sum(%arg0: tensor<2x3xf32>, 0) -> tensor<3xf32>
func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x3xf32>
  %1 = tensor.empty() : tensor<3xf32>
  %2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<3xf32>) -> tensor<3xf32>
  %3 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>], iterator_types = ["reduction", "parallel"]} ins(%0 : tensor<2x3xf32>) outs(%2 : tensor<3xf32>) {
  ^bb0(%in: f32, %out: f32):
    %5 = arith.addf %in, %out : f32
    linalg.yield %5 : f32
  } -> tensor<3xf32>
  %4 = hal.tensor.export %3 : tensor<3xf32> -> !hal.buffer_view
  return %4 : !hal.buffer_view
}

交换循环之后转换成，

func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x3xf32>
  %1 = tensor.empty() : tensor<3xf32>
  %2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<3xf32>) -> tensor<3xf32>
  %3 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d1, d0)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"]} ins(%0 : tensor<2x3xf32>) outs(%2 : tensor<3xf32>) {
  ^bb0(%in: f32, %out: f32):
    %5 = arith.addf %in, %out : f32
    linalg.yield %5 : f32
  } -> tensor<3xf32>
  %4 = hal.tensor.export %3 : tensor<3xf32> -> !hal.buffer_view
  return %4 : !hal.buffer_view
}

memref::createResolveShapedTypeResultDimsPass
mlir::createCanonicalizerPass
mlir::createCSEPass

createFusionOfTensorOpsPass
主要做 elementwise 的算子融合，其次也会将tensor.expand_shape转换成linalg generic op，方便进行算子融合。

elementwise 算子融合的条件：

producer 和 comsumer 都是linalg generic op，且都为 tensor 语义。
producer 只有一个 user。
producer 所有维度的迭代类型都是 parallel，consumer 的 index map 必须和 producer 具有相同的循环嵌套层数。
producer 结果的 index map 必须是 Permutation，即结果的每个元素有且仅 store 一次（输出是 pointwise 的）。
consumer 可以包含 reduction 迭代类型，但需要保证融合后输入的 index map 可以覆盖每一个迭代维度，理由是如果缺失就无法确定该维度的循环边界。

// reduce(mul(arg0, arg1), 0)
// for (int d0 = 0; d0 < n; ++d0) {
//   temp[d0] = arg0[d0] * arg1[d0];
// }
// result = 0;
// for (int d0 = 0; d0 < n; ++d0) {
//   result += temp[d0];
// }
func.func @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2xf32>
  %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2xf32>
  %2 = tensor.empty() : tensor<2xf32>
  %3 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%0, %1 : tensor<2xf32>, tensor<2xf32>) outs(%2 : tensor<2xf32>) {
  ^bb0(%in: f32, %in_0: f32, %out: f32):
    %8 = arith.mulf %in, %in_0 : f32
    linalg.yield %8 : f32
  } -> tensor<2xf32>
  %4 = tensor.empty() : tensor<f32>
  %5 = linalg.fill ins(%cst : f32) outs(%4 : tensor<f32>) -> tensor<f32>
  %6 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> ()>], iterator_types = ["reduction"]} ins(%3 : tensor<2xf32>) outs(%5 : tensor<f32>) {
  ^bb0(%in: f32, %out: f32):
    %8 = arith.addf %in, %out : f32
    linalg.yield %8 : f32
  } -> tensor<f32>
  %7 = hal.tensor.export %6 : tensor<f32> -> !hal.buffer_view
  return %7 : !hal.buffer_view
}

融合mul和reduce之后转换成，

// result = 0;
// for (int d0 = 0; d0 < n; ++d0) {
//   result += arg0[d0] * arg1[d0];
// }
func.func @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2xf32>
  %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2xf32>
  %2 = tensor.empty() : tensor<f32>
  %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<f32>) -> tensor<f32>
  %4 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> ()>], iterator_types = ["reduction"]} ins(%0, %1 : tensor<2xf32>, tensor<2xf32>) outs(%3 : tensor<f32>) {
  ^bb0(%in: f32, %in_0: f32, %out: f32):
    %6 = arith.mulf %in, %in_0 : f32
    %7 = arith.addf %6, %out : f32
    linalg.yield %7 : f32
  } -> tensor<f32>
  %5 = hal.tensor.export %4 : tensor<f32> -> !hal.buffer_view
  return %5 : !hal.buffer_view
}

mlir::createLinalgDetensorizePass
将 0-D Tensor 转换为它的基础元素类型。
mlir::createCanonicalizerPass
mlir::createCSEPass

createSplitReductionPass
将 matmul 和 topk 的单次 reduce 分成两次 reduce 操作（一次 batch matmul 和一次 add）。默认不开启，设置--iree-flow-split-matmul-reduction>=2可开启。

func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32>
  %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32>
  %2 = linalg.init_tensor [128, 256] : tensor<128x256xf32>
  %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32>
  %4 = linalg.matmul ins(%0, %1 : tensor<128x256xf32>, tensor<256x256xf32>) outs(%3 : tensor<128x256xf32>) -> tensor<128x256xf32>
  %5 = hal.tensor.export %4 : tensor<128x256xf32> -> !hal.buffer_view
  return %5 : !hal.buffer_view
}

--iree-flow-split-matmul-reduction=2转换成，

func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32>
  %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32>
  %2 = linalg.init_tensor [128, 256] : tensor<128x256xf32>
  %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32>
  %4 = tensor.expand_shape %0 [[0], [1, 2]] : tensor<128x256xf32> into tensor<128x2x128xf32>
  %5 = tensor.expand_shape %1 [[0, 1], [2]] : tensor<256x256xf32> into tensor<2x128x256xf32>
  %6 = linalg.init_tensor [2, 128, 256] : tensor<2x128x256xf32>
  %7 = linalg.fill ins(%cst : f32) outs(%6 : tensor<2x128x256xf32>) -> tensor<2x128x256xf32>
  %8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d1, d0, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d3, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"]} ins(%4, %5 : tensor<128x2x128xf32>, tensor<2x128x256xf32>) outs(%7 : tensor<2x128x256xf32>) attrs =  {__internal_linalg_transform__ = "SPLIT", linalg.memoized_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]} {
  ^bb0(%arg2: f32, %arg3: f32, %arg4: f32):
    %11 = arith.mulf %arg2, %arg3 : f32
    %12 = arith.addf %arg4, %11 : f32
    linalg.yield %12 : f32
  } -> tensor<2x128x256xf32>
  %9 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>], iterator_types = ["reduction", "parallel", "parallel"]} ins(%8 : tensor<2x128x256xf32>) outs(%3 : tensor<128x256xf32>) attrs =  {__internal_linalg_transform__ = "SPLIT"} {
  ^bb0(%arg2: f32, %arg3: f32):
    %11 = arith.addf %arg2, %arg3 : f32
    linalg.yield %11 : f32
  } -> tensor<128x256xf32>
  %10 = hal.tensor.export %9 : tensor<128x256xf32> -> !hal.buffer_view
  return %10 : !hal.buffer_view
}

createInterchangeGenericOpsPass
循环维度变换。将 reduction 循环维度交换到最内层，相应的 parallel 循环维度被交换到外层。
createInterchangeTransposeGenericOpsPass
当输入 indexing map 是 permutation 时，交换循环维度使得输入的 indexing map 是 identity 的，其作用是使得输入尽可能变成连续访存。
createDispatchWithTransformDialect
根据transform dialect对算子进行调度和派遣，需要另外加载一个transform dialect的 module 文件，默认不做该变换。transform dialect定义了一套调度规则，用于引导目标 IR 进行变换，比如循环展开、tiling 等。

createFormDispatchRegionsPass
以包含reduction loop的linalg op或named linalg op为中心（root），按一定规则合并 producers 和 comsumers，划分出dispatch region子图。dispatch region是 IREE 中的原子执行单元，dispatch region内部可以直接复用输入和输出的内存，从而避免了内部的内存分配操作，内存分配只发生在dispatch region的边界，同时dispatch region之间会自动插入同步操作。

func.func @predict(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x10xf32>
  %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<10x5xf32>
  %2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<5xf32>
  %3 = tensor.empty() : tensor<2x5xf32>
  %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<2x5xf32>) -> tensor<2x5xf32>
  %5 = linalg.matmul ins(%0, %1 : tensor<2x10xf32>, tensor<10x5xf32>) outs(%4 : tensor<2x5xf32>) -> tensor<2x5xf32>
  %6 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%5, %2 : tensor<2x5xf32>, tensor<5xf32>) outs(%3 : tensor<2x5xf32>) {
  ^bb0(%in: f32, %in_0: f32, %out: f32):
    %8 = arith.addf %in, %in_0 : f32
    linalg.yield %8 : f32
  } -> tensor<2x5xf32>
  %7 = hal.tensor.export %6 : tensor<2x5xf32> -> !hal.buffer_view
  return %7 : !hal.buffer_view
}

转换成，

func.func @predict(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x10xf32>
  %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<10x5xf32>
  %2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<5xf32>
  %3 = tensor.empty() : tensor<2x5xf32>
  %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<2x5xf32>) -> tensor<2x5xf32>
  %c1 = arith.constant 1 : index
  %c0 = arith.constant 0 : index
  %c2 = arith.constant 2 : index
  %c1_0 = arith.constant 1 : index
  %5 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0, %c2, %c1_0]
  %c0_1 = arith.constant 0 : index
  %c5 = arith.constant 5 : index
  %c1_2 = arith.constant 1 : index
  %6 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1, %c5, %c1_2]
  %7 = flow.dispatch.region[%5, %6] -> (tensor<2x5xf32>) {
    %9 = linalg.matmul ins(%0, %1 : tensor<2x10xf32>, tensor<10x5xf32>) outs(%4 : tensor<2x5xf32>) -> tensor<2x5xf32>
    %10 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%9, %2 : tensor<2x5xf32>, tensor<5xf32>) outs(%3 : tensor<2x5xf32>) {
    ^bb0(%in: f32, %in_3: f32, %out: f32):
      %11 = arith.addf %in, %in_3 : f32
      linalg.yield %11 : f32
    } -> tensor<2x5xf32>
    flow.return %10 : tensor<2x5xf32>
  } count(%arg3: index, %arg4: index) -> (index, index, index) {
    %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4
    flow.return %x, %y, %z : index, index, index
  }
  %8 = hal.tensor.export %7 : tensor<2x5xf32> -> !hal.buffer_view
  return %8 : !hal.buffer_view
}

createFormDispatchWorkgroupsPass
将dispatch region转换成dispatch work group的形式，并将 cloneable 的 op（比如tensor.fill、tensor.empty等）拷贝到 work group 中。如果在linalg层做了tiling，该 pass 也会把tiling引入的tensor.extract_slice和tensor.insert_slice尽可能转换成flow.tensor.slice和flow.tensor.update，转换不了的后续再转换成flow.dispatch.tensor.load和flow.dispatch.tensor.store
createCaptureDispatchDynamicDimsPass
由于flow.dispatch.workgroups的参数中动态形状 tensor 被替换成了!flow.dispatch.tensor和相应的动态维度 index，该 pass 捕获 workgroups 参数中的动态维度 index，插入flow.dispatch.tie_shape将参数中的动态维度 index 和!flow.dispatch.tensor进行绑定。
mlir::createCanonicalizerPass
createCSEPass
createInitializeEmptyTensorsPass
如果tensor.empty op的 user 中存在非 linalg 或 IREE LinalgExt op，则把该tensor.empty op转换成flow.tensor.empty或flow.tensor.splat op。
IREE::Flow::createOutlineDispatchRegionsPass
把每个dispatch region转换成flow.executable + flow.dispatch op。
IREE::Util::createStripDebugOpsPass
消除DebugOnly op。
mlir::createCanonicalizerPass
IREE::Flow::createDeduplicateExecutablesPass
消除重复的flow.executable。
IREE::Flow::createInjectDispatchTracingPass
注入跟踪运行时 dispatch 函数输入和输出信息的 op。默认不开启。
IREE::Flow::createCleanupTensorShapesPass
删除flow.tensor.tie_shape op，并确认 module 中不再包含tensor.dim和tensor.rank这两类形状查询 op。
mlir::createCanonicalizerPass
mlir::createCSEPass
mlir::createCanonicalizerPass
mlir::createCSEPass
mlir::createSymbolDCEPass

未完待续…

哦豁灬

关注

11
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
iree 编译流程(2)——buildGlobalOptimizationPassPipeline

当输入 indexing map 是 permutation 时，交换循环维度使得输入的 indexing map 是 identity 的，其作用是使得输入尽可能变成连续访存。将 reduction 循环维度交换到最内层，相应的 parallel 循环维度被交换到外层。和相应的动态维度 index，该 pass 捕获 workgroups 参数中的动态维度 index，插入。内部可以直接复用输入和输出的内存，从而避免了内部的内存分配操作，内存分配只发生在。的 module 文件，默认不做该变换。
复制链接

扫一扫

专栏目录