ScaleHLS-opt 参数详解:高层次综合优化工具指南

ScaleHLS-opt 参数详解:高层次综合优化工具指南

https://github.com/UIUC-ChenLab/ScaleHLS-HIDA

引言

在当今异构计算时代,将高级算法高效地映射到硬件加速器(如FPGA和ASIC)上变得至关重要。ScaleHLS是一个基于MLIR(Multi-Level Intermediate Representation,多级中间表示)的开源高层次综合(HLS)框架,旨在提供更强大、更灵活的硬件加速优化能力。作为ScaleHLS的核心工具,scalehls-opt提供了丰富的优化选项,帮助设计者从高级描述生成高质量的硬件实现。

MLIR是由Google开发的编译器基础设施,提供了统一的中间表示框架,使不同领域的编译器可以共享优化技术。ScaleHLS建立在MLIR之上,利用其可扩展性和多层次抽象能力,为高层次综合提供了一套完整的工具链。

scalehls-opt类似于LLVM中的opt工具,它接受MLIR格式的输入,应用指定的优化传递(pass),然后输出优化后的MLIR。通过这些优化,可以显著提高生成硬件的性能、减少资源使用,并简化设计过程。

基本用法与通用选项

scalehls-opt的基本命令行语法如下:

scalehls-opt [options] <input file>

其中<input file>是包含MLIR代码的输入文件,[options]是控制优化过程的各种选项。

常用通用选项

  • --help:显示可用选项
  • --help-list:以列表形式显示可用选项
  • --version:显示程序版本
  • --color:在输出中使用颜色(默认自动检测)
  • -o <filename>:指定输出文件名
  • --split-input-file:将输入文件拆分并独立处理每个片段
  • --allow-unregistered-dialect:允许使用未注册的方言
  • --no-implicit-module:禁用在解析期间自动添加顶层模块

调试与分析选项

  • --mlir-print-ir-before=<pass-arg>:在指定Pass之前打印IR
  • --mlir-print-ir-after=<pass-arg>:在指定Pass之后打印IR
  • --mlir-print-ir-before-all:在每个Pass之前打印IR
  • --mlir-print-ir-after-all:在每个Pass之后打印IR
  • --mlir-pass-statistics:显示每个Pass的统计信息
  • --mlir-timing:显示执行时间
  • --mlir-timing-display=<value>:显示计时数据的方法(list或tree)
  • --dump-pass-pipeline:打印将要运行的优化管道

ScaleHLS支持的方言(Dialects)

MLIR的一个关键特性是支持多种"方言"(Dialect),每种方言代表特定领域的抽象。scalehls-opt支持多种方言,包括:

  • affine:表示仿射变换和循环
  • arith:表示算术操作
  • bufferization:处理张量到缓冲区的转换
  • func:表示函数和控制流
  • hls:ScaleHLS特有的方言,表示高层次综合特定的操作和类型
  • linalg:表示线性代数操作
  • memref:表示内存引用
  • scf:表示结构化控制流
  • tensor:表示张量操作
  • tosa:表示机器学习操作
  • vector:表示向量操作

这些方言共同提供了从高级算法描述到低级硬件实现的完整映射路径。

关键优化Pass分类

scalehls-opt提供了丰富的优化Pass,可以根据其功能分为以下几类:

循环优化相关

--scalehls-affine-loop-fusion                 - 融合仿射循环嵌套
  --fusion-compute-tolerance=<number>         - 循环融合时容忍的额外计算(默认为100.0)
  --fusion-maximal                            - 启用最大循环融合
  --mode=<value>                              - 融合模式(greedy/producer/sibling)

--scalehls-affine-loop-order-opt              - 优化仿射循环嵌套的顺序
--scalehls-affine-loop-perfection             - 尝试完善嵌套循环
--scalehls-affine-loop-tile                   - 对仿射循环嵌套进行分块并标注点循环
  --tile-size=<uint>                          - 循环分块大小

--scalehls-affine-loop-unroll-jam             - 展开并合并仿射循环嵌套
  --point-loop-only                           - 仅对点循环带应用展开和合并
  --unroll-factor=<uint>                      - 展开因子

这些Pass优化了循环结构,以提高硬件实现的并行性和效率。例如,循环分块(tiling)可以将大循环分解为更小的块,以更好地利用局部存储;循环展开(unrolling)可以增加并行度,但会增加资源使用。

内存优化相关

--scalehls-array-partition                    - 应用优化的数组分区策略
  --threshold=<uint>                          - 使用LUTRAM的阈值

--scalehls-buffer-vectorize                   - 向量化缓冲区
--scalehls-create-local-buffer                - 将外部缓冲区提升为片上缓冲区
  --external-buffer-only                      - 仅处理外部缓冲区
  --register-only                             - 仅寄存器或单元素缓冲区

--scalehls-place-dataflow-buffer              - 放置数据流缓冲区
  --place-external-buffer                     - 在外部内存中放置缓冲区
  --threshold=<uint>                          - 放置外部缓冲区的阈值

这些Pass优化了内存访问模式和缓冲区组织,对于提高硬件设计的带宽和减少访问延迟至关重要。例如,数组分区(array partitioning)可以将一个大数组分割成多个小数组,实现并行访问;缓冲区向量化可以启用批量数据传输。

数据流优化相关

--scalehls-create-dataflow-from-affine        - 从仿射循环创建数据流层次结构
--scalehls-create-dataflow-from-linalg        - 从linalg创建数据流层次结构
--scalehls-create-dataflow-from-tosa          - 从tosa创建数据流层次结构
--scalehls-balance-dataflow-node              - 平衡数据流节点
--scalehls-bufferize-dataflow                 - 将数据流操作缓冲化
--scalehls-lower-dataflow                     - 将数据流从任务级降低到节点级
  --split-external-access                     - 是否分割外部内存访问

--scalehls-parallelize-dataflow-node          - 基于数据流结构展开仿射循环嵌套
  --complexity-aware                          - 是否考虑节点复杂性
  --correlation-aware                         - 是否考虑节点相关性
  --max-unroll-factor=<uint>                  - 最大展开因子
  --point-loop-only                           - 仅对点循环带应用展开

--scalehls-schedule-dataflow-node             - 调度数据流节点
  --ignore-violations                         - 忽略多消费者或多生产者违规
--scalehls-stream-dataflow-task               - 流化数据流任务

这些Pass处理数据流结构,这是高性能HLS设计的关键。数据流架构允许不同的计算阶段并行执行,类似于硬件流水线,可以显著提高吞吐量。

接口和函数优化

--scalehls-create-axi-interface               - 为顶层函数创建AXI接口
  --top-func=<string>                         - 用于HLS综合的顶层函数

--scalehls-func-duplication                   - 为每个函数调用复制函数
--scalehls-func-pipelining                    - 应用函数流水线化
  --target-func=<string>                      - 要流水线化的目标函数
  --target-ii=<uint>                          - 要达到的目标初始间隔(II)

--scalehls-func-preprocess                    - 预处理函数,为后续ScaleHLS优化做准备
  --top-func=<string>                         - 用于HLS综合的顶层函数

这些Pass优化了函数结构和接口,特别是对于与外部系统交互的顶层函数。例如,AXI接口创建使得FPGA设计可以通过标准AXI总线与其他系统组件通信。

性能估计和设计空间探索

--scalehls-qor-estimation                     - 估计性能和资源利用率
  --target-spec=<string>                      - 目标后端规格和配置的文件路径

--scalehls-dse                                - 在多个抽象层次优化HLS设计
  --csv-path=<string>                         - 用于导出设计空间CSV的文件路径
  --output-path=<string>                      - 用于导出帕累托设计点MLIR的文件路径
  --target-spec=<string>                      - 目标后端规格和配置的文件路径

这些Pass提供了性能和资源使用的估计,以及设计空间探索功能,帮助设计者找到最佳设计点。

Pass Pipelines

ScaleHLS提供了几个预定义的优化管道(Pass Pipeline),这些管道是一系列优化传递的组合,针对特定的输入类型和优化目标:

–hida-cpp-pipeline

将C++编译为优化的C++代码,是ScaleHLS的核心优化流程。

--hida-cpp-pipeline 选项:
  --axi-interface                            - 创建AXI接口
  --balance-dataflow                         - 是否平衡数据流
  --complexity-aware                         - 在转换中是否考虑节点复杂性
  --correlation-aware                        - 在转换中是否考虑节点相关性
  --debug-point=<uint>                       - 在给定调试点停止管道
  --external-buffer-threshold=<uint>         - 放置外部缓冲区的阈值
  --fake-quantize                            - 触发假量化(仅用于测试)
  --fusion-tolerance=<number>                - 循环融合时容忍的额外计算(默认为100.0)
  --loop-tile-size=<uint>                    - 每个循环的分块大小(必须大于等于1)
  --loop-unroll-factor=<uint>                - 总体循环展开因子(设置为0以禁用)
  --place-external-buffer                    - 在外部内存中放置缓冲区
  --top-func=<string>                        - 指定设计的顶层函数
  --tosa-input                               - 指示输入IR是TOSA
  --vectorize                                - 使用因子2进行向量化

–hida-pytorch-pipeline

将从Torch-MLIR转换得到的TOSA代码编译为HLS C++代码,集成了HIDA(High-level Integrated Dataflow Accelerator)技术。

--hida-pytorch-pipeline 选项:
  [与hida-cpp-pipeline相同的选项]

–scalehls-dse-pipeline

启动设计空间探索(Design Space Exploration, DSE),针对C/C++内核。

--scalehls-dse-pipeline 选项:
  --target-spec=<string>                     - 目标后端规格和配置的文件路径
  --top-func=<string>                        - 指定设计的顶层函数

这些管道简化了优化过程,允许用户通过一个命令应用一系列优化。

实际应用示例

示例1:基本循环优化

假设我们有一个简单的矩阵乘法实现,想要优化其在FPGA上的性能:

# 应用循环分块、展开和数据流优化
scalehls-opt matmul.mlir \
  --scalehls-affine-loop-tile="tile-size=16" \
  --scalehls-affine-loop-unroll-jam="unroll-factor=4" \
  --scalehls-create-dataflow-from-affine \
  --scalehls-array-partition="threshold=256" \
  --scalehls-create-local-buffer \
  --scalehls-create-axi-interface="top-func=matmul" \
  -o optimized_matmul.mlir

这个命令序列对矩阵乘法应用了循环分块(tile-size=16)、循环展开(unroll-factor=4)、数据流创建、数组分区和本地缓冲区创建,最后为顶层函数创建AXI接口。

示例2:使用预定义管道优化CNN模型

对于从PyTorch导出的CNN模型,我们可以使用预定义的管道:

# 使用PyTorch优化管道
scalehls-opt resnet18.mlir \
  --hida-pytorch-pipeline="top-func=forward loop-tile-size=8 loop-unroll-factor=2 balance-dataflow vectorize" \
  -o optimized_resnet18.mlir

这个命令使用HIDA PyTorch管道优化ResNet18模型,应用循环分块、展开、数据流平衡和向量化。

示例3:多阶段优化

对于复杂设计,可以分阶段应用优化,在每个阶段检查中间结果:

# 阶段1:循环优化
scalehls-opt input.mlir \
  --scalehls-affine-loop-tile="tile-size=16" \
  --scalehls-affine-loop-unroll-jam="unroll-factor=4" \
  -o stage1.mlir

# 阶段2:内存优化
scalehls-opt stage1.mlir \
  --scalehls-array-partition="threshold=256" \
  --scalehls-create-local-buffer \
  -o stage2.mlir

# 阶段3:数据流优化
scalehls-opt stage2.mlir \
  --scalehls-create-dataflow-from-affine \
  --scalehls-schedule-dataflow-node \
  -o final.mlir

示例4:设计空间探索

对于想要找到最佳设计配置的场景,可以使用设计空间探索:

# 启动设计空间探索
scalehls-opt kernel.mlir \
  --scalehls-dse-pipeline="top-func=kernel target-spec=xilinx_u250.json" \
  -o pareto_designs.mlir

这个命令为指定的内核函数启动设计空间探索,使用xilinx_u250.json中定义的目标规格,并输出帕累托最优设计点。

高级用法技巧

1. 组合多个优化传递

虽然预定义管道很方便,但有时需要更精细的控制。可以通过组合多个传递来创建自定义优化序列:

scalehls-opt input.mlir \
  --scalehls-func-preprocess="top-func=kernel" \
  --scalehls-affine-loop-tile="tile-size=8" \
  --scalehls-affine-loop-order-opt \
  --scalehls-affine-loop-unroll-jam="unroll-factor=4 point-loop-only" \
  --scalehls-create-local-buffer \
  --scalehls-loop-pipelining="pipeline-level=1 target-ii=1" \
  -o output.mlir

2. 使用 --pass-pipeline 创建自定义管道

对于经常使用的优化组合,可以使用 --pass-pipeline 参数创建自定义管道:

scalehls-opt input.mlir \
  --pass-pipeline="func.func(scalehls-affine-loop-tile{tile-size=8},scalehls-affine-loop-unroll-jam{unroll-factor=4})" \
  -o output.mlir

这个命令创建了一个自定义管道,对函数内的代码先应用循环分块,然后应用循环展开和合并。

3. 调试优化过程

使用打印IR的选项来跟踪优化过程:

scalehls-opt input.mlir \
  --scalehls-affine-loop-tile="tile-size=16" \
  --mlir-print-ir-after=scalehls-affine-loop-tile \
  --scalehls-affine-loop-unroll-jam="unroll-factor=4" \
  --mlir-print-ir-after=scalehls-affine-loop-unroll-jam \
  -o output.mlir

这个命令在每个指定的Pass之后打印IR,帮助理解优化的影响。

详细help-list

(mlir) root@DESKTOP-8J220GD:~/ScaleHLS-HIDA/samples/pytorch/resnet18# scalehls-opt --help-list
OVERVIEW: ScaleHLS Optimization Tool
Available Dialects: affine, arith, bufferization, builtin, dlti, func, hls, linalg, llvm, math, memref, ml_program, scf, tensor, tosa, vector
USAGE: scalehls-opt [options] <input file>

OPTIONS:
  --allow-unregistered-dialect                         - Allow operation with no registered dialects
  --color                                              - Use colors in output (default=autodetect)
  --disable-i2p-p2i-opt                                - Disables inttoptr/ptrtoint roundtrip optimization
  --dot-cfg-mssa=<file name for generated dot file>    - file name for generated dot file
  --dump-pass-pipeline                                 - Print the pipeline that will be run
  --emit-bytecode                                      - Emit bytecode when generating output
  --generate-merged-base-profiles                      - When generating nested context-sensitive profiles, always generate extra base profile for function with all its context profiles merged into it.
  --help                                               - Display available options (--help-hidden for more)
  --mlir-debug-counter=<string>                        - Comma separated list of debug counter skip and count arguments
  --mlir-disable-threading                             - Disable multi-threading within MLIR, overrides any further call to MLIRContext::enableMultiThreading()
  --mlir-elide-elementsattrs-if-larger=<uint>          - Elide ElementsAttrs with "..." that have more elements than the given upper limit
  --mlir-pass-pipeline-crash-reproducer=<string>       - Generate a .mlir reproducer file at the given output path if the pass manager crashes or fails
  --mlir-pass-pipeline-local-reproducer                - When generating a crash reproducer, attempt to generated a reproducer with the smallest pipeline.
  --mlir-pass-statistics                               - Display the statistics of each pass
  --mlir-pass-statistics-display=<value>               - Display method for pass statistics
    =list                                              -   display the results in a merged list sorted by pass name
    =pipeline                                          -   display the results with a nested pipeline view
  --mlir-pretty-debuginfo                              - Print pretty debug info in MLIR output
  --mlir-print-debug-counter                           - Print out debug counter information after all counters have been accumulated
  --mlir-print-debuginfo                               - Print debug info in MLIR output
  --mlir-print-elementsattrs-with-hex-if-larger=<long> - Print DenseElementsAttrs with a hex string that have more elements than the given upper limit (use -1 to disable)
  --mlir-print-ir-after=<pass-arg>                     - Print IR after specified passes
  --mlir-print-ir-after-all                            - Print IR after each pass
  --mlir-print-ir-after-change                         - When printing the IR after a pass, only print if the IR changed
  --mlir-print-ir-after-failure                        - When printing the IR after a pass, only print if the pass failed
  --mlir-print-ir-before=<pass-arg>                    - Print IR before specified passes
  --mlir-print-ir-before-all                           - Print IR before each pass
  --mlir-print-ir-module-scope                         - When printing IR for print-ir-[before|after]{-all} always print the top-level operation
  --mlir-print-local-scope                             - Print with local scope and inline information (eliding aliases for attributes, types, and locations
  --mlir-print-op-on-diagnostic                        - When a diagnostic is emitted on an operation, also print the operation as an attached note
  --mlir-print-stacktrace-on-diagnostic                - When a diagnostic is emitted, also print the stack trace as an attached note
  --mlir-print-value-users                             - Print users of operation results and block arguments as a comment
  --mlir-timing                                        - Display execution times
  --mlir-timing-display=<value>                        - Display method for timing data
    =list                                              -   display the results in a list sorted by total time
    =tree                                              -   display the results ina with a nested tree view
  --no-implicit-module                                 - Disable implicit addition of a top-level module op during parsing
  -o <filename>                                        - Output filename
  --opaque-pointers                                    - Use opaque pointers
  Compiler passes to run
    --pass-pipeline                                    -   A textual description of a pass pipeline to run
    Passes:
      --affine-data-copy-generate                      -   Generate explicit copying for affine memory operations
        --fast-mem-capacity=<ulong>                    - Set fast memory space capacity in KiB (default: unlimited)
        --fast-mem-space=<uint>                        - Fast memory space identifier for copy generation (default: 1)
        --generate-dma                                 - Generate DMA instead of point-wise copy
        --min-dma-transfer=<int>                       - Minimum DMA transfer size supported by the target in bytes
        --skip-non-unit-stride-loops                   - Testing purposes: avoid non-unit stride loop choice depths for copy placement
        --slow-mem-space=<uint>                        - Slow memory space identifier for copy generation (default: 0)
        --tag-mem-space=<uint>                         - Tag memory space identifier for copy generation (default: 0)
      --affine-expand-index-ops                        -   Lower affine operations operating on indices into more fundamental operations
      --affine-loop-coalescing                         -   Coalesce nested loops with independent bounds into a single loop
      --affine-loop-fusion                             -   Fuse affine loop nests
        --fusion-compute-tolerance=<number>            - Fractional increase in additional computation tolerated while fusing        
        --fusion-fast-mem-space=<uint>                 - Faster memory space number to promote fusion buffers to
        --fusion-local-buf-threshold=<ulong>           - Threshold size (KiB) for promoting local buffers to fast memory space       
        --fusion-maximal                               - Enables maximal loop fusion
        --mode=<value>                                 - fusion mode to attempt
    =greedy                                      -   Perform greedy (both producer-consumer and sibling)  fusion
    =producer                                    -   Perform only producer-consumer fusion
    =sibling                                     -   Perform only sibling fusion
      --affine-loop-invariant-code-motion              -   Hoist loop invariant instructions outside of affine loops
      --affine-loop-normalize                          -   Apply normalization transformations to affine loop-like ops
      --affine-loop-tile                               -   Tile affine loop nests
        --cache-size=<ulong>                           - Set size of cache to tile for in KiB (default: 512)
        --separate                                     - Separate full and partial tiles (default: false)
        --tile-size=<uint>                             - Use this tile size for all loops
        --tile-sizes=<uint>                            - List of tile sizes for each perfect nest (overridden by -tile-size)
      --affine-loop-unroll                             -   Unroll affine loops
        --cleanup-unroll                               - Fully unroll the cleanup loop when possible.
        --unroll-factor=<uint>                         - Use this unroll factor for all loops being unrolled
        --unroll-full                                  - Fully unroll loops
        --unroll-full-threshold=<uint>                 - Unroll all loops with trip count less than or equal to this
        --unroll-num-reps=<uint>                       - Unroll innermost loops repeatedly this many times
        --unroll-up-to-factor                          - Allow unrolling up to the factor specified
      --affine-loop-unroll-jam                         -   Unroll and jam affine loops
        --unroll-jam-factor=<uint>                     - Use this unroll jam factor for all loops (default 4)
      --affine-parallelize                             -   Convert affine.for ops into 1-D affine.parallel
        --max-nested=<uint>                            - Maximum number of nested parallel loops to produce. Defaults to unlimited (UINT_MAX).
        --parallel-reductions                          - Whether to parallelize reduction loops. Defaults to false.
      --affine-pipeline-data-transfer                  -   Pipeline non-blocking data transfers between explicitly managed levels of the memory hierarchy
      --affine-scalrep                                 -   Replace affine memref acceses by scalars by forwarding stores to loads and eliminating redundant loads
      --affine-simplify-structures                     -   Simplify affine expressions in maps/sets and normalize memrefs
      --affine-super-vectorize                         -   Vectorize to a target independent n-D vector abstraction
        --test-fastest-varying=<long>                  - Specify a 1-D, 2-D or 3-D pattern of fastest varying memory dimensions to match. See defaultPatterns in Vectorize.cpp for a description and examples. This is used for testing purposes
        --vectorize-reductions                         - Vectorize known reductions expressed via iter_args. Switched off by default.
        --virtual-vector-size=<long>                   - Specify an n-D virtual vector size for vectorization
      --arith-bufferize                                -   Bufferize Arith dialect ops.
        --alignment=<uint>                             - Create global memrefs with a specified alignment
      --arith-emulate-wide-int                         -   Emulate 2*N-bit integer operations using N-bit operations
        --widest-int-supported=<uint>                  - Widest integer type supported by the target
      --arith-expand                                   -   Legalize Arith ops to be convertible to LLVM.
      --arith-unsigned-when-equivalent                 -   Replace signed ops with unsigned ones where they are proven equivalent    
      --arm-neon-2d-to-intr                            -   Convert Arm NEON structured ops to intrinsics
      --async-parallel-for                             -   Convert scf.parallel operations to multiple async compute ops executed concurrently for non-overlapping iteration ranges
        --async-dispatch                               - Dispatch async compute tasks using recursive work splitting. If `false` async compute tasks will be launched using simple for loop in the caller thread.
        --min-task-size=<int>                          - The minimum task size for sharding parallel operation.
        --num-workers=<int>                            - The number of available workers to execute async operations. If `-1` the value will be retrieved from the runtime.
      --async-runtime-policy-based-ref-counting        -   Policy based reference counting for Async runtime operations
      --async-runtime-ref-counting                     -   Automatic reference counting for Async runtime operations
      --async-runtime-ref-counting-opt                 -   Optimize automatic reference counting operations for theAsync runtime by removing redundant operations
      --async-to-async-runtime                         -   Lower high level async operations (e.g. async.execute) to theexplicit async.runtime and async.coro operations
        --eliminate-blocking-await-ops                 - Rewrite functions with blocking async.runtime.await as coroutines with async.runtime.await_and_resume.
      --buffer-deallocation                            -   Adds all required dealloc operations for all allocations in the input program
      --buffer-hoisting                                -   Optimizes placement of allocation operations by moving them into common dominators and out of nested regions
      --buffer-loop-hoisting                           -   Optimizes placement of allocation operations by moving them out of loop nests
      --buffer-results-to-out-params                   -   Converts memref-typed function results to out-params
      --bufferization-bufferize                        -   Bufferize the `bufferization` dialect
      --canonicalize                                   -   Canonicalize operations
        --disable-patterns=<string>                    - Labels of patterns that should be filtered out during application
        --enable-patterns=<string>                     - Labels of patterns that should be used during application, all other patterns are filtered out
        --max-iterations=<long>                        - Seed the worklist in general top-down order
        --region-simplify                              - Seed the worklist in general top-down order
        --top-down                                     - Seed the worklist in general top-down order
      --control-flow-sink                              -   Sink operations into conditional blocks
      --convert-affine-for-to-gpu                      -   Convert top-level AffineFor Ops to GPU kernels
        --gpu-block-dims=<uint>                        - Number of GPU block dimensions for mapping
        --gpu-thread-dims=<uint>                       - Number of GPU thread dimensions for mapping
      --convert-amdgpu-to-rocdl                        -   Convert AMDGPU dialect to ROCDL dialect
        --chipset=<string>                             - Chipset that these operations will run on
      --convert-arith-to-llvm                          -   Convert Arith dialect to LLVM dialect
        --index-bitwidth=<uint>                        - Bitwidth of the index type, 0 to use size of machine word
      --convert-arith-to-spirv                         -   Convert Arith dialect to SPIR-V dialect
        --emulate-non-32-bit-scalar-types              - Emulate non-32-bit scalar types with 32-bit ones if missing native support  
        --enable-fast-math                             - Enable fast math mode (assuming no NaN and infinity for floating point values) when performing conversion
      --convert-async-to-llvm                          -   Convert the operations from the async dialect into the LLVM dialect       
      --convert-bufferization-to-memref                -   Convert operations from the Bufferization dialect to the MemRef dialect   
      --convert-cf-to-llvm                             -   Convert ControlFlow operations to the LLVM dialect
        --index-bitwidth=<uint>                        - Bitwidth of the index type, 0 to use size of machine word
      --convert-cf-to-spirv                            -   Convert ControlFlow dialect to SPIR-V dialect
        --emulate-non-32-bit-scalar-types              - Emulate non-32-bit scalar types with 32-bit ones if missing native support  
      --convert-complex-to-libm                        -   Convert Complex dialect to libm calls
      --convert-complex-to-llvm                        -   Convert Complex dialect to LLVM dialect
      --convert-complex-to-standard                    -   Convert Complex dialect to standard dialect
      --convert-elementwise-to-linalg                  -   Convert ElementwiseMappable ops to linalg
      --convert-func-to-llvm                           -   Convert from the Func dialect to the LLVM dialect
        --data-layout=<string>                         - String description (LLVM format) of the data layout that is expected on the produced module
        --index-bitwidth=<uint>                        - Bitwidth of the index type, 0 to use size of machine word
        --use-bare-ptr-memref-call-conv                - Replace FuncOp's MemRef arguments with bare pointers to the MemRef element types
      --convert-func-to-spirv                          -   Convert Func dialect to SPIR-V dialect
        --emulate-non-32-bit-scalar-types              - Emulate non-32-bit scalar types with 32-bit ones if missing native support  
      --convert-gpu-launch-to-vulkan-launch            -   Convert gpu.launch_func to vulkanLaunch external call
      --convert-gpu-to-nvvm                            -   Generate NVVM operations for gpu operations
        --index-bitwidth=<uint>                        - Bitwidth of the index type, 0 to use size of machine word
      --convert-gpu-to-rocdl                           -   Generate ROCDL operations for gpu operations
        --chipset=<string>                             - Chipset that these operations will run on
        --index-bitwidth=<uint>                        - Bitwidth of the index type, 0 to use size of machine word
        --runtime=<value>                              - Runtime code will be run on (default is Unknown, can also use HIP or OpenCl)
    =unknown                                     -   Unknown (default)
    =HIP                                         -   HIP
    =OpenCL                                      -   OpenCL
        --use-bare-ptr-memref-call-conv                - Replace memref arguments in GPU functions with bare pointers.All memrefs must have static shape
      --convert-gpu-to-spirv                           -   Convert GPU dialect to SPIR-V dialect
      --convert-index-to-llvm                          -   Lower the `index` dialect to the `llvm` dialect.
        --index-bitwidth=<uint>                        - Bitwidth of the index type, 0 to use size of machine word
      --convert-linalg-to-affine-loops                 -   Lower the operations from the linalg dialect into affine loops
      --convert-linalg-to-llvm                         -   Convert the operations from the linalg dialect into the LLVM dialect      
      --convert-linalg-to-loops                        -   Lower the operations from the linalg dialect into loops
      --convert-linalg-to-parallel-loops               -   Lower the operations from the linalg dialect into parallel loops
      --convert-linalg-to-spirv                        -   Convert Linalg dialect to SPIR-V dialect
      --convert-linalg-to-std                          -   Convert the operations from the linalg dialect into the Standard dialect  
      --convert-math-to-funcs                          -   Convert Math operations to calls of outlined implementations.
      --convert-math-to-libm                           -   Convert Math dialect to libm calls
      --convert-math-to-llvm                           -   Convert Math dialect to LLVM dialect
      --convert-math-to-spirv                          -   Convert Math dialect to SPIR-V dialect
      --convert-memref-to-llvm                         -   Convert operations from the MemRef dialect to the LLVM dialect
        --index-bitwidth=<uint>                        - Bitwidth of the index type, 0 to use size of machine word
        --use-aligned-alloc                            - Use aligned_alloc in place of malloc for heap allocations
        --use-generic-functions                        - Use generic allocation and deallocation functions instead of the classic 'malloc', 'aligned_alloc' and 'free' functions
      --convert-memref-to-spirv                        -   Convert MemRef dialect to SPIR-V dialect
        --bool-num-bits=<int>                          - The number of bits to store a boolean value
      --convert-nvgpu-to-nvvm                          -   Convert NVGPU dialect to NVVM dialect
      --convert-openacc-to-llvm                        -   Convert the OpenACC ops to LLVM dialect
      --convert-openacc-to-scf                         -   Convert the OpenACC ops to OpenACC with SCF dialect
      --convert-openmp-to-llvm                         -   Convert the OpenMP ops to OpenMP ops with LLVM dialect
      --convert-parallel-loops-to-gpu                  -   Convert mapped scf.parallel ops to gpu launch operations
      --convert-pdl-to-pdl-interp                      -   Convert PDL ops to PDL interpreter ops
      --convert-scf-to-cf                              -   Convert SCF dialect to ControlFlow dialect, replacing structured control flow with a CFG
      --convert-scf-to-openmp                          -   Convert SCF parallel loop to OpenMP parallel + workshare constructs.      
      --convert-scf-to-spirv                           -   Convert SCF dialect to SPIR-V dialect.
      --convert-shape-constraints                      -   Convert shape constraint operations to the standard dialect
      --convert-shape-to-std                           -   Convert operations from the shape dialect into the standard dialect       
      --convert-spirv-to-llvm                          -   Convert SPIR-V dialect to LLVM dialect
      --convert-tensor-to-linalg                       -   Convert some Tensor dialect ops to Linalg dialect
      --convert-tensor-to-spirv                        -   Convert Tensor dialect to SPIR-V dialect
        --emulate-non-32-bit-scalar-types              - Emulate non-32-bit scalar types with 32-bit ones if missing native support  
      --convert-vector-to-gpu                          -   Lower the operations from the vector dialect into the GPU dialect
        --use-nvgpu                                    - convert to NvGPU ops instead of GPU dialect ops
      --convert-vector-to-llvm                         -   Lower the operations from the vector dialect into the LLVM dialect        
        --enable-amx                                   - Enables the use of AMX dialect while lowering the vector dialect.
        --enable-arm-neon                              - Enables the use of ArmNeon dialect while lowering the vector dialect.       
        --enable-arm-sve                               - Enables the use of ArmSVE dialect while lowering the vector dialect.        
        --enable-x86vector                             - Enables the use of X86Vector dialect while lowering the vector dialect.     
        --force-32bit-vector-indices                   - Allows compiler to assume vector indices fit in 32-bit if that yields faster code
        --reassociate-fp-reductions                    - Allows llvm to reassociate floating-point reductions for speed
      --convert-vector-to-scf                          -   Lower the operations from the vector dialect into the SCF dialect
        --full-unroll                                  - Perform full unrolling when converting vector transfers to SCF
        --lower-permutation-maps                       - Replace permutation maps with vector transposes/broadcasts before lowering transfer ops
        --lower-tensors                                - Lower transfer ops that operate on tensors
        --target-rank=<uint>                           - Target vector rank to which transfer ops should be lowered
      --convert-vector-to-spirv                        -   Convert Vector dialect to SPIR-V dialect
      --cse                                            -   Eliminate common sub-expressions
      --decorate-spirv-composite-type-layout           -   Decorate SPIR-V composite type with layout info
      --drop-equivalent-buffer-results                 -   Remove MemRef return values that are equivalent to a bbArg
      --eliminate-alloc-tensors                        -   Try to eliminate all alloc_tensor ops.
      --empty-tensor-to-alloc-tensor                   -   Replace all empty ops by alloc_tensor ops.
      --finalizing-bufferize                           -   Finalize a partial bufferization
      --fold-memref-alias-ops                          -   Fold memref alias ops into consumer load/store ops
      --func-bufferize                                 -   Bufferize func/call/return ops
      --gpu-async-region                               -   Make GPU ops async
      --gpu-kernel-outlining                           -   Outline gpu.launch bodies to kernel functions
        --data-layout-str=<string>                     - String containing the data layout specification to be attached to the GPU kernel module
      --gpu-launch-sink-index-computations             -   Sink index computations into gpu.launch body
      --gpu-map-parallel-loops                         -   Greedily maps loops to GPU hardware dimensions.
      --gpu-to-llvm                                    -   Convert GPU dialect to LLVM dialect with GPU runtime calls
        --gpu-binary-annotation=<string>               - Annotation attribute string for GPU binary
        --use-bare-pointers-for-kernels                - Use bare pointers to pass memref arguments to kernels. The kernel must use the same setting for this option.
      --inline                                         -   Inline function calls
        --default-pipeline=<string>                    - The default optimizer pipeline used for callables
        --max-iterations=<uint>                        - Maximum number of iterations when inlining within an SCC
        --op-pipelines=<pass-manager>                  - Callable operation specific optimizer pipelines (in the form of `dialect.op(pipeline)`)
      --launch-func-to-vulkan                          -   Convert vulkanLaunch external call to Vulkan runtime external calls       
      --linalg-bufferize                               -   Bufferize the linalg dialect
      --linalg-detensorize                             -   Detensorize linalg ops
        --aggressive-mode                              - Detensorize all ops that qualify for detensoring along with branch operands and basic-block arguments.
      --linalg-fold-unit-extent-dims                   -   Remove unit-extent dimension in Linalg ops on tensors
        --fold-one-trip-loops-only                     - Only folds the one-trip loops from Linalg ops on tensors (for testing purposes only)
      --linalg-fuse-elementwise-ops                    -   Fuse elementwise operations on tensors
      --linalg-generalize-named-ops                    -   Convert named ops into generic ops
      --linalg-inline-scalar-operands                  -   Inline scalar operands into linalg generic ops
      --linalg-named-op-conversion                     -   Convert from one named linalg op to another.
      --llvm-legalize-for-export                       -   Legalize LLVM dialect to be convertible to LLVM IR
      --llvm-optimize-for-nvvm-target                  -   Optimize NVVM IR
      --llvm-request-c-wrappers                        -   Request C wrapper emission for all functions
      --loop-invariant-code-motion                     -   Hoist loop invariant instructions outside of the loop
      --lower-affine                                   -   Lower Affine operations to a combination of Standard and SCF operations   
      --lower-host-to-llvm                             -   Lowers the host module code and `gpu.launch_func` to LLVM
      --map-memref-spirv-storage-class                 -   Map numeric MemRef memory spaces to SPIR-V storage classes
        --client-api=<string>                          - The client API to use for populating mappings
      --memref-emulate-wide-int                        -   Emulate 2*N-bit integer operations using N-bit operations
        --widest-int-supported=<uint>                  - Widest integer type supported by the target
      --memref-expand                                  -   Legalize memref operations to be convertible to LLVM.
      --normalize-memrefs                              -   Normalize memrefs
      --nvgpu-optimize-shared-memory                   -   Optimizes accesses to shard memory memrefs in order to reduce bank conflicts.
      --one-shot-bufferize                             -   One-Shot Bufferize
        --allow-return-allocs                          - Allows returning/yielding new allocations from a block.
        --allow-unknown-ops                            - Allows unknown (not bufferizable) ops in the input IR.
        --analysis-fuzzer-seed=<uint>                  - Test only: Analyze ops in random order with a given seed (fuzzer)
        --analysis-heuristic=<string>                  - Heuristic that control the IR traversal during analysis
        --bufferize-function-boundaries                - Bufferize function boundaries (experimental).
        --copy-before-write                            - Skip the analysis. Make a buffer copy on every write.
        --create-deallocs                              - Specify if buffers should be deallocated. For compatibility with core bufferization passes.
        --dialect-filter=<string>                      - Restrict bufferization to ops from these dialects.
        --function-boundary-type-conversion=<string>   - Controls layout maps when bufferizing function signatures.
        --must-infer-memory-space                      - The memory space of an memref types must always be inferred. If unset, a default memory space of 0 is used otherwise.
        --print-conflicts                              - Test only: Annotate IR with RaW conflicts. Requires test-analysis-only.     
        --test-analysis-only                           - Test only: Only run inplaceability analysis and annotate IR
        --unknown-type-conversion=<string>             - Controls layout maps for non-inferrable memref types.
      --outline-shape-computation                      -   Using shape.func to preserve shape computation
      --print-op-stats                                 -   Print statistics of operations
        --json                                         - print the stats as JSON
      --promote-buffers-to-stack                       -   Promotes heap-based allocations to automatically managed stack-based allocations
        --max-alloc-size-in-bytes=<uint>               - Maximal size in bytes to promote allocations to stack.
        --max-rank-of-allocated-memref=<uint>          - Maximal memref rank to promote dynamic buffers.
      --reconcile-unrealized-casts                     -   Simplify and eliminate unrealized conversion casts
      --remove-shape-constraints                       -   Replace all cstr_ ops with a true witness
      --resolve-ranked-shaped-type-result-dims         -   Resolve memref.dim of result values of ranked shape type
      --resolve-shaped-type-result-dims                -   Resolve memref.dim of result values
      --scalehls-affine-loop-fusion                    -   Fuse affine loop nests
        --fusion-compute-tolerance=<number>            - Fractional increase in additional computation tolerated while fusing        
        --fusion-fast-mem-space=<uint>                 - Faster memory space number to promote fusion buffers to
        --fusion-local-buf-threshold=<ulong>           - Threshold size (KiB) for promoting local buffers to fast memory space       
        --fusion-maximal                               - Enables maximal loop fusion
        --mode=<value>                                 - fusion mode to attempt
    =greedy                                      -   Perform greedy fusion
    =producer                                    -   Perform only producer-consumer fusion
    =sibling                                     -   Perform only sibling fusion
      --scalehls-affine-loop-order-opt                 -   Optimize the order of affine loop nests
      --scalehls-affine-loop-perfection                -   Try to perfect a nested loop
      --scalehls-affine-loop-tile                      -   Tile affine loop nests and annotate point loops
        --tile-size=<uint>                             - Use this tile size for all loops
      --scalehls-affine-loop-unroll-jam                -   Unroll and jam affine loop nests
        --point-loop-only                              - Only apply unroll and jam to point loop band
        --unroll-factor=<uint>                         - Positive number: the factor of unrolling
      --scalehls-affine-store-forward                  -   Forward store to load, including conditional stores
      --scalehls-array-partition                       -   Apply optimized array partition strategy
        --threshold=<uint>                             - Positive number: the threshold of using LUTRAM
      --scalehls-balance-dataflow-node                 -   Balance dataflow nodes
      --scalehls-buffer-vectorize                      -   Vectorize buffers
      --scalehls-bufferize-dataflow                    -   Bufferize dataflow operations
      --scalehls-collapse-memref-unit-dims             -   Collapse memref's unit dimensions
      --scalehls-convert-dataflow-to-func              -   Convert dataflow to function dialect
        --split-external-access                        - whether split external memory accesses
      --scalehls-convert-tensor-to-linalg              -   Lower tosa::ReshapeOp and tensor::PadOp
      --scalehls-create-axi-interface                  -   Create AXI interfaces for the top function
        --top-func=<string>                            - The top function for HLS synthesis
      --scalehls-create-dataflow-from-affine           -   Create dataflow hierarchy from affine loops
      --scalehls-create-dataflow-from-linalg           -   Create dataflow hierarchy from linalg
      --scalehls-create-dataflow-from-tosa             -   Create dataflow hierarchy from tosa
      --scalehls-create-hls-primitive                  -   Create HLS C++ multiplification primitives
      --scalehls-create-local-buffer                   -   Promote external buffer to on-chip buffer
        --external-buffer-only                         - only handle external buffers
        --register-only                                - only registers or single-element buffers
      --scalehls-create-memref-subview                 -   Create subviews based on loop analysis
        --mode=<value>                                 - loop band mode to create subviews
    =point                                       -   Create subviews on point loop band
    =reduction                                   -   Create subviews on reduction loop band
      --scalehls-create-token-stream                   -   Create token stream channels for DRAM buffers
      --scalehls-dse                                   -   Optimize HLS design at multiple abstraction level
        --csv-path=<string>                            - File path: the path for dumping the CSV of design spaces
        --output-path=<string>                         - File path: the path for dumping the MLIR of pareto design points
        --target-spec=<string>                         - File path: target backend specifications and configurations
      --scalehls-eliminate-multi-consumer              -   Eliminate multi-consumer violations
      --scalehls-eliminate-multi-producer              -   Try to eliminate multi-producer violations
      --scalehls-func-duplication                      -   Duplicate function for each function call
      --scalehls-func-pipelining                       -   Apply function pipelining
        --target-func=<string>                         - The target function to be pipelined
        --target-ii=<uint>                             - Positive number: the targeted II to achieve
      --scalehls-func-preprocess                       -   Preprocess the functions subsequent ScaleHLS optimizations
        --top-func=<string>                            - The top function for HLS synthesis
      --scalehls-legalize-dataflow                     -   Legalize dataflow by merging dataflow nodes
      --scalehls-linalg-analyze-model                  -   Analyze the operation number of a linalg model
      --scalehls-linalg-fake-quantize                  -   Convert to quantized model (only for testing use)
        --quan-bits=<uint>                             - the number of bits for quantization
      --scalehls-loop-pipelining                       -   Apply loop pipelining
        --pipeline-level=<uint>                        - Positive number: loop level to be pipelined (from innermost)
        --target-ii=<uint>                             - Positive number: the targeted II to achieve
      --scalehls-lower-affine                          -   Lower AffineSelectOp and AffineForOp
      --scalehls-lower-copy-to-affine                  -   Convert copy and assign to affine loops
        --internal-copy-only                           - only convert copy between internal buffers
      --scalehls-lower-dataflow                        -   Lower dataflow from task level to node level
        --split-external-access                        - whether split external memory accesses
      --scalehls-materialize-reduction                 -   Materialize loop reductions
      --scalehls-parallelize-dataflow-node             -   Unroll affine loop nests based on the dataflow structure
        --complexity-aware                             - Whether to consider node complexity in the transform
        --correlation-aware                            - Whether to consider node correlation in the transform
        --max-unroll-factor=<uint>                     - Positive number: the maximum factor of unrolling
        --point-loop-only                              - Only apply unroll and jam to point loop band
      --scalehls-place-dataflow-buffer                 -   Place dataflow buffers
        --place-external-buffer                        - Place buffers in external buffers
        --threshold=<uint>                             - Positive number: the threshold of placing external buffers
      --scalehls-qor-estimation                        -   Estimate the performance and resource utilization
        --target-spec=<string>                         - File path: target backend specifications and configurations
      --scalehls-raise-affine-to-copy                  -   Raise copy in affine loops to memref.copy
      --scalehls-reduce-initial-interval               -   Try to reduce the intiail interval
      --scalehls-remove-variable-bound                 -   Try to remove variable loop bounds
      --scalehls-schedule-dataflow-node                -   Schedule dataflow nodes
        --ignore-violations                            - Ignore multi-consumer or producer violations
      --scalehls-simplify-affine-if                    -   Simplify affine if operations
      --scalehls-simplify-copy                         -   Simplify memref copy ops
      --scalehls-stream-dataflow-task                  -   Stream dataflow tasks
      --scalehls-tosa-fake-quantize                    -   Convert to 8-bits quantized model (only for testing use)
      --scalehls-tosa-simplify-graph                   -   Remove redundant TOSA operations
      --sccp                                           -   Sparse Conditional Constant Propagation
      --scf-bufferize                                  -   Bufferize the scf dialect.
      --scf-for-loop-canonicalization                  -   Canonicalize operations within scf.for loop bodies
      --scf-for-loop-peeling                           -   Peel `for` loops at their upper bounds.
        --skip-partial                                 - Do not peel loops inside of the last, partial iteration of another already peeled loop.
      --scf-for-loop-range-folding                     -   Fold add/mul ops into loop range
      --scf-for-loop-specialization                    -   Specialize `for` loops for vectorization
      --scf-for-to-while                               -   Convert SCF for loops to SCF while loops
      --scf-parallel-loop-collapsing                   -   Collapse parallel loops to use less induction variables
        --collapsed-indices-0=<uint>                   - Which loop indices to combine 0th loop index
        --collapsed-indices-1=<uint>                   - Which loop indices to combine into the position 1 loop index
        --collapsed-indices-2=<uint>                   - Which loop indices to combine into the position 2 loop index
      --scf-parallel-loop-fusion                       -   Fuse adjacent parallel loops
      --scf-parallel-loop-specialization               -   Specialize parallel loops for vectorization
      --scf-parallel-loop-tiling                       -   Tile parallel loops
        --no-min-max-bounds                            - Perform tiling with fixed upper bound with inbound check inside the internal loops
        --parallel-loop-tile-sizes=<long>              - Factors to tile parallel loops by
      --shape-bufferize                                -   Bufferize the shape dialect.
      --shape-to-shape-lowering                        -   Legalize Shape dialect to be convertible to Arith
      --simplify-extract-strided-metadata              -   Simplify extract_strided_metadata ops
      --snapshot-op-locations                          -   Generate new locations from the current IR
        --filename=<string>                            - The filename to print the generated IR
        --tag=<string>                                 - A tag to use when fusing the new locations with the original. If unset, the locations are replaced.
      --sparse-buffer-rewrite                          -   Rewrite sparse primitives on buffers to actual code
      --sparse-tensor-codegen                          -   Convert sparse tensors and primitives to actual code
      --sparse-tensor-conversion                       -   Convert sparse tensors and primitives to library calls
        --s2s-strategy=<int>                           - Set the strategy for sparse-to-sparse conversion
      --sparse-tensor-rewrite                          -   Applies sparse tensor rewriting rules prior to sparsification
        --enable-convert                               - Enable rewriting rules for the convert operator
        --enable-foreach                               - Enable rewriting rules for the foreach operator
        --enable-runtime-library                       - Enable runtime library for manipulating sparse tensors
      --sparsification                                 -   Automatically generate sparse tensor code from sparse tensor types        
        --parallelization-strategy=<value>             - Set the parallelization strategy
    =none                                        -   Turn off sparse parallelization.
    =dense-outer-loop                            -   Enable dense outer loop sparse parallelization.
    =any-storage-outer-loop                      -   Enable sparse parallelization regardless of storage for the outer loop.
    =dense-any-loop                              -   Enable dense parallelization for any loop.
    =any-storage-any-loop                        -   Enable sparse parallelization for any storage and loop.
      --spirv-canonicalize-gl                          -   Run canonicalization involving GLSL ops
      --spirv-lower-abi-attrs                          -   Decorate SPIR-V composite type with layout info
      --spirv-rewrite-inserts                          -   Rewrite sequential chains of spirv.CompositeInsert operations into spirv.CompositeConstruct operations
      --spirv-unify-aliased-resource                   -   Unify access of multiple aliased resources into access of one single resource
      --spirv-update-vce                               -   Deduce and attach minimal (version, capabilities, extensions) requirements to spirv.module ops
      --strip-debuginfo                                -   Strip debug info from all operations
      --symbol-dce                                     -   Eliminate dead symbols
      --symbol-privatize                               -   Mark symbols private
        --exclude=<string>                             - Comma separated list of symbols that should not be marked private
      --tensor-bufferize                               -   Bufferize the `tensor` dialect
      --tensor-copy-insertion                          -   Make all tensor IR inplaceable by inserting copies
        --allow-return-allocs                          - Allows returning/yielding new allocations from a block.
        --bufferize-function-boundaries                - Bufferize function boundaries (experimental).
        --create-deallocs                              - Specify if new allocations should be deallocated.
        --must-infer-memory-space                      - The memory space of an memref types must always be inferred. If unset, a default memory space of 0 is used otherwise.
      --topological-sort                               -   Sort regions without SSA dominance in topological order
      --tosa-infer-shapes                              -   Propagate shapes across TOSA operations
      --tosa-layerwise-constant-fold                   -   Fold layerwise operations on constant tensors
      --tosa-make-broadcastable                        -   TOSA rank Reshape to enable Broadcasting
      --tosa-optional-decompositions                   -   Applies Tosa operations optional decompositions
      --tosa-to-arith                                  -   Lower TOSA to the Arith dialect
        --include-apply-rescale                        - Whether to include the lowering for tosa.apply_rescale to arith
        --use-32-bit                                   - Whether to prioritze lowering to 32-bit operations
      --tosa-to-linalg                                 -   Lower TOSA to LinAlg on tensors
      --tosa-to-linalg-named                           -   Lower TOSA to LinAlg named operations
      --tosa-to-scf                                    -   Lower TOSA to the SCF dialect
      --tosa-to-tensor                                 -   Lower TOSA to the Tensor dialect
      --transform-dialect-check-uses                   -   warn about potential use-after-free in the transform dialect
      --vector-bufferize                               -   Bufferize Vector dialect ops
      --view-op-graph                                  -   Print Graphviz visualization of an operation
        --max-label-len=<uint>                         - Limit attribute/type length to number of chars
        --print-attrs                                  - Print attributes of operations
        --print-control-flow-edges                     - Print control flow edges
        --print-data-flow-edges                        - Print data flow edges
        --print-result-types                           - Print result types of operations
    Pass Pipelines:
      --hida-cpp-pipeline                              -   Compile C++ to optimized C++
        --axi-interface                                - Create AXI interface
        --balance-dataflow                             - Whether to balance the dataflow
        --complexity-aware                             - Whether to consider node complexity in the transform
        --correlation-aware                            - Whether to consider node correlation in the transform
        --debug-point=<uint>                           - Stop the pipeline at the given debug point
        --external-buffer-threshold=<uint>             - The threshold of placing external buffers
        --fake-quantize                                - Trigger the fake quantization (just for testing use)
        --fusion-tolerance=<number>                    - Additional computation tolerated while loop fusing (default is 100.0)       
        --loop-tile-size=<uint>                        - The tile size of each loop (must larger equal to 1)
        --loop-unroll-factor=<uint>                    - The overall loop unrolling factor (set 0 to disable)
        --place-external-buffer                        - Place buffers in external memories
        --top-func=<string>                            - Specify the top function of the design
        --tosa-input                                   - Inidicate the input IR is TOSA
        --vectorize                                    - Vectorize with factor of 2
      --hida-pytorch-pipeline                          -   Compile TOSA (from Torch-MLIR) to HLS C++ with HIDA
        --axi-interface                                - Create AXI interface
        --balance-dataflow                             - Whether to balance the dataflow
        --complexity-aware                             - Whether to consider node complexity in the transform
        --correlation-aware                            - Whether to consider node correlation in the transform
        --debug-point=<uint>                           - Stop the pipeline at the given debug point
        --external-buffer-threshold=<uint>             - The threshold of placing external buffers
        --fake-quantize                                - Trigger the fake quantization (just for testing use)
        --fusion-tolerance=<number>                    - Additional computation tolerated while loop fusing (default is 100.0)       
        --loop-tile-size=<uint>                        - The tile size of each loop (must larger equal to 1)
        --loop-unroll-factor=<uint>                    - The overall loop unrolling factor (set 0 to disable)
        --place-external-buffer                        - Place buffers in external memories
        --top-func=<string>                            - Specify the top function of the design
        --tosa-input                                   - Inidicate the input IR is TOSA
        --vectorize                                    - Vectorize with factor of 2
      --hida-pytorch-pipeline-post                     -   Compile TOSA (from Torch-MLIR) to HLS C++ with HIDA
        --axi-interface                                - Create AXI interface
        --balance-dataflow                             - Whether to balance the dataflow
        --complexity-aware                             - Whether to consider node complexity in the transform
        --correlation-aware                            - Whether to consider node correlation in the transform
        --debug-point=<uint>                           - Stop the pipeline at the given debug point
        --external-buffer-threshold=<uint>             - The threshold of placing external buffers
        --fake-quantize                                - Trigger the fake quantization (just for testing use)
        --fusion-tolerance=<number>                    - Additional computation tolerated while loop fusing (default is 100.0)       
        --loop-tile-size=<uint>                        - The tile size of each loop (must larger equal to 1)
        --loop-unroll-factor=<uint>                    - The overall loop unrolling factor (set 0 to disable)
        --place-external-buffer                        - Place buffers in external memories
        --top-func=<string>                            - Specify the top function of the design
        --tosa-input                                   - Inidicate the input IR is TOSA
        --vectorize                                    - Vectorize with factor of 2
      --scalehls-dse-pipeline                          -   Launch design space exploration for C/C++ kernel
        --target-spec=<string>                         - File path: target backend specifications and configurations
        --top-func=<string>                            - Specify the top function of the design
      --sparse-compiler                                -   The standard pipeline for taking sparsity-agnostic IR using the sparse-tensor type, and lowering it to LLVM IR with concrete representations and algorithms for sparse tensors.
        --enable-amx                                   - Enables the use of AMX dialect while lowering the vector dialect.
        --enable-arm-neon                              - Enables the use of ArmNeon dialect while lowering the vector dialect.       
        --enable-arm-sve                               - Enables the use of ArmSVE dialect while lowering the vector dialect.        
        --enable-index-optimizations                   - Allows compiler to assume indices fit in 32-bit if that yields faster code  
        --enable-runtime-library                       - Enable runtime library for manipulating sparse tensors
        --enable-x86vector                             - Enables the use of X86Vector dialect while lowering the vector dialect.     
        --parallelization-strategy=<value>             - Set the parallelization strategy
    =none                                        -   Turn off sparse parallelization.
    =dense-outer-loop                            -   Enable dense outer loop sparse parallelization.
    =any-storage-outer-loop                      -   Enable sparse parallelization regardless of storage for the outer loop.
    =dense-any-loop                              -   Enable dense parallelization for any loop.
    =any-storage-any-loop                        -   Enable sparse parallelization for any storage and loop.
        --reassociate-fp-reductions                    - Allows llvm to reassociate floating-point reductions for speed
        --s2s-strategy=<int>                           - Set the strategy for sparse-to-sparse conversion
        --test-bufferization-analysis-only             - Run only the inplacability analysis
  --show-dialects                                      - Print the list of registered dialects
  --split-input-file                                   - Split the input file into pieces and process each chunk independently       
  --verify-diagnostics                                 - Check that emitted diagnostics match expected-* lines on the corresponding line
  --verify-each                                        - Run the verifier after each transformation pass
  --version                                            - Display the version of this program

总结

scalehls-opt是ScaleHLS框架中的核心优化工具,提供了丰富的参数和传递,用于优化高层次综合设计。通过合理使用这些参数,设计者可以显著提高生成硬件的性能、减少资源使用,并简化设计过程。

本文详细解析了scalehls-opt的主要参数分类、关键传递和实际应用示例,希望能帮助读者更好地理解和使用这一强大工具。随着对工具的深入了解和实践经验的积累,设计者可以创建更高效、更优化的硬件实现,充分发挥FPGA和ASIC的潜力。

无论是优化单个算法还是处理复杂的深度学习模型,scalehls-opt都提供了必要的工具和灵活性,使高层次综合更加高效和可控。通过结合循环优化、内存优化、数据流优化和接口创建,可以实现从高级描述到高性能硬件的无缝转换。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值