hipcc 编译 amd gpu kernel 和 打包与解包的流程实验

1, hip cuda kernel 编译概观

编译的文件流:

.hip kernel    --(clang++)-->                    .o

.o                    --(lld)-->                           .out

.out      --(clang-offload-bundler)-->     .hipfb

2,示例 hipcc -###

代码:

__global__ void WWWWW()
{
	((int*)0x8888888)[3] = 0x77777;
}

操作过程:


$ hipcc -### param_00.hip --cuda-device-only --offload-arch=gfx906 
AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.0.2 24012 af27734ed982b52a9f1be0f035ac91726fc697e4)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-6.0.2/llvm/bin
Configuration file: /opt/rocm-6.0.2/lib/llvm/bin/clang++.cfg
clang: warning: argument unused during compilation: '--rtlib=compiler-rt' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-unwindlib=libgcc' [-Wunused-command-line-argument]
 "/opt/rocm-6.0.2/lib/llvm/bin/clang-17" \
 "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" \
 "x86_64-unknown-linux-gnu" "-emit-obj" "-disable-free" \
 "-clear-ast-before-backend" "-disable-llvm-verifier" \
 "-discard-value-names" "-main-file-name" "param_00.hip" \
 "-mrelocation-model" "pic" "-pic-level" "2" "-fhalf-no-semantic-interposition" \
 "-mframe-pointer=none" "-fno-rounding-math" "-mconstructor-aliases" "-aux-target-cpu" \
 "x86-64" "-fcuda-is-device" "-mllvm" "-amdgpu-internalize-symbols" "-fcuda-allow-variadic-functions" \
 "-fvisibility=hidden" "-fapply-global-visibility-to-externs" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/hip.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/ocml.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/ockl.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_daz_opt_off.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_unsafe_math_off.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_finite_only_off.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_wavefrontsize64_on.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_isa_version_906.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_abi_version_500.bc" \
 "-target-cpu" "gfx906" \
 "-debugger-tuning=gdb" "-resource-dir" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0" \
 "-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include/cuda_wrappers" \
 "-idirafter" "/opt/rocm-6.0.2/lib/llvm/bin/../../../include" "-include" \
 "__clang_hip_runtime_wrapper.h" "-c-isystem" "/opt/rocm-6.0.2/llvm/include/gpu-none-llvm" \
 "-isystem" "/opt/rocm-6.0.2/include" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/x86_64-linux-gnu/c++/12" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/backward" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/x86_64-linux-gnu/c++/12" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/backward" \
 "-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include" \
 "-internal-isystem" "/usr/local/include" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include" \
 "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" \
 "-internal-externc-isystem" "/usr/include" "-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include" \
 "-internal-isystem" "/usr/local/include" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include" \
 "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" \
 "-internal-externc-isystem" "/usr/include" "-O3" "-fdeprecated-macro" "-fno-autolink" \
 "-fdebug-compilation-dir=/home/hipper/ex_amd_gpu_compiler/ex/parameters_hip/param_00" \
 "-ferror-limit" "19" "-fhip-new-launch-api" "-fgnuc-version=4.2.1" "-fcxx-exceptions" "-fexceptions" \
 "-fcolor-diagnostics" "-vectorize-loops" "-vectorize-slp" "-mllvm" "-amdgpu-early-inline-all=true" "-mllvm" \
 "-amdgpu-function-calls=false" "-cuid=3e1885b9958b336f" "-fcuda-allow-variadic-functions" "-faddrsig" \
 "-D__GCC_HAVE_DWARF2_CFI_ASM=1" \
 "-o" "/tmp/param_00-gfx906-a7e858.o" \
 "-x" "hip" "param_00.hip"




 "/opt/rocm-6.0.2/llvm/bin/lld" \
 "-flavor" "gnu" "-m" "elf64_amdgpu" \
 "--no-undefined" "-shared" \
 "-plugin-opt=-amdgpu-internalize-symbols" \
 "-plugin-opt=mcpu=gfx906" \
 "-plugin-opt=O3" "--lto-CGO3" \
 "-plugin-opt=-amdgpu-early-inline-all=true" \
 "-plugin-opt=-amdgpu-function-calls=false" \
 "--whole-archive" \
 "-o" "/tmp/param_00-gfx906-65b179.out" \
 "/tmp/param_00-gfx906-a7e858.o" \
 "--no-whole-archive"



 "/opt/rocm-6.0.2/llvm/bin/clang-offload-bundler" \
 "-type=o" "-bundle-align=4096" \
 "-targets=host-x86_64-unknown-linux,hipv4-amdgcn-amd-amdhsa--gfx906" \
 "-input=/dev/null" \
 "-input=/tmp/param_00-gfx906-65b179.out" \
 "-output=param_00.hip-hip-amdgcn-amd-amdhsa.hipfb"

涉及到了三个命令:

clang++                                   -o xxx.o

lld                                             -o  xxx.out

clang-offload-bundler           -output=xxx.hipfb

3,分析中间文件

 

3.1 clang++ 编译生成的 .o

.o 是一个elf文件

这个.o 是使用自己编译出来的clang++ 编译的,

其中,将 cp /opt/rocm/bin/clang++.cfg  local_amdgpu/llvm/bin/

稍作路径调整,编译生成 .o  :

$ "../../../../local_amdgpu/bin/clang-19" \
 "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" \
 "x86_64-unknown-linux-gnu" "-emit-obj" "-disable-free" \
 "-clear-ast-before-backend" "-disable-llvm-verifier" \
 "-discard-value-names" "-main-file-name" "param_00.hip" \
 "-mrelocation-model" "pic" "-pic-level" "2" "-fhalf-no-semantic-interposition" \
 "-mframe-pointer=none" "-fno-rounding-math" "-mconstructor-aliases" "-aux-target-cpu" \
 "x86-64" "-fcuda-is-device" "-mllvm" "-amdgpu-internalize-symbols" "-fcuda-allow-variadic-functions" \
 "-fvisibility=hidden" "-fapply-global-visibility-to-externs" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/hip.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/ocml.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/ockl.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_daz_opt_off.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_unsafe_math_off.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_finite_only_off.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_wavefrontsize64_on.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_isa_version_906.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_abi_version_500.bc" \
 "-target-cpu" "gfx906" \
 "-debugger-tuning=gdb" "-resource-dir" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0" \
 "-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include/cuda_wrappers" \
 "-idirafter" "/opt/rocm-6.0.2/lib/llvm/bin/../../../include" "-include" \
 "__clang_hip_runtime_wrapper.h" "-c-isystem" "/opt/rocm-6.0.2/llvm/include/gpu-none-llvm" \
 "-isystem" "/opt/rocm-6.0.2/include" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/x86_64-linux-gnu/c++/12" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/backward" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/x86_64-linux-gnu/c++/12" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/backward" \
 "-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include" \
 "-internal-isystem" "/usr/local/include" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include" \
 "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" \
 "-internal-externc-isystem" "/usr/include" "-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include" \
 "-internal-isystem" "/usr/local/include" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include" \
 "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" \
 "-internal-externc-isystem" "/usr/include" "-O3" "-fdeprecated-macro" "-fno-autolink" \
 "-fdebug-compilation-dir=/home/hipper/ex_amd_gpu_compiler/ex/parameters_hip/param_00" \
 "-ferror-limit" "19" "-fhip-new-launch-api" "-fgnuc-version=4.2.1" "-fcxx-exceptions" "-fexceptions" \
 "-fcolor-diagnostics" "-vectorize-loops" "-vectorize-slp" "-mllvm" "-amdgpu-early-inline-all=true" "-mllvm" \
 "-amdgpu-function-calls=false" "-cuid=3e1885b9958b336f" "-fcuda-allow-variadic-functions" "-faddrsig" \
 "-D__GCC_HAVE_DWARF2_CFI_ASM=1" \
 "-o" "./param_00-gfx906-a7e858.o" \
 "-x" "hip" "param_00.hip"

3.2 lld 链接得到的.out 文件

也属于 elf 文件,但是是DYN (Shared object file) 类型,不再是 relocationable 类型。

其生成命令也是稍作了路径改动:

"../../../../local_amdgpu/bin/lld" \
 "-flavor" "gnu" "-m" "elf64_amdgpu" \
 "--no-undefined" "-shared" \
 "-plugin-opt=-amdgpu-internalize-symbols" \
 "-plugin-opt=mcpu=gfx906" \
 "-plugin-opt=O3" "--lto-CGO3" \
 "-plugin-opt=-amdgpu-early-inline-all=true" \
 "-plugin-opt=-amdgpu-function-calls=false" \
 "--whole-archive" \
 "-o" "./param_00-gfx906-65b179.out" \
 "./param_00-gfx906-a7e858.o" \
 "--no-whole-archive"

3.3 hipfb 文件

这个类型的文件 是由 clang-offload-bundler 打包而成,这里没有什么新意,对 clang-offload-bundler做一个介绍:

clang-offload-bundler 是一个工具,它是 Clang/LLVM 编译器工具链的一部分,用于支持在多种设备上进行异构计算。这个工具主要用于处理和打包不同目标设备(如 CPU、GPU、FPGA 等)的代码,以便在一个单一的程序中支持多种计算设备。这种技术通常用于加速应用程序,特别是在需要大量并行处理的场景中。

3.3.1  生成 hipfb 文件的方法


 "../../../../local_amdgpu/bin/clang-offload-bundler" \
 "-type=o" "-bundle-align=4096" \
 "-targets=host-x86_64-unknown-linux,hipv4-amdgcn-amd-amdhsa--gfx906" \
 "-input=/dev/null" \
 "-input=./param_00-gfx906-65b179.out" \
 "-output=param_00.hip-hip-amdgcn-amd-amdhsa.hipfb"

3.3.2  主要功能和作用

1. 打包和解包代码对象:

  • clang-offload-bundler 能够将针对不同设备的代码对象(如 CPU 和 GPU 的代码)打包到一个单一的文件中。这使得管理和分发针对异构计算平台的应用程序变得更加简单。
  • 同样,它也可以从这种打包的文件中提取特定目标设备的代码对象,以便在适当的设备上执行。

2. 支持异构编程:

  • 在异构编程中,开发者可能需要将程序的不同部分优化并编译到不同的硬件平台上。clang-offload-bundler 通过管理这些不同的代码段,简化了构建和部署过程。

3. 简化编译和链接流程:

  • 在使用 OpenMP 或 CUDA 等并行编程模型时,clang-offload-bundler 能够处理主机代码和加速器代码之间的交互,包括数据传输和执行控制。这样,开发者可以更专注于代码的并行部分,而不是底层的数据管理和设备控制。

4. 提高性能和可移植性:

  • 通过允许代码针对特定硬件进行优化,clang-offload-bundler 帮助提高应用程序的性能。同时,它也支持代码的可移植性,因为同一个应用程序可以针对多种硬件平台进行编译和打包。

3.3.3  使用场景

  • 并行计算应用:在需要大量计算资源的应用中,如科学计算、图像处理、机器学习等,clang-offload-bundler 可以帮助开发者有效地利用多种计算资源。
  • 开发跨平台应用:对于需要在多种硬件设备上运行的软件,如桌面和移动设备,或者 CPU 和 GPU,clang-offload-bundler 提供了一种统一的方式来处理不同平台的代码。

意义:clang-offload-bundler 还是挺强大的,用于支持和简化异构计算环境中的编程和部署过程。它通过管理针对不同硬件平台的代码,使得开发高性能并行应用程序变得更加高效和可行。

3.3.4 解析 hipfb的方法

clang-offload-bundler -type=o -targets=hip-amdgcn-amd-amdhsa--gfx906 -input=param_00.hip-hip-amdgcn-amd-amdhsa.hipfb -output=device_output.o -unbundle

于是又得到了 bundle前的out 文件:

甚至连文件大小都一样:

  • 24
    点赞
  • 25
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值