openai triton jit 即时编译

huangma.

于 2024-07-25 11:48:05 发布

阅读量1k

点赞数 16

分类专栏： pytorch使用研究文章标签：深度学习 pytorch 神经网络

本文链接：https://blog.csdn.net/weixin_44670163/article/details/140685561

版权

pytorch使用研究专栏收录该内容

2 篇文章 0 订阅

订阅专栏

前言

JIT（Just-in-Time，即时编译）和 AOT（Ahead-of-Time，预编译）是最常见的两种编译模式。
JIT 在运行时即时编译，在开发周期中使用，可以动态下发和执行代码，开发测试效率高，但运行速度和执行性能则会因为运行时即时编译受到影响。
AOT 即提前编译，可以生成被直接执行的二进制代码，运行速度快、执行性能表现好，但每次执行前都需要提前编译，开发测试效率低。

什么是triton

Triton 是一种类似于 Python 的开源编程语言，它使没有 CUDA 经验的研究人员能够编写高效的 GPU 代码——大多数时候与专家编写的代码相当。Triton 可以用相对较少的努力达到最高的硬件性能。
Triton官方文档
 github

一、triton编译器架构

triton 编译的架构如下图所示。triton 自身已经把 nv 和 amd 这两个比较常见的 GPU 后端实现了。
下图中间的分支是 triton 所支持的 nv gpu 的优化路线，当用户写完的 triton dsl 会被翻译成 python 的 AST，然后再从 AST 到对应的 triton dialect（ttir），从这一步开始，也就正式将用户手写的成分转到了 MLIR 这套生态，然后再从 ttir 进一步优化到 triton gpu dialect（ttgir），从ttgir 开始，就走了比较标准的 LLVM 代码生成，从 LLVM IR 一路 lower 到 PTX，最终可以成功运行在 NV 的 GPU 上。
在这里插入图片描述

Triton 的cuda编译过程如下：
● 将核函数解析成 AST (抽象语法树)
● 根据 AST 生成 Triton IR 代码
● 将 Triton IR 代码转换成 Triton GPU IR 代码
● 将 Triton GPU IR 代码转换成 LLVM IR 代码
● 使用 LLVM，将 LLVM IR 代码转换成 PTX 代码
● 使用 ptxas，将 PTX 代码转换成 cubin 机器码
将 Triton IR，Triton GPU IR 和 LLVM IR 分别简称为 ttir，ttgir 和 llir。其中，ttir 和 ttgir 是 Triton 项目中自己定义的中间表示，ttgir 只比 ttir 多了一个 Blocked Layout，描述的是Block对Memory的Access Pattern。llir 则是 LLVM 项目的中间表示。

二、编译实现细节（triton.jit）

1、jit()是一个装饰器，接口返回的是一个继承自 KernelInterface的JITFunction 对象。

@overload
def jit(fn: T) -> JITFunction[T]:
    ...


@overload
def jit(
    *,
    version=None,
    do_not_specialize: Optional[Iterable[int]] = None,
    debug: Optional[bool] = None,
    noinline: Optional[bool] = None,
) -> Callable[[T], JITFunction[T]]:
    ...


def jit(
    fn: Optional[T] = None,
    *,
    version=None,
    do_not_specialize: Optional[Iterable[int]] = None,
    debug: Optional[bool] = None,
    noinline: Optional[bool] = None,
) -> Union[JITFunction[T], Callable[[T], JITFunction[T]]]:
    def decorator(fn: T) -> JITFunction[T]:
        assert callable(fn)
        if os.getenv("TRITON_INTERPRET", "0") == "1":
            return InterpretedFunction(fn)
        else:
            return JITFunction(
                fn,
                version=version,
                do_not_specialize=do_not_specialize,
                debug=debug,
                noinline=noinline,
            )
    if fn is not None:
        return decorator(fn)

    else:
        return decorator

JITFunction 对象主要实现两个功能：收集核函数的signature（包括输入参数、constexpr、configs 等）和注册 backend、导入 compile.py。JITFunction 对象的 run()会调用compile.py 中的 compile() 开始编译。

2、compiler.py

compiler.py的complie()函数，主要通过两个阶段来实现编译生成ttir：编译一个.so文件缓存起来和 IR的几个 stage 的转换
2.1、编译一个.so文件
此阶段主要是两步：第一步是创建一个main.c文件，将launcher相关的代码写入；第二步是编译出以.so文件。dlc和cuda的编译分别是：dlc是编译DLC_Custom_Kernel的环境，而cuda是通过已经写好的接口（build.py中的_build()函数）用clang或gcc编译支持cuda的hip架构环境。
cuda的launcher相关的代码 acuLaunchKernel(function, XXXX)
cuda的实现是调用 make_launcher.py 的 make_stub()函数生成.so文件，返回so_path

2.2、cuda的IR stage转换如下：
● ast_to_ttir ：其中 CodeGenerator(ast.NodeVisitor) 负责主要工作，返回的是一个 C++ 端 Pybind 的 mlir::ModuleOp 对象
● ttir_to_ttgir ：其中 pm变量是一个 C++ 端Pybind 的 mlir::PassManager 对象
● ttgir_to_llir：其中核心函数 gpu_to_llvmir 也是C++端 Pybind 的 translateTritonGPUToLLVMIR 函数
● llir_to_ptx ：其中核心函数 translate_llvmir_to_ptx 也是主要由C++端的 translateLLVMIRToPT 函数来是实现
● ptx_to_cubin ：其中核心函数compile_ptx_to_cubin主要由 C++ 端来实现的

compile 函数的最后一步：返回编译后的 CompiledKernel(fn, so_path, metadata, asm) handle对象。主要逻辑由CompiledKernel 来实现。调用compile的jit函数里最后负责执行的逻辑语句是：

bin.c_wrapper(grid_0, grid_1, grid_2, bin.num_warps, bin.shared, stream, bin.cu_function, triton.compiler.CompiledKernel.launch_enter_hook, triton.compiler.CompiledKernel.launch_exit_hook, bin, *args)

class CompiledKernel:

    def __init__(self, fn, so_path, metadata, asm):
        # initialize launcher
        import importlib.util
        spec = importlib.util.spec_from_file_location("__triton_launcher", so_path)
        mod = importlib.util.module_from_spec(spec)
        self.fn = fn
        spec.loader.exec_module(mod)
        self.c_wrapper = getattr(mod, "launch")    # <-------- 这个家伙

将核函数解析 AST 生成 Triton IR 代码（compile.py）
ast_to_ttir 定义在 code_generator.py。

huangma.

关注

16
点赞
踩
29

收藏

觉得还不错? 一键收藏
0
评论
openai triton jit 即时编译

JIT（Just-in-Time，即时编译）和 AOT（Ahead-of-Time，预编译）是最常见的两种编译模式。JIT 在运行时即时编译，在开发周期中使用，可以动态下发和执行代码，开发测试效率高，但运行速度和执行性能则会因为运行时即时编译受到影响。AOT 即提前编译，可以生成被直接执行的二进制代码，运行速度快、执行性能表现好，但每次执行前都需要提前编译，开发测试效率低。
复制链接

扫一扫

专栏目录