如何用OpenAI Triton 优化Softmax算子

Hi20240217

已于 2024-04-07 17:24:59 修改

阅读量2.2k

点赞数 27

分类专栏： AI算法文章标签： pytorch 人工智能

于 2024-02-21 21:51:39 首次发布

本文链接：https://blog.csdn.net/m0_61864577/article/details/136221311

版权

如何用OpenAI Triton 优化Softmax算子

测试环境
性能数据
操作步骤

本文介绍了如何用OpenAI Triton优化Softmax算子,从TensorRT Softmax算子的性能测试到Triton Softmax算子的不同实现
背景介绍:

对一个带Softmax算子的模型进行Profing时,发现该算子的耗时超出预期
发现该算子的类别数仅为2.如果一次仅处理一行,则效率必然低下
采用Triton实现Softmax,测试二种不同的分块策略,发现一次处理的行数越多,耗时越短,同时lsu_mem_global_op_ld也越少
Triton极大提升了算子的开发效率

测试环境

属性	值
GPU型号	NVIDIA GeForce RTX 3080 Ti
GPU开发环境	按链接的步骤构建镜像
Triton版本	2.1.0

性能数据

实现方案	耗时(ms)	l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum(MB)
理论上		1024x1024x2x4=8MB
TensorRT	0.6381	33.55
Triton一次读取1行数据	0.74	33.55
Triton一次读取4*32行数据	0.02	8.39

操作步骤

搭建环境

docker stop triton
docker rm triton
nvidia-docker run -ti -e NVIDIA_VISIBLE_DEVICES=all --privileged \
				--net=host -v $PWD:/home -w /home --name triton  cuda_dev_image:v1.0 /bin/bash

conda create -n triton python=3.9
conda activate triton
pip install torch onnx triton==2.1.0 matplotlib pandas -i https://pypi.tuna.tsinghua.edu.cn/simple

测试TensorRT Softmax算子的耗时及访存量

模型代码

# model.py

import onnx
import torch
import numpy as np

class UserModel(torch.nn.Module):
    def __init__(self):
        super(UserModel, self).__init__()
    def forward(self, x):
        return torch.softmax(x, axis=1)

torch.manual_seed(0)
x = torch.randn(1024*1024, 2)
model=UserModel()
output = model(x)
print(output.shape)

torch.onnx.export(model,x,"model.onnx",
                  export_params=True,
                  opset_version=10,
                  do_constant_folding=True)

执行脚本生成onnx模型

python model.py

TensorRT构建engine,Profing,获取算子耗时(0.6381ms)

/usr/src/tensorrt/bin/trtexec --onnx=model.onnx --saveEngine=model.engine --fp16 --exportLayerInfo=fp16_layer.json --profilingVerbosity=detailed  \
        --dumpProfile=true --exportProfile="fp16_profile.json" --exportTimes="fp16_times.json" \
        --iterations=1 --duration=0

获取一次推理的访存量(LD:33.55MB ST:33.55MB)

/usr/local/cuda/bin/ncu  \
    --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum,l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum \
    /usr/src/tensorrt/bin/trtexec --loadEngine=model.engine --useSpinWait --iterations=1 --warmUp=0 --duration=0 --avgRuns=1