Pytorch中cuda kernel进行离散访存的几个基本函数

来自底层的仰望

于 2023-07-31 21:43:03 发布

阅读量163

点赞数

文章标签： pytorch 人工智能 c++

本文链接：https://blog.csdn.net/weixin_43969854/article/details/132030902

版权

基本概念

Pytorch原生CUDA C的大部分kernel都支持离散数据访存，其主要的逻辑都是通过thread id去匹配到输入输出Tensor的内存索引。

熟悉Pytorch的同学都知道，Tensor的内存索引在pytorch中是通过size，stride和offset来表示的。

基本的规则便是：

$index = \sum size[i] * stride[i]$

假设一个Tensor a 的size为（2, 4, 4），stride为（16, 4, 1）对其进行slice操作后，Tensor b的size为（2, 4, 2），stride为（16, 4, 2）

那么对应点 $b[1][3][1]$ 的位置便是原生内存上起始点向后的： $1\ast 16+3\ast 4+1\ast 2=30$ 偏移。

python代码：

>>> a = torch.randn((2,4,4))
>>> b = a[:,:,0:3:2]
>>> b.size()
torch.Size([2, 4, 2])
>>> b.stride()
(16, 4, 2)
>>> b[1][3][1]
tensor(0.4909)
>>> a.flatten()[30]
tensor(0.4909)

C++代码实现

对应C++层面的实现中，cuda c代码进行数据读取时就是通过size和stride类似的逻辑关系，和thread idx进行映射。其主要函数分别为：

1 TensorInfo类

代码位置： aten/src/ATen/cuda/detail/TensorInfo.cuh

其是对Tensor信息的抽象，主要抽象Tensor的dim，size和stride信息。存在两个模板参数：分别表示dptr的数据类型和size，stride的数据类型。

1.1）最大支持维度为25;

1.2）支持维度折叠的逻辑，代码路径： aten/src/ATen/CollapseDims.h

1.2.1）只支持正向的维度折叠功能，核心的处理是 size[i] = size[i] * stride[i];

1.2.2）支持中间插值的维度折叠功能，即excludeDim;

注释：其维度折叠的核心逻辑就是借助于公式 $size[i] = size[i+1] * stride[i+1]$ 来判断维度间是否连续。（维度值为1可以天然不考虑）

    for (; oldIndex < stopDim; ++oldIndex) {
      if (sizes[oldIndex] == 1) {
        continue;
      }

      if (strides[newIndex] == sizes[oldIndex] * strides[oldIndex]) {
        sizes[newIndex] *= sizes[oldIndex];
        strides[newIndex] = strides[oldIndex];
      } else {
        ++newIndex;
        sizes[newIndex] = sizes[oldIndex];
        strides[newIndex] = strides[oldIndex];
      }
    }

2 IndexToOffset类

代码位置： aten/src/ATen/cuda/detail/TensorInfo.cuh

其核心逻辑就是通过LinearId（通常就是GPU的thread_id）去映射返回Tensor的真实内存位置。

2.1）通过特化支持动态Dim的方式，也就是info.dim。但看起来这种方式不利于nvcc编译，导致编译时间长。

其核心逻辑get()函数就是通过size和stride取余进行获取内存偏移位置。

template <typename T, typename IndexType>
struct IndexToOffset<T, IndexType, -1> {
  static inline __host__ __device__ IndexType get(
    IndexType linearId,
    const TensorInfo<T, IndexType>& info) {

      IndexType offset = 0;

      for (int i = info.dims - 1; i > 0; --i) {
        IndexType curDimIndex = linearId % info.sizes[i];
        IndexType curDimOffset = curDimIndex * info.strides[i];
        offset += curDimOffset;
        linearId /= info.sizes[i];
      }

      return offset + linearId * info.strides[0];
  }
};

3 getTensorInfo函数实现

代码位置：aten/src/ATen/cuda/detail/IndexUtils.cuh

其核心逻辑就是通过构造TensorInfo对象并返回。其中scalarType是通过data_ptr<scalarType>来保证一致。

来自底层的仰望

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
Pytorch中cuda kernel进行离散访存的几个基本函数

Pytorch原生CUDA C的大部分kernel都支持离散数据访存，其主要的逻辑都是通过thread id去匹配到输入输出Tensor的内存索引。熟悉Pytorch的同学都知道，Tensor的内存索引在pytorch中是通过size，stride和offset来表示的。基本的规则便是：假设一个Tensor a 的size为（2, 4, 4），stride为（16, 4, 1）对其进行slice操作后，Tensor b的size为（2, 4, 2），stride为（16, 4, 2）那么对应点。
复制链接

扫一扫