TH库学习(二): THTensorApply宏观理解(简化)

最新推荐文章于 2024-03-03 09:27:59 发布

爆米花好美啊

最新推荐文章于 2024-03-03 09:27:59 发布

阅读量2k

点赞数 2

分类专栏： TH库源码学习文章标签： TH torch

本文链接：https://blog.csdn.net/u013010889/article/details/79669903

版权

TH库源码学习专栏收录该内容

4 篇文章 2 订阅

订阅专栏

特别说明，本文大多思路和解释都源于:
[1] PyTorch源码浅析（一）
[2] PyTorch源码浅析（二）
[3] tiny_lib

TensorApply系列的宏函数是TH实现各种张量元素操作最重要的操作，它们负责把一个针对某些标量的操作应用到多个张量元素上去。在GPU部分是相当于一个map的操作。大致方法是优先去操作内存连续部分，然后再操作不连续的部分，以增加CPU cache命中率。

/*
1. 先从下标最外层的维度开始循环，因为按照stride和size的计算公式，最外层的维度步长肯定是1
然后往里面的维度check是否内存一直连续，一直找到不连续的地方，称这个外层的tensor为A，它是连续的，
步长为1。我们把里面剩下的几个维度称为Tensor B，我们把Tensor A一整个看成Tensor B中的一个元素，
Tensor B由一个又一个A组成

2. 然后我们开始从最里面的维度循环Tensor B，每次从Tensor B中取得的元素，都是一个连续的A，然后
连续内存操作，会有预取，命中率很高)。
例子: 
连续的内存 size 3,4  stride 4,1
1 2 3 4
5 6 7 8
其中一个Tensor是 size 2,2 stride 4,1
2 3 4
6 7 8
先从最外层维度走起，发现第1个维度是连续的，第0个维度不连续了(根据前面讲过的size和stride的计算公式验证)
最外层第1个维度的Tensor就是A，里面第0个维度的Tensor就是B，B中两个元素，即2个A。每次取出一个最长连续
的内存A进行操作，取两次。
就这两个宏
获取Tensor的内存分布情况及相关信息stride size等
#define __TH_TENSOR_APPLYX_PREAMBLE(TYPE, TENSOR, DIM, ALLOW_CONTIGUOUS)
更新地址到新的一段连续内存上
#define  __TH_TENSOR_APPLYX_UPDATE_COUNTERS(TENSOR, ALWAYS_UPDATE)
*/
// ##########################################################
/*
 * The basic strategy for apply is as follows:
 *
 * 1. Starting with the outermost index, loop until we reach a dimension where the
 * data is no longer contiguous, i.e. the stride at that dimension is not equal to
 * the size of the tensor defined by the outer dimensions. Let's call this outer
 * (contiguous) tensor A. Note that if the Tensor is contiguous, then A is equal
 * to the entire Tensor. Let's call the inner tensor B.
 *
 * 2. We loop through the indices in B, starting at its outermost dimension. For
 * example, if B is a 2x2 matrix, then we do:
 *
 * B[0][0]
 * B[0][1]
 * B[1][0]
 * B[1][1]
 *
 * We set the offset into the underlying storage as (storageOffset + stride_B * index_B),
 * i.e. basically we compute the offset into the storage as we would normally for a
 * Tensor. But because we are guaranteed the subsequent data is contiguous in memory, we
 * can simply loop for sizeof(A) iterations and perform the operation, without having to
 * follow the order described by the strides of A.
 *
 * 3. As an optimization, we merge dimensions of A that are contiguous in memory. For
 * example, if A is a 3x3x3x3 tensor narrowed from a 3x3x4x3 tensor, then the first two
 * dimensions can be merged for the purposes of APPLY, reducing the number of nested
 * loops.
 */