cuDNN v8 Attention(注意力机制)相关API详解

最新推荐文章于 2024-03-11 12:52:13 发布

weixin_41671327

最新推荐文章于 2024-03-11 12:52:13 发布

阅读量758

点赞数 3

文章标签：深度学习神经网络 dnn transformer 算法

本文链接：https://blog.csdn.net/weixin_41671327/article/details/129795458

版权

MHA相关接口

Nvidia cuDNN 深度学习库中的多头注意力机制（multi-head attention, MHA）相关算子的API，是在cuDNN-v8中才加入的，实现在cudnn_adv_train.so和cudnn_adv_infer.so这两个动态库中，各大框架（pytorch，tensorflow等）对此支持的可能并不是很完善。为了使用cuDNN执行注意力机制，除了CUDA本身相关的定义，和一些cuDNN中常用的定义，需要使用如下一些接口（和相关结构体、枚举定义、宏定义）。

seq2seq类模型的接口

和RNN共用一套Sequence Data相关接口，主要为

Attention专用接口

#define CUDNN_ATTN_QUERYMAP_ALL_TO_ONE 0
#define CUDNN_ATTN_QUERYMAP_ONE_TO_ONE (1U << 0)
#define CUDNN_ATTN_DISABLE_PROJ_BIASES 0
#define CUDNN_ATTN_ENABLE_PROJ_BIASES (1U << 1)
cudnnMultiHeadAttnWeightKind_t
cudnnAttnDescriptor_t
cudnnCreateAttnDescriptor()
cudnnSetAttnDescriptor()
cudnnGetAttnDescriptor()
cudnnDestroyAttnDescriptor()
cudnnGetMultiHeadAttnBuffers()
cudnnGetMultiHeadAttnWeights()
cudnnMultiHeadAttnForward()
cudnnMultiHeadAttnBackwardData()
cudnnMultiHeadAttnBackwardWeights()

关于接口的实际使用，我们在eigenMHA的cudnn分支中有详细的案例，在代码中，我们用了eigen实现了与cuDNN这套API同样的计算并获得了相同的结果。对比看eigen实现和cudnn的内存分配、接口调用，可以更深入理解这套cuDNN接口对输入数据、输入权重的要求。（如果您觉得这篇文章和这个项目有用的话，麻烦您给这个eigenMHA加个Star，点个赞鼓励一下作者，谢谢~）

解释

概述

请添加图片描述
整体而言，MHA作为Tranformer模型中的一个模块，在训练中的前向传播中，需要将embedding（图中的Input Q，Input K，Input V），通过Q K V的线性层、 $S=Q*K^T$ （GEMM）、 $P = S o f t ma x (M a s k (S))$ 、 $P = Dro p o u t (P)$ 、 $O = P * V$ （GEMM）、O的线性层，计算得到MHA的输出（图中的Output O），用以继续下一层（可能是Layernorm）的计算。

在整个Transformer模型都前向传播结束后，通过将模型的输出数据和所期望的Target数据输入，损失代价函数进行对比，得到误差Loss。

在反向传播中，误差Loss会从模型输出端以梯度的形式被逐层传播回模型输入端。对于MHA来说，反向传播将Output O的梯度，通过O的线性层、 $O = P * V$ （GEMM）、 $P = Dro p o u t (P)$ 、 $P = S o f t ma x (S)$ 、 $S=Q*K^T$ （GEMM）、Q K V的线性层反向传播回输入端（embedding的梯度）。

反向传播的数据可以分为大类，第一种是模型的中间变量的梯度，第二种是模型的可训练参数的梯度。如果用 $\mathbf{Q}_{in}$ 来表示Input Q，那么MHA作为Transformer模型的一个环节，需要将中间变量 $\mathbf{Q}_{in}$ ， $\mathbf{K}_{in}$ ， $\mathbf{V}_{in}$ 的梯度

$\mathbf{grad\_Q}_{in} \quad \mathbf{grad\_K}_{in} \quad \mathbf{grad\_V}_{in}$

反向传播到MHA的上一层。而MHA的可训练参数 $\mathbf{W}_{Q}$ ， $\mathbf{W}_{K}$ ， $\mathbf{W}_{V}$ ， $\mathbf{W}_{O}$ ，的梯度

$\mathbf{grad\_W}_{Q} \quad \mathbf{grad\_W}_{K} \quad \mathbf{grad\_W}_{V} \quad \mathbf{grad\_W}_{O}$

需要保存并在训练框架的指示下（比如pytorch的.step()）对模型的参数进行更新。

Sequence Data相关接口

接口

cudnnSeqDataDescriptor_t seqDataDesc;
cudnnSeqDataAxis_t

/*
This function creates one instance of an opaque sequence data descriptor object by allocating the host memory for it and initializing all descriptor fields. The function writes NULL to seqDataDesc when the sequence data descriptor object cannot be allocated.

CUDNN_STATUS_SUCCESS - The descriptor object was created successfully.
CUDNN_STATUS_BAD_PARAM - An invalid input argument was encountered (seqDataDesc=NULL).
CUDNN_STATUS_ALLOC_FAILED - The memory allocation failed.
*/
cudnnStatus_t cudnnCreateSeqDataDescriptor(cudnnSeqDataDescriptor_t *seqDataDesc);	


/*
This function destroys the sequence data descriptor object and releases its memory. The seqDataDesc argument can be NULL. Invoking cudnnDestroySeqDataDescriptor() with a NULL argument is a no operation (NOP).

CUDNN_STATUS_SUCCESS - The descriptor was destroyed successfully.
*/
cudnnStatus_t cudnnDestroySeqDataDescriptor(cudnnSeqDataDescriptor_t seqDataDesc);


/*
This function initializes a previously created sequence data descriptor object. In the most simplified view, this descriptor defines dimensions (dimA) and the data layout (axes) of a four-dimensional tensor.

CUDNN_STATUS_SUCCESS - All input arguments were validated and the sequence data descriptor was successfully updated.
CUDNN_STATUS_BAD_PARAM - An invalid input argument was found. Some examples include:
	* seqDataDesc=NULL
	* dateType was not a valid type of cudnnDataType_t
	* nbDims was negative or zero
	* seqLengthArraySize did not match the expected length
	* some elements of seqLengthArray[] were invalid
CUDNN_STATUS_NOT_SUPPORTED - An unsupported input argument was encountered. Some examples include:
	* nbDims is not equal to 4 
	* paddingFill is not NULL
CUDNN_STATUS_ALLOC_FAILED - Failed to allocate storage for the sequence data descriptor object.
*/
cudnnStatus_t cudnnSetSeqDataDescriptor(
	cudnnSeqDataDescriptor_t seqDataDesc,
    cudnnDataType_t dataType,
	int nbDims,
	const int dimA[],
	const cudnnSeqDataAxis_t axes[],
	size_t seqLengthArraySize,
	const int seqLengthArray[],
	void *paddingFill);

/*
This function retrieves settings from a previously created sequence data descriptor. The user can assign NULL to any pointer except seqDataDesc when the retrieved value is not needed. The nbDimsRequested argument applies to both dimA[] and axes[] arrays. A positive value of nbDimsRequested or seqLengthSizeRequested is ignored when the corresponding array, dimA[], axes[], or seqLengthArray[] is NULL.

CUDNN_STATUS_SUCCESS - Requested sequence data descriptor fields were retrieved successfully.
CUDNN_STATUS_BAD_PARAM - An invalid input argument was found.
CUDNN_STATUS_INTERNAL_ERROR - An inconsistent internal state was encountered.
*/
cudnnStatus_t cudnnGetSeqDataDescriptor(
	const cudnnSeqDataDescriptor_t seqDataDesc,
	cudnnDataType_t *dataType,
	int *nbDims,
	int nbDimsRequested,
	int dimA[],
	cudnnSeqDataAxis_t axes[],
	size_t *seqLengthArraySize,
	size_t seqLengthSizeRequested,
	int seqLengthArray[],
	void *paddingFill);

序列描述设置

Sequence Descriptor（cudnnSeqDataDescriptor_t）是对seq2seq任务中输入数据的一种描述。在cuDNN中，一个seq2seq任务的数据在cuDNN可以由四个维度（cudnnSeqDataAxis_t）表示，维度可分为的

CUDNN_SEQDATA_BATCH_DIM
CUDNN_SEQDATA_BEAM_DIM
CUDNN_SEQDATA_TIME_DIM
CUDNN_SEQDATA_VECT_DIM

依次对应了通产而言数据的batch size，beam size，sequence length和embedding size。所以一个具体的Seqeunce数据尺寸由一个长为4的整形数组int dimA[]定义，dimA[axis]表示了axis这个维度的长度。而一个sequence数据的坐标轴维度顺序不是固定的，这个数组维度的顺序由坐标轴数组cudnnSeqDataAxis_t axes[]表示，axes[0]-axes[3]每一个维度的含义需要单独设置。最后一个axes[3]必须是CUDNN_SEQDATA_VECT_DIM，这样在row-major的数据索引中，CUDNN_SEQDATA_VECT_DIM这个维度的数据会保证在内存中相连。

Sequence Descriptors需要通过cudnnSetSeqDataDescriptor()来配置。在前向传播计算API cudnnMultiHeadAttnForward()和反向传播计算API cudnnMultiHeadAttnBackwardData()，cudnnMultiHeadAttnBackwardWeights()中会检测Sequence Descriptor和由cudnnSetAttnDescriptor()设置的模型descriptor是否匹配。

Dropout相关接口

接口

cudnnDropoutDescriptor_t dropoutDesc;

/*
This function creates a generic dropout descriptor object by allocating the memory needed to hold its opaque structure.

CUDNN_STATUS_SUCCESS - The object was created successfully.
CUDNN_STATUS_ALLOC_FAILED - The resources could not be allocated.
*/
cudnnStatus_t cudnnCreateDropoutDescriptor(cudnnDropoutDescriptor_t    *dropoutDesc)

/*
This function performs forward dropout operation over x returning results in y. If dropout was used as a parameter to cudnnSetDropoutDescriptor(), the approximate dropout fraction of x values will be replaced by a 0, and the rest will be scaled by 1/(1-dropout). This function should not be running concurrently with another cudnnDropoutForward() function using the same states.

CUDNN_STATUS_SUCCESS - The call was successful.
CUDNN_STATUS_INVALID_VALUE - The sizeInBytes argument is less than the value returned by cudnnDropoutGetStatesSize().
CUDNN_STATUS_ALLOC_FAILED - The function failed to temporarily extend the GPU stack.
CUDNN_STATUS_EXECUTION_FAILED - The function failed to launch on the GPU.
CUDNN_STATUS_INTERNAL_ERROR - Internally used CUDA functions returned an error status.
*/
cudnnStatus_t cudnnSetDropoutDescriptor(
    cudnnDropoutDescriptor_t    dropoutDesc,
    cudnnHandle_t               handle,
    float                       dropout,
    void                       *states,
    size_t                      stateSizeInBytes,
    unsigned long long          seed)
	
/*
This function destroys a previously created dropout descriptor object.

CUDNN_STATUS_SUCCESS - The object was destroyed successfully.
*/
cudnnStatus_t cudnnDestroyDropoutDescriptor(cudnnDropoutDescriptor_t dropoutDesc)

/*
This function is used to query the amount of space required to store the states of the random number generators used by the cudnnDropoutForward() function.

CUDNN_STATUS_SUCCESS - The query was successful.
*/
cudnnStatus_t cudnnDropoutGetStatesSize(
    cudnnHandle_t       handle,
    size_t             *sizeInBytes)

Dropout的设置

对于Dropout的实现本身，我们更关注Dropout Mask的实现。

Dropout Mask通常由一个介于0和1之间的浮点数dropout定义。这个Mask和需要施加Dropout操作的变量拥有相同的长度，在mask的所有值中，占比dropout的值为 $0$ ，占比(1-dropout)的值为 $1/ (1 - d ro p o u t)$ 。在训练的前向和反向传播时，该mask会与其对应的变量和对应变量的梯度进行element-wise相乘。

cudnnSetDropoutDescriptor()中指向GPU内存的指针states需要用户自己去根据cudnnDropoutGetStatesSize()的返回值sizeInBytes分配内存，但这个开辟的内存并不是mask所需要的，而是随机数生成器需要的。因此Dropout Mask本身是归Attention Descriptor管理的，其内存分配由Attention内存分配函数根据Attention Descriptor完成。

设置好的Dropout Descriptor会传入Attention Descriptor，如果是非融合算子，Dropout的前向传播API和反向传播API会在Attention的前向传播API和反向传播API中调用。

Attention参数的相关接口

接口

cudnnAttnDescriptor_t attnDesc;

/*
This function creates one instance of an attention descriptor object by allocating the
host memory for it and initializing all descriptor fields. The function writes NULL to attnDesc
when the attention descriptor object cannot be allocated.

CUDNN_STATUS_SUCCESS - The descriptor object was created successfully.
CUDNN_STATUS_BAD_PARAM - An invalid input argument was encountered (attnDesc=NULL).
CUDNN_STATUS_ALLOC_FAILED - The memory allocation failed.
*/
cudnnStatus_t cudnnCreateAttnDescriptor(cudnnAttnDescriptor_t *attnDesc);

/*
This function destroys the attention descriptor object and releases its memory. The attnDesc
argument can be NULL. Invoking cudnnDestroyAttnDescriptor() with a NULL argument is a
no operation (NOP). The cudnnDestroyAttnDescriptor() function is not able to detect if the attnDesc 
argument holds a valid address. Undefined behavior will occur in case of passing an invalid
pointer, not returned by the cudnnCreateAttnDescriptor() function, or in the double deletion
scenario of a valid address

CUDNN_STATUS_SUCCESS - The descriptor was destroyed successfully.
*/
cudnnStatus_t cudnnDestroyAttnDescriptor(cudnnAttnDescriptor_t attnDesc);

/*
This function configures a multi-head attention descriptor that was previously created. 

CUDNN_STATUS_SUCCESS - The attention descriptor was configured successfully.
CUDNN_STATUS_BAD_PARAM - An invalid input argument was encountered. Some examples include:
	* post projection and sizes were not equal
	* dataType, computePrec, or mathType were invalid
	* one or more of the following arguments were either negative or zero: nHeads, qSize,
	kSize, vSize, qoMaxSeqLength, kvMaxSeqLength, maxBatchSize, maxBeamSize
	* one or more of the following arguments were negative: qProjSize, kProjSize, vProjSize, smScaler
CUDNN_STATUS_NOT_SUPPORTED - A requested option or a combination of input arguments is not supported.
*/
cudnnStatus_t cudnnSetAttnDescriptor(
	cudnnAttnDescriptor_t attnDesc,
	unsigned attnMode,
	int nHeads,
	double smScaler,
	cudnnDataType_t dataType,
	cudnnDataType_t computePrec,
	cudnnMathType_t mathType,
	cudnnDropoutDescriptor_t attnDropoutDesc,
	cudnnDropoutDescriptor_t postDropoutDesc,
	int qSize,
	int kSize,
	int vSize,
	int qProjSize,
	int kProjSize,
	int vProjSize,
	int oProjSize,
	int qoMaxSeqLength,
	int kvMaxSeqLength,
	int maxBatchSize,
	int maxBeamSize);

/*
This function retrieves settings from the previously created attention descriptor. The user can
assign NULL to any pointer except attnDesc when the retrieved value is not needed.

CUDNN_STATUS_SUCCESS - Requested attention descriptor fields were retrieved successfully.
CUDNN_STATUS_BAD_PARAM - An invalid input argument was found.
*/
cudnnStatus_t cudnnGetAttnDescriptor(
    cudnnAttnDescriptor_t attnDesc,
    unsigned *attnMode,
    int *nHeads,
    double *smScaler,
    cudnnDataType_t *dataType,
    cudnnDataType_t *computePrec,
    cudnnMathType_t *mathType,
    cudnnDropoutDescriptor_t *attnDropoutDesc,
    cudnnDropoutDescriptor_t *postDropoutDesc,
    int *qSize,
    int *kSize,
    int *vSize,
    int *qProjSize,
    int *kProjSize,
    int *vProjSize,
    int *oProjSize,
    int *qoMaxSeqLength,
    int *kvMaxSeqLength,
    int *maxBatchSize,
    int *maxBeamSize);

参数设置方法

cudnnAttnDescriptor_t装载了Transformer模型中Attention层的参数，这些参数对于每一个Attention层来说是性质的表述，直接定义了每个attention的运算方法:

attnMode是接口将一个unsigned整数解读为多个二进制开关，通过二进制OR（|）方法来控制不同模式的选择（on/off），API的“非黑即白”参数可以通过这种方式合并，以减少额外的输入参数。传入前，首先要通过unsigned attnMode = 0将无符号数置零，再通过几个cuDNN预先定义的数来控制选择Attention计算的变体，其中主要有如下两个选择：
- CUDNN_ATTN_DISABLE_PROJ_BIASES / CUDNN_ATTN_ENABLE_PROJ_BIASES：线性层是否加Bias，一般情况下Attention中的线性层是只有weight而没有bias的，因此首选CUDNN_ATTN_DISABLE_PROJ_BIASES。
- CUDNN_ATTN_QUERYMAP_ALL_TO_ONE / CUDNN_ATTN_QUERYMAP_ONE_TO_ONE： Key和Value的beam_size是否为1，即不同的Q beam是对应同一个Key-Value（CUDNN_ATTN_QUERYMAP_ALL_TO_ONE）还是不同的Key-Value (CUDNN_ATTN_QUERYMAP_ONE_TO_ONE)。
nHeads，MHA中head的数目。
smScaler，在求softmax前，先将要做softmax的数乘一个乘子，常用数值为 $\sqrt{d}$ 。
dataType，Attention的输入、权重、输出的数据格式，可选为CUDNN_DATA_HALF/CUDNN_DATA_FLOAT/CUDNN_DATA_DOUBLE
computePrec，Attention的做计算时的使用的数据格式，可选为CUDNN_DATA_HALF/CUDNN_DATA_FLOAT/CUDNN_DATA_DOUBLE，其精度要小于等于dataType
mathType，做矩阵乘法时的Tensor Core选项（mma寄存器的数据类型）。
cuDNN attention模块存在两个dropout
- attnDropoutDesc，施加给FMHA中的softmax概率矩阵P的dropout
- postDropoutDesc，施加给输出矩阵O（加残差之前）的dropout
之后的几个参数涉及到Attention模块的线性层weight和bias的尺寸，一个Transformer模型中可能存在多个encoder（和/或）decoder，每个enc/dec中可能有多个attention。不同enc/dec层的attention的head number数量相同，固定为nHeads；而head size可能不同，因此hidden size可能不同。线性层的输入hidden size为上一层的hidden size尺寸，我们称之为hidden_size1，对应的head size称之为head_size1；线性层的输出hidden size为本层的hidden size尺寸，我们称之为hidden_size2，对应的head size称之为head_size2。
- qSize，输入Q的hidden size（hidden_size1），如果是第一个enc/dec层，即为经过input embedding和positional encoding输入的embedding size
- kSize，输入K的hidden size（hidden_size1），如果是第一个enc/dec层，即为经过input embedding和positional encoding输入的embedding size
- vSize，输入V的hidden size（hidden_size1），如果是第一个enc/dec层，即为经过input embedding和positional encoding输入的embedding size
- qProjSize，Q的线性层输出head size（head_size2）
- kProjSize，K的线性层输出head size（head_size2）
- vProjSize，V的线性层输出head size（head_size2）
- oProjSize，O线性层所有head合在一起输入输出的hidden size（hidden_size2）
- 这些参数所表示的qkv线性层的输入输出关系可表示为：xSize1==nHeads*xProjSize1==hidden_dim1!=hidden_dim2==nHeads*xProjSize2==xSize2（x为q、k、v中的一个）
以下几个参数定义了在device端预先分配的GPU内存Buffer的大小，实际输入数据的尺寸不能超过这些参数规定的上界。
- qoMaxSeqLength，QO序列的最大可能长度（token的数量）
- kvMaxSeqLength，KV序列的最大可能长度（token的数量）
- maxBatchSize，最大batch数
- maxBeamSize，最大beam数（beam可以理解为一种batch）

Attention计算的相关接口

接口

cudnnWgradMode_t addGrad;

/*
CUDNN_STATUS_SUCCESS - No errors were detected while processing API input arguments and launching GPU kernels.
CUDNN_STATUS_BAD_PARAM - An invalid or incompatible input argument was encountered. Some examples include: 
	*a required input pointer was NULL
	*currIdx was out of bound
	*the descriptor value for attention, query, key, value, and output were incompatible with one another
CUDNN_STATUS_EXECUTION_FAILED - The process of launching a GPU kernel returned an error, or an earlier kernel did not complete successfully.
CUDNN_STATUS_INTERNAL_ERROR - An inconsistent internal state was encountered.
CUDNN_STATUS_NOT_SUPPORTED - A requested option or a combination of input arguments is not supported.
CUDNN_STATUS_ALLOC_FAILED - Insufficient amount of shared memory to launch a GPU kernel.
*/
cudnnStatus_t cudnnMultiHeadAttnForward(
	cudnnHandle_t handle,
	const cudnnAttnDescriptor_t attnDesc,
	int currIdx,
	const int loWinIdx[],
	const int hiWinIdx[],
	const int devSeqLengthsQO[],
	const int devSeqLengthsKV[],
	const cudnnSeqDataDescriptor_t qDesc,
	const void *queries,
	const void *residuals,
	const cudnnSeqDataDescriptor_t kDesc,
	const void *keys,
	const cudnnSeqDataDescriptor_t vDesc,
	const void *values,
	const cudnnSeqDataDescriptor_t oDesc,
       void *out,
	size_t weightSizeInBytes,
	const void *weights,
	size_t workSpaceSizeInBytes,
	void *workSpace,
	size_t reserveSpaceSizeInBytes,
	void *reserveSpace);
	
/*
CUDNN_STATUS_SUCCESS - No errors were detected while processing API input arguments and launching GPU kernels.
CUDNN_STATUS_BAD_PARAM - An invalid or incompatible input argument was encountered.
CUDNN_STATUS_EXECUTION_FAILED - The process of launching a GPU kernel returned an error, or an earlier kernel did not complete successfully.
CUDNN_STATUS_INTERNAL_ERROR - An inconsistent internal state was encountered.
CUDNN_STATUS_NOT_SUPPORTED - A requested option or a combination of input arguments is not supported.
CUDNN_STATUS_ALLOC_FAILED - Insufficient amount of shared memory to launch a GPU kernel.
*/
cudnnStatus_t cudnnMultiHeadAttnBackwardData(
	cudnnHandle_t handle,
	const cudnnAttnDescriptor_t attnDesc,
	const int loWinIdx[],
	const int hiWinIdx[],
	const int devSeqLengthsDQDO[],
	const int devSeqLengthsDKDV[],
	const cudnnSeqDataDescriptor_t doDesc,
	const void *dout,
	const cudnnSeqDataDescriptor_t dqDesc,
	void *dqueries,
	const void *queries,
	const cudnnSeqDataDescriptor_t dkDesc,
	void *dkeys,
	const void *keys,
	const cudnnSeqDataDescriptor_t dvDesc,
	void *dvalues,
	const void *values,
	size_t weightSizeInBytes,
	const void *weights,
	size_t workSpaceSizeInBytes,
	void *workSpace,
	size_t reserveSpaceSizeInBytes,
	void *reserveSpace);
	
/*
CUDNN_STATUS_SUCCESS - No errors were detected while processing API input arguments and launching GPU kernels.
CUDNN_STATUS_BAD_PARAM - An invalid or incompatible input argument was encountered.
CUDNN_STATUS_EXECUTION_FAILED - The process of launching a GPU kernel returned an error, or an earlier kernel did not complete successfully.
CUDNN_STATUS_INTERNAL_ERROR - An inconsistent internal state was encountered.
CUDNN_STATUS_NOT_SUPPORTED - A requested option or a combination of input arguments is not supported.
*/
cudnnStatus_t cudnnMultiHeadAttnBackwardWeights(
	cudnnHandle_t handle,
	const cudnnAttnDescriptor_t attnDesc,
	cudnnWgradMode_t addGrad,
	const cudnnSeqDataDescriptor_t qDesc,
	const void *queries,
	const cudnnSeqDataDescriptor_t kDesc,
	const void *keys,
	const cudnnSeqDataDescriptor_t vDesc,
	const void *values,
	const cudnnSeqDataDescriptor_t doDesc,
	const void *dout,
	size_t weightSizeInBytes,
	const void *weights,
	void *dweights,
	size_t workSpaceSizeInBytes,
	void *workSpace,
	size_t reserveSpaceSizeInBytes,
	void *reserveSpace);

Attention计算简述

在推理中，只需要在推理模式（将reserveSpaceSizeInBytes置零，reserveSpace=NULL）下调用cudnnMultiHeadAttnForward()即可，计算图的中间结果（线性层的输出以及非FMHA下的中间矩阵）使用大小为workSpaceSizeInBytes的workSpace暂时存储。

在训练中这几个计算函数的调用是相邻的，但需要满足固定的顺序，且在前一个API中出现的变量需要在之后调用时保持相同。首先需要调用cudnnMultiHeadAttnForward()来计算前向传播，并将中间结果（线性层的输出，非FMHA下的中间矩阵，Dropout的Dropout Mask）长期保存在reserveSpace之中供cudnnMultiHeadAttnBackwardData()后续使用，cudnnMultiHeadAttnBackwardData()会计算反向传播中的输入数据和中间变量的梯度并在reserveSpace保存中间结果（中间变量的梯度）。最后调用的cudnnMultiHeadAttnBackwardWeights()利用cudnnMultiHeadAttnBackwardData()计算得出的中间变量的梯度来得到MHA几个线性层权重的梯度。

输入前向传播的是模型输入数据queries，keys，values和residuals和每个线性层的输入权重weights，最终生成输出数据out。和queries共享qDesc的residuals可根据residuals是否为NULL选择是否在输出时加在输出out上，当输出需要加残差时，oDesc对应的维度需要和qDesc对应的维度一致。

输入反向传播的是dout和原始输入数据/权重queries，keys，values，weights，最终生成变量的梯度dqueries，dkeys，dvalues以及权重的梯度dweights。

Attention计算参数设置

weightSizeInBytes，workSpaceSizeInBytes，reserveSpaceSizeInBytes由内存管理相关接口cudnnGetMultiHeadAttnBuffers计算得到，用来表示权重weights，工作区workSpace，保留区reserveSpace的大小。输入数据（及其反向传播梯度）由各自的Sequence Descriptor和指向各自GPU内存的指针定义，输入权重（及其反向传播梯度）由Attention模型的Attention Descriptor和一个指向一段保存了所有权重的整段GPU内存的指针定义。

其中，输入数据的尺寸需要通过类型为cudnnSeqDataDescriptor_t 的qDesc，kDesc，vDesc和oDesc来进行描述，输入权重的尺寸由通过cudnnGetMultiHeadAttnBuffers计算而得的weightSizeInBytes描述。根据反向传播的性质，一个变量的梯度（中间变量或权重）的尺寸，与原变量的尺寸相一致。因此queries和dqueries共用一个qDesc，keys和dkeys共用一个kDesc，values和dvalues共用一个vDesc，out和dout共用一个oDesc，weights和dweights共用一个weightSizeInBytes。

devSeqLengthsQO[]与devSeqLengthsKV[]在device端记录了每个batch（和beam）的实际序列长度（与devSeqLengthsDQDO[]、devSeqLengthsDKDV[]共用）。这些存储Sequence Length的数组貌似和在host端初始化的qDesc，kDesc，vDesc和oDesc所表示的数据特征存在冗余，但实际他们是必要的，因为在API中即存在需要在CPU端执行的代码（C++ Function）也存在需要在GPU端执行的代码（GPU Kernel），这两者的执行是异步的。这些descriptor作为在CPU端的数据结构，可能在实际GPU Kernel做计算时就已经被下一个循环中相关操作更改了，且Descriptor的数据结构占用空间较大，不适合直接保存在GPU kernel的函数传参区。

cudnnWgradMode_t表示是将权重的梯度累加在dweights之上（CUDNN_WGRAD_MODE_ADD）还是直接覆盖dweights（CUDNN_WGRAD_MODE_ADD）。

currIdx，loWinIdx[]与hiWinIdx[]我们会在下面用一定的篇幅着重讲解。粗略的说，如果将Query/Output Sequence的每一个token记作一个time-step，currIdx即表示当前的所处time-step，loWinIdx[]与hiWinIdx[]指示了Key-Value Sequence的可见范围。如果想理解currIdx，loWinIdx[]与hiWinIdx[]这套体系的工作模式，就要先回顾一下Transformer中的Attention类型和Attention中的Mask类型。

Transformer中的三种Attention
Transformer模型中的Attention总共有3种。其中，Encoder中有一种，我们称之为Self-Attention；Decoder中有两种，我们称之为Masked Self-Attention和Enc-Dec Attention。因此，Transformer中也对应存在3种尺寸的Mask，我们称之为src mask，trg mask，src-trg mask，分别对应Self-Attention，Masked Self-Attention，Enc-Dec Attention。

而将这3种尺寸的Mask按功能分类可以分为2种，一种是在Self-Attention和Enc-Dec Attention（二者区别于输入数据是否由同一套Embedding生成）中用来处理无实际意义的padding token的；另一种是在Masked Self-Attention中用来限制Attention输出的O矩阵的每一个token vector是和V矩阵的哪些个token vector是有关系的。通常所说的Causal Mask，以矩阵形式表示，是一个下三角全1矩阵，即O的每一个token vector只和当前时刻currIdx之前的V的token vectors是有关系的。

因此，Attention计算接口提供了两种机制，用以灵活的处理Transformer的（1）不区分训练和推理的Self-Attention和Enc-Dec Attention（2）区分训练和推理的Masked Self-Attention。

在Self-Attention，Enc-Dec Attention以及Masked Self-Attention的训练（整个target sequence作为训练Decoder的Masked Self-Attention的输入）中，我们将currIdx设置为负数，再手动设置所有time-step的loWinIdx[]与hiWinIdx[]，这样就会一次处理所有的time-step（遍历0到seqLenQ），并在做按行Softmax时忽略每行（即每个time-step）处于[loWinIdx[idx],hiWinIdx[idx])之外的数值，这在Masked Self-Attention的训练中会忽略要做Softmax的矩阵的上三角的值。
在Masked Self-Attention的推理中，由于其输入仅有Decoder在前一个time-step推理的输出与之前的Decoder输出组而成的含有currIdx个有效token的Sequence，因此我们可以通过将currIdx设为一个介于[0,seqLenQ]之间的数，只运算Input Sequence[0,currIdx)之间的Self-Attention，这种情况下，在计算currIdx的cudnnMultiHeadAttnForward()之前设置好loWinIdx[currIdx]和hiWinIdx[currIdx]就可以。

Attention内存相关接口

接口

/*
This function computes weight, work, and reserve space buffer sizes used by the following functions.

CUDNN_STATUS_ARCH_MISMATCH - The GPU device does not support the input data type.
CUDNN_STATUS_SUCCESS - The requested buffer sizes were computed successfully.
CUDNN_STATUS_BAD_PARAM - An invalid input argument was found.
*/
cudnnStatus_t cudnnGetMultiHeadAttnBuffers(
	cudnnHandle_t handle,
	const cudnnAttnDescriptor_t attnDesc,
	size_t *weightSizeInBytes,
	size_t *workSpaceSizeInBytes,
	size_t *reserveSpaceSizeInBytes);


cudnnStatus_t cudnnGetMultiHeadAttnWeights(
    cudnnHandle_t handle,
    const cudnnAttnDescriptor_t attnDesc,
    cudnnMultiHeadAttnWeightKind_t wKind,
    size_t weightSizeInBytes,
    const void *weights,
    cudnnTensorDescriptor_t wDesc,
    void **wAddr);

内存分配简述

cuDNN通过cudnnGetMultiHeadAttnBuffers()计算各个GPU内存Buffer所需的大小，简单而言，weightSizeInBytes是根据Attention Descriptor获得的权重以及权重梯度的内存空间大小，在推理时reserveSpaceSizeInBytes是0，在训练时workSpaceSizeInBytes是0。

cuDNN通过cudnnGetMultiHeadAttnWeights()获得存储在权重内存中的各个具体权重信息。想要获得一个具体的权重（尺寸wDesc、起始地址wAddr），就要提供一个权重内存的起始地址weights及其有效范围weightSizeInBytes，并通过cudnnMultiHeadAttnWeightKind_t指定权重的类型wKind：

CUDNN_MH_ATTN_Q_WEIGHTS, CUDNN_MH_ATTN_K_WEIGHTS, CUDNN_MH_ATTN_V_WEIGHTS, CUDNN_MH_ATTN_O_WEIGHTS
CUDNN_MH_ATTN_Q_BIASES, CUDNN_MH_ATTN_K_BIASES, CUDNN_MH_ATTN_V_BIASES, CUDNN_MH_ATTN_O_BIASES

cudnnGetMultiHeadAttnWeights既可以从GPU内存取数据，也可以从CPU内存取数据，即const void *weights可以是一个CPU端地址也可以是一个GPU端地址。

前向传播内存分配

前向传播时，除了需要用户为queries，keys，values 和weights（queries通常用作为residuals）以及输出out 分配GPU内存，还需要通过cudnnGetMultiHeadAttnBuffers()计算得到的workSpaceSizeInBytes为中间变量工作区workSpace分配内存。

workSpace内存分配方式如下：

当MHA API包含了线性层时，输入的queries，keys，values 需要先经过线性层映射，第二个GEMM的输出O还需要经过O的线性层映射得到out。
- 当Attention的GPU算子不是采用融合算子时（fused multi-head attention, FMHA），需要为三个输入线性层的输出Q，K，V、第一个GEMM的结果S、Softmax的结果P、和第二个GEMM的结果O分配内存。考虑到GPU内存服用，K和V的线性层输出可以复用一块存储空间。
- 当Attention的cuda算子采用FMHA时，只需要为三个输入线性层的输出Q，K，V、FMHA输出O分配内存。
当MHA API不包含线性层时，输入的queries，keys，values直接作为两个GEMM的输入，第二个GEMM的输出O直接作为API的输出
- 非融合算子情况下，需要为第一个GEMM的结果S、Softmax的结果P分配内存
- 融合算子情况下，不需要额外分配内存

反向传播内存分配

需要反向传播的数据有两种，一种是输出的权重梯度dweights，一种是数据的梯度：输入dout以及输出dqueries，dkeys，dvalues。因此，反向传播需要用户为这些梯度显式分配GPU内存。除此之外，还需要通过cudnnGetMultiHeadAttnBuffers()计算得到的reserveSpaceSizeInBytes为中间变量工作区reserveSpace分配内存。

reserveSpace内存分配方式这里只讨论当MHA API包含了线性层时：

非融合算子情况下，cudnnMultiHeadAttnForward()的计算需要为QKV的线性层输出、第一个GEMM的输出、Softmax的输出、Dropout Mask、第二个GEMM的输出这些中间变量分配内存。cudnnMultiHeadAttnBackwardData()在处理梯度数据时，相比前向传播的内存分配，既需要额外的空间来存储必要的梯度数据，也要对reserveSpace的内存空间进行复用。

其中，为了节省GPU内存，S矩阵与其自身的梯度复用一块存储空间，Q的线性层输出梯度可以与O的线性层输入梯度复用一块存储空间，K的线性层输出梯度可以和V的线性层输出复用一块存储空间。这些复用的空间，新的中间变量会对该内存位置此前存储的中间变量进行覆盖，但因为本身在计算链中的因果关系，并不会对效率和安全性产生任何影响，被覆盖掉的内容也不会在cudnnMultiHeadAttnBackwardWeights()被后续使用。而需要保留的原中间变量不能被覆盖，比如，cudnnMultiHeadAttnBackwardWeights()会使用O的线性层输入、QKV的线性层输出的梯度。

融合算子情况下，可以节省Softmax和Dropout相关的中间变量的存储空间（不包括Dropout Mask）。

以上只是一些简单的考虑和估计，实际中cudnn会为不同的输入尺寸、模型尺寸分配不同的workspace，这些均会在运行时由cudnnGetMultiHeadAttnBuffers的workSpaceSizeInBytes给出。且由于可能存在额外的索引，workSpaceSizeInBytes的值会略大于大于我们根据算法估算的字节数。

以上就应该是cuDNN v8 Transformer多头注意力机制有关的全部内容。如果要了解更多，我们在eigenMHA的cudnn分支中有详细的案例，在代码中，我们用了eigen实现了与cuDNN这套API同样的计算并获得了相同的结果。对比看eigen实现和cudnn的内存分配、接口调用，可以更深入理解这套cuDNN接口对输入数据、输入权重的要求。（如果您觉得这篇文章和这个项目有用的话，麻烦您给这个eigenMHA加个Star，点个赞鼓励一下作者，谢谢~）