41.cuBLAS开发指南中文版--cuBLAS中的Level-2gemvBatched()

2.6.24. cublas<t>gemvBatched()

在这里插入图片描述

cublasStatus_t cublasSgemvBatched(cublasHandle_t handle, cublasOperation_t trans,
                                  int m, int n,
                                  const float           *alpha,
                                  const float           *Aarray[], int lda,
                                  const float           *xarray[], int incx,
                                  const float           *beta,
                                  float           *yarray[], int incy,
                                  int batchCount)
cublasStatus_t cublasDgemvBatched(cublasHandle_t handle, cublasOperation_t trans,
                                  int m, int n,
                                  const double          *alpha,
                                  const double          *Aarray[], int lda,
                                  const double          *xarray[], int incx,
                                  const double          *beta,
                                  double          *yarray[], int incy,
                                  int batchCount)
cublasStatus_t cublasCgemvBatched(cublasHandle_t handle, cublasOperation_t trans,
                                  int m, int n,
                                  const cuComplex       *alpha,
                                  const cuComplex       *Aarray[], int lda,
                                  const cuComplex       *xarray[], int incx,
                                  const cuComplex       *beta,
                                  cuComplex       *yarray[], int incy,
                                  int batchCount)
cublasStatus_t cublasZgemvBatched(cublasHandle_t handle, cublasOperation_t trans,
                                  int m, int n,
                                  const cuDoubleComplex *alpha,
                                  const cuDoubleComplex *Aarray[], int lda,
                                  const cuDoubleComplex *xarray[], int incx,
                                  const cuDoubleComplex *beta,
                                  cuDoubleComplex *yarray[], int incy,
                                  int batchCount)
cublasStatus_t cublasHSHgemvBatched(cublasHandle_t handle, cublasOperation_t trans,
                                    int m, int n,
                                    const float           *alpha,
                                    const __half          *Aarray[], int lda,
                                    const __half          *xarray[], int incx,
                                    const float           *beta,
                                    __half                *yarray[], int incy,
                                    int batchCount)
cublasStatus_t cublasHSSgemvBatched(cublasHandle_t handle, cublasOperation_t trans,
                                    int m, int n,
                                    const float           *alpha,
                                    const __half          *Aarray[], int lda,
                                    const __half          *xarray[], int incx,
                                    const float           *beta,
                                    float                 *yarray[], int incy,
                                    int batchCount)
cublasStatus_t cublasTSTgemvBatched(cublasHandle_t handle, cublasOperation_t trans,
                                    int m, int n,
                                    const float           *alpha,
                                    const __nv_bfloat16   *Aarray[], int lda,
                                    const __nv_bfloat16   *xarray[], int incx,
                                    const float           *beta,
                                    __nv_bfloat16         *yarray[], int incy,
                                    int batchCount)
cublasStatus_t cublasTSSgemvBatched(cublasHandle_t handle, cublasOperation_t trans,
                                    int m, int n,
                                    const float           *alpha,
                                    const __nv_bfloat16   *Aarray[], int lda,
                                    const __nv_bfloat16   *xarray[], int incx,
                                    const float           *beta,
                                    float                 *yarray[], int incy,
                                    int batchCount)

此函数执行一批矩阵和向量的矩阵向量乘法。 该批处理被认为是“统一的”,即所有实例对于它们各自的 A 矩阵、x 和 y 向量具有相同的维度 (m, n)、前导维度 (lda)、增量 (incx, incy) 和转置 (trans) . 输入矩阵和向量的地址,以及批处理的每个实例的输出向量,都是从调用者传递给函数的指针数组中读取的。

y [ i ] = α o p ( A [ i ] ) x [ i ] + β y [ i ] , f o r i ∈ [ 0. b a t c h C o u n t − 1 ] y[i] = \alpha op(A[i])x[i] + \beta y[i], for i\in [0. batchCount-1] y[i]=αop(A[i])x[i]+βy[i],fori[0.batchCount1]

其中 α \alpha α β \beta β是标量, A 是指向矩阵 A[i] 的指针数组,以列优先格式存储,维度为 m x n ,x 和 y 是指向向量的指针数组。 此外,对于 matrixA[i] ,

o p ( A [ i ] ) = { A [ i ]      如 果 t r a n s a = = C U B L A S _ O P _ N , A [ i ] T    如 果 t r a n s a = = C U B L A S _ O P _ T , A [ i ] H    如 果 t r a n s a = = C U B L A S _ O P _ C op(A[i])= \begin{cases} A[i]\ \ \ \ 如果 transa == CUBLAS\_OP\_N,\\ A[i]^T \ \ 如果 transa == CUBLAS\_OP\_T,\\ A[i]^H \ \ 如果 transa == CUBLAS\_OP\_C \end{cases} op(A[i])=A[i]    transa==CUBLAS_OP_N,A[i]T  transa==CUBLAS_OP_T,A[i]H  transa==CUBLAS_OP_C
注意:y[i] 向量不能重叠,也就是说,各个 gemv 操作必须是可独立计算的; 否则,会出现未定义的行为。

对于某些规模的问题,在不同的 CUDA 流中多次调用 cublas<t>gemv 可能比使用此 API 更有利。

Param.MemoryIn/outMeaning
handleinputhandle to the cuBLAS library context.
transinputOperation op(A[i]) that is non- or (conj.) transpose.
minputNumber of rows of matrix A[i].
ninputnumber of columns of matrix A.
alphahost or deviceinput<type> scalar used for multiplication.
AarraydeviceinputArray of pointers to array, with each array of dim. lda x n with lda>=max(1,m).All pointers must meet certain alignment criteria. Please see below for details.
ldainputLeading dimension of two-dimensional array used to store each matrix A[i].
xarraydeviceinputArray of pointers to <type> array, with each dimension n if trans==CUBLAS_OP_N and m otherwise.All pointers must meet certain alignment criteria. Please see below for details.
incxinputstride between consecutive elements of x.
betahost or deviceinput<type> scalar used for multiplication. If beta == 0, y does not have to be a valid input.
yarraydevicein/outArray of pointers to array. It has dimensions m if trans==CUBLAS_OP_N and n otherwise. Vectors y[i] should not overlap; otherwise, undefined behavior is expected.All pointers must meet certain alignment criteria. Please see below for details.
incyinputStride of each one-dimensional array y[i].
batchCountinputNumber of pointers contained in Aarray, xarray and yarray.

如果数学模式在使用 cublasSgemvBatched() 时启用快速数学模式,则放置在 GPU 内存中的指针(不是指针数组)必须正确对齐以避免未对齐的内存访问错误。 理想情况下,所有指针都对齐到至少 16 字节。 否则建议他们满足以下规则:

  • if k%4==0 then ensure intptr_t(ptr) % 16 == 0,

该函数可能返回的错误值及其含义如下表所示:

ErrorValueMeaning
CUBLAS_STATUS_SUCCESS操作成功完成
CUBLAS_STATUS_NOT_INITIALIZED库未初始化
CUBLAS_STATUS_INVALID_VALUE参数 m,n,batchCount<0 .
CUBLAS_STATUS_EXECUTION_FAILED该功能无法在 GPU 上启动
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

扫地的小何尚

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值