cuBLAS使用（4）

蓝天巨人

已于 2023-03-16 17:06:55 修改

阅读量631

点赞数

分类专栏： CUDA 学习使用文章标签：算法 c++ 开发语言

于 2023-03-16 16:44:38 首次发布

本文链接：https://blog.csdn.net/qq_44632658/article/details/129590433

版权

CUDA 学习使用专栏收录该内容

8 篇文章 3 订阅

订阅专栏

在本章中，我们将介绍执行矩阵-矩阵运算的第三级基本线性代数子程序（BLAS 3）函数。

cublas<t>gemm()

cublasStatus_t cublasSgemm(cublasHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           int m, int n, int k,
                           const float           *alpha,
                           const float           *A, int lda,
                           const float           *B, int ldb,
                           const float           *beta,
                           float           *C, int ldc)
cublasStatus_t cublasDgemm(cublasHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           int m, int n, int k,
                           const double          *alpha,
                           const double          *A, int lda,
                           const double          *B, int ldb,
                           const double          *beta,
                           double          *C, int ldc)
cublasStatus_t cublasCgemm(cublasHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           int m, int n, int k,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           const cuComplex       *B, int ldb,
                           const cuComplex       *beta,
                           cuComplex       *C, int ldc)
cublasStatus_t cublasZgemm(cublasHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           int m, int n, int k,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           const cuDoubleComplex *B, int ldb,
                           const cuDoubleComplex *beta,
                           cuDoubleComplex *C, int ldc)
cublasStatus_t cublasHgemm(cublasHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           int m, int n, int k,
                           const __half *alpha,
                           const __half *A, int lda,
                           const __half *B, int ldb,
                           const __half *beta,
                           __half *C, int ldc)

此函数支持64位整数接口。
此函数执行矩阵-矩阵乘法

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
transa		input	operation op(`A`) that is non- or (conj.) transpose.
transb		input	operation op(`B`) that is non- or (conj.) transpose.
m		input	number of rows of matrix op(`A`) and `C`.
n		input	number of columns of matrix op(`B`) and `C`.
k		input	number of columns of op(`A`) and rows of op(`B`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimensions `lda x k` with `lda>=max(1,m)` if `transa == CUBLAS_OP_N` and `lda x m` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store the matrix `A`.
B	device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,k)` if `transb == CUBLAS_OP_N` and `ldb x k` with `ldb>=max(1,n)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	<type> scalar used for multiplication. If `beta==0`, `C` does not have to be a valid input.
C	device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of a two-dimensional array used to store the matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m`, `n`, `k` < 0 or if `transa`, `transb` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `lda` < max(1, `m`) if `transa` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldb` < max(1, `k`) if `transb` == `CUBLAS_OP_N` and `ldb` < max(1, `n`) otherwise or if `ldc` < max(1, `m`) or if `alpha`, `beta` == NULL or `C` == NULL if `C` needs to be scaled
`CUBLAS_STATUS_ARCH_MISMATCH`	in the case of `cublasHgemm` the device does not support math in half precision.
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

// CUDA runtime 库 + CUBLAS 库
#include "cuda_runtime.h"
#include "cublas_v2.h"
#include <iostream>
#include <stdlib.h>

using namespace std;

// 定义测试矩阵的维度
int const A_ROW = 5;
int const A_COL = 6;
int const B_ROW = 6;
int const B_COL = 7;

int main()
{
  // 定义状态变量
  cublasStatus_t status;
  float *h_A,*h_B,*h_C;   //存储于内存中的矩阵
  h_A = (float*)malloc(sizeof(float)*A_ROW*A_COL);  //在内存中开辟空间
  h_B = (float*)malloc(sizeof(float)*B_ROW*B_COL);
  h_C = (float*)malloc(sizeof(float)*A_ROW*B_COL);

  // 为待运算矩阵的元素赋予 0-10 范围内的随机数
  for (int i=0; i<A_ROW*A_COL; i++) {
    h_A[i] = (float)(rand()%10+1);
  }
  for(int i=0;i<B_ROW*B_COL; i++) {
    h_B[i] = (float)(rand()%10+1);
  }
  // 打印待测试的矩阵
  cout << "矩阵 A :" << endl;
  for (int i=0; i<A_ROW*A_COL; i++){
    cout << h_A[i] << " ";
    if ((i+1)%A_COL == 0) cout << endl;
  }
  cout << endl;
  cout << "矩阵 B :" << endl;
  for (int i=0; i<B_ROW*B_COL; i++){
    cout << h_B[i] << " ";
    if ((i+1)%B_COL == 0) cout << endl;
  }
  cout << endl;

  float *d_A,*d_B,*d_C;    //存储于显存中的矩阵
  cudaMalloc((void**)&d_A,sizeof(float)*A_ROW*A_COL); //在显存中开辟空间
  cudaMalloc((void**)&d_B,sizeof(float)*B_ROW*B_COL);
  cudaMalloc((void**)&d_C,sizeof(float)*A_ROW*B_COL);

  cublasHandle_t handle;
  cublasCreate(&handle);
  cudaMemcpy(d_A,h_A,sizeof(float)*A_ROW*A_COL,cudaMemcpyHostToDevice); //数据从内存拷贝到显存
  cudaMemcpy(d_B,h_B,sizeof(float)*B_ROW*B_COL,cudaMemcpyHostToDevice);

  float a = 1, b = 0;
  cublasSgemm(
          handle,
          CUBLAS_OP_T,   //矩阵A的属性参数，转置，按行优先
          CUBLAS_OP_T,   //矩阵B的属性参数，转置，按行优先
          A_ROW,          //矩阵A、C的行数
          B_COL,          //矩阵B、C的列数
          A_COL,          //A的列数，B的行数，此处也可为B_ROW,一样的
          &a,             //alpha的值
          d_A,            //左矩阵，为A
          A_COL,          //A的leading dimension，此时选择转置，按行优先，则leading dimension为A的列数
          d_B,            //右矩阵，为B
          B_COL,          //B的leading dimension，此时选择转置，按行优先，则leading dimension为B的列数
          &b,             //beta的值
          d_C,            //结果矩阵C
          A_ROW           //C的leading dimension，C矩阵一定按列优先，则leading dimension为C的行数
  );
  //此时得到的结果便是C=AB,但由于C是按列优先，故此时得到的C应该是正确结果的转置
  std::cout << "计算结果的转置 ( (A*B)的转置 )：" << std::endl;


  cudaMemcpy(h_C,d_C,sizeof(float)*A_ROW*B_COL,cudaMemcpyDeviceToHost);
  for(int i=0;i<A_ROW*B_COL;++i) {
    std::cout<<h_C[i]<<" ";
    if((i+1)%B_COL==0) std::cout<<std::endl;
  }
  cudaFree(d_A);
  cudaFree(d_B);
  cudaFree(d_C);
  free(h_A);
  free(h_B);
  free(h_C);
  return 0;
}

cublas<t>gemm3m()

cublasStatus_t cublasCgemm3m(cublasHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           int m, int n, int k,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           const cuComplex       *B, int ldb,
                           const cuComplex       *beta,
                           cuComplex       *C, int ldc)
cublasStatus_t cublasZgemm3m(cublasHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           int m, int n, int k,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           const cuDoubleComplex *B, int ldb,
                           const cuDoubleComplex *beta,
                           cuDoubleComplex *C, int ldc)

此函数支持64位整数接口。
此函数使用高斯复杂度降低算法执行复矩阵-矩阵乘法。这可使性能提高多达25%

therwise.
lda		input	leading dimension of two-dimensional array used to store the matrix `A`.
B	device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,k)` if `transb == CUBLAS_OP_N` and `ldb x k` with `ldb>=max(1,n)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	<type> scalar used for multiplication. If `beta==0`, `C` does not have to be a valid input.
C	device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of a two-dimensional array used to store the matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m`, `n`, `k` < 0 or if `transa`, `transb` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `lda` < max(1, `m`) if `transa` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldb` < max(1, `k`) if `transb` == `CUBLAS_OP_N` and `ldb` < max(1, `n`) otherwise or if `ldc` < max(1, `m`) or if `alpha`, `beta` == NULL or `C` == NULL if `C` needs to be scaled
`CUBLAS_STATUS_ARCH_MISMATCH`	the device has a compute capabilites lower than 5.0
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

cublas<t>gemmBatched()

cublasStatus_t cublasHgemmBatched(cublasHandle_t handle,
                                  cublasOperation_t transa,
                                  cublasOperation_t transb,
                                  int m, int n, int k,
                                  const __half           *alpha,
                                  const __half           *Aarray[], int lda,
                                  const __half           *Barray[], int ldb,
                                  const __half           *beta,
                                  __half           *Carray[], int ldc,
                                  int batchCount)
cublasStatus_t cublasSgemmBatched(cublasHandle_t handle,
                                  cublasOperation_t transa,
                                  cublasOperation_t transb,
                                  int m, int n, int k,
                                  const float           *alpha,
                                  const float           *Aarray[], int lda,
                                  const float           *Barray[], int ldb,
                                  const float           *beta,
                                  float           *Carray[], int ldc,
                                  int batchCount)
cublasStatus_t cublasDgemmBatched(cublasHandle_t handle,
                                  cublasOperation_t transa,
                                  cublasOperation_t transb,
                                  int m, int n, int k,
                                  const double          *alpha,
                                  const double          *Aarray[], int lda,
                                  const double          *Barray[], int ldb,
                                  const double          *beta,
                                  double          *Carray[], int ldc,
                                  int batchCount)
cublasStatus_t cublasCgemmBatched(cublasHandle_t handle,
                                  cublasOperation_t transa,
                                  cublasOperation_t transb,
                                  int m, int n, int k,
                                  const cuComplex       *alpha,
                                  const cuComplex       *Aarray[], int lda,
                                  const cuComplex       *Barray[], int ldb,
                                  const cuComplex       *beta,
                                  cuComplex       *Carray[], int ldc,
                                  int batchCount)
cublasStatus_t cublasZgemmBatched(cublasHandle_t handle,
                                  cublasOperation_t transa,
                                  cublasOperation_t transb,
                                  int m, int n, int k,
                                  const cuDoubleComplex *alpha,
                                  const cuDoubleComplex *Aarray[], int lda,
                                  const cuDoubleComplex *Barray[], int ldb,
                                  const cuDoubleComplex *beta,
                                  cuDoubleComplex *Carray[], int ldc,
                                  int batchCount)

此函数支持64位整数接口。
此函数执行一批矩阵的矩阵-矩阵乘法。该批被认为是“均匀的，”即所有实例对于它们各自的A、B和C矩阵具有相同的维数（m，n，k）、前导维数（lda，ldb，ldc）和转置（transa，transb）.批处理的每个实例的输入矩阵和输出矩阵的地址是从调用方传递给函数的指针数组中读取的。

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
transa		input	operation op(`A[i]`) that is non- or (conj.) transpose.
transb		input	operation op(`B[i]`) that is non- or (conj.) transpose.
m		input	number of rows of matrix op(`A[i]`) and `C[i]`.
n		input	number of columns of op(`B[i]`) and `C[i]`.
k		input	number of columns of op(`A[i]`) and rows of op(`B[i]`).
alpha	host or device	input	<type> scalar used for multiplication.
Aarray	device	input	array of pointers to <type> array, with each array of dim. `lda x k` with `lda>=max(1,m)` if `transa==CUBLAS_OP_N` and `lda x m` with `lda>=max(1,k)` otherwise. All pointers must meet certain alignment criteria. Please see below for details.
lda		input	leading dimension of two-dimensional array used to store each matrix `A[i]`.
Barray	device	input	array of pointers to <type> array, with each array of dim. `ldb x n` with `ldb>=max(1,k)` if `transb==CUBLAS_OP_N` and `ldb x k` with `ldb>=max(1,n)` max(1,) otherwise. All pointers must meet certain alignment criteria. Please see below for details.
ldb		input	leading dimension of two-dimensional array used to store each matrix `B[i]`.
beta	host or device	input	<type> scalar used for multiplication. If `beta == 0`, `C` does not have to be a valid input.
Carray	device	in/out	array of pointers to <type> array. It has dimensions `ldc x n` with `ldc>=max(1,m)`. Matrices `C[i]` should not overlap; otherwise, undefined behavior is expected. All pointers must meet certain alignment criteria. Please see below for details.
ldc		input	leading dimension of two-dimensional array used to store each matrix `C[i]`.
batchCount		input	number of pointers contained in Aarray, Barray and Carray.

If math mode enables fast math modes when using cublasSgemmBatched(), pointers (not the pointer arrays) placed in the GPU memory must be properly aligned to avoid misaligned memory access errors. Ideally all pointers are aligned to at least 16 Bytes. Otherwise it is recommended that they meet the following rule:

if k%4==0 then ensure intptr_t(ptr) % 16 == 0,

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m`, `n`, `k`, `batchCount` < 0 or if `transa`, `transb` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `lda` < max(1, `m`) if `transa` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldb` < max(1, `k`) if `transb` == `CUBLAS_OP_N` and `ldb` < max(1, `n`) otherwise or if `ldc` < max(1, `m`)
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU
`CUBLAS_STATUS_ARCH_MISMATCH`	`cublasHgemmBatched` is only supported for GPU with architecture capabilities equal or greater than 5.3

cublas<t>gemmStridedBatched()

cublasStatus_t cublasHgemmStridedBatched(cublasHandle_t handle,
                                  cublasOperation_t transa,
                                  cublasOperation_t transb,
                                  int m, int n, int k,
                                  const __half           *alpha,
                                  const __half           *A, int lda,
                                  long long int          strideA,
                                  const __half           *B, int ldb,
                                  long long int          strideB,
                                  const __half           *beta,
                                  __half                 *C, int ldc,
                                  long long int          strideC,
                                  int batchCount)
cublasStatus_t cublasSgemmStridedBatched(cublasHandle_t handle,
                                  cublasOperation_t transa,
                                  cublasOperation_t transb,
                                  int m, int n, int k,
                                  const float           *alpha,
                                  const float           *A, int lda,
                                  long long int          strideA,
                                  const float           *B, int ldb,
                                  long long int          strideB,
                                  const float           *beta,
                                  float                 *C, int ldc,
                                  long long int          strideC,
                                  int batchCount)
cublasStatus_t cublasDgemmStridedBatched(cublasHandle_t handle,
                                  cublasOperation_t transa,
                                  cublasOperation_t transb,
                                  int m, int n, int k,
                                  const double          *alpha,
                                  const double          *A, int lda,
                                  long long int          strideA,
                                  const double          *B, int ldb,
                                  long long int          strideB,
                                  const double          *beta,
                                  double                *C, int ldc,
                                  long long int          strideC,
                                  int batchCount)
cublasStatus_t cublasCgemmStridedBatched(cublasHandle_t handle,
                                  cublasOperation_t transa,
                                  cublasOperation_t transb,
                                  int m, int n, int k,
                                  const cuComplex       *alpha,
                                  const cuComplex       *A, int lda,
                                  long long int          strideA,
                                  const cuComplex       *B, int ldb,
                                  long long int          strideB,
                                  const cuComplex       *beta,
                                  cuComplex             *C, int ldc,
                                  long long int          strideC,
                                  int batchCount)
cublasStatus_t cublasCgemm3mStridedBatched(cublasHandle_t handle,
                                  cublasOperation_t transa,
                                  cublasOperation_t transb,
                                  int m, int n, int k,
                                  const cuComplex       *alpha,
                                  const cuComplex       *A, int lda,
                                  long long int          strideA,
                                  const cuComplex       *B, int ldb,
                                  long long int          strideB,
                                  const cuComplex       *beta,
                                  cuComplex             *C, int ldc,
                                  long long int          strideC,
                                  int batchCount)
cublasStatus_t cublasZgemmStridedBatched(cublasHandle_t handle,
                                  cublasOperation_t transa,
                                  cublasOperation_t transb,
                                  int m, int n, int k,
                                  const cuDoubleComplex *alpha,
                                  const cuDoubleComplex *A, int lda,
                                  long long int          strideA,
                                  const cuDoubleComplex *B, int ldb,
                                  long long int          strideB,
                                  const cuDoubleComplex *beta,
                                  cuDoubleComplex       *C, int ldc,
                                  long long int          strideC,
                                  int batchCount)

此函数支持64位整数接口。
此函数执行一批矩阵的矩阵-矩阵乘法。该批被认为是“均匀的，”即所有实例对于它们各自的A、B和C矩阵具有相同的维数（m，n，k）、前导维数（lda，ldb，ldc）和转置（transa，transb）.批处理的每个实例的输入矩阵A、B和输出矩阵C位于相对于它们在前一实例中的位置的固定数量的元素偏移处。第一个实例中指向A、B和C矩阵的指针由用户传递给函数沿着同时传递的还有元素数量的偏移量-- strideA、strideB和strideC，它们决定了输入和输出矩阵在未来实例中的位置。

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
transa		input	operation op(`A[i]`) that is non- or (conj.) transpose.
transb		input	operation op(`B[i]`) that is non- or (conj.) transpose.
m		input	number of rows of matrix op(`A[i]`) and `C[i]`.
n		input	number of columns of op(`B[i]`) and `C[i]`.
k		input	number of columns of op(`A[i]`) and rows of op(`B[i]`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type>* pointer to the A matrix corresponding to the first instance of the batch, with dimensions `lda x k` with `lda>=max(1,m)` if `transa==CUBLAS_OP_N` and `lda x m` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store each matrix `A[i]`.
strideA		input	Value of type long long int that gives the offset in number of elements between `A[i]` and `A[i+1]`
B	device	input	<type>* pointer to the B matrix corresponding to the first instance of the batch, with dimensions `ldb x n` with `ldb>=max(1,k)` if `transb==CUBLAS_OP_N` and `ldb x k` with `ldb>=max(1,n)` max(1,) otherwise.
ldb		input	leading dimension of two-dimensional array used to store each matrix `B[i]`.
strideB		input	Value of type long long int that gives the offset in number of elements between `B[i]` and `B[i+1]`
beta	host or device	input	<type> scalar used for multiplication. If `beta == 0`, `C` does not have to be a valid input.
C	device	in/out	<type>* pointer to the C matrix corresponding to the first instance of the batch, with dimensions `ldc x n` with `ldc>=max(1,m)`. Matrices `C[i]` should not overlap; otherwise, undefined behavior is expected.
ldc		input	leading dimension of two-dimensional array used to store each matrix `C[i]`.
strideC		input	Value of type long long int that gives the offset in number of elements between `C[i]` and `C[i+1]`
batchCount		input	number of GEMMs to perform in the batch.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m`, `n`, `k`, `batchCount` < 0 or if `transa`, `transb` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `lda` < max(1, `m`) if `transa` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldb` < max(1, `k`) if `transb` == `CUBLAS_OP_N` and `ldb` < max(1, `n`) otherwise or if `ldc` < max(1, `m`)
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU
`CUBLAS_STATUS_ARCH_MISMATCH`	`cublasHgemmStridedBatched` is only supported for GPU with architecture capabilities equal or greater than 5.3

cublas<t>symm()

cublasStatus_t cublasSsymm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           int m, int n,
                           const float           *alpha,
                           const float           *A, int lda,
                           const float           *B, int ldb,
                           const float           *beta,
                           float           *C, int ldc)
cublasStatus_t cublasDsymm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           int m, int n,
                           const double          *alpha,
                           const double          *A, int lda,
                           const double          *B, int ldb,
                           const double          *beta,
                           double          *C, int ldc)
cublasStatus_t cublasCsymm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           int m, int n,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           const cuComplex       *B, int ldb,
                           const cuComplex       *beta,
                           cuComplex       *C, int ldc)
cublasStatus_t cublasZsymm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           int m, int n,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           const cuDoubleComplex *B, int ldb,
                           const cuDoubleComplex *beta,
                           cuDoubleComplex *C, int ldc)

此函数支持64位整数接口。
此函数执行对称矩阵-矩阵乘法

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
side		input	indicates if matrix `A` is on the left or right of `B`.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
m		input	number of rows of matrix `C` and `B`, with matrix `A` sized accordingly.
n		input	number of columns of matrix `C` and `B`, with matrix `A` sized accordingly.
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x m` with `lda>=max(1,m)` if `side == CUBLAS_SIDE_LEFT` and `lda x n` with `lda>=max(1,n)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,m)`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	<type> scalar used for multiplication, if `beta == 0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m`, `n` < 0 or if `side` != `CUBLAS_SIDE_LEFT`, `CUBLAS_SIDE_RIGHT` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `lda` < max(1, `m`) if `side` == `CUBLAS_SIDE_LEFT` and `lda` < max(1, `n`) otherwise or if `ldb` < max(1, `m`) or if `ldc` < max(1, `m`) or if `alpha` == NULL or `beta` == NULL or `C` == NULL if `C` needs to be scaled
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

cublas<t>syrk()

cublasStatus_t cublasSsyrk(cublasHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const float           *alpha,
                           const float           *A, int lda,
                           const float           *beta,
                           float           *C, int ldc)
cublasStatus_t cublasDsyrk(cublasHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const double          *alpha,
                           const double          *A, int lda,
                           const double          *beta,
                           double          *C, int ldc)
cublasStatus_t cublasCsyrk(cublasHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           const cuComplex       *beta,
                           cuComplex       *C, int ldc)
cublasStatus_t cublasZsyrk(cublasHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           const cuDoubleComplex *beta,
                           cuDoubleComplex *C, int ldc)

此函数支持64位整数接口。
此函数执行对称秩- K更新

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `C` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or transpose.
n		input	number of rows of matrix op(`A`) and `C`.
k		input	number of columns of matrix op(`A`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `trans == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix A.
beta	host or device	input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `n`, `k` < 0 or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `lda` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldc` < max(1, `n`) or if `alpha` == NULL or `beta` == NULL or `C` == NULL if `C` needs to be scaled
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

cublas<t>syr2k()

cublasStatus_t cublasSsyr2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const float           *alpha,
                            const float           *A, int lda,
                            const float           *B, int ldb,
                            const float           *beta,
                            float           *C, int ldc)
cublasStatus_t cublasDsyr2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const double          *alpha,
                            const double          *A, int lda,
                            const double          *B, int ldb,
                            const double          *beta,
                            double          *C, int ldc)
cublasStatus_t cublasCsyr2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, int lda,
                            const cuComplex       *B, int ldb,
                            const cuComplex       *beta,
                            cuComplex       *C, int ldc)
cublasStatus_t cublasZsyr2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, int lda,
                            const cuDoubleComplex *B, int ldb,
                            const cuDoubleComplex *beta,
                            cuDoubleComplex *C, int ldc)

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `C` lower or upper part, is stored, the other symmetric part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimensions `ldb x k` with `ldb>=max(1,n)` if `transb == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	<type> scalar used for multiplication, if `beta==0`, then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,n)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

cublas<t>trmm()

cublasStatus_t cublasStrmm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const float           *alpha,
                           const float           *A, int lda,
                           const float           *B, int ldb,
                           float                 *C, int ldc)
cublasStatus_t cublasDtrmm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const double          *alpha,
                           const double          *A, int lda,
                           const double          *B, int ldb,
                           double                *C, int ldc)
cublasStatus_t cublasCtrmm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           const cuComplex       *B, int ldb,
                           cuComplex             *C, int ldc)
cublasStatus_t cublasZtrmm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           const cuDoubleComplex *B, int ldb,
                           cuDoubleComplex       *C, int ldc)

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
side		input	indicates if matrix `A` is on the left or right of `B`.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
diag		input	indicates if the elements on the main diagonal of matrix `A` are unity and should not be accessed.
m		input	number of rows of matrix `B`, with matrix `A` sized accordingly.
n		input	number of columns of matrix `B`, with matrix `A` sized accordingly.
alpha	host or device	input	<type> scalar used for multiplication, if `alpha==0` then `A` is not referenced and `B` does not have to be a valid input.
A	device	input	<type> array of dimension `lda x m` with `lda>=max(1,m)` if `side == CUBLAS_SIDE_LEFT` and `lda x n` with `lda>=max(1,n)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,m)`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
C	device	in/out	<type> array of dimension `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m`, `n` < 0 or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `side` != `CUBLAS_SIDE_LEFT`, `CUBLAS_SIDE_RIGHT` or if `lda` < max(1, `m`) if `side` == `CUBLAS_SIDE_LEFT` and `lda` < max(1, `n`) otherwise or if `ldb` < max(1, `m`) or `C` == NULL if `C` needs to be scaled
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

cublas<t>trsm()

cublasStatus_t cublasStrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const float           *alpha,
                           const float           *A, int lda,
                           float           *B, int ldb)
cublasStatus_t cublasDtrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const double          *alpha,
                           const double          *A, int lda,
                           double          *B, int ldb)
cublasStatus_t cublasCtrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           cuComplex       *B, int ldb)
cublasStatus_t cublasZtrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           cuDoubleComplex *B, int ldb)

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
side		input	indicates if matrix `A` is on the left or right of `X`.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
diag		input	indicates if the elements on the main diagonal of matrix `A` are unity and should not be accessed.
m		input	number of rows of matrix `B`, with matrix `A` sized accordingly.
n		input	number of columns of matrix `B`, with matrix `A` is sized accordingly.
alpha	host or device	input	<type> scalar used for multiplication, if `alpha==0` then `A` is not referenced and `B` does not have to be a valid input.
A	device	input	<type> array of dimension `lda x m` with `lda>=max(1,m)` if `side == CUBLAS_SIDE_LEFT` and `lda x n` with `lda>=max(1,n)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	in/out	<type> array. It has dimensions `ldb x n` with `ldb>=max(1,m)`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m` < 0 or `n` < 0 or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `side` != `CUBLAS_SIDE_LEFT`, `CUBLAS_SIDE_RIGHT` or if `diag` != `CUBLAS_DIAG_NON_UNIT`, `CUBLAS_DIAG_UNIT` or if `lda` < max(1, `m`) if `side` == `CUBLAS_SIDE_LEFT` and `lda` < max(1, `n`) otherwise or if `ldb` < max(1, `m`) or `alpha` == NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

cublas<t>trsmBatched()

cublasStatus_t cublasStrsmBatched( cublasHandle_t    handle,
                                   cublasSideMode_t  side,
                                   cublasFillMode_t  uplo,
                                   cublasOperation_t trans,
                                   cublasDiagType_t  diag,
                                   int m,
                                   int n,
                                   const float *alpha,
                                   const float *const A[],
                                   int lda,
                                   float *const B[],
                                   int ldb,
                                   int batchCount);
cublasStatus_t cublasDtrsmBatched( cublasHandle_t    handle,
                                   cublasSideMode_t  side,
                                   cublasFillMode_t  uplo,
                                   cublasOperation_t trans,
                                   cublasDiagType_t  diag,
                                   int m,
                                   int n,
                                   const double *alpha,
                                   const double *const A[],
                                   int lda,
                                   double *const B[],
                                   int ldb,
                                   int batchCount);
cublasStatus_t cublasCtrsmBatched( cublasHandle_t    handle,
                                   cublasSideMode_t  side,
                                   cublasFillMode_t  uplo,
                                   cublasOperation_t trans,
                                   cublasDiagType_t  diag,
                                   int m,
                                   int n,
                                   const cuComplex *alpha,
                                   const cuComplex *const A[],
                                   int lda,
                                   cuComplex *const B[],
                                   int ldb,
                                   int batchCount);
cublasStatus_t cublasZtrsmBatched( cublasHandle_t    handle,
                                   cublasSideMode_t  side,
                                   cublasFillMode_t  uplo,
                                   cublasOperation_t trans,
                                   cublasDiagType_t  diag,
                                   int m,
                                   int n,
                                   const cuDoubleComplex *alpha,
                                   const cuDoubleComplex *const A[],
                                   int lda,
                                   cuDoubleComplex *const B[],
                                   int ldb,
                                   int batchCount);

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
side		input	indicates if matrix `A[i]` is on the left or right of `X[i]`.
uplo		input	indicates if matrix `A[i]` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A[i]`) that is non- or (conj.) transpose.
diag		input	indicates if the elements on the main diagonal of matrix `A[i]` are unity and should not be accessed.
m		input	number of rows of matrix `B[i]`, with matrix `A[i]` sized accordingly.
n		input	number of columns of matrix `B[i]`, with matrix `A[i]` is sized accordingly.
alpha	host or device	input	<type> scalar used for multiplication, if `alpha==0` then `A[i]` is not referenced and `B[i]` does not have to be a valid input.
A	device	input	array of pointers to <type> array, with each array of dim. `lda x m` with `lda>=max(1,m)` if `side == CUBLAS_SIDE_LEFT` and `lda x n` with `lda>=max(1,n)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A[i]`.
B	device	in/out	array of pointers to <type> array, with each array of dim. `ldb x n` with `ldb>=max(1,m)`. Matrices `B[i]` should not overlap; otherwise, undefined behavior is expected.
ldb		input	leading dimension of two-dimensional array used to store matrix `B[i]`.
batchCount		input	number of pointers contained in A and B.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m` < 0 or `n` < 0 or `batchCount` < 0 or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `side` != `CUBLAS_SIDE_LEFT`, `CUBLAS_SIDE_RIGHT` or if `diag` != `CUBLAS_DIAG_NON_UNIT`, `CUBLAS_DIAG_UNIT` or if `lda` < max(1, `m`) if `side` == `CUBLAS_SIDE_LEFT` and `lda` < max(1, `n`) otherwise or `ldb` < max(1, `m`)
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

cublas<t>hemm()

cublasStatus_t cublasChemm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           int m, int n,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           const cuComplex       *B, int ldb,
                           const cuComplex       *beta,
                           cuComplex       *C, int ldc)
cublasStatus_t cublasZhemm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           int m, int n,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           const cuDoubleComplex *B, int ldb,
                           const cuDoubleComplex *beta,
                           cuDoubleComplex *C, int ldc)

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
side		input	indicates if matrix `A` is on the left or right of `B`.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
m		input	number of rows of matrix `C` and `B`, with matrix `A` sized accordingly.
n		input	number of columns of matrix `C` and `B`, with matrix `A` sized accordingly.
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x m` with `lda>=max(1,m)` if `side==CUBLAS_SIDE_LEFT` and `lda x n` with `lda>=max(1,n)` otherwise. The imaginary parts of the diagonal elements are assumed to be zero.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,m)`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta		input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m` < 0 or `n` < 0 or if `side` != `CUBLAS_SIDE_LEFT`, `CUBLAS_SIDE_RIGHT` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `lda` < max(1, `m`) if `side` == `CUBLAS_SIDE_LEFT` and `lda` < max(1, `n`) otherwise or if `ldb` < max(1, `m`) or if `ldc` < max(1, `m`) or if `alpha` == NULL or `beta` == NULL or `C` == NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

cublas<t>herk()

cublasStatus_t cublasCherk(cublasHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const float  *alpha,
                           const cuComplex       *A, int lda,
                           const float  *beta,
                           cuComplex       *C, int ldc)
cublasStatus_t cublasZherk(cublasHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const double *alpha,
                           const cuDoubleComplex *A, int lda,
                           const double *beta,
                           cuDoubleComplex *C, int ldc)

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`) and `C`.
k		input	number of columns of matrix op(`A`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
beta		input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `n` < 0 or `k` < 0 or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `lda` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldc` < max(1, `n`) or if `alpha` == NULL or `beta` == NULL or `C` == NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

cublas<t>her2k()

cublasStatus_t cublasCher2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, int lda,
                            const cuComplex       *B, int ldb,
                            const float  *beta,
                            cuComplex       *C, int ldc)
cublasStatus_t cublasZher2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, int lda,
                            const cuDoubleComplex *B, int ldb,
                            const double *beta,
                            cuDoubleComplex *C, int ldc)

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimension `ldb x k` with `ldb>=max(1,n)` if `transb == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `n` < 0 or `k` < 0 or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `lda` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldb` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldc` < max(1, `n`) or if `alpha` == NULL or `beta` == NULL or `C` == NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

cublas<t>herkx()

cublasStatus_t cublasCherkx(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, int lda,
                            const cuComplex       *B, int ldb,
                            const float  *beta,
                            cuComplex       *C, int ldc)
cublasStatus_t cublasZherkx(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, int lda,
                            const cuDoubleComplex *B, int ldb,
                            const double *beta,
                            cuDoubleComplex *C, int ldc)

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimension `ldb x k` with `ldb>=max(1,n)` if `transb == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	real scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `n` < 0 or `k` < 0 or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `lda` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldb` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldc` < max(1, `n`) or if `alpha` == NULL or `beta` == NULL or `C` == NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU