混合精度、异构计算——杂记

本文链接：https://blog.csdn.net/csdnzzt/article/details/140901499

1、英伟达GPU架构

1.1 GPU架构

Figure 1 shows a full GA100 GPU with 128 SMs. The A100 is based on GA100 and has 108 SMs.

SM是streaming multiprocessor的简写，4个处理单元组成一个SM，如Figure 2。

每个SM有64个INT32，64个FP32，32个FP64的CUDA core；每个SM还有4个Tensor Core。SM内共享L1缓存。

CUDA Core是用作通用计算的，Tensor Core是专门针对深度学习优化的，负责矩阵运算、混合精度运算。

Figure 1 GA100 Full GPU with 128 SMs. The A100 Tensor Core GPU has 108 SMs.

Figure 2. The GA100 streaming multiprocessor (SM)

Figure 3展示了NVDIA不同代GPU的特性。

Figure 3. 各代GPU架构

1.2 Tensor Core

Tensor Core是从2017年的Volta架构开始演变的针对AI模型大量乘加运算的特殊处理单元。

在A100中，Tensor Core3.0能够支持所有的数据类型，包括FP16, BF16, TF32, FP64, INT8, INT4。提供了针对HPC用途的FP64双精度强大算力。

混合精度在底层硬件算子层面，使用半精度（FP16）作为输入和输出，使用全精度（FP32）进行中间结果计算从而不损失过多精度的技术。这个底层硬件层面其实指的就是Tensor Core，所以GPU上有Tensor Core是使用混合精度训练加速的必要条件。

可以说Tensor Core 是混合精度训练的底层硬件支持。

Tensor Core 的基本运算单元为 D = A*B + C，其中A、B、C、D 均为矩阵。每个 Tensor Core 能在一个时钟周期内完成 4*4 的 mma (matrix multiply accumalation) 运算，即一次矩阵乘法和一次矩阵加法。 (D = A*B + C是深度学习中最重要的算子，output = weight * input + bias)

Figure 4. 使用FP16做乘法，使用FP32做加法（避免大数吃小数）

2. MAGMA库

2.1 MAGMA的介绍

MAGMA is intended for CUDA enabled NVIDIA GPUs and HIP enabled AMD GPUs.

It supports NVIDIA's Kepler, Maxwell, Pascal, Volta, Turing, Ampere, and Hopper

GPUs, and AMD's HIP GPUs.

Included are routines for the following algorithms:

* LU, QR, and Cholesky factorizations

* Hessenberg, bidiagonal, and tridiagonal reductions

* Linear solvers based on LU, QR, and Cholesky

* Eigenvalue and singular value (SVD) problem solvers

* Generalized Hermitian-definite eigenproblem solver

* Mixed-precision iterative refinement solvers based on LU, QR, and Cholesky

* MAGMA BLAS including gemm, gemv, symv, and trsm

* Batched MAGMA BLAS including gemm, gemv, herk, and trsm

* Batched MAGMA LAPACK including LU, inverse (getri), QR, and Cholesky factorizations

* MAGMA Sparse including CG, GMRES, BiCGSTAB, LOBPCG, iterative refinement,

preconditioners, sparse kernels (SpMV, SpMM), and support for CSR, ELL, and

SELL-P data formats

MAGMA的Dense Linear Algebra库是有混合精度算法（指Iterative Refinement，在MAGMA 2.5版本发布）。

比如下面两个接口magma_dsgesv_iteref_gpu和magma_dhgesv_iteref_gpu：

1- The FP32 to FP64 API magma_dsgesv_iteref_gpu, which is similar to the LAPACK dsgesv API. Here A, X, and B are FP64, the routine does the internal conversion and computation, and provides FP64 solution.

2- The FP16 to FP64 API magma_dhgesv_iteref_gpu, which is similar to the magma_dsgesv_gpu API, except it does use the tensor cores and performs computations in FP16. Here A, X, and B are FP64, the routine does the internal conversion and computation, and provides FP64 solution.

MAGMA 是在2018年时发布的MAGMA 2.5，支持了Tensor Core的计算，支持混合精度计算。

 2.5.0 - Nov 16, 2018
    * New routines: Magma is releasing the Nvidia Tensor Cores version
      of its linear mixed-precision solver that is able to provide an
      FP64 solution with up to 4X speedup. The release includes:
      magma_dhgesv_iteref_gpu (FP64-FP16 solver with FP64 input and solution)
      magma_dsgesv_iteref_gpu (FP64-FP32 solver with FP64 input and solution)
      magma_hgetrf_gpu        (mixed precision FP32-FP16 LU factorization)
      magma_htgetrf_gpu       (mixed precision FP32-FP16 LU factorization using Tensor Cores)

2.2. MAGMA Sparse

MAGMA Sparse支持稀疏线性系统的计算，但是solver的接口没有ds和cz开头的混合精度接口。

MAGMA Sparse中迭代法的算法，比如BICGSTAB，是在原始算法的基础上，做了很多算法重新设计以充分利用GPU的性能，并不是说用了混合精度就达到了这个加速目的。

下面左侧表示了原始BICGSTAB算法，右侧表示把这套算法简单粗暴地掉用CUBLAS算子实现，具体实现。这样做的话，并不能充分利用GPU的性能。需要作一些特殊的处理。

在MAGMA中的BICGSTAB主要做了下面的重设计，该算法称为merged-BICGSTAB：

将多个算术运算聚合（merge）到单个内核（kernel）中以减少GPU内存运输和CPU-GPU间的通信。
设计高效的dot kernel，允许多个点积的同时计算

性能对比时，指定双精度计算。得到如下的对比。

MAGMA Sparse是有内置的Iterative Refinement算法的，在magma_diterref、magma_siterref、magma_citerref和magma_ziterref等函数。

下面看一下magma_diterref的源码，可以发现算法实现并不复杂：

/*
    -- MAGMA (version 2.8.0) --
       Univ. of Tennessee, Knoxville
       Univ. of California, Berkeley
       Univ. of Colorado, Denver
       @date March 2024

       @author Hartwig Anzt

       @generated from sparse/src/ziterref.cpp, normal z -> d, Thu Mar 28 12:29:00 2024
*/

#include "magmasparse_internal.h"

#define RTOLERANCE     lapackf77_dlamch( "E" )
#define ATOLERANCE     lapackf77_dlamch( "E" )


/**
    Purpose
    -------

    Solves a system of linear equations
       A * X = B
    where A is a real symmetric N-by-N positive definite matrix A.
    This is a GPU implementation of the Iterative Refinement method.
    The inner solver is passed via the preconditioner argument.

    Arguments
    ---------

    @param[in]
    A           magma_d_matrix
                input matrix A

    @param[in]
    b           magma_d_matrix
                RHS b

    @param[in,out]
    x           magma_d_matrix*
                solution approximation

    @param[in,out]
    solver_par  magma_d_solver_par*
                solver parameters

    @param[in,out]
    precond_par magma_d_preconditioner*
                inner solver
    @param[in]
    queue       magma_queue_t
                Queue to execute in.

    @ingroup magmasparse_dgesv
    ********************************************************************/

extern "C" magma_int_t
magma_diterref(
    magma_d_matrix A, magma_d_matrix b, magma_d_matrix *x,
    magma_d_solver_par *solver_par, magma_d_preconditioner *precond_par,
    magma_queue_t queue )
{
    magma_int_t info = MAGMA_NOTCONVERGED;
    
    // some useful variables
    double c_zero = MAGMA_D_ZERO;
    double c_one  = MAGMA_D_ONE;
    double c_neg_one = MAGMA_D_NEG_ONE;

    // prepare solver feedback
    solver_par->solver = Magma_ITERREF;
    solver_par->numiter = 0;
    solver_par->spmv_count = 0;
    
    magma_int_t dofs = A.num_rows*b.num_cols;

    // solver variables
    double nom, nom0;
    
    // workspace
    magma_d_matrix r={Magma_CSR}, z={Magma_CSR};
    CHECK( magma_dvinit( &r, Magma_DEV, A.num_rows, b.num_cols, c_zero, queue ));
    CHECK( magma_dvinit( &z, Magma_DEV, A.num_rows, b.num_cols, c_zero, queue ));

    double residual;
    CHECK( magma_dresidual( A, b, *x, &residual, queue ));
    solver_par->init_res = residual;
   

    // solver setup
    magma_dscal( dofs, c_zero, x->dval, 1, queue );                    // x = 0
    //CHECK(  magma_dresidualvec( A, b, *x, &r, nom, queue));
    magma_dcopy( dofs, b.dval, 1, r.dval, 1, queue );                    // r = b
    nom0 = magma_dnrm2( dofs, r.dval, 1, queue );                       // nom0 = || r ||
    nom = nom0 * nom0;
    solver_par->init_res = nom0;

    if( nom0 < solver_par->atol ||
        nom0/solver_par->init_res < solver_par->rtol ){
        solver_par->final_res = solver_par->init_res;
        solver_par->iter_res = solver_par->init_res;
        info = MAGMA_SUCCESS;
        goto cleanup;
    }
    
    //Chronometry
    real_Double_t tempo1, tempo2;
    tempo1 = magma_sync_wtime( queue );
    if ( solver_par->verbose > 0 ) {
        solver_par->res_vec[0] = nom0;
        solver_par->timing[0] = 0.0;
    }
    
    // start iteration
    for( solver_par->numiter= 1; solver_par->numiter<solver_par->maxiter;
                                                    solver_par->numiter++ ) {
        magma_dscal( dofs, MAGMA_D_MAKE(1./nom, 0.), r.dval, 1, queue );  // scale it
        CHECK( magma_d_precond( A, r, &z, precond_par, queue )); // inner solver:  A * z = r
        magma_dscal( dofs, MAGMA_D_MAKE(nom, 0.), z.dval, 1, queue );  // scale it
        magma_daxpy( dofs,  c_one, z.dval, 1, x->dval, 1, queue );        // x = x + z
        CHECK( magma_d_spmv( c_neg_one, A, *x, c_zero, r, queue ));      // r = - A x
        solver_par->spmv_count++;
        magma_daxpy( dofs,  c_one, b.dval, 1, r.dval, 1, queue );         // r = r + b
        nom = magma_dnrm2( dofs, r.dval, 1, queue );                    // nom = || r ||

        if ( solver_par->verbose > 0 ) {
            tempo2 = magma_sync_wtime( queue );
            if ( (solver_par->numiter)%solver_par->verbose==0 ) {
                solver_par->res_vec[(solver_par->numiter)/solver_par->verbose]
                        = (real_Double_t) nom;
                solver_par->timing[(solver_par->numiter)/solver_par->verbose]
                        = (real_Double_t) tempo2-tempo1;
            }
        }

        if( nom < solver_par->atol ||
            nom/solver_par->init_res < solver_par->rtol ){
            break;
        }
    }
    tempo2 = magma_sync_wtime( queue );
    solver_par->runtime = (real_Double_t) tempo2-tempo1;
    CHECK(  magma_dresidualvec( A, b, *x, &r, &residual, queue));
    solver_par->final_res = residual;
    solver_par->iter_res = nom;

    if ( solver_par->numiter < solver_par->maxiter ) {
        info = MAGMA_SUCCESS;
    } else if ( solver_par->init_res > solver_par->final_res ) {
        if ( solver_par->verbose > 0 ) {
            if ( (solver_par->numiter)%solver_par->verbose==0 ) {
                solver_par->res_vec[(solver_par->numiter)/solver_par->verbose]
                        = (real_Double_t) nom;
                solver_par->timing[(solver_par->numiter)/solver_par->verbose]
                        = (real_Double_t) tempo2-tempo1;
            }
        }
        info = MAGMA_SLOW_CONVERGENCE;
        if( solver_par->iter_res < solver_par->atol ||
            solver_par->iter_res/solver_par->init_res < solver_par->rtol ){
            info = MAGMA_SUCCESS;
        }
    }
    else {
        if ( solver_par->verbose > 0 ) {
            if ( (solver_par->numiter)%solver_par->verbose==0 ) {
                solver_par->res_vec[(solver_par->numiter)/solver_par->verbose]
                        = (real_Double_t) nom;
                solver_par->timing[(solver_par->numiter)/solver_par->verbose]
                        = (real_Double_t) tempo2-tempo1;
            }
        }
        info = MAGMA_DIVERGENCE;
    }
    
cleanup:
    magma_dmfree(&r, queue );
    magma_dmfree(&z, queue );

    solver_par->info = info;
    return info;
}   /* magma_diterref */

MAGMA Sparse有一些混合精度的算子，详见magma-2.8.0\sparse\include\magmasparse_ds.h文件。

也有混合精度的SpMV的kernel如magma_dsgecsrmv_mixed_prec。

Purpose
    -------
    
    This routine computes y = alpha *  A *  x + beta * y on the GPU.
    A is a matrix in mixed precision, i.e. the diagonal values are stored in
    high precision, the offdiagonal values in low precision.
    The input format is a CSR (val, row, col) in FloatComplex storing all 
    offdiagonal elements and an array containing the diagonal values in 
    DoubleComplex.

magma_dsgecsrmv_mixed_prec(
    magma_trans_t transA,
    magma_int_t m, magma_int_t n,
    double alpha,
    magmaDouble_ptr ddiagval,
    magmaFloat_ptr doffdiagval,
    magmaIndex_ptr drowptr,
    magmaIndex_ptr dcolind,
    magmaDouble_ptr dx,
    double beta,
    magmaDouble_ptr dy,
    magma_queue_t queue )

3. Ginkgo库

3.1 Ginkgo库介绍

Ginkgo是一个高性能的多核系统线性代数库，专注于求解稀疏线性系统。它是使用现代c++实现的(你至少需要一个c++ 14兼容的编译器来构建它)，GPU内核在CUDA (NVIDIA设备)，HIP (AMD设备)和SYCL/ dpc++ (Intel GPU)以及OpenMP (Intel/AMD/ARM multicore)实现。

Ginkgo的加速主要依赖混合精度算法，有下面三方面：

将存储精度与算术精度解耦。这能够加速内存绑定（memory-bound）算法的性能，可以补偿或容忍存储操作中的一些信息丢失。
允许不同精度格式的线性算子和向量之间的组合，而无需显式转换。
多种混合精度算法，比如SAI预处理、CB-GMRES（Compressed Basis GMRES）等。

"Memory-bound algorithm"（内存绑定算法）是指在执行过程中主要受限于内存带宽或内存访问速度的算法。在计算机系统中，处理器（CPU）和内存（RAM）之间的数据传输速度通常是一个瓶颈。当算法的执行时间主要取决于内存访问的速度，而不是计算本身的速度时，就称该算法为内存绑定算法。

内存带宽（Memory Bandwidth）：

衡量的是内存在单位时间内能够传输的数据量，通常以每秒传输的位数（bits per second）来表示。

内存访问速度（Memory Access Speed）：

衡量的是内存响应请求并提供数据的速度，通常以延迟（Latency）来表示，延迟越低，访问速度越快。内存访问速度通常以时间单位（如纳秒ns）来衡量。

CPU和GPU的相对比较：

带宽：GPU通常具有更高的内存带宽，这使得它们在处理大量数据时更为高效。
时延：CPU通常具有更低的内存访问延迟，这使得它们在需要快速响应的单个请求中表现更好。

CPU和GPU的应用场景：

GPU：适用于需要处理大量数据并行任务的场景，如图形渲染、深度学习、科学计算等。
CPU：适用于需要快速响应和处理复杂逻辑的场景，如通用计算、操作系统、游戏逻辑处理等。

3.2 CB-GMRES（Compressed Basis GMRES）算法

Krylov求解器的性能在几乎所有硬件架构上都受内存访问速度和通信的限制。因此一个最直接的想法就是压缩Krylov子空间的基向量，比如将原本双精度存储的向量转为单精度存储。但是这种低精度存储，会引想收敛性，进而导致将使用更多的迭代次数。我们希望收敛延迟带来的耗时，可以被更快的内存访问速度所弥补，从而减少整体求解时间。

这个思想用到的就是此前提到的Ginkgo的第1点思想，因此设计了一种Memory accessor，如下图所示。原话是这么说的“To decouple the memory access and conversion from the code development effort, we use a memory accessor that converts the data between DP and the memory storage/communication format on-the-fly (see Figure 2).”

这样做能够减少数据转换带来的开销。

下图是每个具体测例的加速比。

整体统计如下：

如果增大重启次数（即增加Krylov子空间的维度），那么加速比更明显，如下图所示：

参考：

https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/

AI 工程师都应该知道的GPU工作原理，TensorCore_哔哩哔哩_bilibili

MAGMA的核心算法文献：https://icl.utk.edu/~tomov/MAGMA_Publications.bib

着重看下面的：

Magma Sparse中的BICGSTAB: Hartwig Anzt, 2014, Optimizing Krylov Subspace Solvers on Graphics Processing Units.

利用Tensor Core的mixed-precision iterative refinement solvers：Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. 2018. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers.

Ginkgo的文献：Ginkgo: Citing Ginkgo

核心文献如下：

总体设计：Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing

算法概览：Advances in Mixed Precision Algorithms 2021 Edition

CB-GMRES算法：Jos´e I Aliaga, Hartwig Anzt, Compressed basis gmres on high performance gpus.