An NVIDIA Tensor Core is capable of performing onematrix-multiply-and-accumulate operation on a 4×4 matrixin one GPU clock cycle. In mixed-precision mode, Ten-sor Cores take input data in half floating-point precision,perform matrix multiplication in half precision and theaccumulation in single precision.
NVIDIA Tensor Core 能够在一个 GPU 时钟周期内对 4×4 矩阵执行一次矩阵乘法累加运算。在混合精度模式下,Tensor Cores 以半浮点精度获取输入数据,以半精度进行矩阵乘法,以单精度进行累加。