翻译:Tensor Ops Made Easier in cuDNN

Tensor Ops Made Easier in cuDNN张量操作在cuDNN中更容易

2018年8月20日原创

Neural network models have quickly taken advantage of NVIDIA Tensor Cores for deep learning since their introduction in the Tesla V100 GPU last year. For example, new performance records for ResNet50 training were announced recently with Tensor Core-based solutions. (See the NVIDIA developer post on new performance milestones for additional details).

自去年Tesla V100 GPU引入神经网络模型以来,神经网络模型已迅速利用NVIDIA张量核进行深度学习。例如,最近发布了基于Tensor Core解决方案的ResNet50训练的新性能记录。(有关更多详细信息,请参阅NVIDIA developer关于新性能里程碑的文章)。

NVIDIA’s cuDNN library enables CUDA programmers to optimize both recurrent neural networks and convolutional neural networks for GPU acceleration. We recently outlined easy ways for cuDNN users to take advantage of Tensor Cores for convolutions, complete with instructions and sample code. That article presented a few simple rules for cuDNN applications: FP16 data rules, tensor dimension rules, use of ALGO_1, etc.

NVIDIA的cuDNN库使CUDA程序员能够优化递归神经网络和卷积神经网络以加速GPU。我们最近概述了cuDNN用户利用张量核进行卷积的简单方法,包括指令和示例代码。本文给出了几个简单的cuDNN应用规则:FP16数据规则、张量维规则、ALGO 1算法的使用等。

Recent cuDNN versions now lift most of these constraints. The cuDNN 7.2 version lifted the FP16 data constraint, while cuDNN 7.3 removes the tensor dimension constraints (for packed NCHW tensor data). Let’s get right into the improvements.

最新的cuDNN版本现在解除了这些限制。cuDNN 7.2版本取消了FP16数据约束,而cudnn7.3删除了张量尺寸约束(用于压缩的NCHW张量数据)。让我们开始改进吧。

New: Use FP32 Data for Tensor Ops新:对张量操作使用FP32数据

The post on using Tensor Cores in CUDA discussed the use of FP16 input for tensor operations, as shown in figure 1. While tensor ops still consume FP16 data, the cuDNN API for convolutions now allows the user to choose to have FP32 input data converted to FP16. The output data of the convolution also are converted to FP32 if desired.

在CUDA中使用张量核的文章讨论了使用FP16输入进行张量运算,如图1所示。虽然tensor ops仍然使用FP16数据,但用于卷积的cuDNN API现在允许用户选择将FP32输入数据转换为FP16。如果需要,卷积的输出数据也被转换为FP32。

The CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION enum value, new in cuDNN 7.2, enables the cuDNN applications programmer to choose to convert FP32 data for tensor op use. This enum value is passed to the cudnnSetConvolutionMathType() call, just as is the CUDNN_TENSOR_OP_MATH enum value. This code snippet shows how you might do this:

CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION enum value是CUDNN 7.2中新增的,它使CUDNN应用程序程序员能够选择转换FP32数据以供TENSOR OP使用。这个枚举值被传递给cudnnsetconvertionmathType()调用,就像CUDNN_TENSOR_OP_MATH枚举值一样。此代码段显示了如何执行此操作:

// Set the math type to allow cuDNN to use Tensor Cores:
checkCudnnErr( cudnnSetConvolutionMathType(cudnnConvDesc, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION) );

You can see the context in which code fragment is used in a later section.

您可以在后面的小节中看到使用代码片段的上下文

FP32 Data also for RNNs  FP32数据也适用于RNNs

Similar FP32 data conversions are now also enabled for RNNs. Simply pass the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION enum value to the cudnnSetRNNMatrixMathType() call to allow FP32 data to be converted for use in your RNNs. Use this as follows:

类似的FP32数据转换现在可以在RNNs使用。只需将CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION enum value传递给cudnnsetrnmatrixmathtype()调用,就可以转换FP32数据,以便在RNNs中使用。使用如下:

// Set the math type to allow cuDNN to use Tensor Cores:
checkCudnnErr( cudnnSetRNNMatrixMathType(cudnnRnnDesc, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION) );

New: NCHW Tensor Dimension Constraints Eliminated新:消除了NCHW张量尺寸约束

Earlier versions of cuDNN required the channel dimension of all tensors  be a multiple of 8. That constraint no longer applies to packed NCHW data; cuDNN now automatically pads the tensors as needed.

This padding is automatic for packed NCHW data in both the CUDNN_TENSOR_OP_MATH and the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION cases. The padding occurs with negligible loss of performance.

早期版本的cuDNN要求所有张量的通道维数是8的倍数。该约束不再适用于压缩的NCHW数据;cuDNN现在会根据需要自动填充张量。

此填充对于 CUDNN_TENSOR_OP_MATH 和CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION情况下的压缩NCHW数据是自动的。填充发生时性能损失可以忽略不计。

// Set NCHW tensor dimensions, not necessarily as multiples of eight (only the input tensor is shown here):
int dimA[] = {1, 7, 32, 32};
int strideA[] = {7168, 1024, 32, 1};

The sample code in the section below demonstrates how you might use this.

下面一节中的示例代码演示了如何使用它。

Sample Code示例代码

The logic to use tensor ops for FP32 data and any channel dimensions is similar to the logic used when writing for earlier versions of cuDNN. Only the dimensions and data types have changed (along with the use of CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION):

对FP32数据和任何通道维度使用tensor ops的逻辑与为早期版本的cuDNN编写时使用的逻辑相似。只有维度和数据类型已更改(同时使用CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION):

// Create a cuDNN handle:
checkCudnnErr(cudnnCreate(&handle_));

// Create your tensor descriptors:
checkCudnnErr( cudnnCreateTensorDescriptor( &cudnnIdesc ));
checkCudnnErr( cudnnCreateFilterDescriptor( &cudnnFdesc ));
checkCudnnErr( cudnnCreateTensorDescriptor( &cudnnOdesc ));
checkCudnnErr( cudnnCreateConvolutionDescriptor( &cudnnConvDesc ));

// Set NCHW tensor dimensions, not necessarily as multiples of eight (only the input tensor is shown here):
int dimA[] = {1, 7, 32, 32};
int strideA[] = {7168, 1024, 32, 1};

checkCudnnErr( cudnnSetTensorNdDescriptor(cudnnIdesc, CUDNN_DATA_FLOAT,
convDim+2, dimA, strideA) );

// Allocate and initialize tensors (again, only the input tensor is shown):
checkCudaErr( cudaMalloc((void**)&(devPtrI), (insize) * sizeof(devPtrI[0]) ));
hostI = (T_ELEM*)calloc (insize, sizeof(hostI[0]) );

initImage(hostI, insize);

checkCudaErr( cudaMemcpy(devPtrI, hostI, sizeof(hostI[0]) * insize, cudaMemcpyHostToDevice));

// Set the compute data type (below as CUDNN_DATA_FLOAT):
checkCudnnErr( cudnnSetConvolutionNdDescriptor(cudnnConvDesc, convDim, padA, convstrideA, dilationA, CUDNN_CONVOLUTION, CUDNN_DATA_FLOAT) );

// Set the math type to allow cuDNN to use Tensor Cores:
checkCudnnErr( cudnnSetConvolutionMathType(cudnnConvDesc, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION) );

// Choose a supported algorithm:
cudnnConvolutionFwdAlgo_t algo = CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM;

// Allocate your workspace:
checkCudnnErr( cudnnGetConvolutionForwardWorkspaceSize(handle_, cudnnIdesc,
cudnnFdesc, cudnnConvDesc,
cudnnOdesc, algo, &workSpaceSize) );

if (workSpaceSize > 0) {
   cudaMalloc(&workSpace, workSpaceSize);
}

// Invoke the convolution:
checkCudnnErr( cudnnConvolutionForward(handle_, (void*)(&alpha), cudnnIdesc, devPtrI,
cudnnFdesc, devPtrF, cudnnConvDesc, algo,
workSpace, workSpaceSize, (void*)(&beta),
cudnnOdesc, devPtrO) );

FP32 PerformanceFP32性能

Figure 2 shows the comparative performance of convolutions when using Tensor Cores for FP32 tensor data. The chart compares V100 tensor ops versus V100 FMA ops, so the gains are not quite as dramatic as earlier charts comparing V100 performance versus P100 FMA. Tensor ops used with FP32 input still represent significant gains over using FMA ops, nonetheless.

图2显示了对FP32张量数据使用张量核时卷积的比较性能。该图表比较了V100张量操作和V100 FMA操作,因此收益并不像之前比较V100性能和P100 FMA的图表那么显著。尽管如此,与FP32输入一起使用的张量运算仍然比使用FMA运算有显著的提高。

Remaining Constraints剩余约束

While the major constraints for using tensor ops in cuDNN have been lifted, some minor constraints still remain. One limitation is the use of ALGO_1 (IMPLICIT_PRECOMP_GEMM for forward). No other convolution ALGOs in cuDNN make use of tensor ops yet.

Another minor restriction is the size of the convolution filter, specifically the spatial dimensions (r and s). However, the FFT algorithms for convolution are very well suited for use cases with large filter dimensions. Just switch your convolutions to use FFT algorithms well before the tensor op filter limits are exceeded for maximum performance.

虽然在cuDNN中使用张量运算的主要约束已经解除,但一些次要约束仍然存在。一个限制是使用ALGO 1(隐式预编译为forward)。cuDNN中还没有其他卷积算法利用张量运算。

另一个次要限制是卷积滤波器的大小,特别是空间维度(r和s)。然而,用于卷积的FFT算法非常适合具有大滤波器维数的用例。只需切换您的卷积使用FFT算法之前,张量OP滤波器的限制超过最大性能。

Get Started with Tensor Cores in cuDNN Today今天就开始学习cuDNN中的张量核

You can download the latest version of cuDNN here and get started using Tensor Cores today. See how Tensor Cores can supercharge your cuDNN applications. Read the latest Release Notes for a detailed list of new features and enhancements.

你可以在这里下载最新版本的cuDNN,现在就开始使用Tensor Cores。看看张量核心如何增压你的cuDNN应用程序。请阅读最新的发行说明以获取新功能和增强功能的详细列表。

About the Authors

Scott Yokim

About Scott Yokim

Scott YokimAbout Scott Yokim
Scott Yokim is a senior software engineer in the CUDA libraries team at NVIDIA. He joined NVIDIA in 2008, prior to which he was a computer graphics programmer at various companies. Scott holds a MS in mathematics from Virginia Tech.

Scott Yokim是NVIDIA的CUDA库团队的高级软件工程师。他在2008年加入英伟达,在此之前,他曾在多家公司担任计算机图形程序员。斯科特拥有弗吉尼亚理工学院的数学硕士学位。

更多Tensor Core相关文章:

https://devblogs.nvidia.com/tag/tensor-core/

NVIDIA深度学习Tensor Core全面解析(上篇)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

haimianjie2012

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值