深度学习框架TensorFlow 性能优化

int32 intra_op_parallelism_threads ：
    // 用于控制运算符op内部的并行，当运算符op为单一运算符，并且内部可以实现并行时，在该参数指定数量的线程池中进行并行计算，如矩阵乘法，reduce_sum之类的操作。
    // 0:系统自动选择合适参数。
int32 inter_op_parallelism_threads：
    // 用于控制多个运算符op之间的并行计算，当有多个运算符op，并且他们之间比较独立，运算符和运算符之间没有直接的路径Path相连。Tensorflow会尝试并行地计算他们，使用该参数来控制线程池的数量
    // 0:系统自动选择合适参数。
    // 进程中所有的session中此参数与第一个session同步，除非单独配置session_inter_op_thread_pool参数。
repeated ThreadPoolOptionProto session_inter_op_thread_pool：// 它的用途是当一些会话调用需要在一个后台池中运行时，限制参与该计算线程数量:
    // 例如，可以将一个会话配置为一个大池(用于常规计算)和一个小池(用于周期性的低优先级工作);使用小池是当前限制低优先级工作的互操作并行性的机制。注意，它并不限制单个op内核实现产生的工作并行性。在培训中通常不需要使用这个设置，但是可以帮助一些服务用例。通常还建议设置此原型的global_name字段，以避免创建多个大型池。通常，在一个大池中运行非低优先级工作(甚至跨会话)会更好。
    //在固定CPUcore的资源限制下，通过合理设置线程thread个数可以明显提升tensorflow程序运行速度。
int32 placement_period：
    // 指定分配节点到硬件的周期,在系统预热(warm up)之前每隔placemeant_period步，都会重新计算节点到硬件的分配，而在此之后重新计算通常会自动降低频率。
repeated string device_filters：
    // 硬件设备过滤器，如果被设置的话，会话会忽略掉所有不匹配过滤器的硬件。每个过滤器可以分别制定比如 "/job:ps" "/job:worker/replica:3".``bool allow_soft_placement：``// 如果allow_soft_placement=True,op将会在一下三种情况下被放到CPU上执行：1、op在GPU上没有对应的实现；2、无法识别GPU设备；3、need to co-locate with reftype input(s) which are from CPU``bool log_device_placement：
    // 是否输出硬件分配信息
GraphOptions graph_options：
    // TensorFlow图的配置项``int64 operation_timeout_in_ms：
    // 为会话中所有阻塞操作的全局的超时时间。如果这个值不为0，也没有被每个操作的基准修改的话，这个值就是所有阻塞操作的最长等待时间。
RPCOptions rpc_options：
    // 分布式计算中的通信配置项
ClusterDef cluster_def：
    // workers列表
bool isolate_session_state：
// 如果设置为True,则session中的所有资源（变量等）将不与其他session共享。
message Experimental：
    string collective_group_leader = ``1``;
    //   Task name for group resolution.
    string executor_type：
    //   executor的类型
    int32 recv_buf_max_chunk：
    //   规定大的RecvBuf字段的传输格式
    bool use_numa_affinity：
    //   结合CPU numa

GPUOptions

double per_process_gpu_memory_fraction
// 每个进程开辟的GPU内存量，数值为0-1，1表示分配所有GPU内存，0.5表示进程最多分配约50％的可用GPU内存。
// 除非启用allow_growth选项，否则将预先分配GPU内存。
bool allow_growth
// 如果为true，则分配器不会预先分配整个指定的GPU内存区域，而是从小开始并根据需要增长。
string allocator_type // GPU分配策略类型
int64 deferred_deletion_bytes// 延迟删除多达设置值这么多字节，以减少与gpu驱动程序代码的交互。如果为0，系统选择一个合理的默认值（几个MB）。
string visible_device_list // GPU ID列表，可以使用CUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVISE，来控制启动TF之前，物理可见的GPU设备。
int32 polling_active_delay_usecs  //
int32 polling_inactive_delay_msecs
bool force_gpu_compatible //启用GPU的TensorFlow上，启用此选项将强制为所有CPU张量分配固定的GPU内存。通常，TensorFlow会推断出哪些张量应该是分配为固定内存。但是如果推论不完整，此选项可以显着加快跨设备内存复制适合内存的性能。
//需注意，固定的内存过多可能会对内存产生负面影响总体主机系统性能。

OptimizerOptions

bool do_common_subexpression_elimination // If true, optimize the graph using common subexpression elimination.
bool do_constant_folding //perform constant folding optimization on the graph.
int64 max_folded_constant_in_bytes
bool do_function_inlining // perform function inlining on the graph.
message GraphOptions

GraphOptions

GraphOptions

bool enable_recv_scheduling //接收节点调度选项，如果设置为True，会使用控制流来安排接收节点的激活。（暂时被忽略）
OptimizerOptions optimizer_options // 图优化选项
int64 build_cost_mode //在返回成本模型之前运行的步骤数，这个模型会详细的描述图中每个节点的内存使用和性能。设置为0表示没有成本模型。
int64 build_cost_model_after//在为成本模型收集统计信息之前运行的步骤数，即成本模型运行前，模型运行的步骤数。
bool infer_shapes //是否推测形状数据。设置为True的话，会用输出的数据的形状信息来标注每个节点，只要这个形状能被静态的推导出来。
bool place_pruned_graph //是否放置修建的图。设置为True的话，仅仅只放置运行的子图，而不是整个图。这个对于交互图的构建很有用，因为在这过程中，可能会产生无法在调试进程中放置的图。特别是它允许用户在往图中添加了一个无法满足的其放置位置限制的节点后，还能够继续进行会话。
bool enable_bfloat16_sendrecv //是否开启到bfloat16的转换，如果这个设置为True，那么进程之间的float数据会被转换成bfloat16的类型
int32 timeline_step //时间表的记录的间隔步骤数。如果大于0的话，在每隔设置的步骤数，记录时间表。实验性的：这个现在对于主会话没有影响。

RPCOptions

RPCOptions

bool use_rpc_for_inprocess_master //是否在进程间使用远程调用的选项
//如果为true，则始终使用RPC与会话目标联系。
//如果为false（预设选项），TensorFlow可能会使用最佳化的通信进行传输,此选项主要用于测试RPC堆栈。string compression_algorithm //选择压缩算法："deflate", "gzip"``int32 compression_level //压缩级别

RunOptionscd

RunOptions

TraceLevel trace_level
  enum TraceLevel {
    NO_TRACE = 0;
    SOFTWARE_TRACE = 1;
    HARDWARE_TRACE = 2;
    FULL_TRACE = 3;
  }

int64 timeout_in_ms //等待OP计算的时间（毫秒）
int32 inter_op_thread_pool // 要使用的线程池 (如果配置了 session_inter_op_thread_pool).
bool output_partition_graphs // 是否通过 RunMetadata 输出由执行器完成的分区图
bool report_tensor_allocations_upon_oom
int64 collective_graph_key // If non-zero, declares that this graph is going to use collective ops and must synchronize step_ids with any other graph with this
// same group_key value (in a distributed computation where tasks run disjoint graphs).
bool use_run_handler_pool // If true, then operations (using the inter-op pool) across all session::run() calls will be centrally scheduled, optimizing for (median``// and tail) latency.Consider using this option for CPU-bound workloads like inference.

使用示例

usage

config = tf.ConfigProto()
config.intra_op_parallelism_threads = 32
config.gpu_options.force_gpu_compatible = True
config.graph_options.timeline_step = 0
config.rpc_options.use_rpc_for_inprocess_master = False
with tf.Session(config=config) as sess:
    sess.run()

GPU参数设置

//GPU设备是否使用其自己的线程池，及如何使用这些线程
TF_GPU_THREAD_MODE='' 'global', 'gpu_shared', 'gpu_private'
//'Methods to assign GPU host work to threads. '
//'global: all GPUs and CPUs share the same global threads; '
//'gpu_private: a private threadpool for each GPU; '
//'gpu_shared: all GPUs share the same threadpool.')
TF_GPU_THREAD_COUNT
//The number of threads to use for GPU Only valid when gpu_thread_mode is not global.

2、Kernel计算慢的问题定位及优化

故事从TensorFlow开始

没有对比就没有伤害。

以TensorFlow框架为例，我们发现其训练ssd_mobilenet_v2网络时，dcu平台训练性能比V100差出十万八千里。

加速卡	CPU	BatchSize	NUM-GPU	performance (SEC/STEP)
NV V100	Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz 32cores 2 socket	1	1	0.093
4	0.099
16	1	0.227
4	0.277
DCU(昆山)	Hygon C86 7185 32-core Processor 32cores 1socket	1	1	0.632
4	0.698
16	1	4.2
4	5.901
DCU(可控)	Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz 24cores 2socket	1	1	0.44
4	0.445
16	1	3.158
4	3.874

本着存在即合理的原则踏上漫长的问题追溯之旅。

首先，我们需要知道到底是哪一步计算拖慢了整体的性能，毕竟其他很多网络（比如：resnet系列/vgg系列/inception系列）dcu的表现还不错。

开始之前我们简化一下网络（再来一波伤害）单张加速卡 BS=16时的性能对比，基本可以肯定就是这个网络计算慢。

model	bs	V100	dcu
mobilenet_v2	16	1194	244

顺便提一下，想了解mobilenet_v2整个网络的代码分析可以参考：Mobilenet_V2源码分析，包含dw卷积和普通卷积的对比呐。

确定哪个kernel-op计算慢

我们知道tensorflow1.*是需要先声明一张图，然后再启动一个session来计算图中的节点。

我们的需求是确定哪个节点计算时间长，对应着tensorflow的哪个op计算，对应着哪个kernel实现。

dcu依赖的Rocm生态，给我们提供了HCC和HIP两部分的trace方式来统计kernel计算时间。

HIP层相关环境变量

指令	含义
HIP_PRINT_ENV=1	Print HIP environment variables.
HIP_LAUNCH_BLOCKING=0	Make HIP kernel launches 'host-synchronous', so they block until any kernel launches. Alias: CUDA_LAUNCH_BLOCKING
HIP_LAUNCH_BLOCKING_KERNELS=	Comma-separated list of kernel names to make host-synchronous, so they block until completed.
HIP_API_BLOCKING= 0	Make HIP APIs 'host-synchronous', so they block until completed. Impacts hipMemcpyAsync, hipMemsetAsync
HIP_HIDDEN_FREE_MEM= 256	Amount of memory to hide from the free memory reported by hipMemGetInfo, specified in MB.Impacts hipMemGetInfo
HIP_DB = 0	Print debug info. Bitmask (HIP_DB=0xff) or flags separated by '+' (HIP_DB=api+sync+mem+copy)
HIP_TRACE_API=0	Trace each HIP API call. Print function name and return code to stderr as program executes.
HIP_TRACE_API_COLOR= green	Color to use for HIP_API. None/Red/Green/Yellow/Blue/Magenta/Cyan/White
HIP_PROFILE_API = 0	Add HIP API markers to ATP file generated with CodeXL. 0x1=short API name, 0x2=full API name including args
HIP_DB_START_API =	Comma-separated list of tid.api_seq_num for when to start debug and profiling.
HIP_DB_STOP_API =	Comma-separated list of tid.api_seq_num for when to stop debug and profiling.
HIP_VISIBLE_DEVICES = 0	Only devices whose index is present in the sequence are visible to HIP applications and they are enumerated in the order of sequence
HIP_WAIT_MODE = 0	Force synchronization mode. 1= force yield, 2=force spin, 0=defaults specified in application
HIP_FORCE_P2P_HOST = 0	Force use of host/staging copy for peer-to-peer copies.1=always use copies, 2=always return false for hipDeviceCanAccessPeer
HIP_FORCE_SYNC_COPY = 0	Force all copies (even hipMemcpyAsync) to use sync copies
HIP_FAIL_SOC = 0	Fault on Sub-Optimal-Copy, rather than use a slower but functional implementation.Bit 0x1=Fail on async copy with unpinned memory. Bit 0x2=Fail peer copy rather than use staging buffer copy
HIP_SYNC_HOST_ALLOC = 1	Sync before and after all host memory allocations. May help stability
HIP_SYNC_NULL_STREAM = 0	Synchronize on host for null stream submissions
HIP_HOST_COHERENT = 1	If set, all host memory will be allocated as fine-grained system memory.This allows thread fence_system to work but prevents host memory from being cached on GPU which may have performance impact.
HCC_OPT_FLUSH = 1	When set, use agent-scope fence operations rather than system-scope fence operationsflush when possible. This flag controls both HIP and HCC behavior
HIP_EVENT_SYS_RELEASE = 0	If set, event are created with hipEventReleaseToSystem by default. If 0, events are created with hipEventReleaseToDevice by default. The defaults can be overridden by specifying hipEventReleaseToSystem or hipEventReleaseToDevice flag when creating the event.

HCC层环境变量

指令	含义
HCC_PRINT_ENV=1	will print usage and current values for the HCC and HIP env variables.
HCC_PRINT_ENV = 1	Print values of HCC environment variables
HCC_SERIALIZE_KERNEL= 0	0x1=pre-serialize before each kernel launch, 0x2=post-serialize after each kernel launch, 0x3=both
HCC_SERIALIZE_COPY= 0	0x1=pre-serialize before each data copy, 0x2=post-serialize after each data copy, 0x3=both
HCC_DB = 0	Enable HCC trace debug
HCC_OPT_FLUSH = 1	Perform system-scope acquire/release only at CPU sync boundaries (rather than after each kernel)
HCC_MAX_QUEUES= 20	Set max number of HSA queues this process will use. accelerator_views will share the allotted queues and steal from each other as necessary
HCC_UNPINNED_COPY_MODE = 2	Select algorithm for unpinned copies. 0=ChooseBest(see thresholds), 1=PinInPlace, 2=StagingBuffer,3=Memcpy
HCC_CHECK_COPY = 0	Check dst == src after each copy operation. Only works on large-bar systems.
HCC_H2D_STAGING_THRESHOLD = 64	Min size (in KB) to use staging buffer algorithm for H2D copy if ChooseBest algorithm selected
HCC_H2D_PININPLACE_THRESHOLD = 4096	Min size (in KB) to use pin-in-place algorithm for H2D copy if ChooseBest algorithm selected
HCC_D2H_PININPLACE_THRESHOLD = 1024	Min size (in KB) to use pin-in-place for D2H copy if ChooseBest algorithm selected
HCC_PROFILE = 0	Enable HCC kernel and data profiling. 1=summary, 2=trace
HCC_PROFILE_VERBOSE = 31	Bitmark to control profile verbosity and format. 0x1=default, 0x2=show begin/end, 0x4=show barrier

除了上述一堆指令可以使用以外，优秀的韩博还开发了一个socprof小工具，相当好用。

socprof

从hcc层进行时间线分析

socprof --hcc-trace python *

效果展示如下，我们可以看到DepthwiseConv2dBackpropFilterGPUKernelNCHW这个kernel耗时最多。

（这里需要提一下，为了缩短trace的时间，确定call kernel 的数量，可以把batch设置为1 step也是设置为1。）

NV平台我们使用nvprof/nvvp来看哪个kernel耗时长。

这样我们就确定了是哪个kernel计算的慢。

kernel name翻译

一长串乱七八糟的字符是什么鬼哦，翻译一下。

不仅会得到它的函数名，还有参数。

c++filt

c++filt _ZN10tensorflow42DepthwiseConv2dBackpropFilterGPUKernelNCHWIfLi3ELi3ELi1EEEvNS_13DepthwiseArgsEPKT_S4_PS2_i

void tensorflow::DepthwiseConv2dBackpropFilterGPUKernelNCHW<float, 3, 3, 1>(tensorflow::DepthwiseArgs, float const*, float const*, float*, int)

整个mobilenet_v2还是太大了，调试不方便，我只想要这一个op的计算，并且计算的时候使用它在整个网络计算中使用的参数值。

这样，我需要写一个仅包含这个kernel计算的test case。

写这个test case，需要准备：

1、面向python 客户端的op name

2、这个op 在mobilenet_v2网络中的计算参数

接下来我们需要找到DepthwiseConv2dBackpropFilterGPUKernelNCHW函数实现和它对应的op name 。

kernel源码实现以及和op的对应

kernel源码实现

首先我们可以先看一下tensorflow的代码结构，可知，所有的kernel计算实现都在源码tensorflow/core/kernels/这个路径下。

我们可以看到有关depthwise的文件有：

depthwise的实现文件

depthwise_conv_grad_op.cc
depthwise_conv_op.cc
depthwise_conv_op.h
depthwise_conv_op_gpu.h*
depthwise_conv_op_gpu_double.cu.cc
depthwise_conv_op_gpu_float.cu.cc
depthwise_conv_op_gpu_half.cu.cc

grep 一下就可以找到 DepthwiseConv2dBackpropFilterGPUKernelNCHW 函数实现啦。

op的对应关系

接下来，我们看一下这个op是咋对应的呢？

首先来看一下，tensorflow扩展中如何定义一个新的c++ op ：添加一个新op

这样，我们可以从depthwise_conv_grad_op.cc 文件中找到depthwise计算的c++ op 注册代码，关键字是REGISTER_KERNEL_BUILDER，找到它的c++ op name 是DepthwiseConv2dNativeBackpropFilter。

然后再来看一下，python端使用的op和C++端使用的op 名字的对应关系：python和cpp的名称对应

一句话概括就是驼峰和下划线的变换

这样我们就可以确定python客户端对应的op name 是 depthwise_conv2d_native_backprop_filter啦。

可以参考Tensorlfow的python api官方文档，https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/nn/depthwise_conv2d_backprop_filter

这样一来，我已经知道了dcu上tensorflow框架训练mobilenet_v2网络计算最慢的kernel name和c++客户端/python客户端调用这个kernel的op name。

确定kernel输入输出的size

咋确定一个卷积计算时输入输出的大小呢？

这里我们可以借助Tensorflow 的VLOG。

#查看源码来决定使用啥等级
TF_CPP_MIN_VLOG_LEVEL

# 默认的显示等级，显示所有信息
os.environ["TF_CPP_MIN_LOG_LEVEL"]='1'

# 只显示 warning 和 Error
os.environ["TF_CPP_MIN_LOG_LEVEL"]='2'

# 只显示 Error
os.environ["TF_CPP_MIN_LOG_LEVEL"]='3'

示例

VLOG

VLOG(2) << "DepthwiseConv2dNativeBackpropFilter: "
<< " Input: [" << batch << ", " << input_rows << ", " << input_cols
<< ", " << in_depth << "]; Filter: [" << filter_rows << ", "
<< filter_cols << ", " << in_depth << ", " << depth_multiplier
<< "]; Output: [" << batch << ", " << out_rows << ", " << out_cols
<< ", " << out_depth << "], stride = " << stride_
<< ", pad_rows = " << pad_rows << ", pad_cols = " << pad_cols
<< ", Use cuDNN: " << use_cudnn;

这样我用 TF_CPP_MIN_VLOG_LEVEL=2 就可以看到这VLOG（2）有输出啦。

输出示例

DepthwiseConv2dNativeBackpropFilter:Input:[1,150,150,96];Filter:[3,3,96,1];Output:[1,75,75,96],stride=2,pad_rows=0,pad_cols=0,UsecuDNN:0
DepthwiseConv2dNativeBackpropFilter:Input:[1,150,150,32];Filter:[3,3,32,1];Output:[1,150,150,32],stride=1,pad_rows=1,pad_cols=1,UsecuDNN:0

如果你trace到的kernel里没有这个输出给它加上，重新编译即可。

撰写kernel的test case

好啦我需要的都准备好了。

开始写一个test case

tf test case

import tensorflow as tf
from tensorflow.python.ops 
import array_ops from tensorflow.python.framework 
import constant_op  
import numpy as np
import time
input_size = [1,32,150,150]
filter_size = [3,3,32,1]
output_size = [1,32,150,150]
data_format = 'NHWC'
input0 = np.random.rand(*input_size).astype(np.float32)
output0 = np.random.rand(*output_size).astype(np.float32)  
input1 = constant_op.constant(input0,shape=input_size)
filter1 = constant_op.constant(filter_size, shape=[len(filter_size)])
output1 = constant_op.constant(output0,shape=output_size)  
if data_format == 'NHWC':     
input1 = tf.transpose(input1,[0,2,3,1])     
output1 = tf.transpose(output1,[0,2,3,1])     
x1 = tf.nn.depthwise_conv2d_native_backprop_filter(input1,filter1,output1,strides=[1,1,1,1],padding='SAME',data_format='NHWC')
else:     
#x=tf.nn.conv2d_backprop_filter(inputs,filter_sizes=[3,3,64,1],out_backprop=[1,1,1,64],strides=[2,2],padding='VALID',use_cudnn_on_gpu=False,data_format='NCHW')     x1=tf.nn.depthwise_conv2d_native_backprop_filter(input1,filter1,output1,strides=[1,1,1,1],padding='SAME',data_format='NCHW')
            config = tf.ConfigProto() config.gpu_options.force_gpu_compatible = True with tf.Session(config=config) as sess:     
with tf.device('gpu:0'):         
start  = time.time()         
sess.run(x1)         
end = time.time()         
print("dcu compute dw bw cost time is ",end - start)

这里有一个需要注意的地方。第一次自己写test case 简单的这几行代码写了小一周。（笨吗？笨呐）

就是要把计算指定到GPU上。

然后 socprof再看一下。

好啦，复现成功啦，trace出来这个kernel只有一个call，成功迈出了万里长征的第一步。

来用这个test case 看一下两个平台上的差距。

嗯 10倍。

GPU	v100	dcu
time(ms)	37.9	377.571

分析慢的原因

咱也不知道咋分析。

据说没有什么问题是阅读源码解决不了的。

据说gdb挺好用的，可以查看函数调用关系。

据说pytorch计算dw卷积的时候用上了MIOpen。

MIOpen是对标cuDNN的加速库。

源码阅读

kernel

template <typename T, int kKnownFilterWidth, int kKnownFilterHeight,
          int kKnownDepthMultiplier>
__global__ void __launch_bounds__(640, 2)
    DepthwiseConv2dBackpropInputGPUKernelNCHW(const DepthwiseArgs args,
                                              const T* out_backprop,
                                              const T* filter, T* in_backprop,
                                              int num_in_backprop) {
  const int in_height = args.in_rows;
  const int in_width = args.in_cols;
  const int in_depth = args.in_depth;
  const int filter_height =
      kKnownFilterHeight < 0 ? args.filter_rows : kKnownFilterHeight;
  const int filter_width =
      kKnownFilterWidth < 0 ? args.filter_cols : kKnownFilterWidth;
  const int depth_multiplier =
      kKnownDepthMultiplier < 0 ? args.depth_multiplier : kKnownDepthMultiplier;
  const int stride = args.stride;
  const int pad_height = args.pad_rows;
  const int pad_width = args.pad_cols;
  const int out_height = args.out_rows;
  const int out_width = args.out_cols;
  const int out_depth = args.out_depth;

  // TODO(vrv): Consider assigning threads to output and using
  // atomics for accumulation, similar to the filter case.
  GPU_1D_KERNEL_LOOP(thread_id, num_in_backprop) {
    // Compute the indexes of this thread in the input.
    const int in_col = thread_id % in_width;
    const int in_row = (thread_id / in_width) % in_height;
    const int in_channel = (thread_id / in_width / in_height) % in_depth;
    const int batch = thread_id / in_depth / in_width / in_height;

    T sum = static_cast<T>(0);
    const int out_channel_start = in_channel * depth_multiplier;
    const int out_channel_end = out_channel_start + depth_multiplier;

          T partial_sum = ldg(input + input_offset) * out_bp;
          T* addr =
              filter_backprop +
              (dm + depth_multiplier *
                        (in_channel +
                         in_depth * (filter_col + filter_width * filter_row)));
          GpuAtomicAdd(addr, partial_sum);
        }
      }
    } else {
      UNROLL for (int filter_row = 0; filter_row < filter_height;
                  ++filter_row) {
        const int in_row = in_row_start + filter_row;
        // Avoid repeated computation.
        const int input_offset_temp =
            (batch * in_depth * in_height * in_width) +
            (in_channel * in_height * in_width) + (in_row * in_width);
        UNROLL for (int filter_col = 0; filter_col < filter_width;
                    ++filter_col) {
          const int in_col = in_col_start + filter_col;
          const int addr_temp = filter_width * filter_row;

          if (in_row >= 0 && in_row < in_height && in_col >= 0 &&
              in_col < in_width) {
            const int input_offset = input_offset_temp + in_col;
            T partial_sum = ldg(input + input_offset) * out_bp;
            T* addr =
                filter_backprop +
                (dm + depth_multiplier *
                          (in_channel + in_depth * (filter_col + addr_temp)));
            // Potentially many threads can add to the same address so we have
            // to use atomic add here.
            // TODO(jmchen): If atomic add turns out to be slow, we can:
            // 1. allocate multiple buffers for the gradients (one for each
            // example in a batch, for example). This can reduce the
            // contention on the destination; 2. Have each thread compute one
            // gradient for an element in the filters. This should work well
            // when the input depth is big and filter size is not too small.
            GpuAtomicAdd(addr, partial_sum);
          }
        }
      }
    }
  }
}

我们从前面VLOG到的信息会发现，DepthwiseConv2dNativeBackpropFilter 这个并没有用到加速库。

来看一看为啥没有用，毕竟函数里面是有写if(use_cudnn)了。

使用cudnn的函数调用关系：

dw use cudnn

if (use_cudnn) {
// Reshape from TF depthwise filter to cuDNN grouped convolution filter:
//
// | TensorFlow | cuDNN
// --------------------------------------------------------------------
// filter_out_depth | depth_multiplier | depth_multiplier * group_count
// filter_in_depth | in_depth | in_depth / group_count
//
// For depthwise convolution, we have group_count == in_depth.
int32 filter_in_depth = 1;
TensorShape shape =
TensorShape{filter_rows, filter_cols, filter_in_depth, out_depth};
Tensor reshaped_filter(/*type=*/dtype_);
OP_REQUIRES(
context, reshaped_filter.CopyFrom(*filter_backprop, shape),
errors::Internal(
"Failed to reshape filter tensor for grouped convolution."));

// TODO(yangzihao): Send in arbitrary dilation rates after the dilated
// conv is supported.
launcher_(context, use_cudnn_, cudnn_use_autotune_, out_backprop, input,
/*row_dilation=*/1, /*col_dilation=*/1, stride_, stride_,
padding_, /*explicit_paddings=*/{}, &reshaped_filter,
data_format_);
return;
}

// For in_depth == 1 and grouped convolutions.
LaunchConv2DBackpropFilterOp<Device, T> launcher_;
bool use_cudnn_;
bool cudnn_use_autotune_;
DataType dtype_;

函数LaunchConv2DBackpropFilterOp的实现在tensorflow/core/kernels/conv_grad_filter_ops.cc

miopen的调用

#elif TENSORFLOW_USE_ROCM
ProfileResult best_result;
DnnScratchAllocator scratch_allocator(ConvolveBackwardFilterScratchSize,
ctx);
bool miopen_find_status = true;
if (TestMIOpenBFloat16Support<T>()) {
miopen_find_status =
stream
->ThenConvolveBackwardFilterWithAlgorithm(
input_desc, bfloat16_input_ptr, output_desc,
bfloat16_out_backprop_ptr, conv_desc, filter_desc,
&bfloat16_filter_backprop_ptr, &scratch_allocator,
AlgorithmConfig(), &best_result)
.ok();
} else {
miopen_find_status =
stream
->ThenConvolveBackwardFilterWithAlgorithm(
input_desc, input_ptr, output_desc, out_backprop_ptr,
conv_desc, filter_desc, &filter_backprop_ptr,
&scratch_allocator, AlgorithmConfig(), &best_result)
.ok();
}
OP_REQUIRES(ctx, miopen_find_status && best_result.is_valid(),
errors::NotFound("Failed to find backward filter algorithm!"));
algorithm_config.set_algorithm(best_result.algorithm());
algorithm_config.set_scratch_size(best_result.scratch_size());
#endif

使用gdb工具分析

这个东西搞了两天，然鹅到目前为止，除了证明前面kernel定位没问题之外，还没有其他收益。

但是，我写了个使用文档

参见Debug TensorFlow 的C++代码

对比其他优秀的框架

重复昨天的故事。

这次故事的主角是Pytorch框架。

Pytorch框架可比Tensorflow好用多了。

撰写一个Pytorch框架的test case

为啥要写torch的test case 呢，出发点有两个：

1、看一下它调的哪个MIOpen算法，然后，追一下这个算法能否在tensorflow里面实现。

2、看一下它计算时对应的miopendriver 参数与cudnn对比一下或者与nv_tf的kernel计算对比一下时间（毕竟，我还不知道tf_dw卷积是怎么启用的cudnn），然后，如果可以的话改改tf_dw的计算。

同样的故事，同样的发展路线。

定位参与计算的op name

第一步还是要确定我们的目标op。

这一次目标很明确，就找depthwise_conv啥啥的。

不需要kernel trace的那些工具了，torch自带op trace工具。

太优秀了。

torch prof

with torch.autograd.profiler.profile(use_cuda=True) as prof:
......
print(prof)

加上这段代码之后，你会看到这样的输出

说明pytorch计算dw卷积的时候确实调用了MIOpen。

参与计算的参数

这样的使用环境，可以先不用费心debug了，就用tensorflow trace出来的那个。

撰写pytorch的test case

同样的，我们还是需要写一个test case，方便我们抓取dw卷积计算中调用的是哪个MIOpen计算。

为啥要写torch的test case 呢，出发点有两个：

1、看一下它调的哪个MIOpen算法，然后，追一下这个算法能否在tensorflow里面实现。

好了开始写吧。

torch的dw卷积计算，需要在conv2d计算中指定groups参数。

pytorch test case

import torch
import torch.nn as nn
import torch.autograd as autograd

#input (minibatch,in_channels,H,W)
input = torch.randn(1,32,150,150).cuda()
#weight (out_channels,in_channels/groups,H,W)
#weight = autograd.Variable(torch.randn(32,32,3,3),requires_grad=True)
#groups let (out_channels == in_channels)%groups == 0
groups = 32
output = autograd.Variable(torch.randn(1,32,150,150),requires_grad=True)
with autograd.profiler.profile(use_cuda=True) as prof:

conv = nn.Conv2d(32,32,kernel_size=3,stride=1,padding=1,groups=32,bias=False).cuda()
y = conv(input)
loss=torch.mean(y)
loss.backward()

#####################
# input2 = torch.randn(1,32,150,150)
# dw_conv = nn.Conv2d(32,32,3,1,1,32)
# conv1 = dw_conv(input2)
#fc = nn.Linear(88, 2816)
#fc1 = fc(conv1)
#fc1.backward()
#####################
print(prof)

这里，我们想看一下MIOpen的调用，可用指令来trace。

这些指令，只要是调用了MIOpen库就都可以用。

指令	含义
MIOPEN_LOG_LEVEL	0 - Default. Works as level 4 for Release builds, level 5 for Debug builds.1 - Quiet. No logging messages.2 - Fatal errors only (not used yet).3 - Errors and fatals.4 - All errors and warnings.5 - Info. All the above plus information for debugging purposes.6 - Detailed info. All the above plus more detailed information for debugging.7 - Trace: the most detailed debugging info plus all above.
MIOPEN_ENABLE_LOGGING_MPMT	分进程/线程打印log
MIOPEN_ENABLE_LOGGING_ELAPSED_TIME	添加log信息的时间消耗，毫秒
MIOPEN_ENABLE_LOGGING_CMD	控制台输出
MIOPEN_ENABLE_LOGGING	允许打印miopen API调用信息

MIOpen LOG

MIOpen(HIP): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: ConvOclDirectFwd , 0, 0.02032
MIOpen(HIP): miopenStatus_t miopenConvolutionForward(miopenHandle_t, const void *, const miopenTensorDescriptor_t, const void *, const miopenTensorDescriptor_t, const void *, const mi
openConvolutionDescriptor_t, miopenConvFwdAlgorithm_t, const void *, const miopenTensorDescriptor_t, void *, void *, size_t){MIOpen(HIP): handle = stream: 0, device_id: 0
MIOpen(HIP): alpha = 0x7fff91d20e40
MIOpen(HIP): xDesc = 1, 32, 150, 150
MIOpen(HIP): x = 0x7fc48bc00000
MIOpen(HIP): wDesc = 32, 1, 3, 3
MIOpen(HIP): w = 0x7fc475200000
MIOpen(HIP): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {1, 1}, {1, 1}, {1, 1}, 32,
MIOpen(HIP): algo = 1
MIOpen(HIP): beta = 0x7fff91d20e48
MIOpen(HIP): yDesc = 1, 32, 150, 150
MIOpen(HIP): y = 0x7fc48bebf200
MIOpen(HIP): workSpace = nullptr
MIOpen(HIP): workSpaceSize = 0
MIOpen(HIP): }
MIOpen(HIP): Command [LogCmdConvolution] ./bin/MIOpenDriver conv -n 1 -c 32 -H 150 -W 150 -k 32 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 32 -F 1 -t 1

MIOpen(HIP): Info [FindConvBwdWeightsAlgorithm] BWrW Chosen Algorithm: ConvOclBwdWrW53 , 0, 0.07728
MIOpen(HIP): miopenStatus_t miopenConvolutionBackwardWeights(miopenHandle_t, const void *, const miopenTensorDescriptor_t, const void *, const miopenTensorDescriptor_t, const void *,
const miopenConvolutionDescriptor_t, miopenConvBwdWeightsAlgorithm_t, const void *, const miopenTensorDescriptor_t, void *, void *, size_t){MIOpen(HIP): handle = stream: 0, device_id: 0
MIOpen(HIP): alpha = 0x7fccb0043a50
MIOpen(HIP): dyDesc = 1, 32, 150, 150
MIOpen(HIP): dy = 0x7fc48c17e400
MIOpen(HIP): xDesc = 1, 32, 150, 150
MIOpen(HIP): x = 0x7fc48bc00000
MIOpen(HIP): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {1, 1}, {1, 1}, {1, 1}, 32,
MIOpen(HIP): algo = 1
MIOpen(HIP): beta = 0x7fccb0043a58
MIOpen(HIP): dwDesc = 32, 1, 3, 3
MIOpen(HIP): dw = 0x7fc475200a00
MIOpen(HIP): workSpace = nullptr
MIOpen(HIP): workSpaceSize = 0
MIOpen(HIP): }

这样我们就写完了pytorch的test case 并且看到了相关的参数。

发现没有b wrw的参数对应？

暴力删除$HOME/.config/miopen/*

然后跑程序再看一下输出

user db

32-150-150-3x3-32-150-150-1-1x1-1x1-1x1-0-NCHW-FP32-F_g32=miopenConvolutionFwdAlgoDirect:ConvOclDirectFwd,0.02096,0,miopenConvolutionFwdAlgoDirect,<unused>;miopenConvolutionFwdAlgoWinograd:ConvBinWinogradRxSf2x3,0.28992,0,miopenConvolutionFwdAlgoWinograd,32x150x150x3x3x32x150x150x1xNCHWxFP32x1x1x1x1x1x1x32x1;miopenConvolutionFwdAlgoGEMM:gemm,1.98304,25920000,rocBlas,<unused>
32-150-150-3x3-32-150-150-1-1x1-1x1-1x1-0-NCHW-FP32-W_g32=miopenConvolutionBwdWeightsAlgoWinograd:ConvBinWinogradRxSf2x3,1.05056,0,miopenConvolutionBwd
WeightsAlgoWinograd,32x150x150x3x3x32x150x150x1xNCHWxFP32x1x1x1x1x1x1x32x0;miopenConvolutionBwdWeightsAlgoDirect:ConvOclBwdWrW53,0.07888,0,miopenConvolutionBwdWeightsAlgoDirect,32x150x150x3x3x32x150x150x1xNCHWxFP32x1x1x1x1x1x1x32x0;miopenConvolutionBwdWeightsAlgoGEMM:gemm,0.33712,25920000,rocBlas,<unused>

对应到miopendriver

miopendriver

/opt/rocm/miopen/bin/MIOpenDriver -F 4 -H 150 -W 150 -c 32 -k 32 -x 3 -y 3 -p 1 -q 1 -u 1 -v 1 -n 1 -g 32

有没有收到一点how to 优化的启发？

故事的最后

这样的一波操作之后，我们可以了解了：

dw卷积在tensorflow里面的计算过程

dw卷积在mobilenet_v2中的计算参数

tensorflow的dw conv test case

pytorch的dw conv test case

对应的miopen计算参数

TensorFlow工具

TensorFlow辅助可视化工具

performance trace

trace_run_options = config_pb2.RunOptions(
                                    trace_level=config_pb2.RunOptions.FULL_TRACE)
        run_metadata = config_pb2.RunMetadata()
     total_loss, np_global_step = sess.run([train_op, global_step],
                                                 options=trace_run_options,
                                                 run_metadata=run_metadata)
     tl = timeline.Timeline(run_metadata.step_stats)
     trace = tl.generate_chrome_trace_format()
     trace_filename = *
     file_io.write_string_to_file(trace_filename, trace)

打开地址：chrome://tracing/

TensorBoard可视化

TensorBoard是Tensorflow自带的一个强大的可视化工具，是一个web应用程序套件。TensorBoard，将可以帮助我们构建复杂模型。

使用TensorBoard需要三个步骤：

首先在需要可视化的相关部位添加可视化代码，即创建摘要、添加摘要；
其次运行代码，可以生成了一个或多个事件文件(event files)；
最后启动TensorBoard的Web服务器。

使用方法

在本地做二次转发
ssh -L 16006:127.0.0.1:16006 yangxuan@aaaa
ssh -L 16006:127.0.0.1:6006 yangxuan@bbb
tensorboard --logdir=./checkpoint-dir
/work/home/yangxuan/xuan_local/python27/bin/tensorboard  --logdir=./checkpoint-dir-inception-fusion/
127.0.0.1:16006

3、TensorFlow调试工具

TensorFlow 信息打印

#查看源码来决定使用啥等级
TF_CPP_MIN_VLOG_LEVEL

# 默认的显示等级，显示所有信息
os.environ["TF_CPP_MIN_LOG_LEVEL"]='1'

# 只显示 warning 和 Error
os.environ["TF_CPP_MIN_LOG_LEVEL"]='2'

# 只显示 Error
os.environ["TF_CPP_MIN_LOG_LEVEL"]='3'

tfdbg

tfdbg是专门为TensorFlow定制的调试器。在训练或推理的过程中能够让我们看到流图的内部结构及节点状态。

使用方法

import tensorflow as tf
from tensorflow.python import debug as tf_debug

session_wrapper=tf_debug.LocalCLIDebugWrapperSession

常用指令

命令	语法和选项	说明	示例
lt		列出所有的张量	lt
	-n	列出和给定张量名字正则表达式相匹配的张量	lt -n Softmax.*
	-t	列出和给定张量类型正则表达式相匹配的张量	lt -t MatMul
	s	按给定的 sort_key排序张量列表，默认是按时间戳timestamp 排序的,还有其他的sort_key如：dump_size, op_type tensor_name.	lt -s dump_size
	-r	逆序排序输出张量列表	lt -r -s dump_size
pt		输出一个张量的值
ni		显示节点信息
li		列出节点的输入信息
lo		列出节点输出的接收节点信息
ps		打印debug的python源文件
	ps	打印source.py源文件代码，并添加每个节点创建的注释信息。	ps /path/to/source.py
	-t	展示源码及张量的信息，而不是默认的节点信息。	ps -t /path/to/source.py
run		运行直到下一次 Session.run()	run
	-n	无debug运行到下一次Session.run(), 所以不会显示调试信息。	run -n
ri		列出当前运行时，喂数据（feeds）和拿数据(fetchs)的情况	ri
help		打印帮助信息	help
	help	打印给定命令的帮助信息	help lt

效果展示（部分）

list tensor

node info

traceback

参考资料

https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md

pdb/ipdb调试

安装 ipdb

pip install ipdb

使用方法

1、import ipdb
  ipdb.set_trace()
2、python -m ipdb code.py

常用指令

commond

#单步执行
n(next)
#打印变量
p(print)
pp(pretty print)
#查看变量类型
what is variable
#获取当前行号
w(where)
#进入函数
s(step info)
#在函数中，打印传入函数的所有参数值
a(argument)
#打断点
b num_line(break)
#执行到下一个断点
c(continue)
#清除断点
cl clear_file:num_line
#执行到当前函数返回
r(return)
#执行指定行代码，跳过中间行
j(jump)
#显示更多上下文code
l [first,second](list)
#重新调试
r(restart)
#退出调试
q(quit)

gdb调试

gdb(GNU Debuger)，用来调试C和C++程序。

编译选项

要想使用gdb调试tensorflow ，需要在编译的时候给tensorflow添加编译选项。

bazel build -c dbg ***

我们来验证一下.so编译的时候有没有打上-g 参数。

gdb *.so

有如下啥啥啥done说明就是编译上了的。

gdb /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so...done.

使用方法

在代码里面添加这段代码

import os
input("pid: " + str(os.getpid()) +", press enter to continue")

然后 gdb -p pid

常用指令

添加断点

break /b

b /root/tensorflow/tensorflow/core/kernels/depthwise_conv_grad_op.cc:1092

b tensorflow::LaunchDepthwiseConvBackpropFilterOp::operator()

b tensorflow::DepthwiseConv2dBackpropInputGPUKernelNCHW

打印函数调用堆栈信息

	指令	用法	含义
查看调用栈信息	backtrace/bt
		bt n	显示程序的调用栈信息，只显示栈顶n桢(frame)
		bt -n	表示只打印栈底下 n 层的栈信息。
		set backtrace limit n	设置bt显示的最大桢层数
		where, info stack	都是bt的别名含义一样
查看栈中某一层的信息	frame/f
		f n	表示在 GDB 下切换到编号为 n 的栈帧
		down n	表示往栈顶方向下移 n 层
		up n	表示往栈底方向上移 n 层
查看详细的当前栈帧的信息	info
		info frame ( i f )	查看如函数地址、调用函数的地址、被调用函数的地址、当前函数由哪种编程语言编写、函数参数地址及形参值、局部变量的地址等
		info args	打印出当前函数的参数名及其形参值
		info locals	打印出当前函数中所有局部变量及其值
		info catch	打印出当前函数中的异常处理信息

关于栈和栈帧

内存栈区 (stack) 由编译器自动分配和释放，用于存放函数的形参值、局部变量的值、函数返回地址等数据，其操作方式与数据结构中的栈一致，都是后进先出的原则。在虚拟内存地址空间中，栈从高地址向低地址延伸。

栈帧 (stack frame) 是编译器用来实现函数调用的一种数据结构，是内存栈区的基本单元。内存栈空间上保持了 N 个栈帧的实体。

所有函数调用均发生在栈上，每个函数的每次调用，都有它自己独立的一个栈帧。寄存器 ebp 指向当前栈帧的底部 (高地址)，寄存器 esp 指向当前栈帧的顶部 (低地址)。

示例：

展开源码

打印信息

print / p

p config
$1 = {virtual_thread_count = 720000, thread_per_block = 1024, block_count = 60}

调试宏

在GCC编译程序的时候，加上 -ggdb3  参数

参考资料

Debug Tensorflow的C++代码 | 码农网

Top (Debugging with GDB)

DebuggingWithGdb - Python Wiki

cpython/Tools/gdb/libpython.py at 3.6 · python/cpython · GitHub

Debug Tensorflow的C++代码 | Chi’s Website

TensorFlow 拆包（一）：Session.Run() | Chenfan Blog

https://vimjc.com/categories/gdb/

https://www.cnblogs.com/lsgxeva/p/8078670.html

技术瘾君子1573

关注

29
点赞
踩
28

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录