tensorflow 性能优化

TF-Serving使用Batch提升性能
OPTIMIZING, PROFILING, AND TUNING TENSORFLOW + GPUS
Save a model for inference that was train with Horovod

环境变量TF_GPU_THREAD_MODE
lantency收益:当 TF_GPU_THREAD_MODE 设置为 gpu_shared 、gpu_private 时,warm-up 无变化,avg 延时有明显收益。TF_GPU_THREAD_COUNT 增加到 4、8 时,表现没有提高,甚至有所下降。
cpu收益:当 TF_GPU_THREAD_MODE 设置为 gpu_shared 、gpu_private 时, cpu 利用率有明显下降。

frozen&trim&trt-optimized
frozen&trim graph 的平均用时与 trt_optimized_graph 相同,意味着主要的收益来自于 graph → frozen&trim graph 的变化.应该和此过程中减少了很多 node 有关。目前的 graph 中能被 trt 优化的子图部分占比太小,所以效果不够显著。

python bin/relevance_scripts/freeze_graph.py --input_saved_model_dir=/mnt/1564989211/ --output_graph=saved_model.pb --output_node_names=loss/out_query_emb,loss/out_image_emb

tf tensorrt
Speed up TensorFlow Inference on GPUs with TensorRT

重新编译
如何将TensorFlow Serving的性能提高超过70%?
基于TensorFlow Serving的深度学习在线预估

优化流程
Optimizing TensorFlow Models for Serving
github
graph_transforms

import os
import numpy as np
from datetime import datetime
import sys

import tensorflow as tf
from tensorflow import data
from tensorflow.python.saved_model import tag_constants
from tensorflow.python.tools import freeze_graph
from tensorflow.python import ops
from tensorflow.tools.graph_transforms import TransformGraph

def get_size(model_dir, model_file='saved_model.pb'):
  model_file_path = os.path.join(model_dir, model_file)
  print(model_file_path, '')
  pb_size = os.path.getsize(model_file_path)
  variables_size = 0
  if os.path.exists(
      os.path.join(model_dir,'variables/variables.data-00000-of-00001')):
    variables_size = os.path.getsize(os.path.join(
        model_dir,'variables/variables.data-00000-of-00001'))
    variables_size += os.path.getsize(os.path.join(
        model_dir,'variables/variables.index'))
  print('Model size: {} KB'.format(round(pb_size/(1024.0),3)))
  print('Variables size: {} KB'.format(round( variables_size/(1024.0),3)))
  print('Total Size: {} KB'.format(round((pb_size + variables_size)/(1024.0),3)))

def describe_graph(graph_def, show_nodes=False):
  print('Input Feature Nodes: {}'.format(
      [node.name for node in graph_def.node if node.op=='Placeholder']))
  print('')
  print('Unused Nodes: {}'.format(
      [node.name for node in graph_def.node if 'unused'  in node.name]))
  print('')
  print('Output Nodes: {}'.format(
      [node.name for node in graph_def.node if (
          'predictions' in node.name or 'loss/out' in node.name)]))
  print('')
  print('Quantization Nodes: {}'.format(
      [node.name for node in graph_def.node if 'quant' in node.name]))
  print('')
  print('Constant Count: {}'.format(
      len([node for node in graph_def.node if node.op=='Const'])))
  print('')
  print('Variable Count: {}'.format(
      len([node for node in graph_def.node if 'Variable' in node.op])))
  print('')
  print('Identity Count: {}'.format(
      len([node for node in graph_def.node if node.op=='Identity'])))
  print('', 'Total nodes: {}'.format(len(graph_def.node)), '')

  if show_nodes==True:
    for node in graph_def.node:
      print('Op:{} - Name: {}'.format(node.op, node.name))

def get_graph_def_from_file(graph_filepath):
  with ops.Graph().as_default():
    with tf.gfile.GFile(graph_filepath, 'rb') as f:
      graph_def = tf.GraphDef()
      graph_def.ParseFromString(f.read())
      return graph_def

def get_graph_def_from_saved_model(saved_model_dir):
  with tf.Session() as session:
    meta_graph_def = tf.saved_model.loader.load(
    session,
    tags=[tag_constants.SERVING],
    export_dir=saved_model_dir
  )
  return meta_graph_def.graph_def


def freeze_model(saved_model_dir, output_node_names, output_filename):
  output_graph_filename = os.path.join(saved_model_dir, output_filename)
  initializer_nodes = ''
  freeze_graph.freeze_graph(
      input_saved_model_dir=saved_model_dir,
      output_graph=output_graph_filename,
      saved_model_tags = tag_constants.SERVING,
      output_node_names=output_node_names,
      initializer_nodes=initializer_nodes,
      input_graph=None,
      input_saver=False,
      input_binary=False,
      input_checkpoint=None,
      restore_op_name=None,
      filename_tensor_name=None,
      clear_devices=False,
      input_meta_graph=False,
  )
  print('graph freezed!')


def optimize_graph(model_dir, graph_filename, transforms, output_node, output_filename):
  input_names = []
  output_names = output_node
  if graph_filename is None:
    graph_def = get_graph_def_from_saved_model(model_dir)
  else:
    graph_def = get_graph_def_from_file(os.path.join(model_dir, graph_filename))
  optimized_graph_def = TransformGraph(
      graph_def,
      input_names,
      output_names,
      transforms)
  tf.train.write_graph(optimized_graph_def,
                      logdir=model_dir,
                      as_text=False,
                      name=output_filename)
  print('Graph optimized!')


def convert_graph_def_to_saved_model(export_dir, graph_filepath):
  if tf.gfile.Exists(export_dir):
    tf.gfile.DeleteRecursively(export_dir)
  graph_def = get_graph_def_from_file(graph_filepath)
  with tf.Session(graph=tf.Graph()) as session:
    tf.import_graph_def(graph_def, name='')
    # tf.saved_model.simple_save(
    #     session,
    #     export_dir,
    #     inputs={
    #         node.name: session.graph.get_tensor_by_name(
    #             '{}:0'.format(node.name))
    #         for node in graph_def.node if node.op=='Placeholder'},
    #     outputs={'query_emb': session.graph.get_tensor_by_name(
    #         'loss/out_query_emb:0')}
    # )
    builder = tf.saved_model.builder.SavedModelBuilder(export_dir)
    builder.add_meta_graph_and_variables(session, ['serve'],
             {
                "serving_default":
                    tf.saved_model.signature_def_utils.build_signature_def(
                        inputs={
                            node.name: tf.saved_model.utils.build_tensor_info(session.graph.get_tensor_by_name('{}:0'.format(node.name)))
                            for node in graph_def.node if node.op=='Placeholder' and node.name!="image_list"
                        },
                        outputs={
                            'query_emb': tf.saved_model.utils.build_tensor_info(session.graph.get_tensor_by_name('loss/out_query_emb:0')),
                        },
                        method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME
                    ),
                "serving_cnn":
                    tf.saved_model.signature_def_utils.build_signature_def(
                        inputs={
                            node.name: tf.saved_model.utils.build_tensor_info(session.graph.get_tensor_by_name('{}:0'.format(node.name)))
                            for node in graph_def.node if node.op=='Placeholder' and node.name=="image_list"
                        },
                        outputs={
                            'image_emb': tf.saved_model.utils.build_tensor_info(session.graph.get_tensor_by_name('loss/out_image_emb:0')),
                        },
                        method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME
                    ),
             })

    builder.save()
    print('Optimized graph converted to SavedModel!')


saved_model_dir = "/mnt/cephfs_hl/common/ad/search_ad/huangqingkang/optim/"
output_node_names = "loss/out_query_emb,loss/out_image_emb"
output_node_list = output_node_names.split(",")

# freeze model
print("### starting freeze ###")
frozen_filename = "frozen_model.pb"
frozen_filepath = os.path.join(saved_model_dir, frozen_filename)

freeze_model(saved_model_dir, output_node_names, frozen_filename)

describe_graph(get_graph_def_from_file(frozen_filepath))
get_size(saved_model_dir, frozen_filename)


# optimize model
print("### starting optimize ###")
transforms = [
 "remove_nodes(op=Identity)",
 # "merge_duplicate_nodes",
 "strip_unused_nodes",
 "fold_constants(ignore_errors=true)",
 "fold_batch_norms"
]
optimized_filename = "optimized_model.pb"
optimized_filepath = os.path.join(saved_model_dir, optimized_filename)

optimize_graph(saved_model_dir, frozen_filename , transforms, output_node_list, optimized_filename)

describe_graph(get_graph_def_from_file(optimized_filepath))
get_size(saved_model_dir, optimized_filename)

# covert back
print("### starting convert back ###")
export_dir = os.path.join(saved_model_dir, "export")
convert_graph_def_to_saved_model(export_dir, optimized_filepath)

一、通用优化方法

1、输入流优化

(1)将预处理过程在cpu上执行

(2)使用tf.data API代替feed_dict操作,尤其是大批量输入时

(3)对于图像编码和裁剪,可使用函数tf.image.decode_and_crop_jpeg

(4)使用大文件输入,来防止io瓶颈,将输入文件转换为TFRecord文件。若内存允许,最好将整个数据集加载进内存

2、数据格式

使用NCHW代替NHWC,其中N代表batch大小,H代表图像的高,W代表图像的宽,C代表通道。在CPU中NHWC格式数据更快,但在GPU上NCHW更快。默认是NHWC

3、使用融合操作符

如tf.layers.batch_normalization

4、RNN性能

(1)tf.nn.rnn_cell.BasicLSTMCell迫不得已不要使用这个函数
(2)tf.nn.static_rnn与tf.nn.dynamic_rnn相比,运行时间相差无几,但由于图更庞大,编译时间更长

(3)在GPU上可以使用tf.contrib.cudnn_rnn函数,但不支持layer normalization

(4)tf.contrib.rnn.LSTMBlockCell与tf.nn.rnn_cell.BasicLSTMCell少用3-4倍内存

5、从源码安装tensorflow

从源码安装tensorflow,使其更加适用于本地cpu

二、针对cpu的优化

1.增加线程数

config = tf.ConfigProto()
config.intra_op_parallelism_threads = 44
config.inter_op_parallelism_threads = 44
tf.Session(config=config)
intra_op_parallelism_threads表示操作内部的并行数,
inter_op_parallelism_threads表示能并行运算的操作数
默认线程数是等于逻辑cores数,可选的优化措施是等于物理cores数

2.MKL优化

(1)从源码安装tensorflow

tensorflow 1.2.0

./configure

Pick the desired options

bazel build --config=mkl --config=opt //tensorflow/tools/pip_package:build_pip_package
tensorflow 1.2.0
./configure
Do you wish to build TensorFlow with MKL support? [y/N] Y
Do you wish to download MKL LIB from the web? [Y/n] Y

Select the defaults for the rest of the options.

bazel build --config=mkl --copt="-DEIGEN_USE_VML" -c opt //tensorflow/tools/pip_package:build_pip_package
(2)数据格式需要是NCHW

(3)微调环境变量

KMP_BLOCKTIME: 对于cnn网络设置为0ms,AlexNex、GoogleNet、VGG11分别为30ms、1ms、1ms

KMP_AFFINITY:granularity=fine,verbose,compact,1,0

KMP_SETTINGS

OMP_NUM_THREADS:https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture

intra_op_parallelism_threads:需要和OMP_NUM_THREADS值相同

inter_op_parallelism_threads:Setting this equal to the number of sockets is recommended

3.MKL框架大概比AVX2快3倍

三、针对GPU的优化

1.多个GPU

(1)Tesla K80:若GPU在相同的PCI Express root complex上,且能使用NVIDIA GPUDirect,则将变量设置在GPU上是好的实现,否则放在CPU上。

(2)Titan X (Maxwell and Pascal), M40, P100, and similar:对于模型ResNet、Inception V3则将变量建立在CPU上。但像AlexNet和VGG模型,变量较多,则将变量建立在GPU上比较好。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值