TensorFlow 网络模型移植和训练指南

最新推荐文章于 2024-04-10 10:32:41 发布

风尘23187

最新推荐文章于 2024-04-10 10:32:41 发布

阅读量1.2k

点赞数

分类专栏： GPU_by_ACL 文章标签：服务器人工智能 python

本文链接：https://blog.csdn.net/ygf666/article/details/124305920

版权

GPU_by_ACL 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

TensorFlow 网络模型移植和训练指南（持续更新）

1.限制

tensorflow只兼容tensorflow1.15

3.网络迁移

3.1 使用 Estimator 迁移

关于估算器
Estimator API 是 TensorFlow 的高级 API，在 2018 年发布的 TensorFlow 1.10 中引入。Estimator API 极大地简化了机器学习的编程过程。 Estimator 有很多优点，例如对分发的良好支持、简化的模型创建以及模型开发人员之间的代码共享。
要使用 Estimator API 开发训练脚本，请执行以下步骤。
训练流程
1 数据预处理：创建输入函数 input_fn ；
2 模型构建：构建模型函数model_fn；
3 设置运行配置：实例化 Estimator 并将 Runconfig 类的对象作为运行参数传递；
4 训练：在 Estimator 中调用训练方法 Estimator.train() 使用指定的输入以固定步数训练模型。
下面介绍如何迁移 Estimator API 以在升腾 AI 处理器上进行训练。

3.1.1数据预处理

3.1.2模型构建

3.1.3设置运行配置

Original TensorFlow code.
config=tf.estimator.RunConfig(
model_dir=FLAGS.model_dir,
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
session_config=tf.ConfigProto(allow_soft_placement=True,log_device_placement=False))

Code after migration.
from npu_bridge.estimator.npu.npu_config import NPURunConfig
npu_config=NPURunConfig(
model_dir=FLAGS.model_dir,
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
session_config=tf.ConfigProto(allow_soft_placement=True,log_device_placement=False)# Enable
automatic device selection without logging the device selection event.
)

此外，Ascend 还支持自动混合精度等功能。具体如何开启这些功能，请参见API描述。

3.1.4创建Estimator

需要将 TensorFlow Estimator 迁移到 NPUEstimator。

#Original TensorFlow code.
mnist_classifier=tf.estimator.Estimator(
model_fn=cnn_model_fn,
config=config,
model_dir="/tmp/mnist_convnet_model")

#Code after migration.
from npu_bridge.estimator.npu.npu_estimator import NPUEstimator
mnist_classifier=NPUEstimator(
model_fn=cnn_model_fn,
config=npu_config,
model_dir="/tmp/mnist_convnet_model"
)

使用指定的输入以固定步数训练模型。代码片段在正常情况下即可使用

3.1.5训练

mnist_classifier.train(input_fn=train_input_fn,
						steps=20000,
						hooks=[logging_hook])

3.2 使用sess.run迁移

训练流程
1 数据预处理
2 模型构建，损失计算，梯度更新
3 创建Session和资源初始化
4 训练：

3.2.1数据预处理

在以下情况需要手动调整：
只能训练静态形状，如果从原始网络脚本dataset.batch(batch_size)中返回动态形状，则需在昇腾处理器中将drop_remainder设置为True，因为剩余样本的数量可能小于批量。

dataset = dataset.batch(batch_size, drop_remainder=True)

这可能会丢弃最后的几个样本，以确定每个批次都具有静态形状（batch_size）。
注意：在推理过程中，如果最后一次推理数据量小于batch_size，则需要将迭代中空白数据填充到batch_size。否则若脚本末尾有assert时，

assert num_written_lines == num_actual_predict_examples

表示验证结果数量等于验证样本数量，则训练失败。

3.2.2模型构建、损失计算和梯度更新

代码片段已准备好在正常情况下使用。仅在以下情况下需要手动调整：
1.tf.device，则删除它
2.dropout->AscendCL API：

#Original TensorFlow code.原代码
1.	layers = tf.nn.dropout()
#Code after migration.迁移后
1.	from npu_bridge.estimator import npu_ops
2.	layers = npu_ops.dropout()

2.gelu->AscendCL API：

#Original TensorFlow code.
def gelu(x):
cdf = 0.5 * (1.0 + tf.tanh(
(np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))
return x*cdf
layers = gelu()
#Code after migration.
from npu_bridge.estimator.npu_unary_ops import npu_unary_ops
layers = npu_unary_ops.gelu(x)

3.2.3创建Session和资源初始化

使用sess.run在Asend AI处理器上运行训练脚本时，请注意：
1.以下配置选项默认禁用，不应启用：

rewrite_options.disable_model_pruning

2.以下配置选项默认启用，不应禁用：

rewrite_options.function_optimization
rewrite_options.constant_folding
rewrite_options.shape_optimization rewrite_options.arithmetic_optimization
rewrite_options.loop_optimization
rewrite_options.dependency_optimization
rewrite_options.layout_optimizer
rewrite_options.memory_optimization

3.以下配置选项默认启用，应明确禁用：

rewrite_options.remapping

4.分布式场景下，手动添加 GradFusionOptimizer 优化器。

rewrite_options.optimizers.extend(["GradFusionOptimizer"])

5.以下配置选项默认禁用，应显式启用以在升腾 AI 处理器上进行训练。

custom_op.parameter_map["use_off_line"].b = True

原始 TensorFlow 代码：

# Construct the iterator.
iterator=Iterator.from_structure(train_dataset.output_types,train_dataset.output_shapes)
# Obtain the batch data.
next_batch=iterator.get_next()
# Initialize the iterator.
training_init_op=iterator.make_initializer(train_dataset)
# Initialize the variables.
init=tf.global_variables_initializer()
sess=tf.Session()
sess.run(init)
# Obtain the number of training/validation steps per epoch.
train_batches_per_epoch=int(np.floor(train_size/batch_size))

迁移后的代码：

from npu_bridge.estimator import npu_ops
from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
# Construct the iterator.
iterator=Iterator.from_structure(train_dataset.output_types,train_dataset.output_shapes)
# Obtain the batch data.
next_batch=iterator.get_next()
# Initialize the iterator.
training_init_op=iterator.make_initializer(train_dataset)
# Initialize the variables.
init=tf.global_variables_initializer()
# Create a session.
config = tf.ConfigProto()
custom_op = config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True # Must be explicitly enabled for training on
Ascend AI Processor.
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF # Remapping must be
disabled explicitly.
config.graph_options.rewrite_options.optimizers.extend(["GradFusionOptimizer"]) # Required in the
distributed training scenario.
sess = tf.Session(config=config)
sess.run(init)
# Obtain the number of training/validation steps per epoch.
train_batches_per_epoch=int(np.floor(train_size/batch_size))

Ascend 平台支持 tf.Session 的所有原生功能。
它还允许您启用自动混合精度等功能。具体参见对应的API描述。

3.2.4训练

代码片段已准备好在正常情况下使用

# Start cyclic iteration.
for epoch in range(num_epochs):
	##Initialize iterator with the training dataset
	sess.run(training_init_op)
	for step in range(train_batches_per_epoch):
		#get next batch of data
		img_batch,label_batch=sess.run(next_batch)
		#run the training op
		_,train_loss = sess.run([train_op, loss],feed_dict={x:img_batch,y_:label_batch,is_training:True})

3.3 Keras迁移

3.3.1Keras 介绍

Keras 类似于 Estimator。它们都是 TensorFlow 高级 API，提供方便的图构建函数和方便的 API，用于训练、评估、验证和导出。要使用 Keras API 开发训练脚本，请执行以下步骤：
1 预处理数据
2 构建模型
3 建立这个模型
4 训练这个模型
Keras 迁移到 Ascend 平台时，部分功能受到限制，例如不支持动态学习率。因此，不建议您将使用 Keras 开发的网络脚本迁移到 Ascend 平台。要在 Ascend 平台上运行 Keras 脚本，您可以使用以下两种迁移方法：

①在 Ascend 平台上，可以直接使用原生 Keras API 进行训练。但是，只允许一次 session.run 调用，并且 Ascend AI Processor 上每训练循环的迭代次数固定为 1。具体请参见 Native Keras API 支持。
②为了减少主机和设备之间的交互次数，缩短训练时长，您需要使用model_to_npu_estimator API将使用Keras构建的模型转换为NPUEstimator对象。此外，您需要使用 NPURunConfig 中的 iterations_per_loop 参数指定每次 sess.run() 调用在 Ascend AI 处理器上的每个训练循环的迭代次数。有关详细信息，请参阅 Keras 到 NPUEstimator 的转换。

3.3.2Native Keras API 支持

在 Ascend 平台上，您可以直接使用原生 Keras API 进行训练。但是，Ascend AI Processor 上每个训练循环的迭代次数在每个 sess.run() 调用中固定为 1。将基于 Keras 的网络脚本迁移到 Ascend 平台进行训练，需要注意以下几点：
1.在升腾 AI 处理器上训练需要启用 use_off_line。因此，您需要先创建一个 TensorFlow 会话并注册 Keras。训练结束时应关闭会话。

import tensorflow as tf
import tensorflow.python.keras as keras
from tensorflow.python.keras import backend as K
from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig
from npu_bridge.estimator import npu_ops
sess_config = tf.ConfigProto()
custom_op = sess_config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True
sess_config.graph_options.rewrite_options.remapping = RewriterConfig.OFF
sess_config.graph_options.rewrite_options.optimizers.extend(["GradFusionOptimizer"]) # This
line is required in the distributed training scenario.
sess = tf.Session(config=sess_config)
K.set_session(sess)
# Preprocess the data...
# Construct a model...
# Build the model...
# Train the model...
sess.close()

2.如果在原网络使用tf.device，请删除相关代码。
此外，Ascend 还支持自动混合精度等功能。具体如何开启这些功能，请参见API描述。

3.3.3 Keras 到 NPUEstimator 的转换

本节介绍如何将基于 Keras 的网络脚本迁移到 NPUEstimator 并配置 iterations_per_loop。
数据预处理
你自己将 Keras 的数据预处理部分迁移到 NPUEstimator 中的 input_fn 中。下面是一个例子。
在下面的例子中，Keras 从文件夹中读取图像数据，自动标记数据，进行数据调整大小、归一化、水平翻转等数据增强操作，最后输出数据。
在 Estimator 模式下，数据的预处理方式与从文件列表中读取数据的方式相同。不同的是需要提前读取文件名列表，并且需要对每张图片进行标注，才能输出标注列表。数据经过归一化、调整大小、水平翻转等相同的数据增强操作后输出。

Original TensorFlow code.
# Keras reads images from the folder.
train_datagen = ImageDataGenerator(rescale=1./255,
									horizontal_flip=True)
train_generator = train_datagen.flow_from_directory('data/',
													target_size=(224, 224, 3),
													batch_size=32,
													class_mode='sparse')
Code after migration.
# The function is used to read the image files corresponding to the file names and resize the image files to a unified size.
def _parse_function(filename, label):
	image = tf.read_file(filename)
	image = tf.image.decode_image(image)
	image = image / 255.0
	image = tf.image.resize_images(image, [224, 224, 3])
	image = tf.image.random_flip_left_right(image)
	return image, label
def input_fn():
	# List of image files. The image list needs to be generated by yourself.
	filenames = tf.constant(["/data/image1.jpg", "/data/image2.jpg", ...])
	# label[i] is the label of the filenames[i] image. The label list needs to be generated by yourself.
	labels = tf.constant([0, 5, ...])
	# Now an element in the dataset is (filename, label).
	dataset = tf.data.Dataset.from_tensor_slices((filenames, labels)).repeat(10)
	# Now an element in the dataset is (image_resized, label).
	dataset = dataset.map(_parse_function)
	# Now an element in the dataset is (image_resized_batch, label_batch).
	dataset = dataset.shuffle().batch(32)
	return dataset

模型
通过调用 model_to_npu_estimator API 将 Keras 构建的模型转换为 NPUEstimator 对象并进行训练。

Original TensorFlow code.
from keras.layers import Input, Dense
from keras.models import Model
# This returns a tensor
inputs = Input(shape=(224, 224, 3))
# This creates a model that includes
# the Input layer and three Dense layers
keras_model = ResNet50(input_tensor=inputs, weights=None,include_top=True)
keras_model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')
keras_model.fit_generator(  train_generator,
							steps_per_epoch=100,
							epochs=10)

Code after migration.
from npu_bridge.estimator.npu.keras_to_npu import model_to_npu_estimator
from npu_bridge.estimator.npu.npu_config import NPURunConfig
run_config = NPURunConfig(save_checkpoints_steps=2,
model_dir=model_path,
iterations_per_loop=10)
# Convert the model constructed by using Keras to an NPUEstimator object.
est_resnet = model_to_npu_estimator(keras_model=keras_model, config=run_config)
# Perform training.
est_resnet.train(input_fn=lambda: input_fn(), max_steps=1000)

注意：
● Keras 的回调函数转换为NPUEstimator 对象后无法使用。
● 如果在原网络使用tf.device，请删除相关代码。

4.分布式训练

5.专题

5.1混合精度

5.2损失缩放

5.3混合计算

5.4Profiling

5.5数据转储

5.6溢出检测

5.7迭代卸载

5.8log和sum运算符

5.9数据预处理性能提升

5.10梯度分割策略

5.11训练 .ckpt 转换为离线推理 .pb模型

6. 执行训练

6.1配置处理器资源

6.2配置环境变量

7.迁移实例

7.1 使用 ImageNet 数据集训练 ResNet-50 模型

7.1.1 准备工作

获取数据集
本示例使用 ImageNet 数据集作为示例。从 http://www.image-net.org/ 下载数据集。
关于 ResNet-50
ResNet-50 是一个深度残差网络，可用于对 CIFAR-10 和 ImageNet 数据集的 1000 个类别进行分类。

获取原始模型
原始 ResNet 网络脚本可在 https://github.com/tensorflow/models/tree/r2.1_model_reference/official 获得。
目录结构
该目录组织如下。（仅列出了部分涉及的文件。更多文件请参见原始 ResNet 脚本。）
在这里插入图片描述

7.1.2 训练流程概述

Estimator
Estimator 是 TensorFlow 的高级 API，在 2018 年发布的 TensorFlow 1.10 中引入。它极大地简化了机器学习的编程过程。 Estimator 有很多优点，例如对分发的良好支持、简化的模型创建以及模型开发人员之间的代码共享。要使用 Estimator API 开发训练脚本，请执行以下步骤。
训练流程
1 数据预处理：创建输入函数 input_fn ；
2 模型构建：构建模型函数model_fn；
3 设置运行配置：实例化 Estimator 并将 Runconfig 类的对象作为运行参数传递；
4 训练：在 Estimator 中调用训练方法 Estimator.train() 使用指定的输入以固定步数训练模型。

7.1.3训练代码目录

目录结构
该目录组织如下。（仅列出了部分涉及的文件。更多文件请参见原始 ResNet 脚本。）
在这里插入图片描述
PY 文件描述
imagenet_main.py：
包含与 ImageNet 预处理、模型构建定义和模型运行时相关的 API。 get_filenames()、parse_record()、input_fn()、get_synth_input_fn() 和 _parse_example_proto() 函数用于数据预处理。 ImagenetModel 类、imagenet_model_fn()、run_cifar() 和 define_cifar_flags() 函数用于模型操作。

imagenet_preprocessing.py：
包含 ImageNet 图像数据预处理 API，用于使用提供的边界框对训练图像进行采样、基于边界框裁剪图像、随机翻转图像以及将图像调整为目标输出大小（不保留纵横比）。图像调整大小（纵横比）保留）和集中裁剪在评估过程中使用。

resnet_model.py：
实现 ResNet 模型，包括 ResNet 模型构建的辅助函数和 ResNet 块定义函数。

resnet_run_loop.py
模型运行时文件，包括输入处理和运行循环。输入处理包括解码输入数据、转换格式、输出图像和标签，以及根据是否是训练场景设置数据随机化、批处理和预读。运行循环包括构建 Estimator，以及执行训练和验证。一般来说，模型在特定环境下运行，实现数据和误差流，从而可以使用梯度下降来更新模型参数。

7.1.4数据准备

数据预处理过程与原始模型相同。修改部分代码，适配升腾910 AI处理器，获得更高的计算能力。显示的代码显示了修改
定义输入函数 input_fn
以 ImageNet 数据集的数据预处理为例。适配升腾910 AI处理器的修改.py文件和函数如下。
数据预处理 API
input_fn()：处理 Estimator 训练的数据集并输出真实数据的输入函数。
（/official/r1/resnet/imagenet_main.py）
resnet_main()：包含数据输入、运行配置、训练和验证的主要 API。
（/official/r1/resnet/resnet_run_loop.py）

1.将以下头文件导入官方/r1/resnet/imagenet_main.py
文件：

from hccl.manage.api import get_rank_size
from hccl.manage.api import get_rank_id

2.获取支持数据并行训练的设备数量和设备ID。
Tweak：official/r1/resnet/imagenet_main.py中的input_fn()（更改为
粗体字。）

def input_fn(is_training, 
			data_dir, 
			batch_size, 
			num_epochs=1,
			dtype=tf.float32,
			datasets_num_private_threads=None, 	
			parse_record_fn=parse_record,
			input_context=None,
			drop_remainder=False, 
			tf_data_experimental_slack=False):
"""Function that provides training and validation batches.
Args:
Parameter description:
is_training:	 a bool indicating whether the input is used for training.
data_dir:		 file path that contains the input dataset.
batch_size:		 batch size.
num_epochs:		 number of epochs.
dtype: 			 data type of an image or feature.
datasets_num_private_threads: 		number of threads dedicated to tf.data.
parse_record_fn: 		entry function for parsing TFRecords.
input_context: 			tf.distribute.InputContext object passed by tf.distribute.Strategy
drop_remainder: 		specifies whether to retain or discard the last batch if the data volume of the
last batch is smaller than the value of batch_size. If set to True, the batch dimension is fixed.
tf_data_experimental_slack: specifies whether to enable the experimental_slack option of
tf.data.
Returns:
A dataset that can be used for iteration.
"""
# Obtain the file path.
filenames = get_filenames(is_training, data_dir)
# Split the file based on the first dimension.
dataset = tf.data.Dataset.from_tensor_slices(filenames)
if input_context:
# Obtain the number of devices and device IDs to support data parallel training.
############## npu modify begin #############
dataset = dataset.shard(get_rank_size(),get_rank_id())
############## npu modify end ###############
# Code for data parallel training has been commented out.
# tf.compat.v1.logging.info(
# 'Sharding the dataset: input_pipeline_id=%d num_input_pipelines=%d' % (
# input_context.input_pipeline_id, input_context.num_input_pipelines))
# dataset = dataset.shard(input_context.num_input_pipelines,
# input_context.input_pipeline_id)
if is_training:
# Disorder the files.
	dataset = dataset.shuffle(buffer_size=_NUM_TRAIN_FILES)
	# cycle_length = 10 Read and deserialize 10 files in parallel. You can increase the value if the CPU
	resources are sufficient.
	dataset = dataset.interleave(
			tf.data.TFRecordDataset,
			cycle_length=10,
			num_parallel_calls=tf.data.experimental.AUTOTUNE)
return resnet_run_loop.process_record_dataset(
		dataset=dataset,
		is_training=is_training,
		batch_size=batch_size,
		shuffle_buffer=_SHUFFLE_BUFFER,
		parse_record_fn=parse_record_fn,
		num_epochs=num_epochs,
		dtype=dtype,
		datasets_num_private_threads=datasets_num_private_threads,
		drop_remainder=drop_remainder,
		tf_data_experimental_slack=tf_data_experimental_slack,
		)

3.在训练或测试场景的 input_fn() 中，drop_remainder 必须设置为 True。

调整：official/r1/resnet/resnet_run_loop.py 中的 resnet_main() (The
调整了 input_fn_train() 和 input_fn_eval() 子函数。）

def input_fn_train(num_epochs, input_context=None):
	############## npu modify begin #############
	# Use dtype=tf.float16 to improve data transfer performance.
	# In the current version, drop_remainder can only be set to True.
	# batch_size indicates the batch size of a single device instead of the global batch size.
	return input_function(
			is_training=True,
			data_dir=flags_obj.data_dir,
			batch_size=flags_obj.batch_size,
			num_epochs=num_epochs,
			dtype=tf.float16,
			input_context=input_context,
			drop_remainder=True)
def input_fn_eval():
	# Use dtype=tf.float16 to improve data transfer performance.
	# In the current version, drop_remainder can only be set to True.
	# batch_size indicates the batch size of a single device instead of the global batch size.
	return input_function(
			is_training=False,
			data_dir=flags_obj.data_dir,
			batch_size=flags_obj.batch_size,
			num_epochs=1,
			dtype=tf.float16,
			input_context=True,
			drop_remainder=True)
############## npu modify end ###############
# input_fn() for training and validation in the code are as follows.
# def input_fn_train(num_epochs, input_context=None):
	# return input_function(
			# is_training=True,
			# data_dir=flags_obj.data_dir,
			# batch_size=distribution_utils.per_replica_batch_size(
			# flags_obj.batch_size, flags_core.get_num_gpus(flags_obj)),
			# num_epochs=num_epochs,
			# dtype=flags_core.get_tf_dtype(flags_obj),
			# datasets_num_private_threads=flags_obj.datasets_num_private_threads,
			# input_context=input_context)
#
# def input_fn_eval():
	# return input_function(
				# is_training=False,
				# data_dir=flags_obj.data_dir,
				# batch_size=distribution_utils.per_replica_batch_size(
				# flags_obj.batch_size, flags_core.get_num_gpus(flags_obj)),
				# num_epochs=1,
				# dtype=flags_core.get_tf_dtype(flags_obj))

7.1.5模型构造

模型构建与原始模型相同。修改了一些代码以适应提高计算性能。本节中的示例代码显示了修改。
定义模型函数
下面以基于ImageNet构建的模型函数为例。相关API如下。
-————未完待续————

风尘23187

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
TensorFlow 网络模型移植和训练指南

TensorFlow 网络模型移植和训练指南（持续更新）1.限制tensorflow只兼容tensorflow1.153.网络迁移3.1 使用 Estimator 迁移关于估算器Estimator API 是 TensorFlow 的高级 API，在 2018 年发布的 TensorFlow 1.10 中引入。Estimator API 极大地简化了机器学习的编程过程。 Estimator 有很多优点，例如对分发的良好支持、简化的模型创建以及模型开发人员之间的代码共享。要使用 Estimato
复制链接

扫一扫