tensorflow 单机到分布式 tf.train.SyncReplicasOptimizer + monitoredtrainningsession

最新推荐文章于 2021-03-21 18:22:50 发布

qq_32110859

最新推荐文章于 2021-03-21 18:22:50 发布

阅读量1.3k

点赞数

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/qq_32110859/article/details/81303130

版权

reduce_grads = average_gradients(tower_grads)
opt = tf.train.SyncReplicasOptimizer(
opt_gpu,
replicas_to_aggregate=num_workers,
total_num_replicas=num_workers,
name="sync_replicas")
apply_gradient_op = opt.apply_gradients(reduce_grads, global_step=global_step)

...

hooks = [opt.make_session_run_hook((FLAGS.task_index == 0),num_tokens=0),
tf.train.StopAtStepHook(last_step=1000000),
tf.train.LoggingTensorHook(tensors={'step': global_step, 'loss': total_loss}, every_n_iter=10)]

...
with tf.train.MonitoredTrainingSession(master=server.target,
is_chief=(FLAGS.task_index == 0),
checkpoint_dir="/weixue/my_bench/train_logs",
hooks = hooks,
scaffold=scaffold,
config = config) as mon_sess:
问题记录:

1. opt的顺序

2.var的initialize

3.hook参数增加了num_tokens=0

This is supposed to be executed in the beginning of the chief/sync thread so that even if the total_num_replicas is less than replicas_to_aggregate, the model can still proceed as the replicas can compute multiple steps per variable update. Make sure:
`num_tokens >= replicas_to_aggregate - total_num_replicas`.

当replicas_to_aggregate = total_num_replicas，不要往queue中继续添加op

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

博客等级

码龄9年

34
原创

8
点赞

12
收藏

2
粉丝

关注

私信

热门文章

分类专栏

最新评论

opencv读入图像自动旋转的问题
qq_32110859 回复 Hellow_RMB: 您好，我这边没有的。
opencv读入图像自动旋转的问题
Hellow_RMB: 博主，您好，你有类似于链接的C++代码嘛？
tf.image.resize_bilinear vs cv2.resize
qq_32110859 回复 asd5768878: 这样是一个思路，我认为只要在训练数据，验证数据，以及最后做inference时，用的是一样的处理方法就可以了。我们强制采用opencv是因为做inference时，用tf.image处理图片耗时长，需要ja用va在pb文件以外做，以减少耗时。
tf.image.resize_bilinear vs cv2.resize
asd5768878 回复 qq_32110859: 没事，起码知道解决方向了。我目前的处理方式，是把训练时对数据的处理方式应用到要测试的图片上，即对测试图片应用tf.image.resize_image()这个函数，不过需要额外多跑一个sess，大概多花费0.2秒。
tf.image.resize_bilinear vs cv2.resize
qq_32110859 回复 asd5768878: 用opencv实现了相应的数据处理功能，然后用tf.py_func()调用了相应的函数。我这边测试下来，opencv和tf.image耗时是差不多的。本来想把代码粘给你，但始终弄不对，不好意思~

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。