tensorflow 单机到分布式 tf.train.SyncReplicasOptimizer + monitoredtrainningsession

reduce_grads = average_gradients(tower_grads)
        opt = tf.train.SyncReplicasOptimizer(
          opt_gpu,
          replicas_to_aggregate=num_workers,
          total_num_replicas=num_workers,
          name="sync_replicas")
        apply_gradient_op = opt.apply_gradients(reduce_grads, global_step=global_step)

...

hooks = [opt.make_session_run_hook((FLAGS.task_index == 0),num_tokens=0),
             tf.train.StopAtStepHook(last_step=1000000),
             tf.train.LoggingTensorHook(tensors={'step': global_step, 'loss': total_loss}, every_n_iter=10)]

...
with tf.train.MonitoredTrainingSession(master=server.target,
                                     is_chief=(FLAGS.task_index == 0),
                                     checkpoint_dir="/weixue/my_bench/train_logs",
                                     hooks = hooks,
                                     scaffold=scaffold,
                                     config = config) as mon_sess:
问题记录:

1. opt的顺序

2.var的initialize

3.hook参数增加了num_tokens=0 

This is supposed to be executed in the beginning of the chief/sync thread so that even if the total_num_replicas is less than replicas_to_aggregate, the model can still proceed as the replicas can compute multiple steps per variable update. Make sure:
`num_tokens >= replicas_to_aggregate - total_num_replicas`.

当replicas_to_aggregate = total_num_replicas,不要往queue中继续添加op

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值