reduce_grads = average_gradients(tower_grads)
opt = tf.train.SyncReplicasOptimizer(
opt_gpu,
replicas_to_aggregate=num_workers,
total_num_replicas=num_workers,
name="sync_replicas")
apply_gradient_op = opt.apply_gradients(reduce_grads, global_step=global_step)
...
hooks = [opt.make_session_run_hook((FLAGS.task_index == 0),num_tokens=0),
tf.train.StopAtStepHook(last_step=1000000),
tf.train.LoggingTensorHook(tensors={'step': global_step, 'loss': total_loss}, every_n_iter=10)]
...
with tf.train.MonitoredTrainingSession(master=server.target,
is_chief=(FLAGS.task_index == 0),
checkpoint_dir="/weixue/my_bench/train_logs",
hooks = hooks,
scaffold=scaffold,
config = config) as mon_sess:
问题记录:
1. opt的顺序
2.var的initialize
3.hook参数增加了num_tokens=0
This is supposed to be executed in the beginning of the chief/sync thread so that even if the total_num_replicas is less than replicas_to_aggregate, the model can still proceed as the replicas can compute multiple steps per variable update. Make sure:
`num_tokens >= replicas_to_aggregate - total_num_replicas`.
当replicas_to_aggregate = total_num_replicas,不要往queue中继续添加op