多机多卡分布式训练model时:
启动脚本为:
CUDA_VISIBLE_DEVICES='' nohup python -u tf_cnn_benchmarks.py --batch_size=2048 --data_dir=// --data_name=imagenet --model=alexnet --num_batches=100 --num_gpus=4 --train_dir=// --ps_hosts=wx-test2:50000,wx-test:50000 --worker_hosts=wx-test2:50001,wx-test:50001 --task_index=0 --job_name=ps > alexnet/8gpu/8gpu_ps0.txt 2>&1 &
sleep 5
nohup python -u tf_cnn_benchmarks.py --batch_size=2048 --data_dir=// --data_name=imagenet --model=alexnet --num_batches=100 --num_gpus=4 --train_dir=/weixue/new/scripts/tf_cnn_benchmarks/alexnet/8gpu/ --ps_hosts=wx-test2:50000,wx-test:50000 --worker_hosts=wx-test:50001,wx-test2:50001 --task_index=0 --job_name=worker > alexnet/8gpu/8gpu_worker0.txt 2>&1 &
worker0 持续报错:
tensorflow.python.framework.errors_impl.InvalidArgumentError: /job:ps/r