记录一下问题的结决办法
环境:
tensorflow:tensorflow-gpu1.14.0
horovod:0.19.5
python:3.7.9
CMake:3.21.1
是否安装了nccl:是
nccl版本:nccl_2.6.4-1+cuda10.0_x86_64
CUDA:CUDA10.0
CUDNN_VERSION=7.6.5.32
我在运行官方给的实例会报类似的错误,我的报错是”TypeError: DistributedOptimizer() got an unexpected keyword argument 'gradient_predivide_factor'“我在github上查找到了相关的可能的解决办法。作者亲自解答,我觉得可行度很高
先附上问题网址https://github.com/horovod/horovod/issues/774
Hey @lakshmiumenon, the parameter should be
backward_passes_per_step
with an additional underscore. That's one possible cause of this error (assuming that wasn't just a transcription error).That parameter was added very recently, and those changes haven't yet been packaged into a release yet, so it's possible that your installed version of Horovod is behind the version of the examples you're using.
I'd suggest checking out the version of the examples that is the same as your version of Horovod. For example, if you installed Horovod
v0.15.2
, then you could checkout that tag in your examples repo:git checkout v0.15.2
.
这是作者的答复,说白了就是horovod的版本不够,导致最新的一个参数无法引用。
解决办法:将horovod更新到最新版本就行了。