在本人openstack集群环境中,新增一台compute节点,启动nova-compute服务的时候卡死。而已有的计算节点并没有此问题。从侧面反映出整个集群应该是没有问题的,问题出在新增的这台compute节点上。
查看/var/log/nova/nova-compute.log日志可以看到报错日志信息:
ERROR oslo.messaging._drivers.impl_rabbit [-] [1e21a744-9754-44d3-907b-92e72efdcd7d] AMQP server on controller-150:5672
is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peerERROR oslo.messaging._drivers.impl_rabbit [-] [1e21a744-9754-44d3-907b-92e72efdcd7d] AMQP server on controller-150:5672
is unreachable: [Errno 111] ECONNREFUSED. Trying again in 2 seconds.: error: [Errno 111] ECONNREFUSEDINFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Err
no 104] Connection reset by peerERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 2.0 seconds): er
ror: [Errno 111] ECONNREFUSEDINFO oslo.messaging._drivers.impl_rabbit [-] [1e21a744-9754-44d3-907b-92e72efdcd7d] Reconnected to AMQP server on contr
oller-150:5672 via [amqp] client with port 40872.WARNING nova.conductor.api [req-9a93ad73-7269-4375-9f67-987d98223d4d - - - - -] Timed out waiting for nova-conductor.
Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeout: Timed
out waiting for a reply to message ID b0de8895bd074645a7d7f6058fc5d8cf
从日志看,是无法连到控制节点的nova-conductor。
查看控制节点nova-conductor和rabbitmq的日志,在疯狂刷下面的日志,而43602端口所属进程就是nova-conductor的。说明nova-conductor一直在尝试通过rabbitmq进行通信,但是失败了。
/var/log/nova/nova-conductor.log
ERROR oslo.messaging._drivers.impl_rabbit [req-602672bc-7325-41ad-ac69-2b735b07c875 - - - - -] [73043348-2b34-44b3-8ab
a-7fa9769b64e0] AMQP server on controller-150:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 10
4] Connection reset by peer/var/log/rabbitmq/rabbit\@controller-150.log
=ERROR REPORT==== 17-Aug-2020::20:53:24 ===
Channel error on connection <0.22032.9> (172.5.1.150:43602 -> 172.5.1.150:5672, vhost: '/', user: 'openstack'), channel 1:
operation basic.publish caused a channel exception not_found: no exchange 'reply_7fab30efd0cf4889a35e402ebf0c18ba' in vhost '/'
通过查资料,很多人讲:
- 重启openstack-nova-conductor服务就OK了。 ========然而 并没用。==========
- 重新安装rabbitmq-server。 =========然而 并没用。==============
最后,还是坚信自己一开始的推断,问题出在了新增的compute节点上。
但是,该节点上的配置文件是来自于模板,与其他计算节点配置一样。。应该排除配置出错的问题。
既然配置没问题,那有可能出在安装的介质身上了:
已有的集群部署的是openstack queens,新增节点一开始部署的是train版。有问题之后就删了train版,然后重新安装的queens版本。看来是没有删除干净。
结论:openstack集群所有节点要严格统一版本,不能使用不同的版本,否则会出现很多莫名其妙的错误。
解决方案:
1、通过yum erase package_name 卸载并没有卸载干净。按道理这种方式应该把包依赖一起卸载的。。
2、通过 yum history list openstack-nova-compute 查看所有和openstack-nova-compute相关的yum安装历史;然后通过 yum history undo ID,将最早安装train版的安装历史回滚掉。这样,就干净的卸载掉了。最后,再重装。问题解决。