openstack多版本部署导致：nova-compute启动卡死

最新推荐文章于 2025-02-14 18:30:51 发布

itachi-uchiha

最新推荐文章于 2025-02-14 18:30:51 发布

阅读量5.9k

点赞数

分类专栏：云计算-IaaS 文章标签： openstack

本文链接：https://blog.csdn.net/avatar_2009/article/details/108069944

版权

云计算-IaaS 专栏收录该内容

22 篇文章

订阅专栏

在本人openstack集群环境中，新增一台compute节点，启动nova-compute服务的时候卡死。而已有的计算节点并没有此问题。从侧面反映出整个集群应该是没有问题的，问题出在新增的这台compute节点上。

查看/var/log/nova/nova-compute.log日志可以看到报错日志信息：

ERROR oslo.messaging._drivers.impl_rabbit [-] [1e21a744-9754-44d3-907b-92e72efdcd7d] AMQP server on controller-150:5672
is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer

ERROR oslo.messaging._drivers.impl_rabbit [-] [1e21a744-9754-44d3-907b-92e72efdcd7d] AMQP server on controller-150:5672
is unreachable: [Errno 111] ECONNREFUSED. Trying again in 2 seconds.: error: [Errno 111] ECONNREFUSED

INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Err
no 104] Connection reset by peer

ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 2.0 seconds): er
ror: [Errno 111] ECONNREFUSED

INFO oslo.messaging._drivers.impl_rabbit [-] [1e21a744-9754-44d3-907b-92e72efdcd7d] Reconnected to AMQP server on contr
oller-150:5672 via [amqp] client with port 40872.

WARNING nova.conductor.api [req-9a93ad73-7269-4375-9f67-987d98223d4d - - - - -] Timed out waiting for nova-conductor.
Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeout: Timed
out waiting for a reply to message ID b0de8895bd074645a7d7f6058fc5d8cf

从日志看，是无法连到控制节点的nova-conductor。

查看控制节点nova-conductor和rabbitmq的日志，在疯狂刷下面的日志，而43602端口所属进程就是nova-conductor的。说明nova-conductor一直在尝试通过rabbitmq进行通信，但是失败了。

/var/log/nova/nova-conductor.log

ERROR oslo.messaging._drivers.impl_rabbit [req-602672bc-7325-41ad-ac69-2b735b07c875 - - - - -] [73043348-2b34-44b3-8ab
a-7fa9769b64e0] AMQP server on controller-150:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 10
4] Connection reset by peer

/var/log/rabbitmq/rabbit\@controller-150.log

=ERROR REPORT==== 17-Aug-2020::20:53:24 ===
Channel error on connection <0.22032.9> (172.5.1.150:43602 -> 172.5.1.150:5672, vhost: '/', user: 'openstack'), channel 1:
operation basic.publish caused a channel exception not_found: no exchange 'reply_7fab30efd0cf4889a35e402ebf0c18ba' in vhost '/'

通过查资料，很多人讲：

重启openstack-nova-conductor服务就OK了。 ========然而并没用。==========
重新安装rabbitmq-server。 =========然而并没用。==============

最后，还是坚信自己一开始的推断，问题出在了新增的compute节点上。

但是，该节点上的配置文件是来自于模板，与其他计算节点配置一样。。应该排除配置出错的问题。

既然配置没问题，那有可能出在安装的介质身上了：

已有的集群部署的是openstack queens，新增节点一开始部署的是train版。有问题之后就删了train版，然后重新安装的queens版本。看来是没有删除干净。

结论：openstack集群所有节点要严格统一版本，不能使用不同的版本，否则会出现很多莫名其妙的错误。

解决方案：

1、通过yum erase package_name 卸载并没有卸载干净。按道理这种方式应该把包依赖一起卸载的。。

2、通过 yum history list openstack-nova-compute 查看所有和openstack-nova-compute相关的yum安装历史；然后通过 yum history undo ID，将最早安装train版的安装历史回滚掉。这样，就干净的卸载掉了。最后，再重装。问题解决。