某个在运行的ceph系统,巡检时发现osd掉线了,尝试将其启动,可以正常启动,数据恢复也正常,但是运行1-2分钟后osd又掉了,检查osd状态发现:
7f0231d85d80 -1 osd.2 199 log_to_monitors {default=true}
7f021f689700 -1 received signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
7f021f689700 -1 osd.2 237 *** Got signal Interrupt ***
7f021f689700 -1 osd.2 237 shutdown
osd日志里都是osd数据块传输的日志,没有发现错误,没有警告。
几次重启osd都一样的情况,检查了系统空间、内存、CPU占用都正常。
重点是signal: Interrupt from Kernel 这句进程日志,内核杀掉了进程,看了半天一头雾水。。。
然后查看ceph日志,没发现错误,有个warn
cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
cluster [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)
cluster [INF] osd.2 192.168.1.203:6800/66004 boot
cluster [DBG] osdmap e243: 3 total, 3 up, 3 in
cluster [DBG] osdmap e244: 3 total, 3 up, 3 in
cluster [WRN] Monitor daemon marked osd.2 down, but it is still running
cluster [DBG] map e242 wrongly marked me down at e241
[WRN] Monitor daemon marked osd.2 down, but it is still running 这就奇怪了,好好的mon为啥标记osd down啊,啥事没干啊。
最后检查环境时发现203这台节点的防火墙被打开了,正常情况下内网搭建ceph都是直接关闭防火墙的,没有单独做端口放行,关闭防火墙后即可。
systemctl stop firewalld
systemctl disable firewalld
systemctl start ceph-osd@2
关闭防火墙后重启osd,未再出现异常退出情况。
检查还发现某个节点时间不同步,该节点时间早于集群时间18小时,遂chronydc sources -v同步时间,时间同步完成后节点osd异常了,集群中osd是down状态,但是节点上osd进程正常,有异常日志:
-1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before xxxxx 19:22:34.374108)
解决办法:重启该osd。
systemctl restart ceph-osd@1