背景
ceph集群添加了一个osd之后,该osd的状态始终为down。
错误提示
状态查看如下
1、查看osd tree
[root@node1 Asia]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.05388 root default
-2 0.01469 host node1
0 0.00490 osd.0 up 1.00000 1.00000
1 0.00490 osd.1 up 1.00000 1.00000
2 0.00490 osd.2 up 1.00000 1.00000
-3 0.01959 host node2
4 0.00490 osd.4 up 1.00000 1.00000
5 0.00490 osd.5 up 1.00000 1.00000
6 0.00490 osd.6 up 1.00000 1.00000
7 0.00490 osd.7 up 1.00000 1.00000
-4 0.01959 host node3
8 0.00490 osd.8 up 1.00000 1.00000
9 0.00490 osd.9 up 1.00000 1.00000
3 0.00490 osd.3 up 1.00000 1.00000
10 0.00490 osd.10 up 1.00000 1.00000
11 0 osd.11 down 0 1.00000
[root@node1 Asia]#
2、查看osd状态
[root@node1 /]# systemctl status ceph-osd@11
● ceph-osd@11.service - Ceph object storage daemon
Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: disabled)
Active: failed (Result: start-limit) since Sun 2018-09-09 22:15:25 EDT; 4h 57min ago
Process: 10331 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=1/FAILURE)
Sep 09 22:15:05 node1 systemd[1]: ceph-osd@11.service: control process exited, code=exited status=1
Sep 09 22:15:05 node1 systemd[1]: Failed to start Ceph object storage daemon.
Sep 09 22:15:05 node1 systemd[1]: Unit ceph-osd@11.service entered failed state.
Sep 09 22:15:05 node1 systemd[1]: ceph-osd@11.service failed.
Sep 09 22:15:25 node1 systemd[1]: ceph-osd@11.service holdoff time over, scheduling restart.
Sep 09 22:15:25 node1 systemd[1]: start request repeated too quickly for ceph-osd@11.service
Sep 09 22:15:25 node1 systemd[1]: Failed to start Ceph object storage daemon.
Sep 09 22:15:25 node1 systemd[1]: Unit ceph-osd@11.service entered failed state.
Sep 09 22:15:25 node1 systemd[1]: ceph-osd@11.service failed.
3、启动osd
[root@node1 /]# systemctl start ceph-osd@11
Job for ceph-osd@11.service failed because the control process exited with error code. See "systemctl status ceph-osd@11.service" and "journalctl -xe" for details.
4、查看错误
root@node1 /]# journalctl -xe
Sep 10 03:12:52 node1 polkitd[723]: Unregistered Authentication Agent for unix-process:10473:4129481 (system bus name :1.52, object p
Sep 10 03:13:12 node1 systemd[1]: ceph-osd@11.service holdoff time over, scheduling restart.
Sep 10 03:13:12 node1 systemd[1]: Starting Ceph object storage daemon...
-- Subject: Unit ceph-osd@11.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-osd@11.service has begun starting up.
Sep 10 03:13:12 node1 ceph-osd-prestart.sh[10483]: OSD data directory /var/lib/ceph/osd/ceph-11 does not exist; bailing out.
Sep 10 03:13:12 node1 systemd[1]: ceph-osd@11.service: control process exited, code=exited status=1
Sep 10 03:13:12 node1 systemd[1]: Failed to start Ceph object storage daemon.
-- Subject: Unit ceph-osd@11.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-osd@11.service has failed.
--
-- The result is failed.
Sep 10 03:13:12 node1 systemd[1]: Unit ceph-osd@11.service entered failed state.
Sep 10 03:13:12 node1 systemd[1]: ceph-osd@11.service failed.
其实我也不知道上卖弄的错误是什么原因,但是根据我的记录,这个osd添加的时候,整个集群处于ERR的状态。
错误解决
添加osd时集群状态如下:
[root@node1 ceph]# ceph -s
cluster 8eaa3f15-0946-4500-b018-6d31d1cc69f6
health HEALTH_ERR
clock skew detected on mon.node