ceph修复osd为down的情况
今天巡检发现ceph集群有一个osds Down了
通过dashboard 查看:
ceph修复osd为down的情况:
点击查看详情
可以看到是哪个节点Osds Down 了
通过命令查看Osds状态
①、查看集群状态:
[root@ceph01 ~]# ceph -s
cluster:
id: 240a5732-02e5-11eb-8f5a-000c2945a4b1
health: HEALTH_WARN
Degraded data redundancy: 3972/11916 objects degraded (33.333%), 64 pgs degraded, 65 pgs undersized
65 pgs not deep-scrubbed in time
65 pgs not scrubbed in time
services:
mon: 3 daemons, quorum ceph01,ceph02,ceph03 (age 8d)
mgr: ceph02.zopypt(active, since 10w), standbys: ceph03.ucynxg, ceph01.suwmox
mds: cephfs:1 {0=cephfs.ceph02.axdsbo=up:active} 4 up:standby
osd: 3 osds: 2 up (since 5w), 2 in (since 5w)
data:
pools: 3 pools, 65 pgs
objects: 3.97k objects, 1.8 GiB
usage: 6.0 GiB used, 2.0 TiB / 2.0 TiB avail
pgs: 3972/11916 objects degraded (33.333%)
64 active+undersized+degraded
1 active+undersized
io:
client: 596 B/s wr, 0 op/s rd, 0 op/s wr
②、查看Osds树状态
[root@ceph01 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 3.00000 root default
-3 1.00000 host sjyt-ceph01
0 hdd 1.00000 osd.0 down 0 1.00000
-5 1.00000 host sjyt-ceph02
1 hdd 1.00000 osd.1 up 1.00000 1.00000
-7 1.00000 host sjyt-ceph03
2 hdd 1.00000 osd.2 up 1.00000 1.00000
解决过程:
另一种处理方式:
①、重启故障节点osd服务
[root@sjyt-ceph01 ~]# systemctl status ceph-240a5732-02e5-11eb-8f5a-000c2945a4b1@osd.0.service
● ceph-240a5732-02e5-11eb-8f5a-000c2945a4b1@osd.0.service - Ceph osd.0 for 240a5732-02e5-11eb-8f5a-000c2945a4b1
Loaded: loaded (/etc/systemd/system/ceph-240a5732-02e5-11eb-8f5a-000c2945a4b1@.service; enabled; vendor preset: disabled)
Active: inactive (dead) since Mon 2021-02-01 19:24:37 CST; 1 months 5 days ago
Process: 320045 ExecStopPost=/bin/bash /var/lib/ceph/240a5732-02e5-11eb-8f5a-000c2945a4b1/osd.0/unit.poststop (code=exited, status=0/SUCCESS)
Process: 320033 ExecStop=/bin/podman stop ceph-240a5732-02e5-11eb-8f5a-000c2945a4b1-osd.0 (code=exited, status=125)
Process: 153844 ExecStart=/bin/bash /var/lib/ceph/240a5732-02e5-11eb-8f5a-000c2945a4b1/osd.0/unit.run (code=exited, status=0/SUCCESS)
Process: 153833 ExecStartPre=/bin/podman rm ceph-240a5732-02e5-11eb-8f5a-000c2945a4b1-osd.0 (code=exited, status=1/FAILURE)
Main PID: 153844 (code=exited, status=0/SUCCESS)
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
[root@sjyt-ceph01 ~]# systemctl start ceph-240a5732-02e5-11eb-8f5a-000c2945a4b1@osd.0.service
[root@sjyt-ceph01 ~]# systemctl status ceph-240a5732-02e5-11eb-8f5a-000c2945a4b1@osd.0.service
● ceph-240a5732-02e5-11eb-8f5a-000c2945a4b1@osd.0.service - Ceph osd.0 for 240a5732-02e5-11eb-8f5a-000c2945a4b1
Loaded: loaded (/etc/systemd/system/ceph-240a5732-02e5-11eb-8f5a-000c2945a4b1@.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2021-03-09 10:19:07 CST; 1s ago
Process: 320045 ExecStopPost=/bin/bash /var/lib/ceph/240a5732-02e5-11eb-8f5a-000c2945a4b1/osd.0/unit.poststop (code=exited, status=0/SUCCESS)
Process: 320033 ExecStop=/bin/podman stop ceph-240a5732-02e5-11eb-8f5a-000c2945a4b1-osd.0 (code=exited, status=125)
Process: 2770303 ExecStartPre=/bin/podman rm ceph-240a5732-02e5-11eb-8f5a-000c2945a4b1-osd.0 (code=exited, status=1/FAILURE)
Main PID: 2770314 (bash)
Tasks: 13 (limit: 23968)
Memory: 31.2M
CGroup: /system.slice/system-ceph\x2d240a5732\x2d02e5\x2d11eb\x2d8f5a\x2d000c2945a4b1.slice/ceph-240a5732-02e5-11eb-8f5a-000c2945a4b1@osd.0.service
���─2770314 /bin/bash /var/lib/ceph/240a5732-02e5-11eb-8f5a-000c2945a4b1/osd.0/unit.run
└─2770413 /bin/podman run --rm --net=host --ipc=host --privileged --group-add=disk --name ceph-240a5732-02e5-11eb-8f5a-000c2945a4b1-osd.0 -e CONTAINER_IMAGE=docker.io/ceph/ceph:v15 -e NODE_NAME=sjyt
②、查看OSD状态
[root@sjyt-ceph01 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 3.00000 root default
-3 1.00000 host sjyt-ceph01
0 hdd 1.00000 osd.0 up 1.00000 1.00000
-5 1.00000 host sjyt-ceph02
1 hdd 1.00000 osd.1 up 1.00000 1.00000
-7 1.00000 host sjyt-ceph03
2 hdd 1.00000 osd.2 up 1.00000 1.00000
③、查看集群状态
[root@sjyt-ceph01 ~]# ceph -s
cluster:
id: 240a5732-02e5-11eb-8f5a-000c2945a4b1
health: HEALTH_WARN
Degraded data redundancy: 2654/11916 objects degraded (22.273%), 39 pgs degraded, 39 pgs undersized
64 pgs not deep-scrubbed in time
64 pgs not scrubbed in time
services:
mon: 3 daemons, quorum sjyt-ceph01,sjyt-ceph02,sjyt-ceph03 (age 8d)
mgr: sjyt-ceph02.zopypt(active, since 10w), standbys: sjyt-ceph03.ucynxg, sjyt-ceph01.suwmox
mds: cephfs:1 {0=cephfs.sjyt-ceph02.axdsbo=up:active} 4 up:standby
osd: 3 osds: 3 up (since 8m), 3 in (since 8m); 39 remapped pgs
data:
pools: 3 pools, 65 pgs
objects: 3.97k objects, 1.8 GiB
usage: 9.4 GiB used, 3.0 TiB / 3.0 TiB avail
pgs: 1.538% pgs not active
2654/11916 objects degraded (22.273%)
38 active+undersized+degraded+remapped+backfill_wait
25 active+clean
1 active+undersized+degraded+remapped+backfilling
1 peering
io:
client: 1.5 KiB/s wr, 0 op/s rd, 0 op/s wr
recovery: 2.7 MiB/s, 1 keys/s, 1 objects/s
Osds 恢复正常后,数据开始恢复到新的Osds节点上。