环境:cenot7,ceph luminious,服务器为Proliant DL380 Gen9 安装了 hp ilo4
(一) 从 ceph 删除该 osd
1、登陆 ceph mon 节点,查看坏掉的 osd
2、mon 上执行 out osd.x
ceph osd out osd.x
3、从 crush map 中删除 osd.x,防止它再接受数据
ceph osd crush remove osd.x
ceph auth del osd.x
ceph osd rm osd.x
[root@bakmtr01 ~]# ceph -s
cluster:
id: 0e38e7c6-a704-4132-b0e3-76b87f18d8fa
health: HEALTH_OK
services:
mon: 3 daemons, quorum bakmtr01,bakmtr02,bakmtr03
mgr: bakmtr03(active), standbys: bakmtr01, bakmtr02
osd: 99 osds: 99 up, 99 in
rgw: 3 daemons active
...
确认已经删除
ceph osd destroy osd.x --yes-i-really-mean-it
这些步骤相当于
ceph osd purge osd.x --yes-i-really-mean-it
4、osd 节点执行 umount /var/lib/ceph/osd/ceph-x
umount /var/lib/ceph/osd/ceph-x
5、查找 osd.x 对应的 device,lv、pv、vg
[root@bakcmp31 ~]# ceph-volume inventory /dev/sdt
====== Device report /dev/sdt ======
available False
rejected reasons locked
path /dev/sdt
scheduler mode deadline
rotational 1
vendor HP
human readable size 1.64 TB
sas address
removable 0
model LOGICAL VOLUME
ro 0
--- Logical Volume ---
cluster name ceph
name osd-data-3f2e912c-f327-4221-b350-a4b3de4376b6
osd id 1
cluster fsid 0e38e7c6-a704-4132-b0e3-76b87f18d8fa
type block
block uuid V8RGFc-omqm-B1E2-mKz1-TXfl-2lK3-CF2d0L
osd fsid 2f1aaa8a-f50d-4335-a812-5dd86e8042a3
也可以查看所有磁盘对应的 osd_id
ceph-volume inventory --format json-pretty
还可以通过 ceph-volume lvm list
[root@bakcmp31 ~]# ceph-volume lvm list | grep -A 16 "osd.1 "
====== osd.1 =======
[block] /dev/ceph-757f4a80-60e2-425b-a8fd-629a735a5acd/osd-data-3f2e912c-f327-4221-b350-a4b3de4376b6
type block
osd id 1
cluster fsid 0e38e7c6-a704-4132-b0e3-76b87f18d8fa
cluster name ceph
osd fsid 2f1aaa8a-f50d-4335-a812-5dd86e8042a3
encrypted 0
cephx lockbox secret
block uuid V8RGFc-omqm-B1E2-mKz1-TXfl-2lK3-CF2d0L
block device /dev/ceph-757f4a80-60e2-425b-a8fd-629a735a5acd/osd-data-3f2e912c-f327-4221-b350-a4b3de4376b6
vdo 0
crush device class None
devices /dev/sdt
6、查看 osd1 对应的 lv、vg
[root@bakcmp31 ~]# ceph-volume lvm list /dev/ceph-757f4a80-60e2-425b-a8fd-629a735a5acd/osd-data-3f2e912c-f327-4221-b350-a4b3de4376b6
====== osd.1 =======
[block] /dev/ceph-757f4a80-60e2-425b-a8fd-629a735a5acd/osd-data-3f2e912c-f327-4221-b350-a4b3de4376b6
...
block device /dev/ceph-757f4a80-60e2-425b-a8fd-629a735a5acd/osd-data-3f2e912c-f327-4221-b350-a4b3de4376b6
...
devices /dev/sdt
7、删除 lv 、vg
[root@bakcmp31 ~]# lvremove ceph-757f4a80-60e2-425b-a8fd-629a735a5acd/osd-data-3f2e912c-f327-4221-b350-a4b3de4376b6
Do you really want to remove active logical volume ceph-757f4a80-60e2-425b-a8fd-629a735a5acd/osd-data-3f2e912c-f327-4221-b350-a4b3de4376b6? [y/n]: y
Logical volume "osd-data-3f2e912c-f327-4221-b350-a4b3de4376b6" successfully removed
[root@bakcmp31 ~]# vgremove ceph-757f4a80-60e2-425b-a8fd-629a735a5acd
Volume group "ceph-757f4a80-60e2-425b-a8fd-629a735a5acd" successfully removed
8、删除 pv
找到 osd 对应的 lvs,删除,没有报错的话,删除对应的 vg、pv
[root@cmp17 ~]# lvremove /dev/ceph-a090a75a-bd1c-4c41-9505-55e9919c54c7/osd-data-c9e93977-654c-48ff-9c94-f92ffd1def69
WARNING: Device for PV Eeuf0S-XkKi-UwwB-35C8-Eozs-YNFR-0CUSw8 not found or rejected by a filter.
Couldn't find device with uuid Eeuf0S-XkKi-UwwB-35C8-Eozs-YNFR-0CUSw8.
Do you really want to remove active logical volume ceph-a090a75a-bd1c-4c41-9505-55e9919c54c7/osd-data-c9e93977-654c-48ff-9c94-f92ffd1def69? [y/n]: y
Aborting vg_write: No metadata areas to write to!
报错, 刷新 pv
pvscan --cache
手工删除pv是不行的,这里需要用到一个pvscan --cache命令去刷新缓存,之后再看pv、vg、lv通通都被清理掉了 (感觉不出来有啥变化)
[root@bakcmp31 ~]# pvscan --cache
[root@bakcmp31 ~]# pvs
PV VG Fmt Attr PSize PFree
/dev/sdb ceph-7db7008f-5eea-40b6-b289-6ae7d8a8ed91 lvm2 a-- <1.64t 0
...
/dev/sdt lvm2 --- <1.64t <1.64t
/dev/sdu ceph-02a02c1e-018b-4ea0-8c08-a4fb58547818 lvm2 a-- <1.64t 0
9、依旧可能遇到删除不彻底的问题,如何操作呢?
查看操作
[root@cmp39 ~]# dmsetup ls
删除操作
[root@cmp39 ~]# dmsetup remove ***
(二)更换硬盘后,重建 raid0
根据 ilo 查看对应物理 drive,记录1I:1:20
或者使用 hpssacli 查看 pd 对应的 ld
[root@cmp17 ~]# hpssacli ctrl slot=0 show config detail
安装 hpssacli,从 hp 官网下载 https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_04bffb688a73438598fef81ddd
rpm -ivh hpssacli-2.40-13.0.x86_64.rpm
hpssacli 常用命令
hpssacli ctrl slot=0 pd all show
hpssacli ctrl slot=0 pd all show status
hpssacli ctrl slot=0 ld all show
hpssacli ctrl slot=0 ld all show status
物理 drive 都没问题
[root@cmp17 ~]# hpssacli ctrl slot=0 pd all show status
逻辑 drive 19 error
[root@cmp17 ~]# hpssacli ctrl slot=0 ld all show status
...
logicaldrive 19 (1.6 TB, 0): Failed
logicaldrive 20 (1.6 TB, 0): OK
logicaldrive 21 (1.6 TB, 0): O
查看逻辑 drive 19 对应的设备名,没有显示,说明还没有做 raid
[root@cmp17 ~]# hpssacli ctrl slot=0 ld xx show
删除逻辑 drive 19
[root@cmp17 ~]# hpssacli ctrl slot=0 ld xx delete
创建逻辑 drive 19
[root@cmp17 ~]# hpssacli ctrl slot=0 create type=ld drives=1I:1:xx raid=0
(三) 节点加入 osd
也可以查看所有磁盘对应的 osd_id
ceph-volume inventory /dev/sdx --format json-pretty
批量创建 osd
batch Creates OSDs from a list of devices using a filestore
or bluestore
(default) setup
[root@bakcmp31 ~]# ceph-volume lvm batch --bluestore /dev/sdx
ceph-volume lvm activate
[root@bakcmp31 ~]# ceph-volume lvm activate --all
检查
osd 节点
[root@bakcmp31 ~]# systemctl status ceph-osd@x
mon 节点
[root@bakmtr01 ~]# ceph -s