1、背景
集群中的一个节点损坏,同时另外一个节点坏了一块盘
2、问题
查看ceph集群的状态,看到归置组pg 4.210丢了一个块
# ceph health detail
HEALTH_WARN 481/5647596 objects misplaced (0.009%); 1/1882532 objects unfound (0.000%); Degraded data redundancy: 965/5647596 objects degraded (0.017%), 1 pg degraded, 1 pg undersized
OBJECT_MISPLACED 481/5647596 objects misplaced (0.009%)
OBJECT_UNFOUND 1/1882532 objects unfound (0.000%)
pg 4.210 has 1 unfound objects
PG_DEGRADED Degraded data redundancy: 965/5647596 objects degraded (0.017%), 1 pg degraded, 1 pg undersized
pg 4.210 is stuck undersized for 38159.843116, current state active+recovery_wait+undersized+degraded+remapped, last acting [2]
3、处理过程
3.1、先让集群可以正常使用
查看pg 4.210,可以看到它现在只有一个副本
# ceph pg dump_json pools |grep 4.210
dumped all
4.210 482 1 965 481 1 2013720576 3461 3461 active+recovery_wait+undersized+degraded+remapped 2019-07-10 09:34:53.693724 9027'1835435 9027:1937140 [6,17,20] 6 [2] 2 6368'1830618 2019-07-07 01:36:16.289885 6368'1830618 2019-07-07 01:36:16.289885 2
# ceph pg map 4.210
osdmap e9181 pg 4.210 (4.210) -> up [26,20,2] acting [2]
丢了两个副本,而且最主要的是主副本也丢了…