1、背景
今天学习ceph部署时,发现集群状态异常
ceph health
HEALTH_ERR 21 pgs are stuck inactive for more than 300 seconds; 21 pgs stale; 21 pgs stuck stale
猜测:测试添加osd和删除osd时,没有清理干净或者没有使用正确的方法清理
2、处理办法
解決方法就是用 ceph pg force_creat_pg <pgid>
去覆盖那个有问题的 pg
# 查看有问题的PG
ceph health detail
HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 21 pgs stale; 1 pgs stuck inactive; 21 pgs stuck stale; 1 pgs stuck unclean
pg 0.18 is stuck inactive for 385.978354, current state creating, last acting [0]
pg 0.18 is stuck unclean for 385.978358, current state creating, last acting [0]
pg 0.38 is stuck stale for 5228.614958, current state stale+active+clean, last acting [1]
pg 0.2d is stuck stale for 5228.614920, current state stale+active+clean, last acting [1]
pg 0.2c is stuck stale for 5198.568749, current state stale+active+clean, last acting [2]
pg 0.2b is stuck stale for 5228.614940, current state stale+active+clean, last acting [1]
pg 0.2a is stuck stale for 5198.568753, current state stale+active+clean, last acting [2]
pg 0.1a is stuck stale for 5228.614950, current state stale+active+clean, last acting [1]
pg 0.1b is stuck stale for 5228.614951, current state stale+active+clean, last acting [1]
pg 0.d is stuck stale for 5198.568803, current state stale+active+clean, last acting [2]
pg 0.c is stuck stale for 5228.614961, current state stale+active+clean, last acting [1]
pg 0.22 is stuck stale for 5198.568796, current state stale+active+clean, last acting [2]
pg 0.1c is stuck stale for 5228.614956, current state stale+active+clean, last acting [1]
pg 0.5 is stuck stale for 5198.568804, current state stale+active+clean, last acting [2]
pg 0.3c is stuck stale for 5228.614978, current state stale+active+clean, last acting [1]
pg 0.3e is stuck stale for 5198.568821, current state stale+active+clean, last acting [2]
pg 0.34 is stuck stale for 5228.614975, current state stale+active+clean, last acting [1]
pg 0.1d is stuck stale for 5228.614962, current state stale+active+clean, last acting [1]
pg 0.20 is stuck stale for 5228.614962, current state stale+active+clean, last acting [1]
pg 0.36 is stuck stale for 5228.614981, current state stale+active+clean, last acting [1]
pg 0.1f is stuck stale for 5198.568809, current state stale+active+clean, last acting [2]
pg 0.35 is stuck stale for 5228.614983, current state stale+active+clean, last acting [1]
pg 0.1e is stuck stale for 5228.614968, current state stale+active+clean, last acting [1]
覆盖那个有问题的pg
cat pg_id.sh
#!/bin/bash
PG_ID=(
0.18
0.18
0.38
0.2d
0.2c
0.2b
0.2a
0.1a
0.1b
0.d
0.c
0.22
0.1c
0.5
0.3c
0.3e
0.34
0.1d
0.20
0.36
0.1f
0.35
0.1e
)
for id in ${PG_ID[@]};do
echo $id
ceph pg force_create_pg $id
done
# 执行
sh pg_id.sh
问题特别多可以使用一下命令去跑
for pg in `ceph health detail | grep "stale+active+undersized+degraded" | awk '{print $2}' | sort | uniq`;
do
ceph pg force_create_pg $pg
done
再次查看健康信息
ceph health detail
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs stuck inactive; 1 pgs stuck unclean
pg 0.18 is stuck inactive for 674.102608, current state creating, last acting [0]
pg 0.18 is stuck unclean for 674.102613, current state creating, last acting [0]
看来使用覆盖的方式处理的是stale
的问题,还存在一个inactive
和unclean
的问题
解决creating
for pg in `ceph health detail | grep "creating" | awk '{print $2}' | sort | uniq`;
do
ceph pg map $pg
done
# 执行完成后重启所有的osd服务
systemctl restart ceph-osd@0
重启服务