ceph:HEALTH_ERR 41 pgs are stuck inactive for more than 300 seconds;

1、背景

今天学习ceph部署时,发现集群状态异常

ceph health
HEALTH_ERR 21 pgs are stuck inactive for more than 300 seconds; 21 pgs stale; 21 pgs stuck stale

在这里插入图片描述

猜测:测试添加osd和删除osd时,没有清理干净或者没有使用正确的方法清理

2、处理办法

解決方法就是用 ceph pg force_creat_pg <pgid> 去覆盖那个有问题的 pg

# 查看有问题的PG
ceph health detail
HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 21 pgs stale; 1 pgs stuck inactive; 21 pgs stuck stale; 1 pgs stuck unclean
pg 0.18 is stuck inactive for 385.978354, current state creating, last acting [0]
pg 0.18 is stuck unclean for 385.978358, current state creating, last acting [0]
pg 0.38 is stuck stale for 5228.614958, current state stale+active+clean, last acting [1]
pg 0.2d is stuck stale for 5228.614920, current state stale+active+clean, last acting [1]
pg 0.2c is stuck stale for 5198.568749, current state stale+active+clean, last acting [2]
pg 0.2b is stuck stale for 5228.614940, current state stale+active+clean, last acting [1]
pg 0.2a is stuck stale for 5198.568753, current state stale+active+clean, last acting [2]
pg 0.1a is stuck stale for 5228.614950, current state stale+active+clean, last acting [1]
pg 0.1b is stuck stale for 5228.614951, current state stale+active+clean, last acting [1]
pg 0.d is stuck stale for 5198.568803, current state stale+active+clean, last acting [2]
pg 0.c is stuck stale for 5228.614961, current state stale+active+clean, last acting [1]
pg 0.22 is stuck stale for 5198.568796, current state stale+active+clean, last acting [2]
pg 0.1c is stuck stale for 5228.614956, current state stale+active+clean, last acting [1]
pg 0.5 is stuck stale for 5198.568804, current state stale+active+clean, last acting [2]
pg 0.3c is stuck stale for 5228.614978, current state stale+active+clean, last acting [1]
pg 0.3e is stuck stale for 5198.568821, current state stale+active+clean, last acting [2]
pg 0.34 is stuck stale for 5228.614975, current state stale+active+clean, last acting [1]
pg 0.1d is stuck stale for 5228.614962, current state stale+active+clean, last acting [1]
pg 0.20 is stuck stale for 5228.614962, current state stale+active+clean, last acting [1]
pg 0.36 is stuck stale for 5228.614981, current state stale+active+clean, last acting [1]
pg 0.1f is stuck stale for 5198.568809, current state stale+active+clean, last acting [2]
pg 0.35 is stuck stale for 5228.614983, current state stale+active+clean, last acting [1]
pg 0.1e is stuck stale for 5228.614968, current state stale+active+clean, last acting [1]

覆盖那个有问题的pg

cat pg_id.sh
#!/bin/bash

PG_ID=(
0.18
0.18
0.38
0.2d
0.2c
0.2b
0.2a
0.1a
0.1b
0.d 
0.c 
0.22
0.1c
0.5 
0.3c
0.3e
0.34
0.1d
0.20
0.36
0.1f
0.35
0.1e
)

for id in ${PG_ID[@]};do
  echo $id 
  ceph pg force_create_pg $id
done

# 执行
sh pg_id.sh

问题特别多可以使用一下命令去跑

for pg in `ceph health detail | grep "stale+active+undersized+degraded" | awk '{print $2}' | sort | uniq`;
do
  ceph pg force_create_pg $pg
done

在这里插入图片描述
再次查看健康信息

ceph health detail
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs stuck inactive; 1 pgs stuck unclean
pg 0.18 is stuck inactive for 674.102608, current state creating, last acting [0]
pg 0.18 is stuck unclean for 674.102613, current state creating, last acting [0]

在这里插入图片描述
看来使用覆盖的方式处理的是stale 的问题,还存在一个inactive unclean 的问题
解决creating

for pg in `ceph health detail | grep "creating" | awk '{print $2}' | sort | uniq`;
do
  ceph pg map $pg
done
# 执行完成后重启所有的osd服务
systemctl restart ceph-osd@0

在这里插入图片描述
重启服务
在这里插入图片描述
在这里插入图片描述

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值