记一次ceph集群的严重故障

520nobody

于 2023-07-09 01:06:21 发布

阅读量450

点赞数

文章标签： ceph

原文链接：https://www.cnblogs.com/liangjiongyao/p/9370864.html

版权

问题：集群状态，坏了一个盘，pg状态好像有点问题
[root@ceph-1 ~]# ceph -s
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_WARN
64 pgs degraded
64 pgs stuck degraded
64 pgs stuck unclean
64 pgs stuck undersized
64 pgs undersized
recovery 269/819 objects degraded (32.845%)
monmap e1: 1 mons at {ceph-1=192.168.101.11:6789/0}
election epoch 6, quorum 0 ceph-1
osdmap e38: 3 osds: 2 up, 2 in; 64 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v14328: 72 pgs, 2 pools, 420 bytes data, 275 objects
217 MB used, 40720 MB / 40937 MB avail
269/819 objects degraded (32.845%)
64 active+undersized+degraded
8 active+clean

[root@ceph-1 ~]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.05846 root default
-2 0.01949 host ceph-1
0 0.01949 osd.0 up 1.00000 1.00000
-3 0.01949 host ceph-2
1 0.01949 osd.1 up 1.00000 1.00000
-4 0.01949 host ceph-3
2 0.01949 osd.2 down 0 1.00000

将osd.2的状态设置为out
root@ceph-1:~# ceph osd out osd.2
osd.2 is already out.

从集群中删除
root@ceph-1:~# ceph osd rm osd.2
removed osd.2

从CRUSH中删除
root@ceph-1:~# ceph osd crush rm osd.2
removed item id 2 name ‘osd.2’ from crush map

删除osd.2的认证信息
root@ceph02:~# ceph auth del osd.2
updated

umount报错
[root@ceph-3 ~]# umount /dev/vdb1
umount: /var/lib/ceph/osd/ceph-2: target is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))

kill掉ceph用户的占用
[root@ceph-3 ~]# fuser -mv /var/lib/ceph/osd/ceph-2
USER PID ACCESS COMMAND
/var/lib/ceph/osd/ceph-2:
root kernel mount /var/lib/ceph/osd/ceph-2
ceph 1517 F… ceph-osd
[root@ceph-3 ~]# kill -9 1517
[root@ceph-3 ~]# fuser -mv /var/lib/ceph/osd/ceph-2
USER PID ACCESS COMMAND
/var/lib/ceph/osd/ceph-2:
root kernel mount /var/lib/ceph/osd/ceph-2
[root@ceph-3 ~]# umount /var/lib/ceph/osd/ceph-2

重新准备磁盘
[root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf osd prepare ceph-3:/dev/vdb1

激活所有节点上的osd磁盘或者分区
[root@ceph-deploy my-cluster]# ceph-deploy osd activate ceph-1:/dev/vdb1 ceph-2:/dev/vdb1 ceph-3:/dev/vdb1

报错…
[ceph-3][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: /usr/sbin/ceph-disk -v activate --mark-init systemd --mount /dev/vdb1

一怒之下关机重启
[root@ceph-3 ~]# init 0
Connection to 192.168.101.13 closed by remote host.
Connection to 192.168.101.13 closed.

重启之后，osd好了，但是pg的问题好像还没解决
[root@ceph-1 ~]# ceph -s
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_WARN
64 pgs degraded
64 pgs stuck degraded
64 pgs stuck unclean
64 pgs stuck undersized
64 pgs undersized
recovery 269/819 objects degraded (32.845%)
monmap e1: 1 mons at {ceph-1=192.168.101.11:6789/0}
election epoch 6, quorum 0 ceph-1
osdmap e53: 3 osds: 3 up, 3 in
flags sortbitwise,require_jewel_osds
pgmap v14368: 72 pgs, 2 pools, 420 bytes data, 275 objects
5446 MB used, 55960 MB / 61406 MB avail
269/819 objects degraded (32.845%)
64 active+undersized+degraded
8 active+clean
[root@ceph-1 ~]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.03897 root default
-2 0.01949 host ceph-1
0 0.01949 osd.0 up 1.00000 1.00000
-3 0.01949 host ceph-2
1 0.01949 osd.1 up 1.00000 1.00000
-4 0 host ceph-3
2 0 osd.2 up 1.00000 1.00000

在ceph-1和ceph-2中加了一块硬盘，然后创建osd
[root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf osd create ceph-1:/dev/vdd ceph-2:/dev/vdd

查看集群状态，发现pg数好像小了
[root@ceph-1 ~]# ceph -s
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_WARN
14 pgs degraded
14 pgs stuck degraded
64 pgs stuck unclean
14 pgs stuck undersized
14 pgs undersized
recovery 188/819 objects degraded (22.955%)
recovery 200/819 objects misplaced (24.420%)
too few PGs per OSD (28 < min 30)
monmap e1: 1 mons at {ceph-1=192.168.101.11:6789/0}
election epoch 6, quorum 0 ceph-1
osdmap e63: 5 osds: 5 up, 5 in; 50 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v14408: 72 pgs, 2 pools, 420 bytes data, 275 objects
5663 MB used, 104 GB / 109 GB avail
188/819 objects degraded (22.955%)
200/819 objects misplaced (24.420%)
26 active+remapped
24 active
14 active+undersized+degraded
8 active+clean
增加pg和pgp
[root@ceph-1 ~]# ceph osd pool set rbd pg_num 128
[root@ceph-1 ~]# ceph osd pool set rbd pgp_num 128

状态就成error了…
[root@ceph-1 ~]# ceph -s
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_ERR
118 pgs are stuck inactive for more than 300 seconds
118 pgs peering
118 pgs stuck inactive
128 pgs stuck unclean
recovery 16/657 objects misplaced (2.435%)
monmap e2: 2 mons at {ceph-1=192.168.101.11:6789/0,ceph-3=192.168.101.13:6789/0}
election epoch 8, quorum 0,1 ceph-1,ceph-3
osdmap e74: 5 osds: 5 up, 5 in; 55 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v14459: 136 pgs, 2 pools, 356 bytes data, 221 objects
5665 MB used, 104 GB / 109 GB avail
16/657 objects misplaced (2.435%)
73 peering
45 remapped+peering
10 active+remapped
8 active+clean
[root@ceph-1 ~]# less /etc/ceph/ceph.co

于是我又重启了三台osd机器，重启发现又有osd down了
[root@ceph-1 ~]# ceph -s
2018-07-25 15:18:17.207665 7fb4ec2ee700 0 – :/1038496581 >> 192.168.101.12:6789/0 pipe(0x7fb4e8063fa0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb4e805c610).fault
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_WARN
16 pgs degraded
59 pgs stuck unclean
16 pgs undersized
recovery 134/819 objects degraded (16.361%)
recovery 88/819 objects misplaced (10.745%)
1/5 in osds are down
monmap e2: 2 mons at {ceph-1=192.168.101.11:6789/0,ceph-3=192.168.101.13:6789/0}
election epoch 12, quorum 0,1 ceph-1,ceph-3
osdmap e95: 5 osds: 4 up, 5 in; 43 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v14529: 136 pgs, 2 pools, 420 bytes data, 275 objects
5668 MB used, 104 GB / 109 GB avail
134/819 objects degraded (16.361%)
88/819 objects misplaced (10.745%)
77 active+clean
39 active+remapped
16 active+undersized+degraded
4 active

[root@ceph-1 ~]# ceph osd tree
2018-07-25 15:22:25.573039 7fe5ff87c700 0 – :/3787750993 >> 192.168.101.12:6789/0 pipe(0x7fe604063fd0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe60405c640).fault
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.10725 root default
-2 0.04388 host ceph-1
0 0.01949 osd.0 up 1.00000 1.00000
3 0.02440 osd.3 up 1.00000 1.00000
-3 0.04388 host ceph-2
1 0.01949 osd.1 down 0 1.00000
4 0.02440 osd.4 up 1.00000 1.00000
-4 0.01949 host ceph-3
2 0.01949 osd.2 up 1.00000 1.00000

把坏盘out、rm、crush rm、auth del后，集群健康了
[root@ceph-1 ~]# ceph -s
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_OK
monmap e2: 2 mons at {ceph-1=192.168.101.11:6789/0,ceph-3=192.168.101.13:6789/0}
election epoch 12, quorum 0,1 ceph-1,ceph-3
osdmap e102: 4 osds: 4 up, 4 in
flags sortbitwise,require_jewel_osds
pgmap v14597: 136 pgs, 2 pools, 356 bytes data, 270 objects
5559 MB used, 86551 MB / 92110 MB avail
136 active+clean

换掉了坏盘，把新的盘重新加入ceph集群(扩容也是这样操作)
[root@ceph-deploy my-cluster]# ceph-deploy disk list ceph-2
[root@ceph-deploy my-cluster]# ceph-deploy disk zap ceph-2:vdb
[root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf osd create ceph-2:vdb:/dev/vdc1

现在看是error
[root@ceph-1 ~]# ceph -s
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_ERR
13 pgs are stuck inactive for more than 300 seconds
50 pgs degraded
2 pgs peering
1 pgs recovering
17 pgs recovery_wait
13 pgs stuck inactive
23 pgs stuck unclean
recovery 67/798 objects degraded (8.396%)
monmap e2: 2 mons at {ceph-1=192.168.101.11:6789/0,ceph-3=192.168.101.13:6789/0}
election epoch 12, quorum 0,1 ceph-1,ceph-3
osdmap e110: 5 osds: 5 up, 5 in
flags sortbitwise,require_jewel_osds
pgmap v14633: 136 pgs, 2 pools, 356 bytes data, 268 objects
5669 MB used, 104 GB / 109 GB avail
67/798 objects degraded (8.396%)
79 active+clean
32 activating+degraded
17 active+recovery_wait+degraded
5 activating
2 peering
1 active+recovering+degraded
client io 0 B/s wr, 0 op/s rd, 5 op/s wr

过了一会看就完全正常了
[root@ceph-1 ~]# ceph -s
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_OK
monmap e2: 2 mons at {ceph-1=192.168.101.11:6789/0,ceph-3=192.168.101.13:6789/0}
election epoch 12, quorum 0,1 ceph-1,ceph-3
osdmap e110: 5 osds: 5 up, 5 in
flags sortbitwise,require_jewel_osds
pgmap v14666: 136 pgs, 2 pools, 356 bytes data, 267 objects
5669 MB used, 104 GB / 109 GB avail
136 active+clean

问题：增加mon报错
[root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf mon create ceph-2
[ceph-2][ERROR ] admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
[ceph-2][WARNIN] neither public_addr nor public_network keys are defined for monitors

[root@ceph-2 ~]# less /var/log/ceph/ceph-mon.ceph-2.log
2018-07-25 15:52:02.566212 7efeec7d9780 -1 no public_addr or public_network specified, and mon.ceph-2 not present in monmap or ceph.conf

原因：ceph.conf里面没有配置public_network
[global]
fsid = 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
mon_initial_members = ceph-1,ceph-2,ceph-3
mon_host = 192.168.101.11,192.168.101.12,192.168.101.13
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd pool default size = 2

修改ceph.conf文件
[root@ceph-deploy my-cluster]# vi ceph.conf
[global]
fsid = 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
mon_initial_members = ceph-1,ceph-2,ceph-3
mon_host = 192.168.101.11,192.168.101.12,192.168.101.13
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd pool default size = 2
public_network = 192.168.122.0/24
cluster_network = 192.168.101.0/24

推送新的配置文件至各个节点
[root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf config push ceph-1 ceph-2 ceph-3

增加ceph-2为mon
[root@ceph-deploy my-cluster]# ceph-deploy mon add ceph-2

添加成功后发现，mon集群中ceph-2的ip跟其他的不一样，按照配置文件，应该跟该ceph-1、ceph-3的网段为122
[root@ceph-1 ~]# ceph -s
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_OK
monmap e3: 3 mons at {ceph-1=192.168.101.11:6789/0,ceph-2=192.168.122.12:6789/0,ceph-3=192.168.101.13:6789/0}
election epoch 14, quorum 0,1,2 ceph-1,ceph-3,ceph-2
osdmap e110: 5 osds: 5 up, 5 in
flags sortbitwise,require_jewel_osds
pgmap v14666: 136 pgs, 2 pools, 356 bytes data, 267 objects
5669 MB used, 104 GB / 109 GB avail
136 active+clean

所以，我修改ceph.conf中mon节点的ip段为122
[root@ceph-deploy my-cluster]# vi ceph.conf
[global]
fsid = 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
mon_initial_members = ceph-1,ceph-2,ceph-3
mon_host = 192.168.122.11,192.168.122.12,192.168.122.13
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd pool default size = 2
public_network = 192.168.122.0/24
cluster_network = 192.168.101.0/24

再来一波推送
[root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf config push ceph-1 ceph-2 ceph-3

删除两个mon
[root@ceph-deploy my-cluster]# ceph-deploy mon destroy ceph-1 ceph-3

然后整个集群都不好了
[root@ceph-1 ~]# ceph -s
2018-07-25 16:35:21.723736 7f47dedfb700 0 – 192.168.122.11:0/4277586904 >> 192.168.122.13:6789/0 pipe(0x7f47c8000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c8001f90).fault with nothing to send, going to standby
2018-07-25 16:35:27.723930 7f47dedfb700 0 – 192.168.122.11:0/4277586904 >> 192.168.122.11:6789/0 pipe(0x7f47c8005330 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c8002410).fault with nothing to send, going to standby
2018-07-25 16:35:33.725130 7f47deffd700 0 – 192.168.122.11:0/4277586904 >> 192.168.122.13:6789/0 pipe(0x7f47c8005330 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c80046e0).fault with nothing to send, going to standby

[root@ceph-1 ~]# ceph osd tree
2018-07-25 16:35:21.723736 7f47dedfb700 0 – 192.168.122.11:0/4277586904 >> 192.168.122.13:6789/0 pipe(0x7f47c8000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c8001f90).fault with nothing to send, going to standby
2018-07-25 16:35:27.723930 7f47dedfb700 0 – 192.168.122.11:0/4277586904 >> 192.168.122.11:6789/0 pipe(0x7f47c8005330 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c8002410).fault with nothing to send, going to standby
2018-07-25 16:35:33.725130 7f47deffd700 0 – 192.168.122.11:0/4277586904 >> 192.168.122.13:6789/0 pipe(0x7f47c8005330 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c80046e0).fault with nothing to send, going to standby

好像也加不回去
[root@ceph-deploy my-cluster]# ceph-deploy mon add ceph-1 ceph-3
[ceph-1][WARNIN] 2018-07-25 16:37:52.760218 7f06739b9700 0 – 192.168.122.11:0/2929495808 >> 192.168.122.11:6789/0 pipe(0x7f0668000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0668005c20).fault with nothing to send, going to standby
[ceph-1][WARNIN] 2018-07-25 16:37:55.760830 7f06738b8700 0 – 192.168.122.11:0/2929495808 >> 192.168.122.13:6789/0 pipe(0x7f066800d5e0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f066800e8a0).fault with nothing to send, going to standby
[ceph-1][WARNIN] 2018-07-25 16:37:58.760748 7f06739b9700 0 – 192.168.122.11:0/2929495808 >> 192.168.122.11:6789/0 pipe(0x7f0668000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f066800be40).fault with nothing to send, going to standby

不嫌事大，把最后一个mon也删掉
[root@ceph-deploy my-cluster]# ceph-deploy mon destroy ceph-2

[root@ceph-deploy my-cluster]# ceph-deploy new ceph-1 ceph-2 ceph-3

[root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf mon create-initial
[ceph-1][ERROR ] "ceph auth get-or-create for keytype admin returned 22
[ceph-1][DEBUG ] Error EINVAL: unknown cap type ‘mgr’
[ceph-1][ERROR ] Failed to return ‘admin’ key from host ceph-1
[ceph-2][ERROR ] "ceph auth get-or-create for keytype admin returned 22
[ceph-2][DEBUG ] Error EINVAL: unknown cap type ‘mgr’
[ceph-2][ERROR ] Failed to return ‘admin’ key from host ceph-2
[ceph-3][ERROR ] "ceph auth get-or-create for keytype admin returned 22
[ceph-3][DEBUG ] Error EINVAL: unknown cap type ‘mgr’
[ceph-3][ERROR ] Failed to return ‘admin’ key from host ceph-3
[ceph_deploy.gatherkeys][ERROR ] Failed to connect to host:ceph-1, ceph-2, ceph-3
[ceph_deploy.gatherkeys][INFO ] Destroy temp directory /tmp/tmpnPWk4d
[ceph_deploy][ERROR ] RuntimeError: Failed to connect any mon

[root@ceph-deploy my-cluster]# ceph-deploy mon add ceph-1
[ceph-1][INFO ] monitor: mon.ceph-1 is running

[root@ceph-deploy my-cluster]# ceph-deploy mon add ceph-2
[ceph-2][INFO ] monitor: mon.ceph-2 is running

[root@ceph-deploy my-cluster]# ceph-deploy mon add ceph-3
[ceph-3][INFO ] monitor: mon.ceph-3 is running

[root@ceph-1 ceph-ceph-1]# ceph -s
2018-07-25 20:42:07.965513 7f1482a91700 0 librados: client.admin authentication error (1) Operation not permitted
Error connecting to cluster: PermissionError

通常我们执行ceph -s 时，就相当于开启了一个客户端，连接到 Ceph 集群，而这个客户端默认是使用 client.admin 的账户密码登陆连接集群的，所以平时执行的ceph -s 相当于执行了 ceph -s --name client.admin --keyring /etc/ceph/ceph.client.admin.keyring。需要注意的是，每次我们在命令行执行 Ceph 的指令，都相当于开启一个客户端，和集群交互，再关闭客户端。现在举一个很常见的报错，这在刚接触 Ceph 时，很容易遇到：

[root@blog ~]# ceph -s
2017-08-03 02:22:27.352516 7fbd157b7700 0 librados: client.admin authentication error (1) Operation not permitted
Error connecting to cluster: PermissionError

报错信息很好理解，操作不被允许，也就是认证未通过，由于这里我们使用的是默认的client.admin 用户和它的秘钥，说明秘钥内容和 Ceph 集群记录的不一致，也就是说 /etc/ceph/ceph.client.admin.keyring 内容很可能是之前集群留下的，或者是记录了错误的秘钥，这时，只需要使用 mon.用户来执行 ceph auth list就可以查看到正确的秘钥内容：

[root@ceph-1 ceph]# ceph auth get client.admin --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring
Error ENOENT: failed to find client.admin in keyring
[root@ceph-1 ceph]#

用mon.用户瞄一眼集群
[root@ceph-1 ceph]# ceph -s --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring
cluster 053670e9-9b12-4297-aa04-41c430091f90
health HEALTH_ERR
64 pgs are stuck inactive for more than 300 seconds
64 pgs stuck inactive
64 pgs stuck unclean
no osds
monmap e1: 3 mons at {ceph-1=192.168.101.11:6789/0,ceph-2=192.168.101.12:6789/0,ceph-3=192.168.101.13:6789/0}
election epoch 8, quorum 0,1,2 ceph-1,ceph-2,ceph-3
osdmap e1: 0 osds: 0 up, 0 in
flags sortbitwise,require_jewel_osds
pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
0 kB used, 0 kB / 0 kB avail
64 creating

获取client.admin的秘钥
[root@ceph-1 ceph]# ceph auth get client.admin --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring
Error ENOENT: failed to find client.admin in keyring

添加client.admin用户
[root@ceph-1 ceph]# ceph auth add client.admin --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring

再次获取client.admin的秘钥
[root@ceph-1 ceph]# ceph auth get client.admin --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring
exported keyring for client.admin
[client.admin]
key = AQAIf1hbmuPXBxAA5Q3g/Jz8gerf+S6znEHLBQ==

修改本地client.admin的秘钥
[root@ceph-1 ceph]# vi ceph.client.admin.keyring
[client.admin]

key = AQAnPVBbJJWsMhAAKEqaHkWdwEWndOvqDjtjXA==

    key = AQAIf1hbmuPXBxAA5Q3g/Jz8gerf+S6znEHLBQ==
    caps mds = "allow *"
    caps mon = "allow *"
    caps osd = "allow *"

查看集群状态
[root@ceph-1 ceph]# ceph -s
2018-07-25 21:50:40.512039 7f0ca92d0700 0 librados: client.admin authentication error (13) Permission denied

给client.admin用户添加权限
[root@ceph-1 ceph]# ceph auth add client.admin mon ‘allow r’ osd ‘allow rw’
2018-07-25 21:57:45.263271 7f68398ea700 0 librados: client.admin authentication error (13) Permission denied

之前mon create-initial时新生成的ceph.client.admin.keyring忘了加读权限
[root@ceph-1 ceph]# chmod +r /etc/ceph/ceph.client.admin.keyring

[root@ceph-1 ceph]# ceph -s
2018-07-25 22:06:17.167512 7f449b116700 0 librados: client.admin authentication error (13) Permission denied

再次给client.admin用户添加权限
[root@ceph-1 ceph]# ceph auth add client.admin mon ‘allow r’ osd ‘allow rw’ --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring
Error EINVAL: entity client.admin exists but caps do not match

历经千辛万苦，终于在谷歌找到一个方法，client.admin权限恢复后，查看到集群osd全没了
[root@ceph-1 ~]# cd /var/lib/ceph/mon
[root@ceph-1 mon]# ls
ceph-ceph-1
[root@ceph-1 mon]# cd ceph-ceph-1/
[root@ceph-1 ceph-ceph-1]# ls
done keyring store.db systemd
[root@ceph-1 ceph-ceph-1]# ceph -n mon. --keyring keyring auth caps client.admin mds ‘allow *’ osd ‘allow *’ mon ‘allow *’
updated caps for client.admin
[root@ceph-1 ceph-ceph-1]# ceph -s
cluster 053670e9-9b12-4297-aa04-41c430091f90
health HEALTH_ERR
64 pgs are stuck inactive for more than 300 seconds
64 pgs stuck inactive
64 pgs stuck unclean
no osds
monmap e1: 3 mons at {ceph-1=192.168.101.11:6789/0,ceph-2=192.168.101.12:6789/0,ceph-3=192.168.101.13:6789/0}
election epoch 16, quorum 0,1,2 ceph-1,ceph-2,ceph-3
osdmap e1: 0 osds: 0 up, 0 in
flags sortbitwise,require_jewel_osds
pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
0 kB used, 0 kB / 0 kB avail
64 creating

[root@ceph-1 ceph-ceph-1]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0 root default

在每个节点lsblk查看，所有挂载点均以自动卸载了，趁此，我也调整一下磁盘规格，把它们都统一该为20G
[root@ceph-1 ceph-ceph-1]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sr0 11:0 1 1024M 0 rom
vda 252:0 0 100G 0 disk
├─vda1 252:1 0 1G 0 part /boot
└─vda2 252:2 0 99G 0 part
├─centos-root 253:0 0 50G 0 lvm /
├─centos-swap 253:1 0 2G 0 lvm [SWAP]
└─centos-home 253:2 0 47G 0 lvm /home
vdb 252:16 0 20G 0 disk
└─vdb1 252:17 0 20G 0 part
vdc 252:32 0 20G 0 disk
└─vdc1 252:33 0 5G 0 part
vdd 252:48 0 30G 0 disk
├─vdd1 252:49 0 25G 0 part
└─vdd2 252:50 0 5G 0 part

重新格式化磁盘
[root@ceph-deploy my-cluster]# ceph-deploy disk zap ceph-1:vdb ceph-2:vdb ceph-3:vdb
[root@ceph-deploy my-cluster]# ceph-deploy osd prepare ceph-1:vdb:vdc ceph-2:vdb:vdc ceph-3:vdb:vdc

激活osd，看似好像是osd认证失败导致的
[root@ceph-deploy my-cluster]# ceph-deploy osd activate ceph-1:vdb1:vdc
[ceph-1][WARNIN] ceph_disk.main.Error: Error: ceph osd create failed: Command ‘/usr/bin/ceph’ returned non-zero exit status 1: 2018-07-26 10:34:36.851527 7f678c625700 0 librados: client.bootstrap-osd authentication error (1) Operation not permitted
[ceph-1][WARNIN] Error connecting to cluster: PermissionError
[ceph-1][WARNIN]
[ceph-1][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: /usr/sbin/ceph-disk -v activate --mark-init systemd --mount /dev/vdb1

暂时研究到这里吧，这个集群先放着，等以后证明白cephx再来搞

重装请看这里
ceph-deploy purgedata {ceph-node} [{ceph-node}] ##清空数据
ceph-deploy forgetkeys ##删除之前生成的密钥
ceph-deploy purge {ceph-node} [{ceph-node}] ##卸载ceph软件
If you execute purge, you must re-install Ceph.

ceph-deploy new {initial-monitor-node(s)}
ceph-deploy install {ceph-node}[{ceph-node}
ceph-deploy mon create-initial
ceph-deploy disk list {node-name [node-name]…}
ceph-deploy disk zap osdserver1:sda
ceph-deploy osd prepare ceph-osd1:/dev/sda ceph-osd1:/dev/sdb
ceph-deploy osd activate ceph-osd1:/dev/sda1 ceph-osd1:/dev/sdb1
ceph-deploy admin {admin-node} {ceph-node}
chmod +r /etc/ceph/ceph.client.admin.keyring

520nobody

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
记一次ceph集群的严重故障

问题：集群状态，坏了一个盘，pg状态好像有点问题将osd.2的状态设置为out从集群中删除从CRUSH中删除删除osd.2的认证信息updatedumount报错kill掉ceph用户的占用重新准备磁盘激活所有节点上的osd磁盘或者分区报错…一怒之下关机重启重启之后，osd好了，但是pg的问题好像还没解决。
复制链接

扫一扫