方案简介
LVM方案简介
前面使用Bcache的方式搭建了Ceph,本文使用LVM的方式进行搭建。Bcache的优势是性能比较好,但存在一些缺点,比如内核默认没有加载,需要编译内核。另外,Bcache对于故障的容忍比较差,Bcache在故障后比较脆弱,难于恢复,只能重新做Bcache,创建osd。
LVM (Logical Volume Manager) 是一种用于在 Linux 系统上管理硬盘分区和逻辑卷的工具。LVM 允许管理员对物理硬盘进行逻辑划分,创建灵活的逻辑卷,并提供高级功能,如快照、在线容量扩展和逻辑卷的动态迁移等。
LVM Cache 是 LVM 的一个功能,它允许将高速缓存设备(如固态硬盘)与普通硬盘组合使用,以提高 I/O 性能。通过将高速缓存设备作为缓存层,LVM Cache 可以显著减少从磁盘读取数据的时间,从而加快系统的响应速度。
- 使用场景:固态硬盘用做普通硬盘的缓存,以提高 I/O 性能。
集群环境规划
物理组网
物理组网方式如下,与之前的Bcache方案完全一样:
IP地址规划
IP地址规划如下,与Bcache方案保持一致:
节点 | 千兆 | public | cluster |
node01 | 192.168.13.1 | 188.188.13.1 | 10.10.13.1 |
node02 | 192.168.13.2 | 188.188.13.2 | 10.10.13.2 |
node03 | 192.168.13.3 | 188.188.13.3 | 10.10.13.3 |
node04 | 192.168.13.4 | 188.188.13.4 | 10.10.13.4 |
client01 | 192.168.13.101 | 188.188.13.101 |
硬盘划分
ceph应用,每个nvme分5个区,每个区350GB。
DB划分30GB、WAL划分15GB。每个osd对应的有效缓存为300G(需要额外2G元数据)。
如果使用物理机,所有机械盘使用jbod模式。这里使用虚拟机模拟出20块8TB的机械盘,4块1.8T的nvme。
ceph使用的版本为ceph version 15.2.11 octopus (stable)。
存储节点基础环境搭建
系统环境
本部署方案基于以下系统执行。
[root@node01 ~]# cat /etc/bclinux-release
BigCloud Enterprise Linux release 8.2.2107 (Core)
配置主机名
node01-04节点执行,以节点node01为例,配置主机名。
[root@localhost ~]# hostnamectl set-hostname node01
配置hosts
node01-04节点执行,以节点node01为例,配置/etc/hosts。
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.13.1 node01
192.168.13.2 node02
192.168.13.3 node03
192.168.13.4 node04
防火墙
所有存储节点关闭防火墙,下述命令在node01-node04所有存储节点执行。
[root@localhost ~]# systemctl stop firewalld
[root@localhost ~]# systemctl disable firewalld
Removed /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
Selinux
所有存储节点关闭selinux,下述命令在node01-node04所有存储节点执行,修改完后重启所有存储节点。(可以通过setenforce 0在关机前生效,不重启)
[root@localhost ~]# cat /etc/selinux/config
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these three values:
# targeted - Targeted processes are protected,
# minimum - Modification of targeted policy. Only selected processes are protected.
# mls - Multi Level Security protection.
SELINUXTYPE=targeted
免密访问
配置信任关系,配置node01到所有存储节点的无密访问。
配置密钥
[root@node01 ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
……
| . o. S |
|=ooo o. |
|=X= o+.+ . |
|*E*=+++ + |
|&BB+**B= |
+----[SHA256]-----+
分别设置node01到其他节点的信任:
[root@node01 ~]# ssh-copy-id node02
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@node02's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'node02'"
and check to make sure that only the key(s) you wanted were added.
配置NTP
配置node01为chrony服务器,修改/etc/chrony.conf文件。
# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
pool 192.168.13.1 iburst
……
# Allow NTP client access from local network.
allow 192.168.0.0/16
# Serve time even if not synchronized to a time source.
local stratum 10
……
配置node02-node04向node01对时,修改/etc/chrony.conf文件。
# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
pool 192.168.13.1 iburst
……
分别重启node01-node04的chronyd服务,并确认时间同步。
[root@node01 ~]# systemctl restart chronyd
[root@node01 ~]# chronyc sources -v
.-- Source mode '^' = server, '=' = peer, '#' = local clock.
/ .- Source state '*' = current best, '+' = combined, '-' = not combined,
| / 'x' = may be in error, '~' = too variable, '?' = unusable.
|| .- xxxx [ yyyy ] +/- zzzz
|| Reachability register (octal) -. | xxxx = adjusted offset,
|| Log2(Polling interval) --. | | yyyy = measured offset,
|| \ | | zzzz = estimated error.
|| | | \
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^? node01 0 6 3 - +0ns[ +0ns] +/- 0ns
[root@node01 ~]# date
Sun Dec 24 23:31:24 CST 2023
配置本地yum源
这里将node01配置成本地yum服务器。安装web服务后,把linux的iso镜像内容和网上下载的ceph源文件,通过http服务分享出去。
节点node01通过镜像或网络yum源安装httpd和createrepo。
[root@node01 ~]# yum install httpd createrepo -y
Unable to connect to Registration Management Service
Last metadata expiration check: 0:01:57 ago on Mon 25 Dec 2023 07:16:31 PM CST.
Dependencies resolved.
==============================================================================
Package Architecture Version
================================================================================
Installing:
createrepo_c x86_64 0.15.1-2.el8
httpd x86_64 2.4.37-21.0.1.module+el8.2.0+10157+66773459
Installing dependencies:
apr x86_64 1.6.3-9.el8
……
Complete!
[root@node01 ~]#
将缓存到本地的ceph rpm安装文件及操作系统镜像中的BaseOS、AppStream目录拷贝到/var/www/html/ 下。
[root@node01 home]# ls
ceph ceph.15.2.11.tar.gz
[root@node01 home]# mv ceph /var/www/html/
[root@node01 home]# cp -r /media/BaseOS/ /var/www/html/
[root@node01 home]# cp -r /media/AppStream/ /var/www/html/
[root@node01 home]#
创建repodata目录(即创建yum安装rpm包所需要的元数据),启动httpd服务
[root@node01 ~]# createrepo /var/www/html/ceph/
Directory walk started
Directory walk done - 180 packages
Temporary output repo path: /var/www/html/ceph/.repodata/
Preparing sqlite DBs
Pool started (with 5 workers)
Pool finished
[root@node01 ~]# createrepo /var/www/html/BaseOS/
Directory walk started
Directory walk done - 1440 packages
Temporary output repo path: /var/www/html/BaseOS/.repodata/
Preparing sqlite DBs
Pool started (with 5 workers)
Pool finished
[root@node01 ~]# createrepo /var/www/html/AppStream/
Directory walk started
Directory walk done - 4837 packages
Temporary output repo path: /var/www/html/AppStream/.repodata/
Preparing sqlite DBs
Pool started (with 5 workers)
Pool finished
[root@node01 ~]# systemctl restart httpd
[root@node01 ~]#
通过浏览器检查本地源。
配置repo文件
使用上述自己搭建的离线yum源,node01-node04修改repo文件。检查yum源是否正常(/etc/yum.repos.d/默认自带的repo文件需要备份到其他路径)
[root@node01 yum.repos.d]# pwd
/etc/yum.repos.d
[root@node01 yum.repos.d]# cat local.repo
[baseos]
name=baseos
baseurl=http://192.168.13.1/BaseOS
enabled=1
gpgcheck=0
[app]
name=appstream
baseurl=http://192.168.13.1/AppStream
enabled=1
gpgcheck=0
[ceph]
name=ceph
baseurl=http://192.168.13.1/ceph
enabled=1
gpgcheck=0
priority=2
[root@node01 yum.repos.d]# yum makecache
Unable to connect to Registration Management Service
ceph 72 MB/s | 241 kB 00:00
baseos 3.0 MB/s | 3.0 kB 00:00
appstream 291 MB/s | 5.7 MB 00:00
Metadata cache created.
LVM-cache设置
分区
如下所示,根据当前大小。每块1.8T的nvme SSD对应5块8T的HDD。NVME SSD以350GiB大小划分分区。
[root@node01 ~]# lsblk |grep nvme
nvme0n1 259:0 0 1.8T 0 disk
nvme1n1 259:1 0 1.8T 0 disk
nvme2n1 259:2 0 1.8T 0 disk
nvme3n1 259:3 0 1.8T 0 disk
[root@node01 ~]# parted /dev/nvme0n1 print
Error: /dev/nvme0n1: unrecognised disk label
Model: NVMe Device (nvme)
Disk /dev/nvme0n1: 1933GB
Sector size (logical/physical): 512B/512B
Partition Table: unknown
Disk Flags:
[root@node01 ~]#
以node01节点的nvme0n1为例,其他节点的nvme同样操作。
[root@node01 ~]# parted /dev/nvme0n1 mklabel gpt
Information: You may need to update /etc/fstab.
[root@node01 ~]# parted /dev/nvme0n1 mkpart primary 1Mib 350Gib
Information: You may need to update /etc/fstab.
[root@node01 ~]# parted /dev/nvme0n1 mkpart primary 350Gib 700Gib
Information: You may need to update /etc/fstab.
[root@node01 ~]# parted /dev/nvme0n1 mkpart primary 700Gib 1050Gib
Information: You may need to update /etc/fstab.
[root@node01 ~]# parted /dev/nvme0n1 mkpart primary 1050Gib 1400Gib
Information: You may need to update /etc/fstab.
[root@node01 ~]# parted /dev/nvme0n1 mkpart primary 1400Gib 1750Gib
Information: You may need to update /etc/fstab.
[root@node01 ~]# parted /dev/nvme0n1 print
Model: NVMe Device (nvme)
Disk /dev/nvme0n1: 1933GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 376GB 376GB primary
2 376GB 752GB 376GB primary
3 752GB 1127GB 376GB primary
4 1127GB 1503GB 376GB primary
5 1503GB 1879GB 376GB primary
创建PV
所有节点的HDD及SSD分区创建pv。以下以node01为例。
[root@node01 ~]# for i in {b..u};do pvcreate /dev/sd$i;done
Physical volume "/dev/sdb" successfully created.
Physical volume "/dev/sdc" successfully created.
……
Physical volume "/dev/sdu" successfully created.
[root@node01 ~]# for i in {1..5};do pvcreate /dev/nvme0n1p$i;done
Physical volume "/dev/nvme0n1p1" successfully created.
……
Physical volume "/dev/nvme0n1p5" successfully created.
[root@node01 ~]# for i in {1..5};do pvcreate /dev/nvme1n1p$i;done
Physical volume "/dev/nvme1n1p1" successfully created.
……
Physical volume "/dev/nvme1n1p5" successfully created.
[root@node01 ~]# for i in {1..5};do pvcreate /dev/nvme2n1p$i;done
Physical volume "/dev/nvme2n1p1" successfully created.
……
Physical volume "/dev/nvme2n1p5" successfully created.
[root@node01 ~]# for i in {1..5};do pvcreate /dev/nvme3n1p$i;done
Physical volume "/dev/nvme3n1p1" successfully created.
……
Physical volume "/dev/nvme3n1p5" successfully created.
创建VG
按照实际情况修改对应盘符。所有节点均操作,以下以node01上的部分盘为例。
[root@node01 ~]# vgcreate vgb /dev/sdb /dev/nvme0n1p1
Volume group "vgb" successfully created
[root@node01 ~]# vgcreate vgc /dev/sdc /dev/nvme0n1p2
Volume group "vgc" successfully created
[root@node01 ~]# vgcreate vgd /dev/sdd /dev/nvme0n1p3
Volume group "vgd" successfully created
[root@node01 ~]# vgcreate vge /dev/sde /dev/nvme0n1p4
Volume group "vge" successfully created
[root@node01 ~]# vgcreate vgf /dev/sdf /dev/nvme0n1p5
Volume group "vgf" successfully created
[root@node01 ~]# vgcreate vgg /dev/sdg /dev/nvme1n1p1
Volume group "vgg" successfully created
[root@node01 ~]# vgcreate vgh /dev/sdh /dev/nvme1n1p2
Volume group "vgh" successfully created
[root@node01 ~]# vgcreate vgi /dev/sdi /dev/nvme1n1p3
Volume group "vgi" successfully created
[root@node01 ~]# vgcreate vgj /dev/sdj /dev/nvme1n1p4
Volume group "vgj" successfully created
[root@node01 ~]# vgcreate vgk /dev/sdk /dev/nvme1n1p5
Volume group "vgk" successfully created
……
创建数据盘逻辑卷
按照实际情况修改盘符,所有节点都执行。
[root@node01 ~]# for i in {b..u}; do lvcreate -l 100%PVS -n hd$i vg$i /dev/sd$i;done
Logical volume "hdb" created.
Logical volume "hdc" created.
Logical volume "hdd" created.
Logical volume "hde" created.
Logical volume "hdf" created.
Logical volume "hdg" created.
Logical volume "hdh" created.
Logical volume "hdi" created.
Logical volume "hdj" created.
Logical volume "hdk" created.
Logical volume "hdl" created.
Logical volume "hdm" created.
Logical volume "hdn" created.
Logical volume "hdo" created.
Logical volume "hdp" created.
Logical volume "hdq" created.
Logical volume "hdr" created.
Logical volume "hds" created.
Logical volume "hdt" created.
Logical volume "hdu" created.
[root@node01 ~]#
创建cache/meta
创建cache/meta逻辑卷,按照实际情况修改盘符
创建cache逻辑卷,以node01上部分盘为例
[root@node01 ~]# lvcreate -n cacheb -L 300Gib vgb /dev/nvme0n1p1
Logical volume "cacheb" created.
[root@node01 ~]# lvcreate -n cachec -L 300Gib vgc /dev/nvme0n1p2
Logical volume "cachec" created.
[root@node01 ~]# lvcreate -n cached -L 300Gib vgd /dev/nvme0n1p3
Logical volume "cached" created.
[root@node01 ~]# lvcreate -n cachee -L 300Gib vge /dev/nvme0n1p4
Logical volume "cachee" created.
[root@node01 ~]# lvcreate -n cachef -L 300Gib vgf /dev/nvme0n1p5
Logical volume "cachef" created.
[root@node01 ~]#
……
创建meta逻辑卷,以node01上部分盘为例
[root@node01 ~]# lvcreate -n metab -L 2GiB vgb /dev/nvme0n1p1
Logical volume "metab" created.
[root@node01 ~]# lvcreate -n metac -L 2GiB vgc /dev/nvme0n1p2
Logical volume "metac" created.
[root@node01 ~]# lvcreate -n metad -L 2GiB vgd /dev/nvme0n1p3
Logical volume "metad" created.
[root@node01 ~]# lvcreate -n metae -L 2GiB vge /dev/nvme0n1p4
Logical volume "metae" created.
[root@node01 ~]# lvcreate -n metaf -L 2GiB vgf /dev/nvme0n1p5
Logical volume "metaf" created.
……
创建加入缓存池
创建缓存池,按照实际情况修改盘符.以node01节点为例,其他节点均需要操作
[root@node01 ~]# for i in {b..u};do lvconvert --type cache-pool --poolmetadata vg$i/meta$i vg$i/cache$i -y;done
Using 512.00 KiB chunk size instead of default 64.00 KiB, so cache pool has less than 1000000 chunks.
WARNING: Converting vgb/cacheb and vgb/metab to cache pool's data and metadata volumes with metadata wiping.
THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
Converted vgb/cacheb and vgb/metab to cache pool.
Using 512.00 KiB chunk size instead of default 64.00 KiB, so cache pool has less than 1000000 chunks.
WARNING: Converting vgc/cachec and vgc/metac to cache pool's data and metadata volumes with metadata wiping.
THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
Converted vgc/cachec and vgc/metac to cache pool.
Using 512.00 KiB chunk size instead of default 64.00 KiB, so cache pool has less than 1000000 chunks.
WARNING: Converting vgd/cached and vgd/metad to cache pool's data and metadata volumes with metadata wiping.
THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
Converted vgd/cached and vgd/metad to cache pool.
Using 512.00 KiB chunk size instead of default 64.00 KiB, so cache pool has less than 1000000 chunks.
WARNING: Converting vge/cachee and vge/metae to cache pool's data and metadata volumes with metadata wiping.
……
创建缓存池,按照实际情况修改盘符。以node01节点为例,其他节点均需要操作。
[root@node01 ~]# for i in {b..u};do lvconvert --type cache --cachepool vg$i/cache$i --cachemode writeback vg$i/hd$i -y;done
Logical volume vgb/hdb is now cached.
Logical volume vgc/hdc is now cached.
Logical volume vgd/hdd is now cached.
Logical volume vge/hde is now cached.
Logical volume vgf/hdf is now cached.
Logical volume vgg/hdg is now cached.
……
Logical volume vgu/hdu is now cached.
创建db/wal逻辑卷
创建dbdata逻辑卷,按照实际情况修改盘符。 以node01节点为例,其他节点均需要操作。
[root@node01 ~]# lvcreate -n dbb -L 30GiB vgb /dev/nvme0n1p1
Logical volume "dbb" created.
[root@node01 ~]# lvcreate -n dbc -L 30GiB vgc /dev/nvme0n1p2
Logical volume "dbc" created.
[root@node01 ~]# lvcreate -n dbd -L 30GiB vgd /dev/nvme0n1p3
Logical volume "dbd" created.
[root@node01 ~]# lvcreate -n dbe -L 30GiB vge /dev/nvme0n1p4
Logical volume "dbe" created.
[root@node01 ~]# lvcreate -n dbf -L 30GiB vgf /dev/nvme0n1p5
Logical volume "dbf" created
.……
创建waldata逻辑卷,按照实际情况修改盘符. 以node01节点为例,其他节点均需要操作.
[root@node01 ~]# lvcreate -n walb -L 15GiB vgb /dev/nvme0n1p1
Logical volume "walb" created.
[root@node01 ~]# lvcreate -n walc -L 15GiB vgc /dev/nvme0n1p2
Logical volume "walc" created.
[root@node01 ~]# lvcreate -n wald -L 15GiB vgd /dev/nvme0n1p3
Logical volume "wald" created.
[root@node01 ~]# lvcreate -n wale -L 15GiB vge /dev/nvme0n1p4
Logical volume "wale" created.
[root@node01 ~]# lvcreate -n walf -L 15GiB vgf /dev/nvme0n1p5
Logical volume "walf" created.
[root@node01 ~]#
……
检查
lsblk查看到的HDD与SSD的对应关系如下,以sdb和nvme0n1p1举例。
[root@node01 ~]# lsblk /dev/sdb
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 7.8T 0 disk
└─vgb-hdb_corig 253:25 0 7.8T 0 lvm
└─vgb-hdb 253:3 0 7.8T 0 lvm
[root@node01 ~]# lsblk /dev/nvme0n1p1
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1p1 259:4 0 350G 0 part
├─vgb-cacheb_cpool_cdata 253:23 0 300G 0 lvm
│ └─vgb-hdb 253:3 0 7.8T 0 lvm
├─vgb-cacheb_cpool_cmeta 253:24 0 2G 0 lvm
│ └─vgb-hdb 253:3 0 7.8T 0 lvm
├─vgb-dbb 253:83 0 30G 0 lvm
└─vgb-walb 253:103 0 15G 0 lvm
[root@node01 ~]#
pv检查
[root@node01 ~]# pvs
PV VG Fmt Attr PSize PFree
/dev/nvme0n1p1 vgb lvm2 a-- <350.00g 1020.00m
/dev/nvme0n1p2 vgc lvm2 a-- <350.00g 1020.00m
/dev/nvme0n1p3 vgd lvm2 a-- <350.00g 1020.00m
……
/dev/nvme3n1p4 vgt lvm2 a-- <350.00g 1020.00m
/dev/nvme3n1p5 vgu lvm2 a-- <350.00g 1020.00m
/dev/sda3 bel lvm2 a-- 478.41g 0
/dev/sdb vgb lvm2 a-- 7.81t 0
/dev/sdc vgc lvm2 a-- 7.81t 0
/dev/sdd vgd lvm2 a-- 7.81t 0
/dev/sde vge lvm2 a-- 7.81t 0
……
/dev/sds vgs lvm2 a-- 7.81t 0
/dev/sdt vgt lvm2 a-- 7.81t 0
/dev/sdu vgu lvm2 a-- 7.81t 0
vg检查
[root@node01 ~]# vgs
VG #PV #LV #SN Attr VSize VFree
bel 1 3 0 wz--n- 478.41g 0
vgb 2 3 0 wz--n- 8.15t 1020.00m
vgc 2 3 0 wz--n- 8.15t 1020.00m
vgd 2 3 0 wz--n- 8.15t 1020.00m
vge 2 3 0 wz--n- 8.15t 1020.00m
……
vgs 2 3 0 wz--n- 8.15t 1020.00m
vgt 2 3 0 wz--n- 8.15t 1020.00m
vgu 2 3 0 wz--n- 8.15t 1020.00m
[root@node01 ~]#
lv检查
[root@node01 ~]# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
home bel -wi-ao---- 424.41g
root bel -wi-ao---- 50.00g
swap bel -wi-ao---- 4.00g
dbb vgb -wi-a----- 30.00g
hdb vgb Cwi-a-C--- 7.81t [cacheb_cpool] [hdb_corig] 0.01 0.24 0.00
walb vgb -wi-a----- 15.00g
dbc vgc -wi-a----- 30.00g
hdc vgc Cwi-a-C--- 7.81t [cachec_cpool] [hdc_corig] 0.01 0.24 0.00
walc vgc -wi-a----- 15.00g
……
dbu vgu -wi-a----- 30.00g
hdu vgu Cwi-a-C--- 7.81t [cacheu_cpool] [hdu_corig] 0.01 0.24 0.00
walu vgu -wi-a----- 15.00g
[root@node01 ~]#
Ceph集群部署
安装ceph基础库
安装ceph基础库,在node01-node04节点上执行,如下以ceph01为例,每个节点均安装ceph及其依赖包。
[root@node01 ~]# yum install -y python3 ceph-mgr-dashboard ceph-common
Unable to connect to Registration Management Service
Last metadata expiration check: 0:54:01 ago on Mon 25 Dec 2023 07:46:33 PM CST.
Package python36-3.6.8-2.module+el8.2.0+10008+eff94df6.x86_64 is already installed.
Dependencies resolved.
====================================================================
Package Architecture Version Repository Size
=====================================================================
Installing:
ceph-common x86_64 2:15.2.11-0.el8 ceph 21 M
……
Complete!
[root@node01 ~]# yum install -y snappy leveldb gdisk python3-ceph-argparse python3-flask gperftools-libs ceph
Unable to connect to Registration Management Service
Last metadata expiration check: 0:56:50 ago on Mon 25 Dec 2023 07:46:33 PM CST.
Package snappy-1.1.7-5.el8.x86_64 is already installed.
……
Installed:
ceph-2:15.2.11-0.el8.x86_64 ceph-mds-2:15.2.11-0.el8.x86_64 ceph-mon-2:15.2.11-0.el8.x86_64 ceph-osd-2:15.2.11-0.el8.x86_64 python3-click-6.7-8.el8.noarch python3-flask-1:0.12.2-4.el8.noarch python3-itsdangerous-0.24-14.el8.noarch
Complete!
[root@node01 ~]#
mon安装配置
创建uuid,node01节点执行如下,记录uuid
[root@node01 ~]# uuidgen
275d16f8-56f2-4838-86cf-a904fbe97ba2
创建令牌环,node01节点执行如下
[root@node01 ~]# ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'
creating /tmp/ceph.mon.keyring
[root@node01 ~]# ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' --cap mgr 'allow *'
creating /etc/ceph/ceph.client.admin.keyring
[root@node01 ~]# ceph-authtool --create-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring --gen-key -n client.bootstrap-osd --cap mon 'profile bootstrap-osd' --cap mgr 'allow r'
creating /var/lib/ceph/bootstrap-osd/ceph.keyring
[root@node01 ~]# ceph-authtool /tmp/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring
importing contents of /etc/ceph/ceph.client.admin.keyring into /tmp/ceph.mon.keyring
[root@node01 ~]# ceph-authtool /tmp/ceph.mon.keyring --import-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
importing contents of /var/lib/ceph/bootstrap-osd/ceph.keyring into /tmp/ceph.mon.keyring
修改/tmp/ceph.mon.keyring属主为ceph,检查是否修改成功。在node01节点执行。
[root@node01 ~]# chown ceph:ceph /tmp/ceph.mon.keyring
[root@node01 ~]# ls -l /tmp/ceph.mon.keyring
-rw-------. 1 ceph ceph 357 Dec 25 20:47 /tmp/ceph.mon.keyring
配置monmap,--add参数后跟mon节点的主机名和public IP地址,--fsid后跟上面步骤记录的uuid值。只需要在node01节点执行。
[root@node01 ~]# monmaptool --create --add node01 188.188.13.1 --add node02 188.188.13.2 --add node03 188.188.13.3 --fsid 275d16f8-56f2-4838-86cf-a904fbe97ba2 /tmp/monmap
monmaptool: monmap file /tmp/monmap
monmaptool: set fsid to 275d16f8-56f2-4838-86cf-a904fbe97ba2
monmaptool: writing epoch 0 to /tmp/monmap (3 monitors)
[root@node01 ~]#
在node01节点配置ceph.conf配置文件。在/etc/ceph/目录下创建ceph.conf配置文件,文件内容如下,fsid后跟uuid,mon initial members后跟所有mon节点主机名,mon host后跟所有mon节点public IP,public network后跟public网段,格式如下,cluster network后跟cluster网段,格式如下。
[root@node01 ~]# vi /etc/ceph/ceph.conf
[root@node01 ~]# cat /etc/ceph/ceph.conf
[global]
fsid = 275d16f8-56f2-4838-86cf-a904fbe97ba2
mon initial members = node01,node02,node03
mon host = 188.188.13.1,188.188.13.2,188.188.13.3
public network = 188.188.13.0/24
cluster network = 10.10.13.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd journal size = 1024
osd pool default size = 3
osd pool default min size = 2
osd pool default pg num = 256
osd pool default pgp num = 256
osd crush chooseleaf type = 1
mon_allow_pool_delete = true
将node01节点的/etc/ceph目录拷贝至所有其他节点。下述操作在node01节点执行
[root@node01 ~]# for i in {2..4};do scp -r /etc/ceph/ node0$i:/etc/ ;done
rbdmap 100% 92 128.0KB/s 00:00
ceph.client.admin.keyring 100% 151 121.2KB/s 00:00
ceph.conf 100% 501 714.2KB/s 00:00
rbdmap 100% 92 133.2KB/s 00:00
ceph.client.admin.keyring 100% 151 74.5KB/s 00:00
ceph.conf 100% 501 643.7KB/s 00:00
rbdmap 100% 92 128.3KB/s 00:00
ceph.client.admin.keyring 100% 151 143.4KB/s 00:00
ceph.conf 100% 501 557.8KB/s 00:00
[root@node01 ~]#
将node01节点的/tmp/monmap /tmp/ceph.mon.keyring文件拷贝至所有其他ceph节点的/tmp目录下,并修改属主为ceph。
[root@node01 ~]# for i in {2..4};do scp /tmp/monmap /tmp/ceph.mon.keyring node0$i:/tmp/;done
monmap 100% 319 299.0KB/s 00:00
ceph.mon.keyring 100% 357 530.2KB/s 00:00
monmap 100% 319 263.9KB/s 00:00
ceph.mon.keyring 100% 357 786.1KB/s 00:00
monmap 100% 319 307.9KB/s 00:00
ceph.mon.keyring 100% 357 1.3MB/s 00:00
[root@node01 ~]# ssh node01 'cd /tmp/;chown -R ceph:ceph /tmp/monmap /tmp/ceph.mon.keyring'
[root@node01 ~]# ssh node02 'cd /tmp/;chown -R ceph:ceph /tmp/monmap /tmp/ceph.mon.keyring'
[root@node01 ~]# ssh node03 'cd /tmp/;chown -R ceph:ceph /tmp/monmap /tmp/ceph.mon.keyring'
[root@node01 ~]# ssh node04 'cd /tmp/;chown -R ceph:ceph /tmp/monmap /tmp/ceph.mon.keyring'
[root@node01 ~]# for i in {1..4};do ssh node0$i 'ls -l /tmp/|grep ceph';done
-rw-------. 1 ceph ceph 357 Dec 25 20:47 ceph.mon.keyring
-rw-r--r--. 1 ceph ceph 319 Dec 25 20:49 monmap
-rw-------. 1 ceph ceph 357 Dec 25 20:53 ceph.mon.keyring
-rw-r--r--. 1 ceph ceph 319 Dec 25 20:53 monmap
-rw-------. 1 ceph ceph 357 Dec 25 20:53 ceph.mon.keyring
-rw-r--r--. 1 ceph ceph 319 Dec 25 20:53 monmap
-rw-------. 1 ceph ceph 357 Dec 25 20:53 ceph.mon.keyring
-rw-r--r--. 1 ceph ceph 319 Dec 25 20:53 monmap
[root@node01 ~]#
在所有mon节点创建mon数据目录,并修改属主为ceph。在node01节点执行如下。
[root@node01 ~]# for i in {1..3};do ssh node0$i 'mkdir /var/lib/ceph/mon/ceph-`hostname`';done
[root@node01 ~]# for i in {1..3};do ssh node0$i 'chown -R ceph:ceph /var/lib/ceph/mon/ceph-`hostname`';done
[root@node01 ~]# for i in {1..3};do ssh node0$i 'ls -l /var/lib/ceph/mon/';done
total 0
drwxr-xr-x. 2 ceph ceph 6 Dec 25 20:56 ceph-node01
total 0
drwxr-xr-x. 2 ceph ceph 6 Dec 25 20:56 ceph-node02
total 0
drwxr-xr-x. 2 ceph ceph 6 Dec 25 20:56 ceph-node03
[root@node01 ~]#
在所有mon节点上配置mon服务,在node01节点执行命令如下。
[root@node01 ~]# for i in {1..3};do ssh node0$i 'ceph-mon --mkfs -i `hostname` --monmap /tmp/monmap --keyring /tmp/ceph.mon.keyring';done
[root@node01 ~]#
再次修改所有mon节点/var/lib/ceph/mon/下属主为ceph。在node01节点执行如下。
[root@node01 ~]# for i in {1..3};do ssh node0$i 'chown -R ceph:ceph /var/lib/ceph/mon/ceph-`hostname`';done
[root@node01 ~]# for i in {1..3};do ssh node0$i 'ls -l /var/lib/ceph/mon/ceph-`hostname`';done
total 8
-rw-------. 1 ceph ceph 77 Dec 25 20:57 keyring
-rw-------. 1 ceph ceph 8 Dec 25 20:57 kv_backend
drwxr-xr-x. 2 ceph ceph 112 Dec 25 20:57 store.db
total 8
-rw-------. 1 ceph ceph 77 Dec 25 20:57 keyring
-rw-------. 1 ceph ceph 8 Dec 25 20:57 kv_backend
drwxr-xr-x. 2 ceph ceph 112 Dec 25 20:57 store.db
total 8
-rw-------. 1 ceph ceph 77 Dec 25 20:57 keyring
-rw-------. 1 ceph ceph 8 Dec 25 20:57 kv_backend
drwxr-xr-x. 2 ceph ceph 112 Dec 25 20:57 store.db
[root@node01 ~]#
所有mon节点启动ceph-mon服务,在ceph01节点执行如下。
[root@node01 ~]# for i in {1..3};do ssh node0$i 'systemctl start ceph-mon@`hostname`';done
[root@node01 ~]# for i in {1..3};do ssh node0$i 'systemctl status ceph-mon@`hostname`|grep active';done
Active: active (running) since Mon 2023-12-25 20:58:14 CST; 18s ago
Active: active (running) since Mon 2023-12-25 20:58:15 CST; 18s ago
Active: active (running) since Mon 2023-12-25 20:58:15 CST; 18s ago
[root@node01 ~]#
所有mon节点开启msgr2,检查ceph状态。
[root@node01 ~]# for i in {1..3};do ssh node0$i 'ceph mon enable-msgr2';done
[root@node01 ~]# ceph config set mon auth_allow_insecure_global_id_reclaim false
[root@node01 ~]# ceph -s
cluster:
id: 275d16f8-56f2-4838-86cf-a904fbe97ba2
health: HEALTH_OK
services:
mon: 3 daemons, quorum node01,node02,node03 (age 4m)
mgr: no daemons active
osd: 0 osds: 0 up, 0 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:
[root@node01 ~]#
mgr安装配置
mgr安装配置,如无特殊说明,本小节命令在node01节点执行。
创建目录并注册。
[root@node01 ~]# mkdir /var/lib/ceph/mgr/ceph-node01_mgr/
[root@node01 ~]# ceph auth get-or-create mgr.node01_mgr mon 'allow profile mgr' osd 'allow *' mds 'allow *' > /var/lib/ceph/mgr/ceph-node01_mgr/keyring
[root@node01 ~]#
/etc/ceph/ceph.conf中添加如下蓝色内容并将ceph.conf配置文件同步至所有ceph节点,重启所有mon节点mon服务。
[root@node01 ~]# cat /etc/ceph/ceph.conf
[global]
fsid = 275d16f8-56f2-4838-86cf-a904fbe97ba2
mon initial members = node01,node02,node03
mon host = 188.188.13.1,188.188.13.2,188.188.13.3
public network = 188.188.13.0/24
cluster network = 10.10.13.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd journal size = 1024
osd pool default size = 3
osd pool default min size = 2
osd pool default pg num = 256
osd pool default pgp num = 256
osd crush chooseleaf type = 1
mon_allow_pool_delete = true
[mon]
mgr initial modules = dashboard balancer
[root@node01 ~]# for i in {2..4};do scp /etc/ceph/ceph.conf node0$i:/etc/ceph/ ;done
ceph.conf 100% 549 201.3KB/s 00:00
ceph.conf 100% 549 346.7KB/s 00:00
ceph.conf 100% 549 318.2KB/s 00:00
[root@node01 ~]# for i in {1..3};do ssh node0$i 'systemctl restart ceph-mon@`hostname`';done
[root@node01 ~]# for i in {1..3};do ssh node0$i 'systemctl status ceph-mon@`hostname`|grep active';done
Active: active (running) since Mon 2023-12-25 21:09:42 CST; 20s ago
Active: active (running) since Mon 2023-12-25 21:09:43 CST; 19s ago
Active: active (running) since Mon 2023-12-25 21:09:43 CST; 19s ago
配置mgr服务
[root@node01 ~]# ceph-mgr -i node01_mgr
[root@node01 ~]# systemctl start ceph-mgr@node01_mgr
[root@node01 ~]# systemctl enable ceph-mgr@node01_mgr
Created symlink /etc/systemd/system/ceph-mgr.target.wants/ceph-mgr@node01_mgr.service → /usr/lib/systemd/system/ceph-mgr@.service.
[root@node01 ~]# systemctl status ceph-mgr@node01_mgr
● ceph-mgr@node01_mgr.service - Ceph cluster manager daemon
Loaded: loaded (/usr/lib/systemd/system/ceph-mgr@.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2023-12-25 21:54:18 CST; 20s ago
Main PID: 43489 (ceph-mgr)
Tasks: 20 (limit: 822859)
Memory: 293.1M
CGroup: /system.slice/system-ceph\x2dmgr.slice/ceph-mgr@node01_mgr.service
└─43489 /usr/bin/ceph-mgr -f --cluster ceph --id node01_mgr --setuser ceph --setgroup ceph
Dec 25 21:54:18 node01 systemd[1]: Started Ceph cluster manager daemon.
Dec 25 21:54:27 node01 ceph-mgr[43489]: 2023-12-25T21:54:27.892+0800 7fa608338700 -1 mgr handle_mgr_map I was activ>
Dec 25 21:54:27 node01 ceph-mgr[43489]: ignoring --setuser ceph since I am not root
Dec 25 21:54:27 node01 ceph-mgr[43489]: ignoring --setgroup ceph since I am not root [root@node01 ~]#
界面服务安装配置(选做)
dashboard安装配置
[root@node01 ~]# ceph mgr module enable dashboard
[root@node01 ~]# ceph config set mgr mgr/dashboard/ssl false
[root@node01 ~]# ceph config set mgr mgr/dashboard/server_addr 0.0.0.0
[root@node01 ~]# ceph config set mgr mgr/dashboard/server_port 3244
[root@node01 ~]# echo "11223344" >> password
[root@node01 ~]# ceph dashboard ac-user-create admin administrator -i password --force-password
{"username": "admin", "password": "$2b$12$yruot.mVxLsUC7tRdLrkW.Kgmd/JbXAbaRRzXw4MkbNhrGSKKQGHq", "roles": ["administrator"], "name": null, "email": null, "lastUpdate": 1703513298, "enabled": true, "pwdExpirationDate": null, "pwdUpdateRequired": false}
[root@node01 ~]#
通过浏览器确认dashboard的状态。
登录后首页如下图所示。
osd安装配置
创建OSD,将/var/lib/ceph/bootstrap-osd/ceph.keyring文件同步到所有OSD节点,在ceph01节点执行如下
[root@node01 ~]# for i in {2..4};do scp /var/lib/ceph/bootstrap-osd/ceph.keyring node0$i:/var/lib/ceph/bootstrap-osd/;done
ceph.keyring 100% 129 82.5KB/s 00:00
ceph.keyring 100% 129 61.5KB/s 00:00
ceph.keyring 100% 129 91.1KB/s 00:00
[root@node01 ~]#
创建osd,每个vgX(X为b至u)部署1个osd服务。以机械盘/dev/sdb为例,创建lvm后,vgb-hdb代表数据分区,vgb-dbb为30G的db数据分区,vgb-walb为15G的wal分区。
[root@node01 ~]# lsblk /dev/sdb
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 7.8T 0 disk
└─vgb-hdb_corig 253:25 0 7.8T 0 lvm
└─vgb-hdb 253:3 0 7.8T 0 lvm
[root@node01 ~]# lsblk /dev/nvme0n1p1
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1p1 259:4 0 350G 0 part
├─vgb-cacheb_cpool_cdata 253:23 0 300G 0 lvm
│ └─vgb-hdb 253:3 0 7.8T 0 lvm
├─vgb-cacheb_cpool_cmeta 253:24 0 2G 0 lvm
│ └─vgb-hdb 253:3 0 7.8T 0 lvm
├─vgb-dbb 253:83 0 30G 0 lvm
└─vgb-walb 253:103 0 15G 0 lvm
[root@node01 ~]#
这里以node01为例,创建osd服务。
[root@node01 ~]# for i in {b..u};do ceph-volume lvm create --bluestore --data vg$i/hd$i --block.db vg$i/db$i --block.wal vg$i/wal$i;sleep 1 ;done
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 5af227de-4929-480c-ae5c-2b06afd95ba2
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
Running command: /usr/sbin/restorecon /var/lib/ceph/osd/ceph-0
Running command: /usr/bin/chown -h ceph:ceph /dev/vgb/hdb
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-3
Running command: /usr/bin/ln -s /dev/vgb/hdb /var/lib/ceph/osd/ceph-0/block
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-0/activate.monmap
stderr: 2023-12-26T11:10:34.132+0800 7f9fd89b1700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
2023-12-26T11:10:34.132+0800 7f9fd89b1700 -1 AuthRegistry(0x7f9fd4059a70) no keyring found at /etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,, disabling cephx
stderr: got monmap epoch 2
Running command: /usr/bin/ceph-authtool /var/lib/ceph/osd/ceph-0/keyring --create-keyring --name osd.0 --add-key AQAoRIplLNUDDBAAac7FZJM6gMSa9tZ+EfjgGg==
……
stderr: Created symlink /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-19-f04372b2-0b7c-4717-98b4-d5c25f5b3e29.service → /usr/lib/systemd/system/ceph-volume@.service.
Running command: /usr/bin/systemctl enable --runtime ceph-osd@19
stderr: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@19.service → /usr/lib/systemd/system/ceph-osd@.service.
Running command: /usr/bin/systemctl start ceph-osd@19
--> ceph-volume lvm activate successful for osd ID: 19
--> ceph-volume lvm create successful for: vgu/hdu
[root@node01 ~]#
查看集群状态
[root@node01 ~]# ceph -s
cluster:
id: 275d16f8-56f2-4838-86cf-a904fbe97ba2
health: HEALTH_OK
services:
mon: 3 daemons, quorum node01,node02,node03 (age 8m)
mgr: node01_mgr(active, starting, since 2s)
osd: 20 osds: 20 up (since 2m), 20 in (since 2m); 1 remapped pgs
data:
pools: 1 pools, 1 pgs
objects: 0 objects, 0 B
usage: 620 GiB used, 156 TiB / 157 TiB avail
pgs: 1 active+clean
[root@node01 ~]#
按照此方式,将所有节点的所有vgX创建osd,并指定wal和db对应的lvm分区。
全部完成osd创建后查看集群状态。与Bcache方案部署的无不同。
[root@node01 ~]# ceph -s
cluster:
id: 275d16f8-56f2-4838-86cf-a904fbe97ba2
health: HEALTH_OK
services:
mon: 3 daemons, quorum node01,node02,node03 (age 5s)
mgr: node01_mgr(active, since 4s)
osd: 80 osds: 80 up (since 4m), 80 in (since 4m)
data:
pools: 1 pools, 1 pgs
objects: 0 objects, 0 B
usage: 2.4 TiB used, 625 TiB / 627 TiB avail
pgs: 1 active+clean
查看集群的osd分布。
[root@node01 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 627.34375 root default
-3 156.83594 host node01
0 ssd 7.84180 osd.0 up 1.00000 1.00000
1 ssd 7.84180 osd.1 up 1.00000 1.00000
2 ssd 7.84180 osd.2 up 1.00000 1.00000
3 ssd 7.84180 osd.3 up 1.00000 1.00000
4 ssd 7.84180 osd.4 up 1.00000 1.00000
5 ssd 7.84180 osd.5 up 1.00000 1.00000
6 ssd 7.84180 osd.6 up 1.00000 1.00000
7 ssd 7.84180 osd.7 up 1.00000 1.00000
8 ssd 7.84180 osd.8 up 1.00000 1.00000
9 ssd 7.84180 osd.9 up 1.00000 1.00000
10 ssd 7.84180 osd.10 up 1.00000 1.00000
11 ssd 7.84180 osd.11 up 1.00000 1.00000
12 ssd 7.84180 osd.12 up 1.00000 1.00000
13 ssd 7.84180 osd.13 up 1.00000 1.00000
14 ssd 7.84180 osd.14 up 1.00000 1.00000
15 ssd 7.84180 osd.15 up 1.00000 1.00000
16 ssd 7.84180 osd.16 up 1.00000 1.00000
17 ssd 7.84180 osd.17 up 1.00000 1.00000
18 ssd 7.84180 osd.18 up 1.00000 1.00000
19 ssd 7.84180 osd.19 up 1.00000 1.00000
-5 156.83594 host node02
20 ssd 7.84180 osd.20 up 1.00000 1.00000
……
74 ssd 7.84180 osd.74 up 1.00000 1.00000
75 ssd 7.84180 osd.75 up 1.00000 1.00000
76 ssd 7.84180 osd.76 up 1.00000 1.00000
77 ssd 7.84180 osd.77 up 1.00000 1.00000
78 ssd 7.84180 osd.78 up 1.00000 1.00000
79 ssd 7.84180 osd.79 up 1.00000 1.00000
至此,部署完毕。应用安装与Bcache方案一致,这里不再重复。