在国内环境中快速构建稳定、安全、可观测的高可用 Kubernetes 集群,满足 40-80 节点规模的生产业务需求。
一、整体规划
1. 节点规划
节点类型 | 数量 | IP 地址 | 硬件配置 | 用途说明 |
---|---|---|---|---|
Master 节点 | 3 | 10.24.1.10 10.24.1.11 10.24.1.12 | CPU: 8 核 / 3.0GHz<br>内存: 32GB<br>存储: 2×50GB SSD(RAID1 系统盘) | 控制平面组件(API Server/Controller Manager/Scheduler) |
Worker 节点 | 80-100 | 10.24.1.100-10.24.1.110 | CPU: 4 核 / 2.5GHz<br>内存: 16GB<br>存储: 1×100GB SSD | 运行应用容器 |
Ceph 存储节点 | 3 | 10.24.1.13 10.24.1.14 10.24.1.15 | CPU: 6 核 / 2.4GHz<br>内存: 64GB<br>存储: 1×200GB SSD(元数据 / 日志)+ 4×4TB HDD(容量存储)<br>双网卡:管理网 (10.24.1.0/24)+ 存储网 (10.24.5.0/24) | 分布式存储系统(OSD/MON/MDS) |
Harbor 节点 | 3 | 10.24.1.16 10.24.1.17 10.24.1.18 | CPU: 4 核 / 2.0GHz<br>内存: 16GB<br>存储: 2×200GB SSD(镜像存储,RAID1) | 镜像仓库高可用部署 |
监控日志节点 | 3 | 10.24.1.19 10.24.1.20 10.24.1.21 | CPU: 6 核 / 2.4GHz<br>内存: 32GB<br>存储: 2×500GB HDD(日志存储,RAID1) | Prometheus/Grafana/EFK 部署 |
负载均衡节点 | 2 | 10.24.1.22 10.24.1.23 10.24.1.24 | CPU: 2 核 / 2.0GHz<br>内存: 8GB<br>存储: 1×20GB SSD | HAProxy+Keepalived(VIP: 10.24.1.24) |
APM+Kuboard 节点 | 1 | 10.24.1.25 | CPU: 4 核 / 2.4GHz<br>内存: 16GB<br>存储: 1×100GB SSD | Jaeger+Kuboard 部署 |
2. 网络规划
网络类型 | 网段 | 带宽 | 用途说明 |
---|---|---|---|
管理网络 | 10.24.1.0/24 | 1Gbps | 节点管理、API 通信、集群控制 |
存储网络 | 10.24.5.0/24 | 10Gbps | Ceph OSD 间数据传输(万兆) |
Pod 网络 | 10.200.0.0/16 | - | Pod 间通信(Calico IPIP 模式) |
Service 网络 | 10.90.0.0/12 | - | Service 虚拟 IP 段(IPVS 转发) |
Keepalived VIP | 10.24.1.24/24 | - | 控制平面入口 IP(6443 端口) |
3. 组件版本矩阵
组件名称 | 版本号 | 兼容性说明 |
---|---|---|
Kubernetes | v1.28.3 | 支持 IPVS 代理模式,PSP 替换为 PSA |
etcd | v3.5.9 | Kubernetes 推荐版本 |
Ceph | v17.2.6 (Quincy) | 支持 Cephx 认证,3 副本策略 |
Calico | v3.26.5 | 支持 Pod 网络策略与 IPIP 模式 |
Harbor | v2.8.2 | 支持 HTTPS 与镜像跨区域复制 |
Prometheus | v2.47.0 | 兼容 Kubernetes 1.28+ |
Elasticsearch | v8.10.4 | 支持 TLS 加密与 X-Pack 认证 |
Jaeger | v1.46.0 | 支持 OpenTracing 协议 |
Kuboard | v3.0.0 | 支持 Kubernetes 1.19+ |
二、提前准备内容
-
1.硬件与网络准备
- 服务器要求:所有节点配置 NTP 服务(执行命令:yum install -y chrony 并启动),确保时间同步
- 双网卡配置:Ceph 节点需配置独立管理网(eth0: 10.24.1.0/24)和存储网(eth1: 10.24.5.0/24)
- 端口开放:
# 管理网开放端口:22(SSH), 6443(API), 2379/2380(etcd), 80/443(Harbor), 9090(Prometheus), 5601(Kibana), 16686(Jaeger)
IP 规划:提前分配固定 IP,确保 VIP(10.24.1.24)未被占用,各节点 IP 无冲突
-
2.软件依赖准备
- 所有节点提前安装:
yum install -y openssl-devel curl wget vim jq chrony lvm2 device-mapper-persistent-data
证书准备(自签证书流程)
- Harbor 证书:按前文中步骤生成,确保证书 CN 为 k8s-vip
- Kibana 证书:生成时 CN 为 kibana.k8s.local,并配置到 Nginx
- etcd 证书:包含所有 Master 节点 IP,通过 cfssl 生成并分发
- 所有节点提前安装:
-
3.存储准备
- Ceph 节点磁盘:
- SSD 分区:执行命令:mkfs.ext4 /dev/sdb(用于日志/元数据)
- HDD 分区:执行命令:parted /dev/sdc mkpart primary xfs 0% 100%(用于容量存储)
- Harbor 持久化存储:提前创建 /data/harbor 目录并赋予权限
- Ceph 节点磁盘:
三、部署注意事项
-
1.版本兼容性
- 严格按照组件版本矩阵部署,避免版本冲突(如 Kubernetes v1.28.3 仅支持 etcd v3.5.0+)
- 确保 Docker 与 Kubernetes Cgroup Driver 一致(均为 systemd)
-
2.安全配置
- 证书轮换:每年更新自签证书,使用 openssl 重新生成并替换
- Harbor 权限:按项目隔离镜像仓库,启用基于角色的访问控制
- Kubernetes 审计:配置 API Server 审计日志,记录所有请求到持久化存储
-
3.性能优化
# IPVS 连接超时优化(负载均衡节点) echo "options ip_vs timeout=300" >> /etc/modprobe.d/ipvs.conf modprobe -r ip_vs && modprobe ip_vs # Ceph 数据重建限速(避免影响业务) ceph osd set osd_max_backfill_rate 1024 # 限制为 1MB/s
4.备份策略
- etcd 备份:每日2:00执行全量备份,保留7天历史,异地存储关键快照
- Ceph 快照:每周对核心存储池执行全量快照,使用命令:rbd snap create
- 监控日志归档:按季度删除超过90天的监控数据,释放磁盘空间
四、基础环境准备(所有节点)
-
1.系统初始化
-
# 关闭系统服务
systemctl disable --now firewalld
sed -i 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
setenforce 0
swapoff -a
sed -i '/swap/s/^\(.*\)$/#\1/' /etc/fstab
# 配置内核参数
cat <<EOF > /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables=1
net.bridge.bridge-nf-call-ip6tables=1
net.ipv4.ip_forward=1
vm.swappiness=0
fs.file-max=655360
EOF
sysctl -p
-
2.配置主机名与 hosts 映射
-
# 设置主机名(各节点执行对应名称) hostnamectl set-hostname master1 # Master1 节点 hostnamectl set-hostname master2 # Master2 节点 # 其他节点类似,如 ceph1、harbor1、lb1 等 # 所有节点添加 hosts 映射 cat <<EOF >> /etc/hosts 10.24.1.10 master1 10.24.1.11 master2 10.24.1.12 master3 10.24.1.13 ceph1 10.24.1.14 ceph2 10.24.1.15 ceph3 10.24.1.16 harbor1 10.24.1.17 harbor2 10.24.1.18 harbor3 10.24.1.19 monitor1 10.24.1.20 monitor2 10.24.1.21 monitor3 10.24.1.22 lb1 10.24.1.23 lb2 10.24.1.24 k8s-vip 10.24.1.25 apm-kuboard EOF
-
3.安装 Docker(阿里云仓库)
-
# 添加阿里云 Docker 仓库 yum-config-manager --add-repo https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo yum makecache fast # 安装 Docker CE yum install -y docker-ce-24.0.7 docker-ce-cli-24.0.7 containerd.io # 配置 systemd Cgroup Driver sed -i 's/cgroup-driver=cgroupfs/cgroup-driver=systemd/' /etc/docker/daemon.json systemctl restart docker systemctl enable --now docker
-
3.安装 Kubernetes 组件(阿里云仓库)
-
# 添加阿里云 Kubernetes 仓库 cat <<EOF > /etc/yum.repos.d/kubernetes.repo [kubernetes] name=Kubernetes baseurl=https://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64 enabled=1 gpgcheck=1 repo_gpgcheck=1 gpgkey=https://mirrors.aliyun.com/kubernetes/yum/doc/yum-key.gpg https://mirrors.aliyun.com/kubernetes/yum/doc/rpm-package-key.gpg EOF # 安装组件 yum install -y kubelet-1.28.3 kubeadm-1.28.3 kubectl-1.28.3 --disableexcludes=kubernetes systemctl enable --now kubelet
-
5.启用 IPVS 转发
yum install -y ipvsadm ipset
modprobe ip_vs
modprobe ip_vs_rr
modprobe ip_vs_wrr
modprobe ip_vs_sh
modprobe nf_conntrack
echo -e "ip_vs\nip_vs_rr\nip_vs_wrr\nip_vs_sh\nnf_conntrack" >> /etc/modules-load.d/ipvs.conf
五、高可用负载均衡器(HAProxy+Keepalived)
-
1.HAProxy 配置(负载均衡节点 10.24.1.22/23)
yum install -y haproxy
cat <<EOF > /etc/haproxy/haproxy.cfg
global
log 127.0.0.1 local2 info
maxconn 8192
chroot /var/lib/haproxy
pidfile /var/run/haproxy.pid
stats socket /var/lib/haproxy/stats mode 660 level admin
defaults
mode tcp
timeout connect 10s
timeout client 30s
timeout server 30s
balance roundrobin
option tcp-check
# Kubernetes API Server 负载均衡
frontend k8s-api
bind 0.0.0.0:6443
default_backend k8s-master
# Harbor 镜像仓库负载均衡
frontend harbor-https
bind 0.0.0.0:443
default_backend harbor-nodes
# 监控系统负载均衡
frontend monitoring
bind 0.0.0.0:9090
default_backend monitoring-backend
# 日志系统负载均衡
frontend logging
bind 0.0.0.0:5601
default_backend logging-backend
backend k8s-master
tcp-check send GET /healthz HTTP/1.0\r\n\r\n
tcp-check expect status 200
server master1 10.24.1.10:6443 check inter 2000ms fall 3 rise 2
server master2 10.24.1.11:6443 check inter 2000ms fall 3 rise 2
server master3 10.24.1.12:6443 check inter 2000ms fall 3 rise 2
backend harbor-nodes
tcp-check send GET /healthz HTTP/1.0\r\n\r\n
tcp-check expect status 200
server harbor1 10.24.1.16:443 check inter 2000ms fall 3 rise 2
server harbor2 10.24.1.17:443 check inter 2000ms fall 3 rise 2
server harbor3 10.24.1.18:443 check inter 2000ms fall 3 rise 2
# 监控后端配置
backend monitoring-backend
tcp-check send GET /metrics HTTP/1.0\r\n\r\n
tcp-check expect status 200
server monitor1 10.24.1.19:9090 check inter 2000ms fall 3 rise 2
server monitor2 10.24.1.20:9090 check inter 2000ms fall 3 rise 2
server monitor3 10.24.1.21:9090 check inter 2000ms fall 3 rise 2
# 日志后端配置
backend logging-backend
tcp-check send GET / HTTP/1.0\r\n\r\n
tcp-check expect status 200
server log1 10.24.1.19:5601 check inter 2000ms fall 3 rise 2
server log2 10.24.1.20:5601 check inter 2000ms fall 3 rise 2
server log3 10.24.1.21:5601 check inter 2000ms fall 3 rise 2
EOF
systemctl enable --now haproxy
-
2.Keepalived 配置
- 1.主节点 10.24.1.22
yum install -y keepalived
cat <<EOF > /etc/keepalived/keepalived.conf
global_defs {
router_id LVS_K8S_VIP
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
nopreempt # 禁止抢占
advert_int 1
authentication {
auth_type PASS
auth_pass k8s-lb-2024
}
virtual_ipaddress {
10.24.1.24/24
}
}
EOF
systemctl enable --now keepalived
- 2.备节点 10.24.1.23
cat <<EOF > /etc/keepalived/keepalived.conf
global_defs {
router_id LVS_K8S_VIP
}
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 51
priority 90
nopreempt # 禁止抢占
advert_int 1
authentication {
auth_type PASS
auth_pass k8s-lb-2024
}
virtual_ipaddress {
10.24.1.24/24
}
}
EOF
systemctl enable --now keepalived
六、高可用 etcd 集群(Master 节点)
-
1.生成 etcd 证书(Master1 节点操作)
-
# 安装 cfssl curl -s -L -o /usr/local/bin/cfssl https://pkg.cfssl.org/R1.2/cfssl_linux-amd64 curl -s -L -o /usr/local/bin/cfssljson https://pkg.cfssl.org/R1.2/cfssljson_linux-amd64 chmod +x /usr/local/bin/cfssl /usr/local/bin/cfssljson mkdir -p /etc/etcd/pki cd /etc/etcd/pki # 生成 CA 证书 cat <<EOF > ca-csr.json { "CN": "etcd-ca", "key": { "algo": "rsa", "size": 2048 }, "names": [{"C": "CN", "L": "Beijing", "O": "Kubernetes"}] } EOF cfssl gencert -initca ca-csr.json | cfssljson -bare ca # 生成节点证书(包含所有 Master IP) cat <<EOF > server-csr.json { "CN": "etcd-server", "hosts": ["10.24.1.10", "10.24.1.11", "10.24.1.12"], "key": { "algo": "rsa", "size": 2048 }, "names": [{"C": "CN", "L": "Beijing", "O": "Kubernetes"}] } EOF cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -profile=etcd server-csr.json | cfssljson -bare server # 分发证书 scp ca.pem server.pem server-key.pem master2:/etc/etcd/pki/ scp ca.pem server.pem server-key.pem master3:/etc/etcd/pki/
-
2.安装与配置 etcd
-
# 下载 etcd ETCD_VER=v3.5.9 wget https://github.com/etcd-io/etcd/releases/download/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz tar xzvf etcd-${ETCD_VER}-linux-amd64.tar.gz mv etcd-${ETCD_VER}-linux-amd64/etcd* /usr/local/bin/ # 配置文件 /etc/etcd/etcd.conf(各节点修改对应 IP) cat <<EOF > /etc/etcd/etcd.conf ETCD_NAME=etcd-$(hostname) ETCD_DATA_DIR=/var/lib/etcd ETCD_LISTEN_PEER_URLS=https://$(hostname -i):2380 ETCD_LISTEN_CLIENT_URLS=https://$(hostname -i):2379 ETCD_INITIAL_ADVERTISE_PEER_URLS=https://$(hostname -i):2380 ETCD_ADVERTISE_CLIENT_URLS=https://$(hostname -i):2379 ETCD_INITIAL_CLUSTER="etcd-master1=https://10.24.1.10:2380,etcd-master2=https://10.24.1.11:2380,etcd-master3=https://10.24.1.12:2380" ETCD_CERT_FILE=/etc/etcd/pki/server.pem ETCD_KEY_FILE=/etc/etcd/pki/server-key.pem ETCD_TRUSTED_CA_FILE=/etc/etcd/pki/ca.pem ETCD_PEER_CERT_FILE=/etc/etcd/pki/server.pem ETCD_PEER_KEY_FILE=/etc/etcd/pki/server-key.pem ETCD_PEER_TRUSTED_CA_FILE=/etc/etcd/pki/ca.pem EOF # 启动服务 systemctl enable --now etcd
七、Kubernetes 集群初始化
-
1.主 Master 节点初始化(10.24.1.10)
-
kubeadm init \ --control-plane-endpoint "k8s-vip:6443" \ --pod-network-cidr=10.200.0.0/16 \ --service-cidr=10.90.0.0/12 \ --etcd-servers=https://10.24.1.10:2379,https://10.24.1.11:2379,https://10.24.1.12:2379 \ --cert-dir=/etc/kubernetes/pki \ --apiserver-cert-extra-sans=k8s-vip \ --feature-gates=SupportIPVSProxyMode=true \ --apiserver-extra-args="feature-gates=SupportIPVSProxyMode=true" \ --kube-proxy-extra-args="proxy-mode=ipvs" # 配置 kubectl mkdir -p $HOME/.kube cp -i /etc/kubernetes/admin.conf $HOME/.kube/config chown $(id -u):$(id -g) $HOME/.kube/config
-
2.添加其他 Master 节点(10.24.1.11/12)
-
# 获取初始化时的 join 命令(替换 <token> 和 <hash>,通过 kubeadm token create --print-join-command 获取) kubeadm join k8s-vip:6443 \ --token <token> \ --discovery-token-ca-cert-hash sha256:<hash> \ --control-plane \ --certificate-key <certificate-key> \ --etcd-cafile /etc/etcd/pki/ca.pem \ --etcd-certfile /etc/etcd/pki/server.pem \ --etcd-keyfile /etc/etcd/pki/server-key.pem
-
3.加入 Worker 节点
-
kubeadm join k8s-vip:6443 \ --token <token> \ --discovery-token-ca-cert-hash sha256:<hash>
八、Calico 网络插件部署
-
kubectl apply -f https://docs.projectcalico.org/v3.26/manifests/calico.yaml
九、Ceph 分布式存储系统
-
1.存储节点准备(10.24.1.13-15)
-
# 安装 Ceph 部署工具 yum install -y https://download.ceph.com/rpm-quincy/el7/noarch/ceph-release-1-1.el7.noarch.rpm yum install -y ceph-deploy # 初始化部署目录 mkdir ceph-cluster && cd ceph-cluster ceph-deploy new ceph1 ceph2 ceph3
-
2.配置 ceph.conf
-
cat <<EOF >> ceph.conf [global] fsid = $(uuidgen) mon_initial_members = ceph1, ceph2, ceph3 mon_host = 10.24.1.13,10.24.1.14,10.24.1.15 public_network = 10.24.1.0/24 cluster_network = 10.24.5.0/24 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx osd_pool_default_size = 3 osd_pool_default_min_size = 2 [mon] mon_allow_pool_delete = true EOF
-
3.部署 Ceph 集群
-
# 安装组件 ceph-deploy install ceph1 ceph2 ceph3 # 初始化 Monitor ceph-deploy mon create-initial # 准备 OSD(假设 SSD: /dev/sdb,HDD: /dev/sdc/sdd/sde) ceph-deploy osd prepare ceph1:/dev/sdc:/dev/sdb ceph2:/dev/sdc:/dev/sdb ceph3:/dev/sdc:/dev/sdb ceph-deploy osd activate ceph1:/dev/sdc1 ceph2:/dev/sdc1 ceph3:/dev/sdc1
-
4.安装 Ceph CSI 驱动
-
kubectl apply -f https://raw.githubusercontent.com/ceph/ceph-csi/master/deploy/rbd/kubernetes/cephcsi-rbd.yaml
十、Harbor 高可用镜像仓库
-
1.安装 Docker Compose(Harbor 节点)
-
curl -L https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose chmod +x /usr/local/bin/docker-compose
-
2.生成自签证书(所有 Harbor 节点)
-
# 生成 Harbor 自签证书(在任一节点执行) mkdir -p /etc/harbor/certs openssl req -x509 -nodes -days 3650 -newkey rsa:4096 \ -keyout /etc/harbor/certs/harbor.key \ -out /etc/harbor/certs/harbor.crt \ -subj "/C=CN/ST=Beijing/L=Beijing/O=Harbor/CN=k8s-vip" # 分发证书到所有 Harbor 节点 scp -r /etc/harbor/certs harbor1:/etc/harbor/ scp -r /etc/harbor/certs harbor2:/etc/harbor/ scp -r /etc/harbor/certs harbor3:/etc/harbor/ # 节点信任自签证书(所有节点执行) cp /etc/harbor/certs/harbor.crt /etc/pki/ca-trust/source/anchors/ update-ca-trust enable && update-ca-trust extract
3.配置与启动 Harbor
-
# 下载并解压 wget https://github.com/goharbor/harbor/releases/download/v2.8.2/harbor-offline-installer-v2.8.2.tgz tar xvf harbor-offline-installer-v2.8.2.tgz && cd harbor # 修改 harbor.yml cat <<EOF > harbor.yml hostname: k8s-vip external_url: https://k8s-vip:443 data_volume: /data/harbor harbor_admin_password: Harbor123! https: certificate: /etc/harbor/certs/harbor.crt private_key: /etc/harbor/certs/harbor.key database: password: db_Harbor123! redis: password: redis_Harbor123! EOF # 启动 Harbor ./install.sh
4.验证集群访问 Harbor
-
# 登录 Harbor docker login k8s-vip:443 -u admin -p Harbor123! # 拉取镜像验证 kubectl run harbor-test --image=k8s-vip:443/library/busybox:1.35 kubectl get pod harbor-test # 预期状态为 Running
十一、监控与日志系统部署
1.Prometheus+Grafana 监控
-
(1)Prometheus 安装与启动
wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
tar xvf && mv prometheus-2.47.0.linux-amd64 /opt/prometheus
cat <<EOF > /opt/prometheus/prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
EOF
nohup /opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml &
-
(2)Grafana 安装
yum install -y https://dl.grafana.com/oss/release/grafana-10.2.3-1.x86_64.rpm
systemctl enable --now grafana-server
-
2.EFK 日志系统
(1)Elasticsearch 安装与配置启动
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.10.4-x86_64.rpm
rpm -ivh elasticsearch-8.10.4-x86_64.rpm
sed -i 's/network.host: 127.0.0.1/network.host: 0.0.0.0/' /etc/elasticsearch/elasticsearch.yml
systemctl enable --now elasticsearch
(2)Kibana 安装与配置启动
(1)生成自签证书
openssl req -x509 -nodes -days 3650 -newkey rsa:4096 \
-keyout /etc/nginx/kibana.key \
-out /etc/nginx/kibana.crt \
-subj "/C=CN/ST=Beijing/L=Beijing/O=Kibana/CN=kibana.k8s.local"
(2)安装与启动
wget https://artifacts.elastic.co/downloads/kibana/kibana-8.10.4-x86_64.rpm
rpm -ivh kibana-8.10.4-x86_64.rpm
sed -i "s/elasticsearch.hosts: .*/elasticsearch.hosts: ['https://monitor1:9200', 'https://monitor2:9200', 'https://monitor3:9200']/" /etc/kibana/kibana.yml
systemctl enable --now kibana
(3)Nginx 反向代理与 Basic Auth认证
-
(1)安装 Nginx
yum install -y nginx
-
(2)创建认证
htpasswd -c /etc/nginx/.htpasswd kibana-user
-
(3)配置文件 /etc/nginx/conf.d/kibana.conf
cat <<EOF > /etc/nginx/conf.d/kibana.conf
server {
listen 443 ssl;
server_name kibana.k8s.local;
ssl_certificate /etc/nginx/kibana.crt;
ssl_certificate_key /etc/nginx/kibana.key;
location / {
auth_basic "Restricted Access";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://10.24.1.19:5601;
proxy_set_header Host \$host;
}
}
EOF
systemctl enable --now nginx
十二、APM 工具 Jaeger 部署(10.24.1.25)
-
# 安装 Docker Compose curl -L https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose chmod +x /usr/local/bin/docker-compose mkdir jaeger && cd jaeger cat <<EOF > docker-compose.yml version: '3' services: jaeger: image: jaegertracing/all-in-one:1.46.0 ports: - "16686:16686" # Web 界面 - "6831:6831/udp" # UDP 收集端口 - "14268:14268" # HTTP 收集端口 environment: - COLLECTOR_ZIPKIN_HTTP_PORT=9411 restart: always EOF docker-compose up -d
十三、Kuboard 可视化界面(10.24.1.25)
-
mkdir kuboard && cd kuboard wget https://kuboard.cn/install-script/kuboard.yaml docker-compose -f kuboard.yaml up -d
十四、认证与授权配置
-
1.Pod 安全策略(PSA)
-
kubectl apply -f - <<EOF apiVersion: policy/v1 kind: PodSecurityPolicy metadata: name: restricted spec: privileged: false runAsUser: rule: MustRunAsNonRoot seLinux: rule: RunAsAny supplementalGroups: rule: MustRunAs ranges: - min: 1 max: 65535 EOF
-
2.网络策略(Calico)
-
kubectl apply -f - <<EOF apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-deny-all spec: podSelector: {} policyTypes: - Ingress - Egress EOF
十五、数据加密配置
-
1.etcd 静态加密
-
# 创建加密密钥 head -c 32 /dev/urandom | base64 > /etc/etcd/encryption-key # 配置文件 /etc/etcd/etcd.conf echo "experimental-encryption-provider-config = /etc/etcd/encryption-config.yaml" >> /etc/etcd/etcd.conf # 加密配置 /etc/etcd/encryption-config.yaml cat <<EOF > /etc/etcd/encryption-config.yaml kind: EncryptionConfig apiVersion: v1 resources: - resources: - secrets providers: - aescbc: keys: - name: key1 secret: $(cat /etc/etcd/encryption-key) - identity: {} EOF systemctl restart etcd
十六、集群验证与测试
-
1.核心组件状态检查
-
# 检查 Kubernetes 组件状态 kubectl get componentstatuses # 预期输出:所有组件状态为 Healthy # 验证 CoreDNS 运行状态 kubectl get pods -n kube-system -l k8s-app=kube-dns # 预期输出:CoreDNS Pod 均为 Running 状态 # 节点角色检查 kubectl get nodes --show-labels # 预期输出:Master 节点包含 `node-role.kubernetes.io/control-plane` 标签,Worker 节点包含 `node-role.kubernetes.io/worker` 标签
-
2.监控与日志系统访问
-
服务名称
访问地址 验证方式 Kubernetes API https://k8s-vip:6443 kubectl cluster-info 显示正常 Harbor https://k8s-vip:443 用户名: admin,密码: Harbor123! 登录成功 Prometheus http://k8s-vip:9090 页面显示指标数据 Kibana https://kibana.k8s.local Basic Auth 认证后显示日志界面 Jaeger http://10.24.1.25:16686 显示分布式追踪界面 Kuboard http://10.24.1.25:8080 登录后显示集群可视化界面
十七、灾难恢复
-
1.etcd 集群灾难恢复(增强业务依赖验证)
-
(1)全量备份恢复流程(新增验证步骤)
# 1. 停止故障节点 etcd 服务 systemctl stop etcd # 2. 恢复快照(使用最新全量备份) ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd/etcd-snapshot-latest.db \ --data-dir=/var/lib/etcd \ --initial-cluster="etcd-master1=https://10.24.1.10:2380,etcd-master2=https://10.24.1.11:2380,etcd-master3=https://10.24.1.12:2380" \ --initial-cluster-state=new # 3. 启动服务并验证集群状态 systemctl start etcd ETCDCTL_API=3 etcdctl --endpoints=https://10.24.1.10:2379 endpoint health # 检查节点健康 kubectl get pods --all-namespaces # 验证业务 Pod 调度是否正常(依赖 etcd 数据)
(2)多节点故障重建(新增一致性检查)
1. 当超过半数 Master 节点故障时: 优先恢复任意一个健康节点作为引导节点 新节点加入时强制同步数据: kubeadm init phase etcd local --etcd-cert-file=/etc/etcd/pki/server.pem --etcd-key-file=/etc/etcd/pki/server-key.pem # 初始化本地 etcd etcdctl cluster-health # 确认集群达到多数派一致性
2.Ceph 集群灾难恢复(强化存储与业务联动)
-
(1)OSD 节点故障恢复(新增存储卷验证)
-
# 场景:OSD 磁盘损坏导致 Pod 数据丢失 1. **故障处理**: ceph osd fail osd.0 && ceph osd remove osd.0 --yes-i-really-mean-it 2. **新磁盘部署**: ceph-deploy osd create --data /dev/sdd1 ceph1 # 新增 OSD ceph osd set osd.1 weight 1 # 触发数据再平衡 3. **业务验证**: kubectl get pvc test-pvc # 检查 PVC 状态是否恢复 Bound kubectl exec -it app-pod -- df -h # 验证 Pod 能否正常访问 Ceph 存储
(2)Monitor 集群故障(保障控制平面可用性)
1. 当单个 Monitor 节点故障时: ceph mon remove ceph1 # 移除故障节点 ceph-deploy mon add ceph1 # 在新节点重新部署(自动同步集群配置) ceph -s # 确认 MON 集群状态为 active+clean 2. 业务影响:短暂影响存储元数据操作,Kubernetes 存储控制器会自动重试,Pod 重建后可恢复数据访问.
-
3.Kubernetes 控制平面恢复(保障控制平面高可用)
-
(1)master 节点替换(新增角色验证)
# 场景:master1 硬件故障,替换为新节点 1. **节点初始化**: 执行基础环境准备步骤,配置 IP 10.24.1.10 并加入集群: kubeadm join k8s-vip:6443 --control-plane --certificate-key <key> # 使用初始化时的证书密钥 1. 角色确认: kubectl get nodes master1 -o jsonpath='{.metadata.labels.node-role\.kubernetes\.io/control-plane}' # 验证标签是否正确 kubectl get componentstatuses # 检查控制器管理器、调度器是否正常运行
(2)API Server 故障自愈(业务无感知恢复)
1. 当 API Server Pod 异常时: kubectl get pods -n kube-system -l component=kube-apiserver # 确认故障 Pod kubectl delete pod <apiserver-pod> --grace-period=0 --force # 强制删除,kubelet 自动重建 watch kubectl get pods -n kube-system # 等待 Pod 进入 Running 状态(负载均衡自动切换流量)
-
4.Harbor 镜像仓库恢复(保障镜像拉取连续性)
(1)单节点故障(负载均衡自动切换)
# 场景:harbor1 节点故障,流量自动切换至 harbor2/harbor3
1. HAProxy 健康检查自动剔除故障节点(无需人工干预)
2. 新节点恢复步骤:
docker-compose down # 停止故障节点服务
rsync -av /data/harbor/ new-harbor-node:/data/harbor/ # 同步数据(基于共享存储或备份)
docker-compose up -d # 启动新节点,HAProxy 自动添加回后端
1. 业务验证:
kubectl run nginx --image=k8s-vip:443/library/nginx:alpine # 验证镜像拉取速度与可用性
(2)数据卷损坏(快速恢复镜像服务)
1. 基于备份存储快速恢复:
umount /data/harbor && mount /backup/harbor-volume /data/harbor # 挂载备份卷
docker-compose restart # 重启 Harbor 服务,10 分钟内恢复镜像访问
-
5.监控与日志系统恢复(保障可观测性连续性)
(1)Elasticsearch 节点故障(数据分片自动迁移)
# 场景:monitor1 节点 ES 故障,数据分片迁移至 monitor2/monitor3
1. 重启服务或替换节点后:
curl -X GET "https://monitor2:9200/_cluster/health?wait_for_status=green" # 等待集群恢复健康
2. **业务验证**:
kubectl logs app-pod # 确认日志采集正常
Grafana 仪表盘刷新后显示最新指标数据 # 验证监控数据连续性
(2)Prometheus 数据恢复(避免监控中断)
1. 基于远程存储(如 Ceph RBD)恢复:
systemctl stop prometheus
rm -rf /opt/prometheus/data/* # 清除损坏数据
mount /dev/rbd0 /opt/prometheus/data # 挂载远程存储卷(包含历史数据)
systemctl start prometheus # 1 分钟内恢复监控数据采集
-
6.整体容灾演练与业务连续性保障
(1)跨可用区容灾策略(增强版)
容灾维度 | 技术方案 | 业务恢复指标 |
---|---|---|
etcd 备份 | 每日全量备份至对象存储(MinIO),支持 15 分钟内恢复最新数据 | RTO ≤ 30 分钟 |
Ceph 副本 | 关键存储池启用 3 副本,跨 AZ 部署,故障时自动触发数据重建 | 数据零丢失(RPO=0) |
VIP 漂移 | Keepalived 非抢占模式,故障检测时间 < 5 秒,VIP 切换无感知 | 业务中断 < 10 秒 |
应用自愈 | Kubernetes 控制器自动重建故障节点 Pod,配合 Readiness Probe 确保服务可用 | Pod 恢复时间 < 2 分钟 |
(2)季度容灾演练清单
-
模拟场景: 同时故障 2 个 Master 节点 Ceph 存储节点双网卡故障 Harbor 集群半数节点宕机
-
验证步骤:
- etcd 恢复验证
- etcdctl snapshot restore --wal-only # 快速恢复最近事务日志 kubectl get secrets # 检查业务密钥是否完整
- 存储恢复验证
- ceph osd tree # 确认 OSD 节点重构完成 kubectl exec -it db-pod -- mysql -h db-service # 验证数据库存储访问
- 镜像服务验证
- docker pull k8s-vip:443/app:v1 # 测试镜像拉取延迟 < 500ms
十八、方案总结
-
1.本方案通过以下核心特性确保生产环境落地:
-
高可用架构:3 节点 Master + Keepalived + HAProxy
-
完整自签证书体系:覆盖 Harbor、Kibana、etcd 的证书生成与信任配置,无需外部 CA
-
存储与网络隔离:Ceph 双网卡分离管理网与存储网,保障数据传输性能
-
安全与合规:启用 PSA、网络策略、全链路加密,满足企业级安全要求
-
可观测性:集成 Prometheus+Grafana、EFK、Jaeger,实现监控、日志、分布式追踪全覆盖
-
2.最终验证清单
验证项 | 执行命令 / 操作 | 预期结果 |
---|---|---|
HAProxy 配置生效 | systemctl status haproxy | 服务运行正常,无错误日志 |
Keepalived VIP 绑定 | ip addr show dev eth0 | 显示 10.24.1.24/24 虚拟 IP |
CoreDNS 解析 | nslookup kubernetes.default.svc | 解析到 Service IP(10.90.0.1 示例) |
Harbor 镜像拉取 | kubectl run harbor-test --image=k8s-vip:443/busybox | Pod 状态为 Running |
Kibana 认证访问 | 浏览器访问 https://kibana.k8s.local<br>输入用户名密码后显示 Kibana 界面 | |
节点角色正确性 | kubectl get nodes --show-labels | Master 节点含 control-plane 标签 |