29.5 搭建kubernetes集群(ansible-playbook)
29.6 kubernetes集群备份和恢复
29.5 搭建kubernetes集群(ansible-playbook)
使用ansible以二进制方式安装kubernetes集群
具体问题 安装步骤等 本文档参考 https://github.com/gjmzj/kubeasz
扩展: 使用kubeadm部署集群 https://blog.frognew.com/2018/08/kubeadm-install-kubernetes-1.11.html
kubernetes官方github地址 https://github.com/kubernetes/kubernetes/releases
软硬件限制:
1)四台机器 cpu和内存 master:至少1核2g,推荐2c4g;node:至少1核2g
2)linux系统 内核版本至少3.10,推荐CentOS7/RHEL7
3)docker 至少1.9版本,推荐1.12+
4)etcd 至少2.0版本,推荐3.0+
高可用集群所需节点规划:
部署节点——x1 : 运行这份 ansible 脚本的节点
etcd节点——x3 : 注意etcd集群必须是1,3,5,7…奇数个节点
master节点—-x2 : 根据实际集群规模可以增加节点数,需要额外规划一个master VIP(虚地址)
lb节点——–x2 : 负载均衡节点两个,安装 haproxy+keepalived
node节点——x2 : 真正应用负载的节点,根据需要提升机器配置和增加节点数
机器规划图:
准备工作
四台机器,全部执行:
1 2 3 4 5 | yum install -y epel-release yum update -y systemctl stop firewalld systemctl disable firewalld setenforce 0 |
在kun01上
1) 安装ansible
1 | yum install -y ansible |
或者
1 2 3 4 | yum install -y python-pip git pip install pip --upgrade -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com pip install --no-cache-dir ansible -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com |
问题 安装ansible出错
1 2 3 | python-httplib2-0.9.2-1.el7.no FAILED http://ftp.sjtu.edu.cn/centos/7.5.1804/extras/x86_64/Packages/python-httplib2-0.9.2-1.el7.noarch.rpm: [Errno -1] 软件包与预期下载的不符。建议:运行 yum --enablerepo=extras clean metadata 正在尝试其它镜像。 |
解决 手动下载python-httplib2 rpm包安装
1 | rpm -ivh python-httplib2-0.9.2-1.el7.noarch.rpm |
2) 生成密钥并发送给其他机器
1 2 3 | ssh-keygen ##生成密钥 for ip in 101 102 103 104; do ssh-copy-id 192.168.80.$ip; done ssh-copy-id 将本机的公钥复制到远程机器 |
3) 下载kubeasz
1 2 3 | git clone https://github.com/gjmzj/kubeasz.git mv kubeasz/* /etc/ansible/ kubeasz目录里面包含所有安装的playbook |
4) 下载二进制文件并解压
百度云盘地址 https://pan.baidu.com/s/1c4RFaA#list/path=%2F
tar包里面都是k8s docker的可执行文件
这里下载1-12的tar包并拷贝到kun01上
1 2 | tar zxvf k8s.1-12-1.tar.gz mv bin/* /etc/ansible/bin/ |
5.配置集群的参数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | cp /etc/ansible/example/hosts.m-masters.example /etc/ansible/hosts ##拷贝模板 vim /etc/ansible/hosts [deploy] 192.168.80.101 NTP_ENABLED=no [etcd] 192.168.80.101 NODE_NAME=etcd1 192.168.80.102 NODE_NAME=etcd2 192.168.80.103 NODE_NAME=etcd3 [kube-master] 192.168.80.101 192.168.80.104 [lb] 192.168.80.101 LB_IF="ens33" LB_ROLE=master #注意根据实际使用网卡设置 192.168.80.104 LB_IF="ens33" LB_ROLE=backup [kube-node] 192.168.80.102 192.168.80.103 K8S_VER="v1.12" ##版本 MASTER_IP="192.168.80.66" ##指VIP |
分步骤安装
1)创建证书和安装准备
1 2 | cd /etc/ansible/ ansible-playbook 01.prepare.yml |
2)安装etcd集群
1 | ansible-playbook 02.etcd.yml |
检查etcd节点健康状况:
1 2 | bash ##刚刚拷贝的执行文件还没生效 for ip in 101 102 103; do ETCDCTL_API=3 etcdctl --endpoints=https://192.168.80.${ip}:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem endpoint health; done |
3)安装docker
1 | ansible-playbook 03.docker.yml |
4)安装master节点
1 | ansible-playbook 04.kube-master.yml |
问题 连接不上vip
1 | fatal: [192.168.80.101 -> 192.168.80.101]: FAILED! => {"attempts": 5, "changed": true, "cmd": ["/opt/kube/bin/kubectl", "get", "node"], "delta": "0:00:15.187609", "end": "2018-10-27 18:49:05.111291", "msg": "non-zero return code", "rc": 1, "start": "2018-10-27 18:48:49.923682", "stderr": "Unable to connect to the server: dial tcp 192.168.80.66:8443: connect: no route to host", "stderr_lines": ["Unable to connect to the server: dial tcp 192.168.80.66:8443: connect: no route to host"], "stdout": "", "stdout_lines": []} |
解决 由于配置文件中的网卡写错导致节点上keepalived服务没起来
更改host文件
1 2 | vim /etc/ansible/hosts 把LB_IF="eth33"改为LB_IF="ens33" |
并在每个节点上修改keepalived配置并启动服务
1 2 3 | vim /etc/keepalived/keepalived.conf interface eth33改为interface ens33 systemctl start keepalived |
查看集群状态
1 | kubectl get cs |
5)安装node节点
1 | ansible-playbook 05.kube-node.yml |
查看node节点
1 | kubectl get no |
6)部署集群网络
1 | ansible-playbook 06.network.yml |
查看kube-system namespace上的pod,从中可以看到flannel相关的pod
1 | kubectl get pod -n kube-system |
7)安装集群插件(dns, dashboard)
1 | ansible-playbook 07.cluster-addon.yml |
查看kube-system namespace下的服务
1 | kubectl get svc -n kube-system |
一步安装
1 | ansible-playbook 90.setup.yml |
查看集群信息
1 | kubectl cluster-info |
查看node/pod使用资源情况
1 2 | kubectl top node kubectl top pod --all-namespaces |
测试DNS
a)创建nginx service
1 | kubectl run nginx --image=nginx --expose --port=80 |
b)创建busybox 测试pod
1 2 3 4 5 6 | kubectl run busybox --rm -it --image=busybox /bin/sh //进入到busybox内部 nslookup nginx.default.svc.cluster.local //结果如下 Server: 10.68.0.2 Address: 10.68.0.2:53 Name: nginx.default.svc.cluster.local Address: 10.68.9.156 |
增加node节点
1)deploy节点免密码登录node
1 | ssh-copy-id 新node ip |
2)修改/etc/ansible/hosts
1 2 | [new-node] 172.7.15.117 |
3)执行安装脚本
1 | ansible-playbook /etc/ansible/20.addnode.yml |
4)验证
1 2 | kubectl get node kubectl get pod -n kube-system -o wide |
5)后续工作
修改/etc/ansible/hosts
,将new-node里面的所有ip全部移动到kube-node组里去
增加master节点
https://github.com/gjmzj/kubeasz/blob/master/docs/op/AddMaster.md
升级集群
1)备份etcd
1 | ETCDCTL_API=3 etcdctl snapshot save backup.db |
查看备份文件信息
1 | ETCDCTL_API=3 etcdctl --write-out=table snapshot status backup.db |
2)到本项目的根目录kubeasz
1 | cd /dir/to/kubeasz |
拉取最新的代码
1 | git pull origin master |
3)下载升级目标版本的kubernetes二进制包 百度网盘https://pan.baidu.com/s/1c4RFaA#list/path=%2F
解压,并替换/etc/ansible/bin/
下的二进制文件
4)docker升级(略),除非特别需要,否则不建议频繁升级docker
5)如果接受业务中断,执行:
1 | ansible-playbook -t upgrade_k8s,restart_dockerd 22.upgrade.yml |
6)不能接受短暂中断,需要这样做:
1 2 3 4 5 6 | ansible-playbook -t upgrade_k8s 22.upgrade.yml 到所有node上逐一: kubectl cordon和kubectl drain //迁移业务pod systemctl restart docker kubectl uncordon //恢复pod |
29.6 kubernetes集群备份和恢复
1)备份恢复原理:
备份,从运行的etcd集群中备份数据到磁盘文件
恢复,把etcd的备份文件恢复到etcd集群中,然后据此重建整个集群
2)如果使用kubeasz项目创建的集群,除了备份etcd数据外,还需要备份CA证书文件,以及ansible的hosts文件
手动操作步骤:
备份etcd数据和ca证书
1 2 3 | mkdir -p /backup/k8s //创建备份目录 ETCDCTL_API=3 etcdctl snapshot save /backup/k8s/snapshot.db //备份etcd数据 cp /etc/kubernetes/ssl/ca* /backup/k8s/ //备份ca证书 |
模拟集群崩溃
1 | ansible-playbook /etc/ansible/99.clean.yml |
恢复步骤如下(在deploy节点):
a)恢复ca证书
1 2 | mkdir -p /etc/kubernetes/ssl cp /backup/k8s/ca* /etc/kubernetes/ssl/ |
b)重建集群
1 2 3 4 5 6 | cd /etc/ansible ansible-playbook 01.prepare.yml ansible-playbook 02.etcd.yml ansible-playbook 03.docker.yml ansible-playbook 04.kube-master.yml ansible-playbook 05.kube-node.yml |
c)恢复etcd数据
停止服务
1 | ansible etcd -m service -a 'name=etcd state=stopped' |
清空文件
1 | ansible etcd -m file -a 'name=/var/lib/etcd/member/ state=absent' |
登录所有的etcd节点,参照本etcd节点cat /etc/systemd/system/etcd.service
的服务文件,替换如下中变量后执行
1 2 | rsync -av /backup/ 192.168.80.102:/backup/ rsync -av /backup/ 192.168.80.103:/backup/ |
在kun01上
1 2 3 4 5 6 7 | cd /backup/k8s/ ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \ --name etcd1 \ --initial-cluster etcd1=https://192.168.80.101:2380,etcd2=https://192.168.80.102:2380,etcd3=https://192.168.80.103:2380 \ --initial-cluster-token etcd-cluster-0 \ --initial-advertise-peer-urls https://192.168.80.101:2380 |
在kun02上
1 2 3 4 5 6 7 | cd /backup/k8s/ ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \ --name etcd2 \ --initial-cluster etcd1=https://192.168.80.101:2380,etcd2=https://192.168.80.102:2380,etcd3=https://192.168.80.103:2380 \ --initial-cluster-token etcd-cluster-0 \ --initial-advertise-peer-urls https://192.168.80.102:2380 |
在kun03上
1 2 3 4 5 6 7 | cd /backup/k8s/ ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \ --name etcd3 \ --initial-cluster etcd1=https://192.168.80.101:2380,etcd2=https://192.168.80.102:2380,etcd3=https://192.168.80.103:2380 \ --initial-cluster-token etcd-cluster-0 \ --initial-advertise-peer-urls https://192.168.80.103:2380 |
执行上面的步骤后,/backup/k8s/
里会生成etcd{1,2,3}.etcd
目录并拷贝到/var/lib/etcd/
下
在kun01上
1 2 | cp -r etcd1.etcd/member /var/lib/etcd/ systemctl restart etcd |
在kun02上
1 2 | cp -r etcd2.etcd/member /var/lib/etcd/ systemctl restart etcd |
在kun03上
1 2 | cp -r etcd3.etcd/member /var/lib/etcd/ systemctl restart etcd |
d)在deploy节点重建网络
1 | ansible-playbook /etc/ansible/tools/change_k8s_network.yml |
不想手动恢复,可以用ansible自动恢复
需要一键备份
1 | ansible-playbook /etc/ansible/23.backup.yml |
检查/etc/ansible/roles/cluster-backup/files
目录下是否有文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | tree /etc/ansible/roles/cluster-backup/files/ //如下 ├── ca # 集群CA 相关备份 │ ├── ca-config.json │ ├── ca.csr │ ├── ca-csr.json │ ├── ca-key.pem │ └── ca.pem ├── hosts # ansible hosts备份 │ ├── hosts # 最近的备份 │ └── hosts-201807231642 ├── readme.md └── snapshot # etcd 数据备份 ├── snapshot-201807231642.db └── snapshot.db # 最近的备份 |
模拟故障:
1 | ansible-playbook /etc/ansible/99.clean.yml |
修改文件/etc/ansible/roles/cluster-restore/defaults/main.yml
,指定要恢复的etcd快照备份,如果不修改就是最新的一次
恢复操作
1 2 | ansible-playbook /etc/ansible/24.restore.yml ansible-playbook /etc/ansible/tools/change_k8s_network.yml |