1.环境准备
机器系统:ubuntu20.04.3
k8s版本:1.28.2
containerd版本:1.6.25
三台虚拟机:
master:192.168.100.55
node1: 192.168.100.66
node2: 192.168.100.77
1.1 虚拟机初始化设置
step1 配置ip
下载的ubuntu可能没有ifconfig命令,所以使用ip命令配置虚拟机ip。(前提:虚拟机可以和主机连通)
ip addr add 192.168.100.55/20 dev <网卡名称> # 配置网卡
ip route add default via 192.168.100.1 dev <网卡名称> # 配置网关
step2 配置apt镜像源
首先,在hosts中加入DNS地址vim /etc/hosts
添加如下内容
<内网镜像源ip> aliyun.com # 这里以aliyun为例
其次,修改镜像源。如果是内网修改这个文件下面的源地址/etc/apt/sources.list
,如果是外网
https://blog.csdn.net/xiangxianghehe/article/details/122856771
配置完后apt-get update
step3 配置k8s镜像源
sudo echo "deb <镜像源地址> main" > /etc/apt/sources.list.d/kubernetes.list
apt-get update
如果update
失败需要配置keyserver-ubuntu
再重新update
.
1.2 关闭防火墙
root@master:~# systemctl disable ufw
Synchronizing state of ufw.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable ufw
1.3 配置时间同步
root@master:~# apt install -y ntpdate
root@master:~# ntpdate time1.aliyun.com(内网地址)
1.4 禁止swap分区
root@master:~# swapoff -a
root@master:~# vim /etc/fstab #这一步是注释掉swap,但是我这台虚拟机直接没有swap的挂载信息
root@master:~# free -m
total used free shared buff/cache available
Mem: 3913 1161 263 3 2488 2385
Swap: 0 0 0
对于为什么禁用swap分区:至于swap,在计算集群(请注意计算集群这四个字的含义,这种集群主要运行一些生存周期短暂的计算应用,申请大量内存-动用大量CPU-完成计算-输出结果-退出,而不是运行诸如mysql之类的服务型程序)中,我们通常希望OOM的时候直接杀掉进程,向运维或者作业提交者报错提示,并且执行故障转移,把进程在其他节点上重启起来。而不是用swap续命,导致节点hang住,集群性能大幅下降,并且运维还得不到报错提示。更可怕的是有一些集群的swap位于机械硬盘阵列上,大量动用swap基本可以等同于死机,你甚至连root都登录不上,不用提杀掉问题进程了。往往结局就是硬重启。
k8s配置中关闭swap是必要的,否则会报错。
1.5 ubuntu系统配置修改
modprobe ip_vs
modprobe ip_vs_rr
modprobe ip_vs_wrr
modprobe ip_vs_sh
modprobe nf_conntrack
modprobe nf_conntrack_ipv4
modprobe br_netfilter
modprobe overlay
cat > /etc/modules-load.d/k8s-modules.conf <<EOF
ip_vs
ip_vs_rr
ip_vs_wrr
ip_vs_sh
nf_conntrack
nf_conntrack_ipv4
br_netfilter
overlay
EOF
cat <<EOF > /etc/sysctl.d/kubernetes.conf
# 开启数据包转发功能(实现vxlan)
net.ipv4.ip_forward=1
# iptables对bridge的数据进行处理
net.bridge.bridge-nf-call-iptables=1
net.bridge.bridge-nf-call-ip6tables=1
net.bridge.bridge-nf-call-arptables=1
# 关闭tcp_tw_recycle,否则和NAT冲突,会导致服务不通
net.ipv4.tcp_tw_recycle=0
# 不允许将TIME-WAIT sockets重新用于新的TCP连接
net.ipv4.tcp_tw_reuse=0
# socket监听(listen)的backlog上限
net.core.somaxconn=32768
# 最大跟踪连接数,默认 nf_conntrack_buckets * 4
net.netfilter.nf_conntrack_max=1000000
# 禁止使用 swap 空间,只有当系统 OOM 时才允许使用它
vm.swappiness=0
# 计算当前的内存映射文件数。
vm.max_map_count=655360
# 内核可分配的最大文件数
fs.file-max=6553600
# 持久连接
net.ipv4.tcp_keepalive_time=600
net.ipv4.tcp_keepalive_intvl=30
net.ipv4.tcp_keepalive_probes=10
EOF
sysctl -p /etc/sysctl.d/kubernetes.conf
ufw disable
2 配置containerd
2.1 先验环境安装
step1 安装必要的系统工具
sudo apt-get update
sudo apt-get -y install apt-transport-https ca-certificates curl software-properties-common
step2 安装GPG证书
mkdir -p /etc/apt/keyrings
curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo apt-key add -
step3 写入软件源信息
sudo add-apt-repository "deb [arch=amd64] https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"
step4 更新并安装containerd
sudo apt-get -y update
sudo apt-get -y install containerd.io
step5 查看containerd版本
root@master:/home# containerd --version
containerd containerd.io 1.6.25 d8f198a4ed8892c764191ef7b3b06d8a2eeb5c7f
step6 生成containerd配置
root@master:~# containerd config default | sudo tee /etc/containerd/config.toml #如果报错可以先mkdir /etc/containerd目录再执行
2.2 配置containerd修改
vim /etc/containerd/config.toml
sandbox_image = "k8s.gcr.io/pause:3.9" #将sandbox_image后面的谷歌仓库改为阿里云仓库地址
systemd_cgroup = true #将原来默认的false改为true,如果不修改可能会报warning提示cgroup控制器有问题(需要和kubelet的控制器保持一致)
runtime_type = "io.containerd.runtime.v1.linux" #如果不修改这个,后面可能无法正常拉取镜像
# 重新加载配置文件
daemon-reload
# 重启containerd服务
systemctl enable --now containerd && systemctl restart containerd
查看containerd状态
root@master:/home# systemctl status containerd
● containerd.service - containerd container runtime
Loaded: loaded (/lib/systemd/system/containerd.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2023-12-06 02:19:46 EST; 19h ago
Docs: https://containerd.io
Main PID: 32499 (containerd)
Tasks: 125
Memory: 121.9M
CPU: 14min 9.409s
CGroup: /system.slice/containerd.service
├─32499 /usr/bin/containerd
├─34173 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 3af88ea90654ada11aa917024580843784cb95aefd633404358575c22ed4d518 -address /run/>
├─34205 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 0ab295c65faa9fcc66b1c0bea85950b26fceb941c347819b9a19f56ca15f0cad -address /run/>
├─34237 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id afb371525a1f9555d595394cfa2bffde592c65a57725b602e06ce8fc15b0c826 -address /run/>
├─34263 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 24adcfdfd2a194fabca3552cff2232c7618cab3e9c603e50ffd386b245ea4713 -address /run/>
├─34568 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id b69a1dc421ff046aca6c96f6dff15ecd74f4f52859916f7689c72a34334815ea -address /run/>
├─38459 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 5b30d61ed93385cbb45ddcc6ffd07717daaff503e84d7de873f5999906972e78 -address /run/>
├─38509 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 6579c8e87a41c99f41b5e9012ee58f172497a658c6dd68aef0f270b9d6022302 -address /run/>
└─46491 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 999461451999fd3061e46eae3e279b1f5e631779312bfc27b70ba4aae17c1cef -address /run/>
Dec 06 04:48:57 master containerd[32499]: time="2023-12-06T04:48:57.563811417-05:00" level=info msg="shim disconnected" id=d000434f48430019b2f60cd0a9b9fd74>
Dec 06 04:48:57 master containerd[32499]: time="2023-12-06T04:48:57.563877892-05:00" level=warning msg="cleaning up after shim disconnected" id=d000434f484>
Dec 06 04:48:57 master containerd[32499]: time="2023-12-06T04:48:57.563889339-05:00" level=info msg="cleaning up dead shim"
Dec 06 04:48:57 master containerd[32499]: time="2023-12-06T04:48:57.575365944-05:00" level=warning msg="cleanup warnings time=\"2023-12-06T04:48:57-05:00\">
Dec 06 04:48:58 master containerd[32499]: time="2023-12-06T04:48:58.359166719-05:00" level=info msg="RemoveContainer for \"3081eb56e570bdaaa812baa43601906e>
Dec 06 04:48:59 master containerd[32499]: time="2023-12-06T04:48:59.259638215-05:00" level=info msg="RemoveContainer for \"3081eb56e570bdaaa812baa43601906e>
Dec 06 04:49:15 master containerd[32499]: time="2023-12-06T04:49:15.467269020-05:00" level=info msg="CreateContainer within sandbox \"afb371525a1f9555d5953>
Dec 06 04:49:18 master containerd[32499]: time="2023-12-06T04:49:18.356434471-05:00" level=info msg="CreateContainer within sandbox \"afb371525a1f9555d5953>
Dec 06 04:49:18 master containerd[32499]: time="2023-12-06T04:49:18.357125524-05:00" level=info msg="StartContainer for \"6668a22a2f3b799c2869c29d50cb08b2d>
Dec 06 04:49:18 master containerd[32499]: time="2023-12-06T04:49:18.848875292-05:00" level=info msg="StartContainer for \"6668a22a2f3b799c2869c29d50cb08b2d>
查看containerd版本
root@master:/home# ctr --version
ctr containerd.io 1.6.25
上面配置后,containerd安装成功。
3.安装kubeadm、kubelet、kubectl
3.1安装
上面已经配置完k8s的镜像源现在只需要安装即可
apt-get install -y kubelet kubeadm kubectl
# 将软件包标记为保留,以防止软件包被自动安装、升级或删除
apt-mark hold kubelet kubeadm kubectl
3.2 修改配置
k8s安装好之后默认的配置是/run/docker/docker.sock
,因为在k8s1.24之后弃用了docker。所以需要修改成containerd,否则在执行k8s的init时会保存。修改步骤如下
root@master:~# vim /etc/crictl.yaml
runtime-endpoint: "unix:///run/containerd/containerd.sock"
image-endpoint: "unix:///run/containerd/containerd.sock"
timeout: 10 #超时时间不宜过短,我这里修改成10秒了
debug: false
pull-image-on-create: false
disable-pull-on-run: false
root@master:~# systemctl daemon-reload && systemctl restart containerd
root@master:~# crictl images
IMAGE TAG IMAGE ID SIZE
4.初始化k8s
4.1 生成k8s配置文件
kubeadm config print init-defaults --component-configs KubeletConfiguration > kubeadm_init.yaml
配置文件如下
apiVersion: kubeadm.k8s.io/v1beta3
bootstrapTokens:
- groups:
- system:bootstrappers:kubeadm:default-node-token
token: abcdef.0123456789abcdef
ttl: 24h0m0s
usages:
- signing
- authentication
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 192.168.100.55
bindPort: 6443
nodeRegistration:
criSocket: unix:///var/run/containerd/containerd.sock
imagePullPolicy: IfNotPresent
name: master
taints: null
---
apiServer:
timeoutForControlPlane: 4m0s
apiVersion: kubeadm.k8s.io/v1beta3
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes
controllerManager: {}
dns: {}
etcd:
local:
dataDir: /var/lib/etcd
imageRepository: registry.aliyuncs.com/google_containers
kind: ClusterConfiguration
kubernetesVersion: 1.28.2
networking:
dnsDomain: cluster.local
serviceSubnet: 10.96.0.0/12
scheduler: {}
---
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
anonymous:
enabled: false
webhook:
cacheTTL: 0s
enabled: true
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
mode: Webhook
webhook:
cacheAuthorizedTTL: 0s
cacheUnauthorizedTTL: 0s
cgroupDriver: systemd
clusterDNS:
- 10.96.0.10
clusterDomain: cluster.local
containerRuntimeEndpoint: ""
cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
logging:
flushFrequency: 0
options:
json:
infoBufferSize: "0"
verbosity: 0
memorySwap: {}
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
resolvConf: /run/systemd/resolve/resolv.conf
rotateCertificates: true
runtimeRequestTimeout: 0s
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s
需要修改的地方
字段 | 默认值 | 修改后的值 |
---|---|---|
cgroupDriver | systemd | containerd的配置已经改成了使用systemd,所以这里保持默认配置 |
kubernetesVersion | 1.28.0 | 1.28.2 |
imageRepository | k8s.gcr.io | registry.aliyuncs.com/google_containers |
advertiseAddress | 1.2.3.4 | |
nodeRegistration | node | master的hostname名字 |
4.2 提前拉取镜像
kubeadm config images pull --config k8s-init-master.yaml
4.3 给pause镜像打tag
ctr -n k8s.io i tag registry.aliyuncs.com/google_containers/pause:3.9 k8s.gcr.io/pause:3.9
重启containerd
systemctl restart containerd
4.3 正式初始化k8s
root@master:/home# kubeadm init --config kubeadm_init.yaml
[init] Using Kubernetes version: v1.28.2
······
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Alternatively, if you are the root user, you can run:
export KUBECONFIG=/etc/kubernetes/admin.conf
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join 192.168.100.55:6443 --token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:014dca58bef3df1ba0a8e75dac1ea6598487f28eec691782c5a78b8c117519b2
根据上面的提示执行
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
export KUBECONFIG=/etc/kubernetes/admin.conf
5.配置node节点
可以将上述的虚拟机镜像clone一份作为node。node主机中需要使用k8s加入master节点。执行如下命令
kubeadm join 192.168.100.55:6443 --token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:014dca58bef3df1ba0a8e75dac1ea6598487f28eec691782c5a78b8c117519b2
root@master:/home/node1# kubeadm join 192.168.100.55:6443 --token abcdef.0123456789abcdef --discovery-token-ca-cert-hash sha256:014dca58bef3df1ba0a8e75dac1ea6598487f28eec691782c5a78b8c117519b2
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
至此,k8s基础架构已经安装成功,后面依然需要网络镜像等。
6.配置k8s网络
此时查看node是否启动
root@master:/home# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-86966648-5jc4x 0/1 Pending 0 61s
coredns-86966648-96sqz 0/1 Pending 0 61s
etcd-master 1/1 Running 1 73s
kube-apiserver-master 1/1 Running 1 75s
kube-controller-manager-master 1/1 Running 1 75s
kube-proxy-9nvdc 1/1 Running 0 61s
kube-scheduler-master 1/1 Running 1 70s
root@master:/home# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master NotReady control-plane 2m29s v1.28.2
node1 NotReady <none> 4s v1.28.2
可以看到master和node都显示NotReady。这是因为k8s没有配置网络的原因。下面开始配置calico网络。
6.1下载calico-3.26.4
下载地址:https://github.com/projectcalico/calico/releases
下载后传到服务器上进行解压。将image镜像pull到container中。
ctr -n k8s.io images import calico-cni.tar
ctr -n k8s.io images import calico-kube-controllers.tar
ctr -n k8s.io images import calico-node.tar
注意:上述镜像在node中也要安装
同时需要修改release-v3.26.4/manifests
中的calico.yaml
文件。在4800行左右,打开下面两行的注释,并修改10.96.0.0/12
为init中设置的pod地址。
# chosen from this range. Changing this value after installation will have
# no effect. This should fall within `--cluster-cidr`.
# - name: CALICO_IPV4POOL_CIDR
# value: "10.96.0.0/12"
6.2 部署
kubectl apply -f calico.yaml
执行完可以获得以下信息
root@master:/home# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-7c968b5878-8sbkl 1/1 Running 2 (17h ago) 18h
kube-system calico-node-68m72 1/1 Running 0 18h
kube-system calico-node-vn95j 1/1 Running 0 18h
kube-system coredns-86966648-5jc4x 1/1 Running 0 20h
kube-system coredns-86966648-96sqz 1/1 Running 0 20h
kube-system etcd-master 1/1 Running 1 20h
kube-system kube-apiserver-master 1/1 Running 2 (17h ago) 20h
kube-system kube-controller-manager-master 1/1 Running 3 (17h ago) 20h
kube-system kube-proxy-9nvdc 1/1 Running 0 20h
kube-system kube-proxy-9xjz6 1/1 Running 0 20h
kube-system kube-scheduler-master 1/1 Running 2 (18h ago) 20h
root@master:/home# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready control-plane 20h v1.28.2
node1 Ready <none> 20h v1.28.2
至此,k8s基本架构和网络都已配置后。后面需要部署业务,部署业务推荐使用gitlab。后续业务部署需要根据不同业务来实现,较完整地应用部署说明:https://cloud.tencent.com/developer/article/1821616
7.相关问题解决方案
7.1 问题1:
crictl imagesWARN[0000] image connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.E0722 23:05:31.059137 34283 remote_image.go:119] “ListImages with filter from image service failed” err=“rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory”” filter="&ImageFilter{Image:&ImageSpec{Image:,Annotations:map[string]string{},},}"FATA[0000] listing images: rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory”
解决方案:
上述这个问题产生的原因是:/var/run/dockershim.sock
文件根本就没有。因为k8s默认依然是docker.sock所以,需要配置成containerd.sock,是因为下面这一步没有配置成unix:///run/containerd/containerd.sock
,配置成后报错解决
root@master:~# vim /etc/crictl.yaml
runtime-endpoint: "unix:///run/containerd/containerd.sock"
image-endpoint: "unix:///run/containerd/containerd.sock"
timeout: 10 #超时时间不宜过短,我这里修改成10秒了
debug: false
pull-image-on-create: false
disable-pull-on-run: false
7.2 问题2:
“Error getting node” err=“node “master” not found”
解决:主机的hostname
和nodeRegistration
中的node
不一致,修改成一致即可。如果一致也不可以可能是k8s和containerd版本不兼容。
7.3 问题3:
kubelet-check] The HTTP call equal to ‘curl -sSL http://localhost:10248/healthz’ failed with error: Get “http://localhost:10248/healthz”: dial tcp 127.0.0.1:10248: connect: connection refused.
解决:可能是swap没有关闭成功,执行 swapoff -a
。
上述几个问题解决思路可能并不都适用,如果有疑问可以联系博主。