kubernetes版本: 1.12.1 源码路径 pkg/proxy/ipvs/proxier.go
本文只讲解IPVS相关部分,启动流程前文:
https://blog.csdn.net/zhonglinzhang/article/details/80185053
WHY IPVS
尽管 Kubernetes 在版本v1.6中已经支持5000个节点,但使用 iptables 的 kube-proxy 实际上是将集群扩展到5000个节点的瓶颈。 在5000节点集群中使用 NodePort 服务,如果有2000个服务并且每个服务有10个 pod,这将在每个工作节点上至少产生20000个 iptable 记录,这可能使内核非常繁忙。
WHAT ?
kube-proxy引入了IPVS,IPVS与iptables基于Netfilter,但IPVS采用的hash表,因此当service数量规模特别大时,hash查表的速度优势就会突显,而提高查找service性能
HOW IPVS?
kube-proxy启动参数
/usr/bin/kube-proxy --bind-address=10.12.51.172 --hostname-override=10.12.51.172 --cluster-cidr=10.254.0.0/16 --kubeconfig=/etc/kubernetes/kube-proxy.kubeconfig --logtostderr=true --v=2
--ipvs-scheduler=wrr --ipvs-min-sync-period=5s --ipvs-sync-period=5s --proxy-mode=ipvs
参数 --masquerade-all=true
则 ipvs 将伪装所有访问 Service 的 Cluster IP 的流量,此时的行为和 iptables 一样
参数--cluster-cidr=<cidr>
参数: –cleanup-ipvs:true清除在 IPVS 模式下创建的 IPVS 配置和 IPTables 规则。
参数: –ipvs-sync-period 同步IPVS 规则的最大间隔时间(’5s’,’1m’)。
参数: –ipvs-min-sync-period 同步 IPVS 规则的最小间隔时间间隔(例如’5s’,’1m’)
参数: –ipvs-scheduler 默认为rr
- rr: round-robin
- lc: least connection
- dh: destination hashing
- sh: source hashing
- sed: shortest expected delay
- nq: never queue
kube-proxy 开启 ipvs 前置条件
modprobe br_netfilter
cat > /etc/sysconfig/modules/ipvs.modules << EOF
#! /bin/bash
modprobe -- ip_vs
modprobe -- ip_vs_rr
modprobe -- ip_vs_wrr
modprobe -- ip_vs_sh
modprobe -- nf_conntrack_ipv4
EOF
chmod 755 /etc/sysconfig/modules/ipvs.modules
/etc/sysconfig/modules/ipvs.modules && lsmod | greo -e ip_vs -e nf_conntrack_ipv4
通过 svc 创建的 Cluster 都绑定在 kube-ipvs0 这块虚拟网卡上
IPVS原理
摘自网上文章,一目了然
ipvs : 工作于内核空间,主要用于使用户定义的策略生效;
ipvsadm : 工作于用户空间,主要用于用户定义和管理集群服务的工具;
- 当用户请求到达Director Server,数据报文先到内核空间的PREROUTING链。 此时报文的源IP为CIP,目标IP为VIP
- PREROUTING检查发现数据包的目标IP是本机,将数据包送至INPUT链。
- ipvs会监听到达input链的数据包,数据包请求的服务是集群服务,修改数据包的目标IP地址为后端服务器IP,然后将数据包发至POSTROUTING链。 此时报文的源IP为CIP,目标IP为RIP
- POSTROUTING链通过选路,将数据包发送给Real Server
- Real Server发现目标为自己的IP,响应报文发回给Director Server。 此时报文的源IP为RIP,目标IP为CIP
- Director Server在响应客户端前,此时会将源IP地址修改为自己的VIP地址,然后响应给客户端。 此时报文的源IP为VIP,目标IP为CIP
IPVS 中有三种代理模式:
NAT(masq),IPIP 和 DR。
只有 NAT 模式支持端口映射。 Kube-proxy 利用 NAT 模式进行端口映射。
IPVS DR方式原理
ipset原理
ipset是iptables的扩展, 创建匹配整个地址集合的规则。而普通的iptables链只能单IP匹配, ip集合存储在带索引的数据结构中,这种结构即时集合比较大也可以进行高效的查找。官网:http://ipset.netfilter.org/
ipvs 会使用 iptables 进行包过滤、SNAT、masquared(伪装)。具体来说,ipvs 将使用ipset
来存储需要DROP
或masquared
的流量的源或目标地址,以确保 iptables 规则的数量是恒定的
内核模块
确保 ipvs 需要的内核模块,需要下面几个模块:ip_vs、ip_vs_rr、ip_vs_wrr、ip_vs_sh、nf_conntrack_ipv4
var ipvsModules = []string{
"ip_vs",
"ip_vs_rr",
"ip_vs_wrr",
"ip_vs_sh",
"nf_conntrack_ipv4",
}
1. NewProxier函数
1.1 设置内核参数
- net/ipv4/conf/all/route_localnet: 是否允许外部访问localhost
- net/bridge/bridge-nf-call-iptables: 1为二层的网桥在转发包时也会被iptables的FORWARD规则所过滤,这样就会出现L3层的iptables rules去过滤L2的帧的问题
- net/ipv4/vs/conntrack
- net/ipv4/ip_forward: 是否打开ipv4的IP转发(0:禁止 1:打开)
// Set the route_localnet sysctl we need for
if err := sysctl.SetSysctl(sysctlRouteLocalnet, 1); err != nil {
return nil, fmt.Errorf("can't set sysctl %s: %v", sysctlRouteLocalnet, err)
}
// Proxy needs br_netfilter and bridge-nf-call-iptables=1 when containers
// are connected to a Linux bridge (but not SDN bridges). Until most
// plugins handle this, log when config is missing
if val, err := sysctl.GetSysctl(sysctlBridgeCallIPTables); err == nil && val != 1 {
glog.Infof("missing br-netfilter module or unset sysctl br-nf-call-iptables; proxy may not work as intended")
}
// Set the conntrack sysctl we need for
if err := sysctl.SetSysctl(sysctlVSConnTrack, 1); err != nil {
return nil, fmt.Errorf("can't set sysctl %s: %v", sysctlVSConnTrack, err)
}
// Set the ip_forward sysctl we need for
if err := sysctl.SetSysctl(sysctlForward, 1); err != nil {
return nil, fmt.Errorf("can't set sysctl %s: %v", sysctlForward, err)
}
1.2 初始化IPSet列表
load定义的IPSet到ipsetList map中
// initialize ipsetList with all sets we needed
proxier.ipsetList = make(map[string]*IPSet)
for _, is := range ipsetInfo {
if is.isIPv6 {
proxier.ipsetList[is.name] = NewIPSet(ipset, is.name, is.setType, isIPv6, is.comment)
}
proxier.ipsetList[is.name] = NewIPSet(ipset, is.name, is.setType, false, is.comment)
}
1.3 syncRunner初始化,主要函数是syncProxyRules
proxier.syncRunner = async.NewBoundedFrequencyRunner("sync-runner", proxier.syncProxyRules, minSyncPeriod, syncPeriod, burstSyncs)
2. syncProxyRules函数
2.1 reset 四个buffer
在头部写入*filter,*nat标志表的起始
// Reset all buffers used later.
// This is to avoid memory reallocations and thus improve performance.
proxier.natChains.Reset()
proxier.natRules.Reset()
proxier.filterChains.Reset()
proxier.filterRules.Reset()
// Write table headers.
writeLine(proxier.filterChains, "*filter")
writeLine(proxier.natChains, "*nat")
2.2 创建dunmny device
# ip route show table local type local proto kernel
- 10.12.51.172 dev eth0 scope host src 10.12.51.172
- 10.254.0.1 dev kube-ipvs0 scope host src 10.254.0.1
- 10.254.0.2 dev kube-ipvs0 scope host src 10.254.0.2
- 10.254.69.27 dev kube-ipvs0 scope host src 10.254.69.27
- 10.254.86.39 dev kube-ipvs0 scope host src 10.254.86.39
- 127.0.0.0/8 dev lo scope host src 127.0.0.1
- 127.0.0.1 dev lo scope host src 127.0.0.1
- 172.30.46.1 dev docker0 scope host src 172.30.46.1
// make sure dummy interface exists in the system where ipvs Proxier will bind service address on it
_, err := proxier.netlinkHandle.EnsureDummyDevice(DefaultDummyDevice)
if err != nil {
glog.Errorf("Failed to create dummy interface: %s, error: %v", DefaultDummyDevice, err)
return
}
2.3 每一次 sync 循环遍历所有 service
// Build IPVS rules for each service.
for svcName, svc := range proxier.serviceMap {
svcInfo, ok := svc.(*serviceInfo)
if !ok {
klog.Errorf("Failed to cast serviceInfo %q", svcName.String())
continue
}
protocol := strings.ToLower(string(svcInfo.Protocol()))
// Precompute svcNameString; with many services the many calls
// to ServicePortName.String() show up in CPU profiles.
svcNameString := svcName.String()
2.3.1 该 service 如果存在 endpoint 则添加 KUBE-LOOP-BACK ipset 记录
# ipset -L KUBE-LOOP-BACK
Name: KUBE-LOOP-BACK
Type: hash:ip,port,ip
Revision: 5
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 448
References: 1
Number of entries: 4
Members:
192.170.77.151,udp:53,192.170.77.151
192.170.77.151,tcp:53,192.170.77.151
192.170.77.151,tcp:9153,192.170.77.151
192.170.77.137,tcp:3306,192.170.77.137
// Handle traffic that loops back to the originator with SNAT.
for _, e := range proxier.endpointsMap[svcName] {
ep, ok := e.(*proxy.BaseEndpointInfo)
if !ok {
klog.Errorf("Failed to cast BaseEndpointInfo %q", e.String())
continue
}
if !ep.IsLocal {
continue
}
epIP := ep.IP()
epPort, err := ep.Port()
// Error parsing this endpoint has been logged. Skip to next endpoint.
if epIP == "" || err != nil {
continue
}
entry := &utilipset.Entry{
IP: epIP,
Port: epPort,
Protocol: protocol,
IP2: epIP,
SetType: utilipset.HashIPPortIP,
}
if valid := proxier.ipsetList[kubeLoopBackIPSet].validateEntry(entry); !valid {
klog.Errorf("%s", fmt.Sprintf(EntryInvalidErr, entry, proxier.ipsetList[kubeLoopBackIPSet].Name))
continue
}
proxier.ipsetList[kubeLoopBackIPSet].activeEntries.Insert(entry.String())
}
2.3.2 添加 KUBE-CLUSTER-IP ipset 记录
# ipset -L KUBE-CLUSTER-IP
Name: KUBE-CLUSTER-IP
Type: hash:ip,port
Revision: 5
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 568
References: 2
Number of entries: 7
Members:
10.200.233.191,tcp:44134
10.200.194.33,tcp:3306
10.200.254.254,udp:53
10.200.0.1,tcp:443
10.200.254.254,tcp:9153
10.200.254.254,tcp:53
10.200.254.116,tcp:443
// Capture the clusterIP.
// ipset call
entry := &utilipset.Entry{
IP: svcInfo.ClusterIP().String(),
Port: svcInfo.Port(),
Protocol: protocol,
SetType: utilipset.HashIPPort,
}
// add service Cluster IP:Port to kubeServiceAccess ip set for the purpose of solving hairpin.
// proxier.kubeServiceAccessSet.activeEntries.Insert(entry.String())
if valid := proxier.ipsetList[kubeClusterIPSet].validateEntry(entry); !valid {
klog.Errorf("%s", fmt.Sprintf(EntryInvalidErr, entry, proxier.ipsetList[kubeClusterIPSet].Name))
continue
}
proxier.ipsetList[kubeClusterIPSet].activeEntries.Insert(entry.String())
2.3.3 以 cluster-ip 作为 ipvs 的 virtual server,把cluster ip 绑定到 dummy 接口上
// ipvs call
serv := &utilipvs.VirtualServer{
Address: svcInfo.ClusterIP(),
Port: uint16(svcInfo.Port()),
Protocol: string(svcInfo.Protocol()),
Scheduler: proxier.ipvsScheduler,
}
// Set session affinity flag and timeout for IPVS service
if svcInfo.SessionAffinityType() == v1.ServiceAffinityClientIP {
serv.Flags |= utilipvs.FlagPersistent
serv.Timeout = uint32(svcInfo.StickyMaxAgeSeconds())
}
// We need to bind ClusterIP to dummy interface, so set `bindAddr` parameter to `true` in syncService()
if err := proxier.syncService(svcNameString, serv, true); err == nil {
activeIPVSServices[serv.String()] = true
activeBindAddrs[serv.Address.String()] = true
// ExternalTrafficPolicy only works for NodePort and external LB traffic, does not affect ClusterIP
// So we still need clusterIP rules in onlyNodeLocalEndpoints mode.
if err := proxier.syncEndpoint(svcName, false, serv); err != nil {
klog.Errorf("Failed to sync endpoint for service: %v, err: %v", serv, err)
}
} else {
klog.Errorf("Failed to sync service: %v, err: %v", serv, err)
}
2.3.3.1 syncService 函数
# ipvsadm -l
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP master.node.local:35351 rr
-> 192.170.77.137:mysql Masq 1 0 0
TCP lin:https rr
对于 virtual server 没有的则添加,变动的则更新,EnsureAddressBind 函数直接添加到 kube-ipvs0 上,相当于命令 ip addr add $addr dev $link
func (proxier *Proxier) syncService(svcName string, vs *utilipvs.VirtualServer, bindAddr bool) error {
appliedVirtualServer, _ := proxier.ipvs.GetVirtualServer(vs)
klog.Infof("zzlin syncService bindAddr: %v svcName: %v vs: %#v appliedVirtualServer: %#v", bindAddr, svcName, vs, appliedVirtualServer)
if appliedVirtualServer == nil || !appliedVirtualServer.Equal(vs) {
if appliedVirtualServer == nil {
// IPVS service is not found, create a new service
klog.V(3).Infof("Adding new service %q %s:%d/%s", svcName, vs.Address, vs.Port, vs.Protocol)
if err := proxier.ipvs.AddVirtualServer(vs); err != nil {
klog.Errorf("Failed to add IPVS service %q: %v", svcName, err)
return err
}
} else {
// IPVS service was changed, update the existing one
// During updates, service VIP will not go down
klog.V(3).Infof("IPVS service %s was changed", svcName)
if err := proxier.ipvs.UpdateVirtualServer(vs); err != nil {
klog.Errorf("Failed to update IPVS service, err:%v", err)
return err
}
}
}
// bind service address to dummy interface even if service not changed,
// in case that service IP was removed by other processes
if bindAddr {
klog.V(4).Infof("Bind addr %s", vs.Address.String())
_, err := proxier.netlinkHandle.EnsureAddressBind(vs.Address.String(), DefaultDummyDevice)
if err != nil {
klog.Errorf("Failed to bind service address to dummy device %q: %v", svcName, err)
return err
}
}
return nil
}
2.3.3.1 syncEndpoint 函数
获取 virtual server,获得不到报错
func (proxier *Proxier) syncEndpoint(svcPortName proxy.ServicePortName, onlyNodeLocalEndpoints bool, vs *utilipvs.VirtualServer) error {
appliedVirtualServer, err := proxier.ipvs.GetVirtualServer(vs)
if err != nil || appliedVirtualServer == nil {
klog.Errorf("Failed to get IPVS service, error: %v", err)
return err
}
2.3.3.1.1 对于新的 endpoint 则添加 real server
// Create new endpoints
for _, ep := range newEndpoints.List() {
newDest := &utilipvs.RealServer{
Address: net.ParseIP(ip),
Port: uint16(portNum),
Weight: 1,
}
err = proxier.ipvs.AddRealServer(appliedVirtualServer, newDest)
if err != nil {
klog.Errorf("Failed to add destination: %v, error: %v", newDest, err)
continue
}
}
2.3.3.1.1 对于旧的 endpoint 则优雅的删除 real server
2.3.4 对于 nodePort 的情况,把 node 地址加入
if svcInfo.NodePort() != 0 {
if len(nodeAddresses) == 0 || len(nodeIPs) == 0 {
// Skip nodePort configuration since an error occurred when
// computing nodeAddresses or nodeIPs.
continue
}
var lps []utilproxy.LocalPort
for _, address := range nodeAddresses {
lp := utilproxy.LocalPort{
Description: "nodePort for " + svcNameString,
IP: address,
Port: svcInfo.NodePort(),
Protocol: protocol,
}
if utilproxy.IsZeroCIDR(address) {
// Empty IP address means all
lp.IP = ""
lps = append(lps, lp)
// If we encounter a zero CIDR, then there is no point in processing the rest of the addresses.
break
}
lps = append(lps, lp)
}
2.3.5 nodePort 方式开端口,其实就是监听
// For ports on node IPs, open the actual port and hold it.
for _, lp := range lps {
if proxier.portsMap[lp] != nil {
klog.V(4).Infof("Port %s was open before and is still needed", lp.String())
replacementPortsMap[lp] = proxier.portsMap[lp]
// We do not start listening on SCTP ports, according to our agreement in the
// SCTP support KEP
} else if svcInfo.Protocol() != v1.ProtocolSCTP {
socket, err := proxier.portMapper.OpenLocalPort(&lp)
if err != nil {
klog.Errorf("can't open %s, skipping this nodePort: %v", lp.String(), err)
continue
}
if lp.Protocol == "udp" {
isIPv6 := utilnet.IsIPv6(svcInfo.ClusterIP())
conntrack.ClearEntriesForPort(proxier.exec, lp.Port, isIPv6, v1.ProtocolUDP)
}
replacementPortsMap[lp] = socket
} // We're holding the port, so it's OK to install ipvs rules.
}
2.3.6 添加 ipset 表记录,比如 TCP 的 nodePort,表为 KUBE-NODE-PORT-TCP 如下
# ipset -L KUBE-NODE-PORT-TCP
Name: KUBE-NODE-PORT-TCP
Type: bitmap:port
Revision: 3
Header: range 0-65535
Size in memory: 8300
References: 1
Number of entries: 1
Members:
35351
switch protocol {
case "tcp":
nodePortSet = proxier.ipsetList[kubeNodePortSetTCP]
entries = []*utilipset.Entry{{
// No need to provide ip info
Port: svcInfo.NodePort(),
Protocol: protocol,
SetType: utilipset.BitmapPort,
}}
case "udp":
nodePortSet = proxier.ipsetList[kubeNodePortSetUDP]
entries = []*utilipset.Entry{{
// No need to provide ip info
Port: svcInfo.NodePort(),
Protocol: protocol,
SetType: utilipset.BitmapPort,
}}
case "sctp":
nodePortSet = proxier.ipsetList[kubeNodePortSetSCTP]
// Since hash ip:port is used for SCTP, all the nodeIPs to be used in the SCTP ipset entries.
entries = []*utilipset.Entry{}
for _, nodeIP := range nodeIPs {
entries = append(entries, &utilipset.Entry{
IP: nodeIP.String(),
Port: svcInfo.NodePort(),
Protocol: protocol,
SetType: utilipset.HashIPPort,
})
}
default:
// It should never hit
klog.Errorf("Unsupported protocol type: %s", protocol)
}
nodeport 方式的也是调用 syncService syncEndpoint 接口同步 virtual server 和 real server,不同的是无需绑定到 kube-ipvs0上
2.4 同步 ipset 表创记录
列出所有 ipset 记录,清除以及添加 ipset 表记录
func (set *IPSet) syncIPSetEntries() {
appliedEntries, err := set.handle.ListEntries(set.Name)
if err != nil {
klog.Errorf("Failed to list ip set entries, error: %v", err)
return
}
// currentIPSetEntries represents Endpoints watched from API Server.
currentIPSetEntries := sets.NewString()
for _, appliedEntry := range appliedEntries {
currentIPSetEntries.Insert(appliedEntry)
}
if !set.activeEntries.Equal(currentIPSetEntries) {
// Clean legacy entries
for _, entry := range currentIPSetEntries.Difference(set.activeEntries).List() {
if err := set.handle.DelEntry(entry, set.Name); err != nil {
if !utilipset.IsNotFoundError(err) {
klog.Errorf("Failed to delete ip set entry: %s from ip set: %s, error: %v", entry, set.Name, err)
}
} else {
klog.V(3).Infof("Successfully delete legacy ip set entry: %s from ip set: %s", entry, set.Name)
}
}
// Create active entries
for _, entry := range set.activeEntries.Difference(currentIPSetEntries).List() {
if err := set.handle.AddEntry(entry, &set.IPSet, true); err != nil {
klog.Errorf("Failed to add entry: %v to ip set: %s, error: %v", entry, set.Name, err)
} else {
klog.V(3).Infof("Successfully add entry: %v to ip set: %s", entry, set.Name)
}
}
}
}
规则如下:
Name: | Type: | Revision: | Header: | Size in memory: | References: | Members: |
KUBE-LOOP-BACK | hash:ip,port,ip | 2 | family inet hashsize 1024 maxelem 65536 | 16824 | 1 | 172.30.46.39,tcp:6379,172.30.46.39 172.30.3.15,udp:53,172.30.3.15 |
KUBE-NODE-PORT-TCP | bitmap:port | 1 | range 0-65535 | 524432 | 1 | 31011 32371 |
KUBE-CLUSTER-IP | hash:ip,port | 2 | family inet hashsize 1024 maxelem 65536 | 16688 | 2 | 10.254.0.2,tcp:53 10.254.0.2,udp:53 10.254.86.39,tcp:6379 10.254.0.1,tcp:443 10.254.69.27,tcp:443 |
3. inspectWithIptablesChain
KUBE-POSTROUTING匹配KUBE-LOOP-BACK ipset表,则伪装: -A KUBE-POSTROUTING -m comment --comment "Kubernetes endpoints dst ip:port, source ip for solving hairpin purpose" -m set --match-set KUBE-LOOP-BACK dst,dst,src -j MASQUERADE
-A KUBE-SERVICES -m addrtype --dst-type LOCAL -j KUBE-NODE-PORT
-A KUBE-SERVICES -m set --match-set KUBE-CLUSTER-IP dst,dst -j ACCEPT
// ipsetWithIptablesChain is the ipsets list with iptables source chain and the chain jump to
// `iptables -t nat -A <from> -m set --match-set <name> <matchType> -j <to>`
// example: iptables -t nat -A KUBE-SERVICES -m set --match-set KUBE-NODE-PORT-TCP dst -j KUBE-NODE-PORT
// ipsets with other match rules will be created Individually.
// Note: kubeNodePortLocalSetTCP must be prior to kubeNodePortSetTCP, the same for UDP.
var ipsetWithIptablesChain = []struct {
name string
from string
to string
matchType string
protocolMatch string
}{
{kubeLoopBackIPSet, string(kubePostroutingChain), "MASQUERADE", "dst,dst,src", ""},
{kubeLoadBalancerSet, string(kubeServicesChain), string(KubeLoadBalancerChain), "dst,dst", ""},
{kubeLoadbalancerFWSet, string(KubeLoadBalancerChain), string(KubeFireWallChain), "dst,dst", ""},
{kubeLoadBalancerSourceCIDRSet, string(KubeFireWallChain), "RETURN", "dst,dst,src", ""},
{kubeLoadBalancerSourceIPSet, string(KubeFireWallChain), "RETURN", "dst,dst,src", ""},
{kubeLoadBalancerLocalSet, string(KubeLoadBalancerChain), "RETURN", "dst,dst", ""},
{kubeNodePortLocalSetTCP, string(KubeNodePortChain), "RETURN", "dst", "tcp"},
{kubeNodePortSetTCP, string(KubeNodePortChain), string(KubeMarkMasqChain), "dst", "tcp"},
{kubeNodePortLocalSetUDP, string(KubeNodePortChain), "RETURN", "dst", "udp"},
{kubeNodePortSetUDP, string(KubeNodePortChain), string(KubeMarkMasqChain), "dst", "udp"},
{kubeNodePortSetSCTP, string(kubeServicesChain), string(KubeNodePortChain), "dst", "sctp"},
{kubeNodePortLocalSetSCTP, string(KubeNodePortChain), "RETURN", "dst", "sctp"},
}
4. writeIptablesRules
将规则写入nat rule buffer中,写入filter buffer中,下面一大堆都是这种操作
for _, set := range ipsetWithIptablesChain {
if _, find := proxier.ipsetList[set.name]; find && !proxier.ipsetList[set.name].isEmpty() {
args = append(args[:0], "-A", set.from)
if set.protocolMatch != "" {
args = append(args, "-p", set.protocolMatch)
}
args = append(args,
"-m", "comment", "--comment", proxier.ipsetList[set.name].getComment(),
"-m", "set", "--match-set", set.name,
set.matchType,
)
writeLine(proxier.natRules, append(args, "-j", set.to)...)
}
}
-A KUBE-SERVICES ! -s 10.254.0.0/16 -m comment --comment "Kubernetes service cluster ip + port for masquerade purpose" -m set --match-set KUBE-CLUSTER-IP dst,dst -j KUBE-MARK-MASQ
if !proxier.ipsetList[kubeClusterIPSet].isEmpty() {
args = append(args[:0],
"-A", string(kubeServicesChain),
"-m", "comment", "--comment", proxier.ipsetList[kubeClusterIPSet].getComment(),
"-m", "set", "--match-set", kubeClusterIPSet,
)
if proxier.masqueradeAll {
writeLine(proxier.natRules, append(args, "dst,dst", "-j", string(KubeMarkMasqChain))...)
} else if len(proxier.clusterCIDR) > 0 {
// This masquerades off-cluster traffic to a service VIP. The idea
// is that you can establish a static route for your Service range,
// routing to any node, and that node will bridge into the Service
// for you. Since that might bounce off-node, we masquerade here.
// If/when we support "Local" policy for VIPs, we should update this.
writeLine(proxier.natRules, append(args, "dst,dst", "! -s", proxier.clusterCIDR, "-j", string(KubeMarkMasqChain))...)
} else {
// Masquerade all OUTPUT traffic coming from a service ip.
// The kube dummy interface has all service VIPs assigned which
// results in the service VIP being picked as the source IP to reach
// a VIP. This leads to a connection from VIP:<random port> to
// VIP:<service port>.
// Always masquerading OUTPUT (node-originating) traffic with a VIP
// source ip and service port destination fixes the outgoing connections.
writeLine(proxier.natRules, append(args, "src,dst", "-j", string(KubeMarkMasqChain))...)
}
}
-A KUBE-LOAD-BALANCER -j KUBE-MARK-MASQ
// mark drop for KUBE-LOAD-BALANCER
writeLine(proxier.natRules, []string{
"-A", string(KubeLoadBalancerChain),
"-j", string(KubeMarkMasqChain),
}...)
// mark drop for KUBE-FIRE-WALL
writeLine(proxier.natRules, []string{
"-A", string(KubeFireWallChain),
"-j", string(KubeMarkDropChain),
}...)
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
// If the masqueradeMark has been added then we want to forward that same
// traffic, this allows NodePort traffic to be forwarded even if the default
// FORWARD policy is not accept.
writeLine(proxier.filterRules,
"-A", string(KubeForwardChain),
"-m", "comment", "--comment", `"kubernetes forwarding rules"`,
"-m", "mark", "--mark", proxier.masqueradeMark,
"-j", "ACCEPT",
)
这个主要是创建
-A KUBE-FORWARD -s 10.254.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -d 10.254.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
// The following rules can only be set if clusterCIDR has been defined.
if len(proxier.clusterCIDR) != 0 {
// The following two rules ensure the traffic after the initial packet
// accepted by the "kubernetes forwarding rules" rule above will be
// accepted, to be as specific as possible the traffic must be sourced
// or destined to the clusterCIDR (to/from a pod).
writeLine(proxier.filterRules,
"-A", string(KubeForwardChain),
"-s", proxier.clusterCIDR,
"-m", "comment", "--comment", `"kubernetes forwarding conntrack pod source rule"`,
"-m", "conntrack",
"--ctstate", "RELATED,ESTABLISHED",
"-j", "ACCEPT",
)
writeLine(proxier.filterRules,
"-A", string(KubeForwardChain),
"-m", "comment", "--comment", `"kubernetes forwarding conntrack pod destination rule"`,
"-d", proxier.clusterCIDR,
"-m", "conntrack",
"--ctstate", "RELATED,ESTABLISHED",
"-j", "ACCEPT",
)
}
5. 使用iptables-restore批量导入Linux防火墙规则
// Sync iptables rules.
// NOTE: NoFlushTables is used so we don't flush non-kubernetes chains in the table.
proxier.iptablesData.Reset()
proxier.iptablesData.Write(proxier.natChains.Bytes())
proxier.iptablesData.Write(proxier.natRules.Bytes())
proxier.iptablesData.Write(proxier.filterChains.Bytes())
proxier.iptablesData.Write(proxier.filterRules.Bytes())
glog.V(5).Infof("Restoring iptables rules: %s", proxier.iptablesData.Bytes())
err = proxier.iptables.RestoreAll(proxier.iptablesData.Bytes(), utiliptables.NoFlushTables, utiliptables.RestoreCounters)
if err != nil {
glog.Errorf("Failed to execute iptables-restore: %v\nRules:\n%s", err, proxier.iptablesData.Bytes())
// Revert new local ports.
utilproxy.RevertPorts(replacementPortsMap, proxier.portsMap)
return
}
*nat
:KUBE-SERVICES - [0:0]
:KUBE-POSTROUTING - [0:0]
:KUBE-FIREWALL - [0:0]
:KUBE-NODE-PORT - [0:0]
:KUBE-LOAD-BALANCER - [0:0]
:KUBE-MARK-MASQ - [0:0]
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x00004000/0x00004000 -j MASQUERADE
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x00004000/0x00004000
-A KUBE-POSTROUTING -m comment --comment "Kubernetes endpoints dst ip:port, source ip for solving hairpin purpose" -m set --match-set KUBE-LOOP-BACK dst,dst,src -j MASQUERADE
-A KUBE-SERVICES -m comment --comment "Kubernetes service cluster ip + port for masquerade purpose" -m set --match-set KUBE-CLUSTER-IP dst,dst ! -s 172.170.0.0/16 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -m addrtype --dst-type LOCAL -j KUBE-NODE-PORT
-A KUBE-LOAD-BALANCER -j KUBE-MARK-MASQ
-A KUBE-FIREWALL -j KUBE-MARK-DROP
-A KUBE-SERVICES -m set --match-set KUBE-CLUSTER-IP dst,dst -j ACCEPT
COMMIT
*filter
:KUBE-FORWARD - [0:0]
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x00004000/0x00004000 -j ACCEPT
-A KUBE-FORWARD -s 172.170.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack pod destination rule" -d 172.170.0.0/16 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
COMMIT
6. 获得当前绑定地址
// Clean up legacy bind address
// currentBindAddrs represents ip addresses bind to DefaultDummyDevice from the system
currentBindAddrs, err := proxier.netlinkHandle.ListBindAddress(DefaultDummyDevice)
if err != nil {
glog.Errorf("Failed to get bind address, err: %v", err)
}
ipvs模式,通过svc创建的Cluster都绑定在kube-ipvs0这块虚拟网卡。创建 ClusterIP 执行以下三项操作:
- 节点中存在虚拟接口为 kube-ipvs0
- 服务 IP 地址绑定到虚拟接口
- 分别为每个服务 IP 地址创建 IPVS 虚拟服务器
TCP 10.254.23.85:5566 wrr
-> 172.30.3.29:5566 Masq 1 0 0
-> 172.30.46.41:5566 Masq 1 0 0
总结:
更新 service 与 endpoint 变化,erviceChanges 和 endpointsChanges 用来跟踪 Service 和 Endpoint 的更新。
创建 kube 顶层链和 jump 链连接 KUBE-POSTROUTING KUBE-MARK-MASQ
{utiliptables.TableNAT, kubeServicesChain}, {utiliptables.TableNAT, kubePostroutingChain}, {utiliptables.TableNAT, KubeFireWallChain}, {utiliptables.TableNAT, KubeNodePortChain}, {utiliptables.TableNAT, KubeLoadBalancerChain}, {utiliptables.TableNAT, KubeMarkMasqChain}, {utiliptables.TableFilter, KubeForwardChain},
确保 kube-ipvs0 接口存在,后续 service 地址绑到该上
添加 KUBE-LOOP-BACK KUBE-CLUSTER-IP ipset 表
调用 syncService syncEndpoint 绑定 clusterIP 到 dummy 接口,添加 vs rs,有可能有 service 没有 endpoint
循环遍历所有 service
a. 对于有 endpoint 的则添加 KUBE-LOOP-BACK ipset 记录
b. 添加 KUBE-CLUSTER-IP ipset 记录
c. 没有 virtual server 则创建,有变动则更新 ipvs 的 virtual server,并调用 EnsureAddressBind 函数直接添加到 kube-ipvs0 上,相当于命令 ip addr add $addr dev $link
d. 对于 endpoint 则作为 real server,新的则添加,旧的则优雅删除
f. nodePort 方式确保监听端口,添加 KUBE-NODE-PORT-TCP 等 ipset 表记录,无需绑定到 dummy 接口上
问题 a:增大ipvs模块hash table的大小
ipvs模块hash table默认值为2^12=4096,改为2^20=1048576。ipvsadm -l 查询当前hash table的大小
修改方法: /etc/modprobe.d/目录下添加文件ip_vs.conf,内容为:options ip_vs conn_tab_bits=20
重新加载ipvs模块
bug 1. kube-proxy cpu 高在大量 service 情况
https://github.com/kubernetes/kubernetes/issues/73382
kubernetes-v1.16 已经修改,主要是 netlink 系统调用问题,由原来每一次sync中的 每循环一个nodeport 系统调用 nodeIP 改为每一次sync 只查询以西 nodeIP