【kubernetes/k8s源码分析】kubectl-proxy ipvs源码分析

kubernetes版本: 1.12.1  源码路径 pkg/proxy/ipvs/proxier.go

本文只讲解IPVS相关部分,启动流程前文:

https://blog.csdn.net/zhonglinzhang/article/details/80185053

WHY IPVS

    尽管 Kubernetes 在版本v1.6中已经支持5000个节点,但使用 iptables 的 kube-proxy 实际上是将集群扩展到5000个节点的瓶颈。 在5000节点集群中使用 NodePort 服务,如果有2000个服务并且每个服务有10个 pod,这将在每个工作节点上至少产生20000个 iptable 记录,这可能使内核非常繁忙。  

WHAT ?

   kube-proxy引入了IPVS,IPVS与iptables基于Netfilter,但IPVS采用的hash表,因此当service数量规模特别大时,hash查表的速度优势就会突显,而提高查找service性能

HOW IPVS?

  kube-proxy启动参数

  /usr/bin/kube-proxy --bind-address=10.12.51.172 --hostname-override=10.12.51.172 --cluster-cidr=10.254.0.0/16 --kubeconfig=/etc/kubernetes/kube-proxy.kubeconfig --logtostderr=true --v=2

--ipvs-scheduler=wrr --ipvs-min-sync-period=5s --ipvs-sync-period=5s --proxy-mode=ipvs

参数 --masquerade-all=true

    则 ipvs 将伪装所有访问 Service 的 Cluster IP 的流量,此时的行为和 iptables 一样

参数--cluster-cidr=<cidr>

参数: –cleanup-ipvs:true清除在 IPVS 模式下创建的 IPVS 配置和 IPTables 规则。

参数: –ipvs-sync-period  同步IPVS 规则的最大间隔时间(’5s’,’1m’)。

参数: –ipvs-min-sync-period 同步 IPVS 规则的最小间隔时间间隔(例如’5s’,’1m’)

参数: –ipvs-scheduler 默认为rr

  • rr: round-robin
  • lc: least connection
  • dh: destination hashing
  • sh: source hashing
  • sed: shortest expected delay
  • nq: never queue

 

kube-proxy 开启 ipvs 前置条件

modprobe br_netfilter

cat > /etc/sysconfig/modules/ipvs.modules << EOF

#! /bin/bash

modprobe -- ip_vs

modprobe -- ip_vs_rr

modprobe -- ip_vs_wrr

modprobe -- ip_vs_sh

modprobe -- nf_conntrack_ipv4

EOF

chmod 755 /etc/sysconfig/modules/ipvs.modules 

/etc/sysconfig/modules/ipvs.modules && lsmod | greo -e ip_vs -e nf_conntrack_ipv4

 

  通过 svc 创建的 Cluster 都绑定在 kube-ipvs0 这块虚拟网卡上

  

IPVS原理

        摘自网上文章,一目了然

        ipvs : 工作于内核空间,主要用于使用户定义的策略生效;

        ipvsadm : 工作于用户空间,主要用于用户定义和管理集群服务的工具;

  • 当用户请求到达Director Server,数据报文先到内核空间的PREROUTING链。 此时报文的源IP为CIP,目标IP为VIP
  • PREROUTING检查发现数据包的目标IP是本机,将数据包送至INPUT链。
  • ipvs会监听到达input链的数据包,数据包请求的服务是集群服务,修改数据包的目标IP地址为后端服务器IP,然后将数据包发至POSTROUTING链。 此时报文的源IP为CIP,目标IP为RIP
  • POSTROUTING链通过选路,将数据包发送给Real Server
  • Real Server发现目标为自己的IP,响应报文发回给Director Server。 此时报文的源IP为RIP,目标IP为CIP
  • Director Server在响应客户端前,此时会将源IP地址修改为自己的VIP地址,然后响应给客户端。 此时报文的源IP为VIP,目标IP为CIP

 

 

IPVS 中有三种代理模式:

       NAT(masq),IPIP 和 DR。

       只有 NAT 模式支持端口映射。 Kube-proxy 利用 NAT 模式进行端口映射。

IPVS DR方式原理

 

 

ipset原理

   ipset是iptables的扩展, 创建匹配整个地址集合的规则。而普通的iptables链只能单IP匹配, ip集合存储在带索引的数据结构中,这种结构即时集合比较大也可以进行高效的查找。官网:http://ipset.netfilter.org/

   ipvs 会使用 iptables 进行包过滤、SNAT、masquared(伪装)。具体来说,ipvs 将使用ipset来存储需要DROPmasquared的流量的源或目标地址,以确保 iptables 规则的数量是恒定的

 

内核模块

    确保 ipvs 需要的内核模块,需要下面几个模块:ip_vs、ip_vs_rr、ip_vs_wrr、ip_vs_sh、nf_conntrack_ipv4

var ipvsModules = []string{
	"ip_vs",
	"ip_vs_rr",
	"ip_vs_wrr",
	"ip_vs_sh",
	"nf_conntrack_ipv4",
}

    

1. NewProxier函数

  1.1 设置内核参数

  • net/ipv4/conf/all/route_localnet: 是否允许外部访问localhost
  • net/bridge/bridge-nf-call-iptables: 1为二层的网桥在转发包时也会被iptables的FORWARD规则所过滤,这样就会出现L3层的iptables rules去过滤L2的帧的问题
  • net/ipv4/vs/conntrack
  • net/ipv4/ip_forward: 是否打开ipv4的IP转发(0:禁止 1:打开)
	// Set the route_localnet sysctl we need for
	if err := sysctl.SetSysctl(sysctlRouteLocalnet, 1); err != nil {
		return nil, fmt.Errorf("can't set sysctl %s: %v", sysctlRouteLocalnet, err)
	}

	// Proxy needs br_netfilter and bridge-nf-call-iptables=1 when containers
	// are connected to a Linux bridge (but not SDN bridges).  Until most
	// plugins handle this, log when config is missing
	if val, err := sysctl.GetSysctl(sysctlBridgeCallIPTables); err == nil && val != 1 {
		glog.Infof("missing br-netfilter module or unset sysctl br-nf-call-iptables; proxy may not work as intended")
	}

	// Set the conntrack sysctl we need for
	if err := sysctl.SetSysctl(sysctlVSConnTrack, 1); err != nil {
		return nil, fmt.Errorf("can't set sysctl %s: %v", sysctlVSConnTrack, err)
	}

	// Set the ip_forward sysctl we need for
	if err := sysctl.SetSysctl(sysctlForward, 1); err != nil {
		return nil, fmt.Errorf("can't set sysctl %s: %v", sysctlForward, err)
	}

  1.2 初始化IPSet列表

    load定义的IPSet到ipsetList map中

	// initialize ipsetList with all sets we needed
	proxier.ipsetList = make(map[string]*IPSet)
	for _, is := range ipsetInfo {
		if is.isIPv6 {
			proxier.ipsetList[is.name] = NewIPSet(ipset, is.name, is.setType, isIPv6, is.comment)
		}
		proxier.ipsetList[is.name] = NewIPSet(ipset, is.name, is.setType, false, is.comment)
	}

  1.3 syncRunner初始化,主要函数是syncProxyRules

	proxier.syncRunner = async.NewBoundedFrequencyRunner("sync-runner", proxier.syncProxyRules, minSyncPeriod, syncPeriod, burstSyncs)

 

2. syncProxyRules函数

  2.1 reset 四个buffer

    在头部写入*filter,*nat标志表的起始

// Reset all buffers used later.
// This is to avoid memory reallocations and thus improve performance.
proxier.natChains.Reset()
proxier.natRules.Reset()
proxier.filterChains.Reset()
proxier.filterRules.Reset()

// Write table headers.
writeLine(proxier.filterChains, "*filter")
writeLine(proxier.natChains, "*nat")

  2.2 创建dunmny device

    # ip route show table local type local proto kernel

  • 10.12.51.172 dev eth0  scope host  src 10.12.51.172 
  • 10.254.0.1 dev kube-ipvs0  scope host  src 10.254.0.1 
  • 10.254.0.2 dev kube-ipvs0  scope host  src 10.254.0.2 
  • 10.254.69.27 dev kube-ipvs0  scope host  src 10.254.69.27 
  • 10.254.86.39 dev kube-ipvs0  scope host  src 10.254.86.39 
  • 127.0.0.0/8 dev lo  scope host  src 127.0.0.1 
  • 127.0.0.1 dev lo  scope host  src 127.0.0.1 
  • 172.30.46.1 dev docker0  scope host  src 172.30.46.1
	// make sure dummy interface exists in the system where ipvs Proxier will bind service address on it
	_, err := proxier.netlinkHandle.EnsureDummyDevice(DefaultDummyDevice)
	if err != nil {
		glog.Errorf("Failed to create dummy interface: %s, error: %v", DefaultDummyDevice, err)
		return
	}

    2.3  每一次 sync 循环遍历所有 service 

// Build IPVS rules for each service.
for svcName, svc := range proxier.serviceMap {
	svcInfo, ok := svc.(*serviceInfo)
	if !ok {
		klog.Errorf("Failed to cast serviceInfo %q", svcName.String())
		continue
	}
	protocol := strings.ToLower(string(svcInfo.Protocol()))
	// Precompute svcNameString; with many services the many calls
	// to ServicePortName.String() show up in CPU profiles.
	svcNameString := svcName.String()

     2.3.1 该 service 如果存在 endpoint 则添加 KUBE-LOOP-BACK ipset 记录

# ipset -L KUBE-LOOP-BACK
Name: KUBE-LOOP-BACK
Type: hash:ip,port,ip
Revision: 5
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 448
References: 1
Number of entries: 4
Members:
192.170.77.151,udp:53,192.170.77.151
192.170.77.151,tcp:53,192.170.77.151
192.170.77.151,tcp:9153,192.170.77.151
192.170.77.137,tcp:3306,192.170.77.137

// Handle traffic that loops back to the originator with SNAT.
for _, e := range proxier.endpointsMap[svcName] {
	ep, ok := e.(*proxy.BaseEndpointInfo)
	if !ok {
		klog.Errorf("Failed to cast BaseEndpointInfo %q", e.String())
		continue
	}
	if !ep.IsLocal {
		continue
	}
	epIP := ep.IP()
	epPort, err := ep.Port()
	// Error parsing this endpoint has been logged. Skip to next endpoint.
	if epIP == "" || err != nil {
		continue
	}
	entry := &utilipset.Entry{
		IP:       epIP,
		Port:     epPort,
		Protocol: protocol,
		IP2:      epIP,
		SetType:  utilipset.HashIPPortIP,
	}
	if valid := proxier.ipsetList[kubeLoopBackIPSet].validateEntry(entry); !valid {
		klog.Errorf("%s", fmt.Sprintf(EntryInvalidErr, entry, proxier.ipsetList[kubeLoopBackIPSet].Name))
		continue
	}
	proxier.ipsetList[kubeLoopBackIPSet].activeEntries.Insert(entry.String())
}

      2.3.2 添加 KUBE-CLUSTER-IP ipset 记录 

# ipset -L KUBE-CLUSTER-IP
Name: KUBE-CLUSTER-IP
Type: hash:ip,port
Revision: 5
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 568
References: 2
Number of entries: 7
Members:
10.200.233.191,tcp:44134
10.200.194.33,tcp:3306
10.200.254.254,udp:53
10.200.0.1,tcp:443
10.200.254.254,tcp:9153
10.200.254.254,tcp:53
10.200.254.116,tcp:443

// Capture the clusterIP.
// ipset call
entry := &utilipset.Entry{
	IP:       svcInfo.ClusterIP().String(),
	Port:     svcInfo.Port(),
	Protocol: protocol,
	SetType:  utilipset.HashIPPort,
}
// add service Cluster IP:Port to kubeServiceAccess ip set for the purpose of solving hairpin.
// proxier.kubeServiceAccessSet.activeEntries.Insert(entry.String())
if valid := proxier.ipsetList[kubeClusterIPSet].validateEntry(entry); !valid {
	klog.Errorf("%s", fmt.Sprintf(EntryInvalidErr, entry, proxier.ipsetList[kubeClusterIPSet].Name))
	continue
}
proxier.ipsetList[kubeClusterIPSet].activeEntries.Insert(entry.String())

     2.3.3 以 cluster-ip 作为 ipvs 的 virtual server,把cluster ip 绑定到 dummy 接口上

// ipvs call
serv := &utilipvs.VirtualServer{
	Address:   svcInfo.ClusterIP(),
	Port:      uint16(svcInfo.Port()),
	Protocol:  string(svcInfo.Protocol()),
	Scheduler: proxier.ipvsScheduler,
}
// Set session affinity flag and timeout for IPVS service
if svcInfo.SessionAffinityType() == v1.ServiceAffinityClientIP {
	serv.Flags |= utilipvs.FlagPersistent
	serv.Timeout = uint32(svcInfo.StickyMaxAgeSeconds())
}
// We need to bind ClusterIP to dummy interface, so set `bindAddr` parameter to `true` in syncService()
if err := proxier.syncService(svcNameString, serv, true); err == nil {
	activeIPVSServices[serv.String()] = true
	activeBindAddrs[serv.Address.String()] = true
	// ExternalTrafficPolicy only works for NodePort and external LB traffic, does not affect ClusterIP
	// So we still need clusterIP rules in onlyNodeLocalEndpoints mode.
	if err := proxier.syncEndpoint(svcName, false, serv); err != nil {
		klog.Errorf("Failed to sync endpoint for service: %v, err: %v", serv, err)
	}
} else {
	klog.Errorf("Failed to sync service: %v, err: %v", serv, err)
}

     2.3.3.1 syncService 函数

# ipvsadm -l
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  master.node.local:35351 rr
  -> 192.170.77.137:mysql         Masq    1      0          0         
TCP  lin:https rr

      对于 virtual server 没有的则添加,变动的则更新,EnsureAddressBind 函数直接添加到 kube-ipvs0 上,相当于命令 ip addr add $addr dev $link

func (proxier *Proxier) syncService(svcName string, vs *utilipvs.VirtualServer, bindAddr bool) error {
	appliedVirtualServer, _ := proxier.ipvs.GetVirtualServer(vs)

	klog.Infof("zzlin syncService bindAddr: %v svcName: %v   vs: %#v  appliedVirtualServer: %#v", bindAddr, svcName, vs, appliedVirtualServer)
	if appliedVirtualServer == nil || !appliedVirtualServer.Equal(vs) {
		if appliedVirtualServer == nil {
			// IPVS service is not found, create a new service
			klog.V(3).Infof("Adding new service %q %s:%d/%s", svcName, vs.Address, vs.Port, vs.Protocol)
			if err := proxier.ipvs.AddVirtualServer(vs); err != nil {
				klog.Errorf("Failed to add IPVS service %q: %v", svcName, err)
				return err
			}
		} else {
			// IPVS service was changed, update the existing one
			// During updates, service VIP will not go down
			klog.V(3).Infof("IPVS service %s was changed", svcName)
			if err := proxier.ipvs.UpdateVirtualServer(vs); err != nil {
				klog.Errorf("Failed to update IPVS service, err:%v", err)
				return err
			}
		}
	}

	// bind service address to dummy interface even if service not changed,
	// in case that service IP was removed by other processes
	if bindAddr {
		klog.V(4).Infof("Bind addr %s", vs.Address.String())
		_, err := proxier.netlinkHandle.EnsureAddressBind(vs.Address.String(), DefaultDummyDevice)
		if err != nil {
			klog.Errorf("Failed to bind service address to dummy device %q: %v", svcName, err)
			return err
		}
	}
	return nil
}

     2.3.3.1 syncEndpoint 函数

      获取 virtual server,获得不到报错

func (proxier *Proxier) syncEndpoint(svcPortName proxy.ServicePortName, onlyNodeLocalEndpoints bool, vs *utilipvs.VirtualServer) error {
	appliedVirtualServer, err := proxier.ipvs.GetVirtualServer(vs)
	if err != nil || appliedVirtualServer == nil {
		klog.Errorf("Failed to get IPVS service, error: %v", err)
		return err
	}

     2.3.3.1.1 对于新的 endpoint 则添加 real server

// Create new endpoints
for _, ep := range newEndpoints.List() {

	newDest := &utilipvs.RealServer{
		Address: net.ParseIP(ip),
		Port:    uint16(portNum),
		Weight:  1,
	}

	err = proxier.ipvs.AddRealServer(appliedVirtualServer, newDest)
	if err != nil {
		klog.Errorf("Failed to add destination: %v, error: %v", newDest, err)
		continue
	}
}

     2.3.3.1.1 对于旧的 endpoint 则优雅的删除 real server

     2.3.4 对于 nodePort 的情况,把 node 地址加入

if svcInfo.NodePort() != 0 {
	if len(nodeAddresses) == 0 || len(nodeIPs) == 0 {
		// Skip nodePort configuration since an error occurred when
		// computing nodeAddresses or nodeIPs.
		continue
	}

	var lps []utilproxy.LocalPort
	for _, address := range nodeAddresses {
		lp := utilproxy.LocalPort{
			Description: "nodePort for " + svcNameString,
			IP:          address,
			Port:        svcInfo.NodePort(),
			Protocol:    protocol,
		}
		if utilproxy.IsZeroCIDR(address) {
			// Empty IP address means all
			lp.IP = ""
			lps = append(lps, lp)
			// If we encounter a zero CIDR, then there is no point in processing the rest of the addresses.
			break
		}
		lps = append(lps, lp)
	}

     2.3.5 nodePort 方式开端口,其实就是监听 

// For ports on node IPs, open the actual port and hold it.
for _, lp := range lps {
	if proxier.portsMap[lp] != nil {
		klog.V(4).Infof("Port %s was open before and is still needed", lp.String())
		replacementPortsMap[lp] = proxier.portsMap[lp]
		// We do not start listening on SCTP ports, according to our agreement in the
		// SCTP support KEP
	} else if svcInfo.Protocol() != v1.ProtocolSCTP {
		socket, err := proxier.portMapper.OpenLocalPort(&lp)
		if err != nil {
			klog.Errorf("can't open %s, skipping this nodePort: %v", lp.String(), err)
			continue
		}
		if lp.Protocol == "udp" {
			isIPv6 := utilnet.IsIPv6(svcInfo.ClusterIP())
			conntrack.ClearEntriesForPort(proxier.exec, lp.Port, isIPv6, v1.ProtocolUDP)
		}
		replacementPortsMap[lp] = socket
	} // We're holding the port, so it's OK to install ipvs rules.
}

     2.3.6 添加 ipset 表记录,比如 TCP 的 nodePort,表为 KUBE-NODE-PORT-TCP 如下

# ipset -L KUBE-NODE-PORT-TCP
Name: KUBE-NODE-PORT-TCP
Type: bitmap:port
Revision: 3
Header: range 0-65535
Size in memory: 8300
References: 1
Number of entries: 1
Members:
35351     

switch protocol {
case "tcp":
	nodePortSet = proxier.ipsetList[kubeNodePortSetTCP]
	entries = []*utilipset.Entry{{
		// No need to provide ip info
		Port:     svcInfo.NodePort(),
		Protocol: protocol,
		SetType:  utilipset.BitmapPort,
	}}
case "udp":
	nodePortSet = proxier.ipsetList[kubeNodePortSetUDP]
	entries = []*utilipset.Entry{{
		// No need to provide ip info
		Port:     svcInfo.NodePort(),
		Protocol: protocol,
		SetType:  utilipset.BitmapPort,
	}}
case "sctp":
	nodePortSet = proxier.ipsetList[kubeNodePortSetSCTP]
	// Since hash ip:port is used for SCTP, all the nodeIPs to be used in the SCTP ipset entries.
	entries = []*utilipset.Entry{}
	for _, nodeIP := range nodeIPs {
		entries = append(entries, &utilipset.Entry{
			IP:       nodeIP.String(),
			Port:     svcInfo.NodePort(),
			Protocol: protocol,
			SetType:  utilipset.HashIPPort,
		})
	}
default:
	// It should never hit
	klog.Errorf("Unsupported protocol type: %s", protocol)
}

    nodeport 方式的也是调用 syncService syncEndpoint 接口同步 virtual server 和 real server,不同的是无需绑定到 kube-ipvs0上

 

  2.4 同步 ipset 表创记录

      列出所有 ipset 记录,清除以及添加 ipset 表记录

func (set *IPSet) syncIPSetEntries() {
	appliedEntries, err := set.handle.ListEntries(set.Name)
	if err != nil {
		klog.Errorf("Failed to list ip set entries, error: %v", err)
		return
	}

	// currentIPSetEntries represents Endpoints watched from API Server.
	currentIPSetEntries := sets.NewString()
	for _, appliedEntry := range appliedEntries {
		currentIPSetEntries.Insert(appliedEntry)
	}

	if !set.activeEntries.Equal(currentIPSetEntries) {
		// Clean legacy entries
		for _, entry := range currentIPSetEntries.Difference(set.activeEntries).List() {
			if err := set.handle.DelEntry(entry, set.Name); err != nil {
				if !utilipset.IsNotFoundError(err) {
					klog.Errorf("Failed to delete ip set entry: %s from ip set: %s, error: %v", entry, set.Name, err)
				}
			} else {
				klog.V(3).Infof("Successfully delete legacy ip set entry: %s from ip set: %s", entry, set.Name)
			}
		}
		// Create active entries
		for _, entry := range set.activeEntries.Difference(currentIPSetEntries).List() {
			if err := set.handle.AddEntry(entry, &set.IPSet, true); err != nil {
				klog.Errorf("Failed to add entry: %v to ip set: %s, error: %v", entry, set.Name, err)
			} else {
				klog.V(3).Infof("Successfully add entry: %v to ip set: %s", entry, set.Name)
			}
		}
	}
}

    规则如下:

Name:Type:Revision:Header:Size in memory:References:Members:
KUBE-LOOP-BACKhash:ip,port,ip2

family

inet

hashsize 1024

maxelem 65536

168241172.30.46.39,tcp:6379,172.30.46.39
172.30.3.15,udp:53,172.30.3.15
KUBE-NODE-PORT-TCPbitmap:port1range 0-65535524432131011
32371
KUBE-CLUSTER-IPhash:ip,port2family inet hashsize 1024 maxelem 6553616688210.254.0.2,tcp:53
10.254.0.2,udp:53
10.254.86.39,tcp:6379
10.254.0.1,tcp:443
10.254.69.27,tcp:443

 

3. inspectWithIptablesChain

KUBE-POSTROUTING匹配KUBE-LOOP-BACK ipset表,则伪装: -A KUBE-POSTROUTING -m comment --comment "Kubernetes endpoints dst ip:port, source ip for solving hairpin purpose" -m set --match-set KUBE-LOOP-BACK dst,dst,src -j MASQUERADE

-A KUBE-SERVICES -m addrtype --dst-type LOCAL -j KUBE-NODE-PORT
-A KUBE-SERVICES -m set --match-set KUBE-CLUSTER-IP dst,dst -j ACCEPT

// ipsetWithIptablesChain is the ipsets list with iptables source chain and the chain jump to
// `iptables -t nat -A <from> -m set --match-set <name> <matchType> -j <to>`
// example: iptables -t nat -A KUBE-SERVICES -m set --match-set KUBE-NODE-PORT-TCP dst -j KUBE-NODE-PORT
// ipsets with other match rules will be created Individually.
// Note: kubeNodePortLocalSetTCP must be prior to kubeNodePortSetTCP, the same for UDP.
var ipsetWithIptablesChain = []struct {
	name          string
	from          string
	to            string
	matchType     string
	protocolMatch string
}{
	{kubeLoopBackIPSet, string(kubePostroutingChain), "MASQUERADE", "dst,dst,src", ""},
	{kubeLoadBalancerSet, string(kubeServicesChain), string(KubeLoadBalancerChain), "dst,dst", ""},
	{kubeLoadbalancerFWSet, string(KubeLoadBalancerChain), string(KubeFireWallChain), "dst,dst", ""},
	{kubeLoadBalancerSourceCIDRSet, string(KubeFireWallChain), "RETURN", "dst,dst,src", ""},
	{kubeLoadBalancerSourceIPSet, string(KubeFireWallChain), "RETURN", "dst,dst,src", ""},
	{kubeLoadBalancerLocalSet, string(KubeLoadBalancerChain), "RETURN", "dst,dst", ""},
	{kubeNodePortLocalSetTCP, string(KubeNodePortChain), "RETURN", "dst", "tcp"},
	{kubeNodePortSetTCP, string(KubeNodePortChain), string(KubeMarkMasqChain), "dst", "tcp"},
	{kubeNodePortLocalSetUDP, string(KubeNodePortChain), "RETURN", "dst", "udp"},
	{kubeNodePortSetUDP, string(KubeNodePortChain), string(KubeMarkMasqChain), "dst", "udp"},
	{kubeNodePortSetSCTP, string(kubeServicesChain), string(KubeNodePortChain), "dst", "sctp"},
	{kubeNodePortLocalSetSCTP, string(KubeNodePortChain), "RETURN", "dst", "sctp"},
}

 

4. writeIptablesRules

    将规则写入nat rule buffer中,写入filter buffer中,下面一大堆都是这种操作

	for _, set := range ipsetWithIptablesChain {
		if _, find := proxier.ipsetList[set.name]; find && !proxier.ipsetList[set.name].isEmpty() {
			args = append(args[:0], "-A", set.from)
			if set.protocolMatch != "" {
				args = append(args, "-p", set.protocolMatch)
			}
			args = append(args,
				"-m", "comment", "--comment", proxier.ipsetList[set.name].getComment(),
				"-m", "set", "--match-set", set.name,
				set.matchType,
			)
			writeLine(proxier.natRules, append(args, "-j", set.to)...)
		}
	}

 

-A KUBE-SERVICES ! -s 10.254.0.0/16 -m comment --comment "Kubernetes service cluster ip + port for masquerade purpose" -m set --match-set KUBE-CLUSTER-IP dst,dst -j KUBE-MARK-MASQ

	if !proxier.ipsetList[kubeClusterIPSet].isEmpty() {
		args = append(args[:0],
			"-A", string(kubeServicesChain),
			"-m", "comment", "--comment", proxier.ipsetList[kubeClusterIPSet].getComment(),
			"-m", "set", "--match-set", kubeClusterIPSet,
		)
		if proxier.masqueradeAll {
			writeLine(proxier.natRules, append(args, "dst,dst", "-j", string(KubeMarkMasqChain))...)
		} else if len(proxier.clusterCIDR) > 0 {
			// This masquerades off-cluster traffic to a service VIP.  The idea
			// is that you can establish a static route for your Service range,
			// routing to any node, and that node will bridge into the Service
			// for you.  Since that might bounce off-node, we masquerade here.
			// If/when we support "Local" policy for VIPs, we should update this.
			writeLine(proxier.natRules, append(args, "dst,dst", "! -s", proxier.clusterCIDR, "-j", string(KubeMarkMasqChain))...)
		} else {
			// Masquerade all OUTPUT traffic coming from a service ip.
			// The kube dummy interface has all service VIPs assigned which
			// results in the service VIP being picked as the source IP to reach
			// a VIP. This leads to a connection from VIP:<random port> to
			// VIP:<service port>.
			// Always masquerading OUTPUT (node-originating) traffic with a VIP
			// source ip and service port destination fixes the outgoing connections.
			writeLine(proxier.natRules, append(args, "src,dst", "-j", string(KubeMarkMasqChain))...)
		}
	}

-A KUBE-LOAD-BALANCER -j KUBE-MARK-MASQ

	// mark drop for KUBE-LOAD-BALANCER
	writeLine(proxier.natRules, []string{
		"-A", string(KubeLoadBalancerChain),
		"-j", string(KubeMarkMasqChain),
	}...)

	// mark drop for KUBE-FIRE-WALL
	writeLine(proxier.natRules, []string{
		"-A", string(KubeFireWallChain),
		"-j", string(KubeMarkDropChain),
	}...)

-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT

	// If the masqueradeMark has been added then we want to forward that same
	// traffic, this allows NodePort traffic to be forwarded even if the default
	// FORWARD policy is not accept.
	writeLine(proxier.filterRules,
		"-A", string(KubeForwardChain),
		"-m", "comment", "--comment", `"kubernetes forwarding rules"`,
		"-m", "mark", "--mark", proxier.masqueradeMark,
		"-j", "ACCEPT",
	)

 

  这个主要是创建

-A KUBE-FORWARD -s 10.254.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -d 10.254.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

	// The following rules can only be set if clusterCIDR has been defined.
	if len(proxier.clusterCIDR) != 0 {
		// The following two rules ensure the traffic after the initial packet
		// accepted by the "kubernetes forwarding rules" rule above will be
		// accepted, to be as specific as possible the traffic must be sourced
		// or destined to the clusterCIDR (to/from a pod).
		writeLine(proxier.filterRules,
			"-A", string(KubeForwardChain),
			"-s", proxier.clusterCIDR,
			"-m", "comment", "--comment", `"kubernetes forwarding conntrack pod source rule"`,
			"-m", "conntrack",
			"--ctstate", "RELATED,ESTABLISHED",
			"-j", "ACCEPT",
		)
		writeLine(proxier.filterRules,
			"-A", string(KubeForwardChain),
			"-m", "comment", "--comment", `"kubernetes forwarding conntrack pod destination rule"`,
			"-d", proxier.clusterCIDR,
			"-m", "conntrack",
			"--ctstate", "RELATED,ESTABLISHED",
			"-j", "ACCEPT",
		)
	}

 

5. 使用iptables-restore批量导入Linux防火墙规则

	// Sync iptables rules.
	// NOTE: NoFlushTables is used so we don't flush non-kubernetes chains in the table.
	proxier.iptablesData.Reset()
	proxier.iptablesData.Write(proxier.natChains.Bytes())
	proxier.iptablesData.Write(proxier.natRules.Bytes())
	proxier.iptablesData.Write(proxier.filterChains.Bytes())
	proxier.iptablesData.Write(proxier.filterRules.Bytes())

	glog.V(5).Infof("Restoring iptables rules: %s", proxier.iptablesData.Bytes())
	err = proxier.iptables.RestoreAll(proxier.iptablesData.Bytes(), utiliptables.NoFlushTables, utiliptables.RestoreCounters)
	if err != nil {
		glog.Errorf("Failed to execute iptables-restore: %v\nRules:\n%s", err, proxier.iptablesData.Bytes())
		// Revert new local ports.
		utilproxy.RevertPorts(replacementPortsMap, proxier.portsMap)
		return
	}

*nat
:KUBE-SERVICES - [0:0]
:KUBE-POSTROUTING - [0:0]
:KUBE-FIREWALL - [0:0]
:KUBE-NODE-PORT - [0:0]
:KUBE-LOAD-BALANCER - [0:0]
:KUBE-MARK-MASQ - [0:0]
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x00004000/0x00004000 -j MASQUERADE
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x00004000/0x00004000
-A KUBE-POSTROUTING -m comment --comment "Kubernetes endpoints dst ip:port, source ip for solving hairpin purpose" -m set --match-set KUBE-LOOP-BACK dst,dst,src -j MASQUERADE
-A KUBE-SERVICES -m comment --comment "Kubernetes service cluster ip + port for masquerade purpose" -m set --match-set KUBE-CLUSTER-IP dst,dst ! -s 172.170.0.0/16 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -m addrtype --dst-type LOCAL -j KUBE-NODE-PORT
-A KUBE-LOAD-BALANCER -j KUBE-MARK-MASQ
-A KUBE-FIREWALL -j KUBE-MARK-DROP
-A KUBE-SERVICES -m set --match-set KUBE-CLUSTER-IP dst,dst -j ACCEPT
COMMIT
*filter
:KUBE-FORWARD - [0:0]
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x00004000/0x00004000 -j ACCEPT
-A KUBE-FORWARD -s 172.170.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack pod destination rule" -d 172.170.0.0/16 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
COMMIT
 

 

6.  获得当前绑定地址

	// Clean up legacy bind address
	// currentBindAddrs represents ip addresses bind to DefaultDummyDevice from the system
	currentBindAddrs, err := proxier.netlinkHandle.ListBindAddress(DefaultDummyDevice)
	if err != nil {
		glog.Errorf("Failed to get bind address, err: %v", err)
	}

ipvs模式,通过svc创建的Cluster都绑定在kube-ipvs0这块虚拟网卡。创建 ClusterIP 执行以下三项操作:

  • 节点中存在虚拟接口为 kube-ipvs0
  • 服务 IP 地址绑定到虚拟接口
  • 分别为每个服务 IP 地址创建 IPVS 虚拟服务器

TCP  10.254.23.85:5566 wrr
  -> 172.30.3.29:5566             Masq    1      0          0         
  -> 172.30.46.41:5566            Masq    1      0          0

 

总结:

    更新 service 与 endpoint 变化,erviceChanges 和 endpointsChanges 用来跟踪 Service 和 Endpoint 的更新。

    创建 kube 顶层链和 jump 链连接 KUBE-POSTROUTING  KUBE-MARK-MASQ

{utiliptables.TableNAT, kubeServicesChain},
{utiliptables.TableNAT, kubePostroutingChain},
{utiliptables.TableNAT, KubeFireWallChain},
{utiliptables.TableNAT, KubeNodePortChain},
{utiliptables.TableNAT, KubeLoadBalancerChain},
{utiliptables.TableNAT, KubeMarkMasqChain},
{utiliptables.TableFilter, KubeForwardChain},

    确保 kube-ipvs0 接口存在,后续 service 地址绑到该上

    添加 KUBE-LOOP-BACK KUBE-CLUSTER-IP ipset 表

    调用 syncService syncEndpoint 绑定 clusterIP 到 dummy 接口,添加 vs rs,有可能有 service 没有 endpoint

    

    循环遍历所有 service

    a. 对于有 endpoint 的则添加 KUBE-LOOP-BACK ipset 记录

    b. 添加 KUBE-CLUSTER-IP ipset 记录

    c. 没有 virtual server 则创建,有变动则更新 ipvs 的 virtual server,并调用 EnsureAddressBind 函数直接添加到 kube-ipvs0 上,相当于命令 ip addr add $addr dev $link

    d. 对于 endpoint 则作为 real server,新的则添加,旧的则优雅删除

    f. nodePort 方式确保监听端口,添加 KUBE-NODE-PORT-TCP 等 ipset 表记录,无需绑定到 dummy 接口上

 

问题 a:增大ipvs模块hash table的大小
    ipvs模块hash table默认值为2^12=4096,改为2^20=1048576。ipvsadm -l 查询当前hash table的大小


    修改方法: /etc/modprobe.d/目录下添加文件ip_vs.conf,内容为:

    options ip_vs conn_tab_bits=20

    重新加载ipvs模块

 

bug 1.  kube-proxy cpu 高在大量 service 情况

    https://github.com/kubernetes/kubernetes/issues/73382 

    kubernetes-v1.16 已经修改,主要是 netlink 系统调用问题,由原来每一次sync中的 每循环一个nodeport 系统调用 nodeIP 改为每一次sync 只查询以西 nodeIP 

     https://github.com/kubernetes/kubernetes/pull/79444

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值