如果要构建跨主机节点的docker网络,比如使用docker swarm,默认情况下docker swarm用的是overlay网络技术,这会导致网络性能比较差,如果使用macvlan的话,性能就很好。
macvlan是linux kernel模块,功能是允许在同一个物理网卡上配置多个 MAC 地址,即多个NIC,每个 interface 可以配置自己的 IP。macvlan 本质上是一种网卡虚拟化技术,macvlan 的最大优点是性能极好,相比其他实现,macvlan 不需要创建 Linux bridge,而是直接通过以太网连接物理网络,lsmod一下可以看到确实是linux内核加载的模块!
[root@linux3 ~]# lsmod | grep mac
macvlan 19239 0
看得出来是Patrick McHardy这哥们开发的macvlan技术
[root@linux3 ~]# modinfo macvlan
filename: /lib/modules/3.10.0-1160.el7.x86_64/kernel/drivers/net/macvlan.ko.xz
alias: rtnl-link-macvlan
description: Driver for MAC address based VLANs
author: Patrick McHardy <kaber@trash.net>
license: GPL
retpoline: Y
rhelversion: 7.9
srcversion: 140D211EA232257B4320276
depends:
intree: Y
vermagic: 3.10.0-1160.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key: E1:FD:B0:E2:A7:E8:61:A1:D1:CA:80:A2:3D:CF:0D:BA:3A:A4:AD:F5
sig_hashalgo: sha256
本来吧,一个网卡对应一个MAC地址,这个是网卡制造商生产网卡就固定好的,现在有了macvlan,一个网卡能虚拟出多个mac地址,也是醉了,那就变成假设你的电脑上只有一块物理网卡,通过macvlan技术,就好似你的电脑上目前有n块网卡一样,省的你买网卡了,比如下面的图表示2台服务器,各自只有1个物理网卡,但是通过macvlan技术,各自虚拟出4个网卡,分别有自己的mac地址,这会带来什么呢?
注意,为了让4个虚拟网卡(带自己的mac地址)都可以通过物理网卡发送数据,需要把物理网卡通过linux相关命令设置为混杂模式 ip link set ens33 promisc on,这里的ens33就是物理网卡的名称,具体ifconfig可以看得到网卡的名称
[root@linux3 ~]# docker network ls
NETWORK ID NAME DRIVER SCOPE
987befb5a3b1 bridge bridge local
581ad0a5ce75 host host local
0a451b66cef2 none null local
[root@linux3 ~]# docker network --help
Usage: docker network COMMAND
Manage networks
Commands:
connect Connect a container to a network
create Create a network
disconnect Disconnect a container from a network
inspect Display detailed information on one or more networks
ls List networks
prune Remove all unused networks
rm Remove one or more networks
Run 'docker network COMMAND --help' for more information on a command.
[root@linux3 ~]# docker network create --help
Usage: docker network create [OPTIONS] NETWORK
Create a network
Options:
--attachable Enable manual container attachment
--aux-address map Auxiliary IPv4 or IPv6 addresses used by Network driver (default map[])
--config-from string The network from which to copy the configuration
--config-only Create a configuration only network
-d, --driver string Driver to manage the Network (default "bridge")
--gateway strings IPv4 or IPv6 Gateway for the master subnet
--ingress Create swarm routing-mesh network
--internal Restrict external access to the network
--ip-range strings Allocate container ip from a sub-range
--ipam-driver string IP Address Management Driver (default "default")
--ipam-opt map Set IPAM driver specific options (default map[])
--ipv6 Enable IPv6 networking
--label list Set metadata on a network
-o, --opt map Set driver specific options (default map[])
--scope string Control the network's scope
--subnet strings Subnet in CIDR format that represents a network segment
通过命令行帮助,可以看到如何创建macvlan网络
[root@linux3 ~]# docker network create --driver macvlan \
> --subnet=172.16.86.0/24 \
> --gateway=172.16.86.1 \
> -o parent=ens33 macvlan_net1
5ab5dd6287bf3177518b573f1006f5500499102ce887618ee8b9a3c94f3fa105
可以看到,我在这台机器上创建了一个macvlan类型的docker网络
[root@linux3 ~]# docker network ls
NETWORK ID NAME DRIVER SCOPE
987befb5a3b1 bridge bridge local
581ad0a5ce75 host host local
5ab5dd6287bf macvlan_net1 macvlan local
0a451b66cef2 none null local
用inspect查看一下详情
[root@linux3 ~]# docker network inspect macvlan_net1
[
{
"Name": "macvlan_net1",
"Id": "5ab5dd6287bf3177518b573f1006f5500499102ce887618ee8b9a3c94f3fa105",
"Created": "2021-09-13T13:45:15.941104269+08:00",
"Scope": "local",
"Driver": "macvlan",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": {},
"Config": [
{
"Subnet": "172.16.86.0/24",
"Gateway": "172.16.86.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {},
"Options": {
"parent": "ens33"
},
"Labels": {}
}
]
macvlan网络因为是跨主机的网络,所以上面的命令要同时在多个主机上执行,然后我们用上面的网络启动一个容器,并指定容器的IP地址
docker run -itd --name box1 --ip=172.16.86.20 --network macvlan_net1 busybox
5c8be65f8680df44296c84df6343f029517619b3777179a34180880bce0a3c08
再启动一个box2容器,IP地址是21
docker run -itd --name box2 --ip=172.16.86.21 --network macvlan_net1 busybox
f2353faf15d47acdeaaac51937a7cc352a59976a70b1b8b11ad7ab1ebad41636
然后互相ping一下,不错,IP可以互通哇
[root@linux3 ~]# docker exec box2 ping -c 3 172.16.86.20
PING 172.16.86.20 (172.16.86.20): 56 data bytes
64 bytes from 172.16.86.20: seq=0 ttl=64 time=0.138 ms
64 bytes from 172.16.86.20: seq=1 ttl=64 time=0.083 ms
64 bytes from 172.16.86.20: seq=2 ttl=64 time=0.058 ms
--- 172.16.86.20 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.058/0.093/0.138 ms
[root@linux3 ~]# docker exec box1 ping -c 3 172.16.86.21
PING 172.16.86.21 (172.16.86.21): 56 data bytes
64 bytes from 172.16.86.21: seq=0 ttl=64 time=0.052 ms
64 bytes from 172.16.86.21: seq=1 ttl=64 time=0.082 ms
64 bytes from 172.16.86.21: seq=2 ttl=64 time=0.051 ms
--- 172.16.86.21 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.051/0.061/0.082 ms
可以看到只有一个docker0,我们用brctl show可以看到这台服务器上只有1个网桥,我们创建macvlan格式的docker网络的时候,并不会新创建网桥,不使用网桥,这个是macvlan模式比bridge模式,网络性能更好的原因。
[root@linux3 ~]# brctl show
bridge name bridge id STP enabled interfaces
docker0 8000.0242ce22dc79 no veth03b63f8
[root@linux3 ~]# docker network ls
NETWORK ID NAME DRIVER SCOPE
987befb5a3b1 bridge bridge local
581ad0a5ce75 host host local
5ab5dd6287bf macvlan_net1 macvlan local
0a451b66cef2 none null local
[root@linux3 ~]# ifconfig
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
inet6 fe80::42:ceff:fe22:dc79 prefixlen 64 scopeid 0x20<link>
ether 02:42:ce:22:dc:79 txqueuelen 0 (Ethernet)
RX packets 177695 bytes 16034206 (15.2 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 217980 bytes 584716285 (557.6 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
通过ip link命令查看容器的网卡设施,可以看到容器里面的网卡除了lo,就是一个eth0@if2,这个if2是有讲究的,2是序号的意思,再回到宿主机上看ip link,就看得出来了,这个容器的的eth0就是宿主机的ens33通过macvlan技术虚拟出来的一个网卡罢了。
再看看docker安装好以后默认创建的bridge网桥
docker network ls
NETWORK ID NAME DRIVER SCOPE
6dbe26dc7d0c bridge bridge local
581ad0a5ce75 host host local
0a451b66cef2 none null local
[root@linux4 andycui]# docker network inspect bridge
[
{
"Name": "bridge",
"Id": "6dbe26dc7d0cfeb290ebd653e3db12e4be5918d9f934b26c1964d7ede83c30ba",
"Created": "2021-08-25T15:17:57.124884782+08:00",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.17.0.0/16",
"Gateway": "172.17.0.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {},
"Options": {
"com.docker.network.bridge.default_bridge": "true",
"com.docker.network.bridge.enable_icc": "true",
"com.docker.network.bridge.enable_ip_masquerade": "true",
"com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
"com.docker.network.bridge.name": "docker0",
"com.docker.network.driver.mtu": "1500"
},
"Labels": {}
}
]
bridge networks are isolated networks on a single Engine installation. If you want to create a network that spans multiple Docker hosts each running an Engine, you must create an overlay
network.
Docker默认的bridge模式是单机隔离的,也就是上面的Subnet 172.17.0.0/16只是给本机的容器分配IP的,不能跨主机,而且跨主机很可能容器IP都是一样的,如果要跨主机,就必须创建overlay网络模式。
但是默认情况下,docker的overlay网络的性能比较差,比如如果用docker swarm init创建跨主机的docker network,那么用的就是overlay网络,性能不太好
[root@linux3 ~]# docker swarm init --help
Usage: docker swarm init [OPTIONS]
Initialize a swarm
Options:
--advertise-addr string Advertised address (format: <ip|interface>[:port])
--autolock Enable manager autolocking (requiring an unlock key to start a stopped manager)
--availability string Availability of the node ("active"|"pause"|"drain") (default "active")
--cert-expiry duration Validity period for node certificates (ns|us|ms|s|m|h) (default 2160h0m0s)
--data-path-addr string Address or interface to use for data path traffic (format: <ip|interface>)
--data-path-port uint32 Port number to use for data path traffic (1024 - 49151). If no value is set or is set to 0, the default port (4789) is used.
--default-addr-pool ipNetSlice default address pool in CIDR format (default [])
--default-addr-pool-mask-length uint32 default address pool subnet mask length (default 24)
--dispatcher-heartbeat duration Dispatcher heartbeat period (ns|us|ms|s|m|h) (default 5s)
--external-ca external-ca Specifications of one or more certificate signing endpoints
--force-new-cluster Force create a new cluster from current state
--listen-addr node-addr Listen address (format: <ip|interface>[:port]) (default 0.0.0.0:2377)
--max-snapshots uint Number of additional Raft snapshots to retain
--snapshot-interval uint Number of log entries between Raft snapshots (default 10000)
--task-history-limit int Task history retention limit (default 5)
如果是用kubernetes的话,有一个CNI插件flannel,性能也不咋地,具体:
- 先有一个全局的大的IP地址范围,存放在etcd里面,比如这个IP地址范围有10000个IP地址
- 然后给每个主机分配一小段subnetwork ip地址范围,比如200个
- 每个主机上的容器就才能够这个小段IP地址范围内分配IP
- flannel网络要保存每台主机分配的ip地址范围,每个容器的IP地址,这些信息都存放在etcd里面
- 每个主机上都运行一个flanneld进程,支持把容器的数据包封装后再通过主机的网卡发出去,收到后再解封发给容器,所以这里有一个封包解包的过程,导致性能差了,但并不是不能用,具体支持udp封包、vxlan等方式,也就是说用udp数据包把容器的网络包再封装一下,再发出去
- 这就意味着,当主机1上的容器A要和主机B上的容器2通信的时候,主机1上的flanneld进程要预先知道容器2所在的物理主机B,这样主机1上的flanneld进程才好发UDP数据包,发送UDP数据包,必须知道主机B的IP地址
可以通过api方式访问dockerd服务端进程,并不是只能通过docker命令才能和服务端交互,通过api一样可以操作远程服务器上的容器,参考:Develop with Docker Engine SDKs | Docker Documentation