容器的本质
容器的本质就是一个进程,只不过对它进行了Linux Namesapce
隔离,让它看不到外面的世界,用Cgroups
限制了它能使用的资源,同时利用系统调用pivot_root
或chroot
切换了进程的根目录,把容器镜像挂载为根文件系统rootfs
。rootfs
中不仅有要运行的应用程序,还包含了应用的所有依赖库,以及操作系统的目录和文件。rootfs
打包了应用运行的完整环境,这样就保证了在开发、测试、线上等多个场景的一致性。
看清了容器的本质,很多问题就容易理解。例如我们执行 docker exec
命令能够进入运行中的容器,好像登录进独立的虚拟机一样。实际上这只不过是利用系统调用setns
,让当前进程进入到容器进程的Namesapce
中,它就能“看到”容器内部的情况了。
容器的网络
如何让容器之间互相连接保持网络通畅,Docker有多种网络模型,最常见的有:
-
单机环境:bridge
-
扩主机:overlay
本篇章主要通过动手实验,理解birdge模型。
基本概念
Veth Pairs:Veth是成对出现的两张虚拟网卡,从一端发送的数据包,总会在另一端接收到。利用Veth的特性,我们可以将一端的虚拟网卡"放入"容器内,另一端接入虚拟交换机。这样,接入同一个虚拟交换机的容器之间就实现了网络互通。
Linux Bridge:交换机是工作在数据链路层的网络设备,它转发的是二层网络包。最简单的转发策略是将到达交换机输入端口的报文,广播到所有的输出端口。当然更好的策略是在转发过程中进行学习,记录交换机端口和MAC地址的映射关系,这样在下次转发时就能够根据报文中的MAC地址,发送到对应的输出端口。
我们可以认为Linux bridge就是虚拟交换机,连接在同一个bridge上的容器组成局域网,不同的bridge之间网络是隔离的。 docker network create [NETWORK NAME]实际上就是创建出虚拟交换机。
iptables:容器需要能够访问外部世界,同时也可以暴露服务让外界访问,这时就要用到iptables。另外,不同bridge之间的隔离也会用到iptables。
我们说的iptables包含了用户态的配置工具(/sbin/iptables)和内核netfilter模块,通过使用iptables命令对内核的netfilter模块做规则配置。
实验部分
实验拓扑:
实验一:单主机容器间网络互通
实验步骤:
1、创建容器(network namespace)
[root@master1 ~]# ip netns add docker1
[root@master1 ~]# ip netns add docker2
查看确认:
[root@master1 ~]# ip netns ls
docker2
docker1
[root@master1 ~]# ls /var/run/netns/ -l
-r--r--r-- 1 root root 0 May 24 15:28 docker1
-r--r--r-- 1 root root 0 May 24 15:28 docker2
2、创建veth 设备对
[root@master1 ~]# ip netns add docker1
[root@master1 ~]# ip netns add docker2
查看确认:
[root@master1 ~]# ip link show type veth
25: veth10@veth11: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 76:d1:d2:22:00:77 brd ff:ff:ff:ff:ff:ff
26: veth11@veth10: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 92:4f:f5:bb:36:76 brd ff:ff:ff:ff:ff:ff
27: veth20@veth21: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ba:d0:9d:df:4c:62 brd ff:ff:ff:ff:ff:ff
28: veth21@veth20: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 56:96:e1:88:88:69 brd ff:ff:ff:ff:ff:ff
3、将veth 一端放入容器
[root@master1 ~]# ip link set veth11 netns docker1
[root@master1 ~]# ip link set veth21 netns docker2
查看确认
[root@master1 ~]# ip netns exec docker1 ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
26: veth11@if25: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 92:4f:f5:bb:36:76 brd ff:ff:ff:ff:ff:ff link-netnsid 0
[root@master1 ~]# ip netns exec docker2 ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
28: veth21@if27: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 56:96:e1:88:88:69 brd ff:ff:ff:ff:ff:ff link-netnsid 0
4、创建bridge设备
[root@master1 ~]# ip link add bridge1 type bridge
也可以使用brctl 工具。
查看确认:
[root@master1 ~]# ip link show type bridge
29: bridge1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether f2:10:a7:ae:a1:53 brd ff:ff:ff:ff:ff:ff
5、将veth 另一端放入bridge1
[root@master1 ~]# ip link set veth10 master bridge1
[root@master1 ~]# ip link set veth20 master bridge1
查看确认:
[root@master1 ~]# ip link show type veth |grep master
25: veth10@if26: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master bridge1 state DOWN mode DEFAULT group default qlen 1000
27: veth20@if28: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master bridge1 state DOWN mode DEFAULT group default qlen 1000
或
[root@master1 ~]# brctl show bridge1
bridge name bridge id STP enabled interfaces
bridge1 8000.76d1d2220077 no veth10
veth20
6、为容器内的网卡分配ip地址并激活
[root@master1 ~]# ip netns exec docker1 ip addr add 172.30.0.100/24 dev veth11
[root@master1 ~]# ip netns exec docker1 ip link set dev veth11 up
[root@master1 ~]# ip netns exec docker2 ip addr add 172.30.0.200/24 dev veth21
[root@master1 ~]# ip netns exec docker2 ip link set dev veth21 up
7、为bridge1 分配ip并激活上线
[root@master1 ~]# ip addr add 172.30.0.1/24 dev bridge1
[root@master1 ~]# ip link set dev bridge1 up
[root@master1 ~]# ip link set dev veth10 up
[root@master1 ~]# ip link set dev veth20 up
8、ping测试
1、tcpdump 抓取bridge1 包
[root@master1 ~]# tcpdump -i bridge1 -n
2、在docker1 的ns ping docker2的ip
[root@master1 ~]# ip netns exec docker1 ping -c 2 172.30.0.200
PING 172.30.0.200 (172.30.0.200) 56(84) bytes of data.
64 bytes from 172.30.0.200: icmp_seq=1 ttl=64 time=0.285 ms
64 bytes from 172.30.0.200: icmp_seq=2 ttl=64 time=0.068 ms
3、抓包结果如下:
#1、docker1 ---》docker2 的arp 请求
15:43:35.290903 ARP, Request who-has 172.30.0.200 tell 172.30.0.100, length 28
15:43:35.290946 ARP, Reply 172.30.0.200 is-at 56:96:e1:88:88:69, length 28
#2、icmp发包
15:43:35.291013 IP 172.30.0.100 > 172.30.0.200: ICMP echo request, id 47979, seq 1, length 64
15:43:35.291068 IP 172.30.0.200 > 172.30.0.100: ICMP echo reply, id 47979, seq 1, length 64
15:43:36.343154 IP 172.30.0.100 > 172.30.0.200: ICMP echo request, id 47979, seq 2, length 64
15:43:36.343197 IP 172.30.0.200 > 172.30.0.100: ICMP echo reply, id 47979, seq 2, length 64
# 3 docker2--》 docker1 的arp 请求
15:43:40.439240 ARP, Request who-has 172.30.0.100 tell 172.30.0.200, length 28
15:43:40.439276 ARP, Reply 172.30.0.100 is-at 92:4f:f5:bb:36:76, length 28
实验二:宿主机访问容器网络
1、在容器内启动服务,监听80端口
[root@master1 ~]# ip netns exec docker1 nc -lp 80
2、在宿主机上访问
[root@master1 ~]# telnet 172.30.0.100 80
Trying 172.30.0.100...
Connected to 172.30.0.100.
Escape character is '^]'.
3、原理分析
[root@master1 ~]# ip r |grep 172.30
172.30.0.0/24 dev bridge1 proto kernel scope link src 172.30.0.1
[root@master1 ~]# ip netns exec docker1 ip r
172.30.0.0/24 dev veth11 proto kernel scope link src 172.30.0.100
[root@master1 ~]# ip netns exec docker1 tcpdump -i veth11 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on veth11, link-type EN10MB (Ethernet), capture size 262144 bytes
15:59:02.039154 IP6 fe80::74d1:d2ff:fe22:77 > ff02::2: ICMP6, router solicitation, length 16
15:59:10.176221 IP 172.30.0.1.55530 > 172.30.0.100.http: Flags [S], seq 674548384, win 64240, options [mss 1460,sackOK,TS val 2826573941 ecr 0,nop,wscale 7], length 0
15:59:10.176244 IP 172.30.0.100.http > 172.30.0.1.55530: Flags [S.], seq 773882025, ack 674548385, win 65160, options [mss 1460,sackOK,TS val 2679766513 ecr 2826573941,nop,wscale 7], length 0
15:59:10.176278 IP 172.30.0.1.55530 > 172.30.0.100.http: Flags [.], ack 1, win 502, options [nop,nop,TS val 2826573941 ecr 2679766513], length 0
15:59:15.351119 ARP, Request who-has 172.30.0.1 tell 172.30.0.100, length 28
15:59:15.351279 ARP, Request who-has 172.30.0.100 tell 172.30.0.1, length 28
15:59:15.351287 ARP, Reply 172.30.0.100 is-at 92:4f:f5:bb:36:76, length 28
15:59:15.351289 ARP, Reply 172.30.0.1 is-at 76:d1:d2:22:00:77, length 28
15:59:18.125125 IP 172.30.0.1.55530 > 172.30.0.100.http: Flags [P.], seq 1:6, ack 1, win 502, options [nop,nop,TS val 2826581890 ecr 2679766513], length 5: HTTP
15:59:18.125156 IP 172.30.0.100.http > 172.30.0.1.55530: Flags [.], ack 6, win 510, options [nop,nop,TS val 2679774462 ecr 2826581890], length 0
!!source ip 是网桥bridge1的ip
实验三:容器网络访问外部(SNAT)
1、确认宿主机ip forwarding是否开启
[root@master1 ~]# sysctl net.ipv4.conf.all.forwarding=1
2、查看iptable 的forward连规则
[root@master1 ~]# iptables -L |grep FORWARD
Chain FORWARD (policy ACCEPT)
3、修改容器的默认网关为bridge1的ip
[root@master1 ~]# ip netns exec docker1 route add default
gw 172.30.0.1 veth11
[root@master1 ~]# ip netns exec docker2 route add default
gw 172.30.0.1 veth21
4、配置iptables 的snat规则
器的IP地址外部并不认识,如果它要访问外网,需要在数据包离开前将源地址替换为宿主机的IP,这样外部主机才能用宿主机的IP作为目的地址发回响应。
另外一个需要注意的问题,内核netfilter
会追踪记录连接,我们在增加了SNAT规则时,系统会自动增加一个隐式的反向规则,这样返回的包会自动将宿主机的IP替换为容器IP。
[root@master1 ~]# iptables -t nat -A POSTROUTING -s 172.30.0.0/24 ! -o bridge1 -j MASQUERADE
[root@master1 ~]# iptables -t nat -L |grep 172.30
MASQUERADE all -- 172.30.0.0/24 anywhere
!上面的命令的含义是:在nat表的POSTROUTING链增加规则,当数据包的源地址为172.18.0.0/24网段,出口设备不是br0时,就执行MASQUERADE动作。
5、从容器访问外部地址
[root@master1 ~]# ip netns exec docker1 ping -c 2 114.114.114.114
PING 114.114.114.114 (114.114.114.114) 56(84) bytes of data.
64 bytes from 114.114.114.114: icmp_seq=1 ttl=83 time=33.5 ms
64 bytes from 114.114.114.114: icmp_seq=2 ttl=73 time=30.1 ms
6、抓包分析
# 在 veth11 上抓包
[root@master1 ~]# ip netns exec docker1 tcpdump -i veth11 -p icmp -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on veth11, link-type EN10MB (Ethernet), capture size 262144 bytes
16:20:14.169626 IP 172.30.0.100 > 114.114.114.114: ICMP echo request, id 20867, seq 1, length 64
16:20:14.200613 IP 114.114.114.114 > 172.30.0.100: ICMP echo reply, id 20867, seq 1, length 64
16:20:15.170228 IP 172.30.0.100 > 114.114.114.114: ICMP echo request, id 20867, seq 2, length 64
16:20:15.201280 IP 114.114.114.114 > 172.30.0.100: ICMP echo reply, id 20867, seq 2, length 64
# 在eth0上抓包
[root@master1 ~]# tcpdump -i eth0 -p icmp -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
16:20:14.169683 IP 10.51.104.5 > 114.114.114.114: ICMP echo request, id 20867, seq 1, length 64
16:20:14.200581 IP 114.114.114.114 > 10.51.104.5: ICMP echo reply, id 20867, seq 1, length 64
16:20:15.170309 IP 10.51.104.5 > 114.114.114.114: ICMP echo request, id 20867, seq 2, length 64
16:20:15.201238 IP 114.114.114.114 > 10.51.104.5: ICMP echo reply, id 20867, seq 2, length 64
实验四:外部访问容器网络(DNAT)
1、配置dnat规则
[root@master1 ~]# iptables -t nat -A PREROUTING ! -i bridge1 -p tcp --dport 80 -j DNAT --to-destination 172.30.0.10
0:80
上面命令的含义是:在nat表的PREROUTING链增加规则,当输入设备不是bridge1 ,目的端口为80时,做目的地址转换,将宿主机IP替换为容器IP。
2、容器启动服务监听80端口
[root@master1 ~]# ip netns exec docker1 nc -lp 80
3、使用其他主机来访问测试
[root@node1 ~]# ip a |grep 51
inet 10.51.104.8/24 brd 10.51.104.255 scope global dynamic eth0
valid_lft 63517sec preferred_lft 63517sec
[root@node1 ~]# telnet 10.51.104.5 80
Trying 10.51.104.5...
Connected to 10.51.104.5.
Escape character is '^]'.
4、抓包分析
# eth0
[root@master1 ~]# tcpdump -i eth0 tcp port 80 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
16:40:25.496628 IP 10.51.104.8.33818 > 10.51.104.5.http: Flags [S], seq 3457572293, win 64240, options [mss 1410,sackOK,TS val 317524495 ecr 0,nop,wscale 7], length 0
16:40:25.496772 IP 10.51.104.5.http > 10.51.104.8.33818: Flags [S.], seq 1034954862, ack 3457572294, win 65160, options [mss 1460,sackOK,TS val 106860443 ecr 317524495,nop,wscale 7], length 0
16:40:25.498787 IP 10.51.104.8.33818 > 10.51.104.5.http: Flags [.], ack 1, win 502, options [nop,nop,TS val 317524500 ecr 106860443], length 0
16:40:26.487974 IP 10.51.104.5.39144 > 31.13.81.4.http: Flags [S], seq 149664692, win 64240, options [mss 1460,sackOK,TS val 291139983 ecr 0,nop,wscale 7], length 0