事故前提
1.线上docker的data文件未做重定向,导致根目录撑爆,无奈重启释放内存中数据
2.线上docker容器alc服务器需要维持10万个会话连接
3.nf_conntrack表爆满,随即切换至host,关闭服务器防火期,关闭net.ipv4.ip_forward=0
问题原因
1./var/log/messages中一直打印kernel: nf_conntrack: table full, dropping packet,导致入口,出口严重丢包
Jun 21 20:33:11 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
Jun 21 20:33:11 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
Jun 21 20:33:11 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
Jun 21 20:33:11 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
Jun 21 20:33:11 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
Jun 21 20:33:11 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
Jun 21 20:33:11 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
Jun 21 20:33:11 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
Jun 21 20:33:11 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
Jun 21 20:33:11 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
Jun 21 20:33:16 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
Jun 21 20:33:16 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
Jun 21 20:33:16 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
Jun 21 20:33:16 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
Jun 21 20:33:16 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
Jun 21 20:33:16 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
Jun 21 20:33:16 vm10-254-136-6 kernel: nf_conntrack: table full, dropping packet
排查思路
1.重新检查服务器防火墙/DOCKER网络模式/net.ipv4.ip_forward=0设置,发现了问题(ip_forward)
cat /proc/sys/net/ipv4/ip_forward
1
service docker status
Redirecting to /bin/systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2016-06-22 09:40:58 CST; 2h 29min ago
Docs: https://docs.docker.com
Main PID: 748 (docker)
Memory: 95.6M
CGroup: /system.slice/docker.service
├─ 748 /usr/bin/docker daemon -H fd://
├─1845 docker-containerd -l /var/run/docker/libcontainerd/docker-containerd.sock --runtime docker-runc --start-timeout 2m
├─2042 docker-containerd-shim 6ee6ae880f800d3b9cb0caa42d75ed6c7711ceea037a322be3af7a4f0adffa45 /var/run/docker/libcontainerd/6ee...
├─2070 docker-containerd-shim dca4a84bdf21720ead39cf60eea3833401dd4788049ac74fe98bff4083bd0d37 /var/run/docker/libcontainerd/dca...
├─2110 docker-containerd-shim 928b687adb9f993e19526b245e8e416c60d06de7da5cde44dc173fb57761e01f /var/run/docker/libcontainerd/928...
├─2251 docker-containerd-shim 24619ec28fd7cd50893043201a10e9702cc1053e50384c50ddfd0d428516c751 /var/run/docker/libcontainerd/246...
├─2285 docker-containerd-shim ff1f6a34eb3b252bedf02896a4a9b258549d4adef20f611d891579366e91426e /var/run/docker/libcontainerd/ff1...
├─3169 docker-containerd-shim 7c0010dc518ddc74259dc72d74af8f25a04db9a1c8a4255873c1db40d8335f77 /var/run/docker/libcontainerd/7c0...
├─3611 docker-containerd-shim 6b844bb2f3b8f96b054e7f1b8b99f92dd8a08975b8f2022bc915a73b066c0f56 /var/run/docker/libcontainerd/6b8...
├─3750 docker-containerd-shim 6b844bb2f3b8f96b054e7f1b8b99f92dd8a08975b8f2022bc915a73b066c0f56 /var/run/docker/libcontainerd/6b8...
├─3779 docker-containerd-shim 6b844bb2f3b8f96b054e7f1b8b99f92dd8a08975b8f2022bc915a73b066c0f56 /var/run/docker/libcontainerd/6b8...
├─3807 docker-containerd-shim 6b844bb2f3b8f96b054e7f1b8b99f92dd8a08975b8f2022bc915a73b066c0f56 /var/run/docker/libcontainerd/6b8...
└─3832 docker-containerd-shim 6b844bb2f3b8f96b054e7f1b8b99f92dd8a08975b8f2022bc915a73b066c0f56 /var/run/docker/libcontainerd/6b8...
Jun 22 11:18:57 vm10-254-136-6.ksc.com docker[748]: time="2016-06-22T11:18:57.803889188+08:00" level=error msg="Handler for GET /v1.2...56e1a"
Jun 22 11:18:58 vm10-254-136-6.ksc.com docker[748]: time="2016-06-22T11:18:58.252585878+08:00" level=error msg="Handler for GET /v1.2...56e1a"
Jun 22 11:19:08 vm10-254-136-6.ksc.com docker[748]: time="2016-06-22T11:19:08.797803102+08:00" level=error msg="Handler for GET /v1.2...56e1a"
Jun 22 11:19:09 vm10-254-136-6.ksc.com docker[748]: time="2016-06-22T11:19:09.174213090+08:00" level=error msg="Handler for GET /v1.2...56e1a"
Jun 22 11:19:19 vm10-254-136-6.ksc.com docker[748]: time="2016-06-22T11:19:19.675408402+08:00" level=error msg="Handler for GET /v1.2...56e1a"
Jun 22 11:19:20 vm10-254-136-6.ksc.com docker[748]: time="2016-06-22T11:19:20.202181622+08:00" level=error msg="Handler for GET /v1.2...56e1a"
Jun 22 11:19:30 vm10-254-136-6.ksc.com docker[748]: time="2016-06-22T11:19:30.450311518+08:00" level=error msg="Handler for GET /v1.2...56e1a"
Jun 22 11:19:30 vm10-254-136-6.ksc.com docker[748]: time="2016-06-22T11:19:30.910231364+08:00" level=error msg="Handler for GET /v1.2...56e1a"
Jun 22 11:19:41 vm10-254-136-6.ksc.com docker[748]: time="2016-06-22T11:19:41.147869043+08:00" level=error msg="Handler for GET /v1.2...56e1a"
Jun 22 11:19:41 vm10-254-136-6.ksc.com docker[748]: time="2016-06-22T11:19:41.486591196+08:00" level=error msg="Handler for GET /v1.2...56e1a"
Warning: docker.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Hint: Some lines were ellipsized, use -l to show in full.
说明:/usr/bin/docker daemon -H fd://发现Docker使用了默认配置--ip-forward=true,导致重启服务器后,守护重启docker.service时自动重写了net.ipv4.ip_forward的值,所以会出现contrack爆满持续丢包的情况
解决办法
1.如上可以看到启动文件位于/usr/lib/systemd/system/docker.service,最简单的方式是在ExecStart启动程序后面添加--ip-forward=false选项,有些Docker平台可能有些许区别,可以编辑/etc/sysconfig/docker-network添加此选项,具体情况具体而定
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network.target docker.socket
Requires=docker.socket
[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/docker daemon -H fd:// --ip-forward=false
MountFlags=slave
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
[Install]
WantedBy=multi-user.target
2.如果依然出现如上问题可以尝试对ip_conntrack内核参数进行设置,将最大值更改为系统open_files的最大句柄数,线上服务器有的开的100万,有的200万
net.ipv4.netfilter.ip_conntrack_max = 2000000
net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 1200
失败后
net.netfilter.nf_conntrack_max = 2000000
net.netfilter.nf_conntrack_tcp_timeout_established = 1200
说明:如果sysctl -p时出现unknow key错误时,标识内核版本太高,可以如上的参数来对nf_conntrack设置