问题
最近一台刚装完Linux系统的服务器经常断网,换网线和交换机后还是无法解决,排除了硬件问题。
发现规律
观察网络状况,在ping的时候发现该服务器会重复断开与重连网络:
…
…
…
…
根据数字计算发现间隔是有规律的:
连接状态 317-137 = 179
断开状态 616-318 = 298
连接状态 795-617 = 178
断开状态 1095-796 = 299
调查驱动
怀疑是系统设置的问题,用ethtool em1命令看了网卡,和其他服务器比较下,完全一样,确认驱动没有问题:
通过系统Log调查
去看/var/log/messages,发现 NetworkManager这个服务会经常连接和断开,确认问题是NetworkManager和network服务冲突,导致网络经常断开,以下是截取的Log:
Mar 22 19:03:00 master01 NetworkManager[1345]: <warn> [1616454180.2962] dhcp4 (em1): request timed out
Mar 22 19:03:00 master01 NetworkManager[1345]: <info> [1616454180.2963] dhcp4 (em1): state changed unknown -> timeout
Mar 22 19:03:00 master01 NetworkManager[1345]: <info> [1616454180.3126] dhcp4 (em1): canceled DHCP transaction, DHCP client pid 52872
Mar 22 19:03:00 master01 NetworkManager[1345]: <info> [1616454180.3126] dhcp4 (em1): state changed timeout -> done
Mar 22 19:03:00 master01 NetworkManager[1345]: <info> [1616454180.3130] device (em1): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
Mar 22 19:03:00 master01 NetworkManager[1345]: <info> [1616454180.3137] manager: NetworkManager state is now DISCONNECTED
Mar 22 19:03:00 master01 NetworkManager[1345]: <warn> [1616454180.3142] device (em1): Activation: failed for connection 'em1'
Mar 22 19:03:00 master01 NetworkManager[1345]: <info> [1616454180.3146] device (em1): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
Mar 22 19:03:00 master01 NetworkManager[1345]: <info> [1616454180.3232] policy: auto-activating connection 'em1' (cd606acc-5c5e-415c-9be6-4bdb98572b1a)
Mar 22 19:03:00 master01 NetworkManager[1345]: <info> [1616454180.3240] device (em1): Activation: starting connection 'em1' (cd606acc-5c5e-415c-9be6-4bdb98572b1a)
Mar 22 19:03:00 master01 NetworkManager[1345]: <info> [1616454180.3242] device (em1): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Mar 22 19:03:00 master01 NetworkManager[1345]: <info> [1616454180.3248] manager: NetworkManager state is now CONNECTING
Mar 22 19:03:00 master01 NetworkManager[1345]: <info> [1616454180.3252] device (em1): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Mar 22 19:03:00 master01 NetworkManager[1345]: <info> [1616454180.3751] device (em1): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Mar 22 19:03:00 master01 NetworkManager[1345]: <info> [1616454180.3756] dhcp4 (em1): activation: beginning transaction (timeout in 45 seconds)
Mar 22 19:03:00 master01 NetworkManager[1345]: <info> [1616454180.3779] dhcp4 (em1): dhclient started with pid 52911
Mar 22 19:03:00 master01 dhclient[52911]: DHCPDISCOVER on em1 to 255.255.255.255 port 67 interval 8 (xid=0x37146d2d)
Mar 22 19:03:08 master01 dhclient[52911]: DHCPDISCOVER on em1 to 255.255.255.255 port 67 interval 15 (xid=0x37146d2d)
Mar 22 19:03:23 master01 dhclient[52911]: DHCPDISCOVER on em1 to 255.255.255.255 port 67 interval 17 (xid=0x37146d2d)
Mar 22 19:03:40 master01 dhclient[52911]: DHCPDISCOVER on em1 to 255.255.255.255 port 67 interval 17 (xid=0x37146d2d)
Mar 22 19:03:45 master01 NetworkManager[1345]: <warn> [1616454225.2992] dhcp4 (em1): request timed out
Mar 22 19:03:45 master01 NetworkManager[1345]: <info> [1616454225.2993] dhcp4 (em1): state changed unknown -> timeout
确认 NetworkManager的状态,这里也有Log反映它连接网络的活动:
[root@master01 ~]# systemctl status NetworkManager
● NetworkManager.service - Network Manager
Loaded: loaded (/usr/lib/systemd/system/NetworkManager.service; enabled; vendor preset: enabled)
Active: active (running) since 一 2021-03-22 08:05:59 EDT; 12h ago
Docs: man:NetworkManager(8)
Main PID: 1345 (NetworkManager)
CGroup: /system.slice/NetworkManager.service
├─ 1345 /usr/sbin/NetworkManager --no-daemon
└─58651 /sbin/dhclient -d -q -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-em1.pid -lf /var/lib/NetworkManager/dhclient-cd606acc-5c5e-41...
3月 22 21:02:15 master01.qd.china NetworkManager[1345]: <info> [1616461335.2920] device (em1): state change: disconnected -> prepare (re...aged')
3月 22 21:02:15 master01.qd.china NetworkManager[1345]: <info> [1616461335.2926] manager: NetworkManager state is now CONNECTING
3月 22 21:02:15 master01.qd.china NetworkManager[1345]: <info> [1616461335.2930] device (em1): state change: prepare -> config (reason '...aged')
3月 22 21:02:15 master01.qd.china NetworkManager[1345]: <info> [1616461335.3289] device (em1): state change: config -> ip-config (reason...aged')
3月 22 21:02:15 master01.qd.china NetworkManager[1345]: <info> [1616461335.3298] dhcp4 (em1): activation: beginning transaction (timeout...conds)
3月 22 21:02:15 master01.qd.china NetworkManager[1345]: <info> [1616461335.3337] dhcp4 (em1): dhclient started with pid 58651
3月 22 21:02:15 master01.qd.china dhclient[58651]: DHCPDISCOVER on em1 to 255.255.255.255 port 67 interval 5 (xid=0x3013b871)
3月 22 21:02:20 master01.qd.china dhclient[58651]: DHCPDISCOVER on em1 to 255.255.255.255 port 67 interval 7 (xid=0x3013b871)
3月 22 21:02:27 master01.qd.china dhclient[58651]: DHCPDISCOVER on em1 to 255.255.255.255 port 67 interval 15 (xid=0x3013b871)
3月 22 21:02:42 master01.qd.china dhclient[58651]: DHCPDISCOVER on em1 to 255.255.255.255 port 67 interval 21 (xid=0x3013b871)
Hint: Some lines were ellipsized, use -l to show in full.
解决问题
停止NetworkManager服务
systemctl status NetworkManager
systemctl stop NetworkManage
systemctl disable NetworkManager
后面测试再没有出现此问题,成功解决。