最近遇见服务器连接数过多问题,有些困扰,本次主要讲下本人遇见进程数过多的一些处理方式,如有不正确的或者更好的方式,还请不吝赐教!
1.首先如何查看
linux:
a.最简单的统计数量
ps -ef |wc -l
b.根据用户分组统计,并且倒序显示
ps -ef |awk ‘{print $1}’ |sort |uniq -c |sort -rn
[root@amenity03 ~]<20210830 17:05:16># ps -ef |awk '{print $1}' |sort |uniq -c |sort -rn
198 root
2 68
1 UID
1 rpc
1 ntp
1 dbus
可以根据awk来区分是哪些用户的 哪些类型的占比数量较多 awk 后的 $1…
2.思路:
1.我自身服务器的情况,查看到了进程大约3000个不到,我的服务器单个用户限制3000,因此已经对我服务器造成严重影响
2.我查看到的进程都是批处理用户的进程,分类后都是 sshd 的服务进程,因为该用户是提供给多方使用,因此也没有办法根据用户判断,因此得想其他办法
3.来源IP,我根据最近的进程增多的情况,使用 last 查看当前还有连接的IP
[etlgs@etldispatch01kf ~]<20210830 16:56:47>$ last |grep still
etlds pts/13 55.11.38.208 Tue Aug 31 17:13 still logged in
etlgs pts/9 99.15.197.27 Tue Aug 31 17:13 still logged in
etlgs pts/8 99.11.235.45 Tue Aug 31 17:11 still logged in
etlgs pts/2 99.15.197.222 Tue Aug 31 17:06 still logged in
etlgs pts/3 99.11.233.93 Tue Aug 31 17:04 still logged in
etlds pts/11 55.11.38.208 Tue Aug 31 17:03 still logged in
etlgs pts/12 55.11.39.216 Tue Aug 31 15:18 still logged in
etlgs pts/10 55.11.39.209 Tue Aug 31 11:07 still logged in
qadmsom pts/5 55.11.39.195 Tue Aug 31 10:59 still logged in
etlgs pts/7 55.11.38.137 Tue Aug 31 09:43 still logged in
etlgs pts/0 55.11.39.209 Tue Aug 31 09:07 still logged in
qadmsom pts/6 99.12.39.225 Mon Aug 30 09:11 still logged in
类似,很遗憾,生产服务器都是容器平台地址,导致我只知道有哪些IP,但是我也不能确定IP具体的连接个数
4.高权限,没办法,只能联系管理员要了审计日志 /usr/log/secure
[root@etldispatch01kf ~]<20210830 17:25:51># cat /var/log/secure |egrep 'Accepted password for|Received disconnect from' |awk -F']:' '{print $2Received disconnect from}' |awk -F':' '{print $1}' |awk -F'port' '{print $1}' |sort |uniq -c |sort -rn
1648 Accepted password for etlds from 55.9.131.95
1439 Received disconnect from 55.9.131.95
1352 Received disconnect from 55.9.10.202
1321 Accepted password for etlds from 55.6.136.249
1269 Accepted password for etlds from 55.9.10.202
1235 Received disconnect from 55.6.136.249
1224 Accepted password for etlds from 55.9.6.43
1204 Received disconnect from 55.9.6.43
636 Accepted password for etlds from 55.9.131.26
580 Accepted password for etlgs from 55.9.10.202
520 Accepted password for etlds from 55.9.6.124
515 Received disconnect from 55.9.6.126
513 Accepted password for etlgs from 55.9.131.95
494 Accepted password for etlds from 55.6.136.237
489 Received disconnect from 55.9.131.26
487 Accepted password for etlgs from 55.9.6.43
478 Received disconnect from 55.9.6.124
450 Accepted password for etlgs from 55.6.136.249
382 Received disconnect from 55.6.136.237
对比Accepted 和 disconnect 不难比对出是哪个IP连接占用了sshd进程
5.问题查明:新上线的应用在登录服务器时,操作完成后未做连接释放的操作,导致服务器出现进程数异常的情况
查看本身sshd进程
[root@etldispatch01kf ~]<20210830 17:36:36># ps
PID TTY TIME CMD
17509 pts/6 00:00:00 ps
35645 pts/6 00:00:00 bash
[root@etldispatch01kf ~]<20210830 17:37:06># ps -ef |grep 35645
root 17591 35645 1 17:37 pts/6 00:00:00 ps -ef
root 17592 35645 0 17:37 pts/6 00:00:00 grep 35645
root 35645 **35641** 0 Aug30 pts/6 00:00:00 -bash
排除自身登录程序以及自身必须要保留的程序外,进行kill操作(注意服务器对外应用的程序),风险较大,慎用。
ps -ef |grep '^etlds' |grep 'sshd' |awk '{print $2}' |grep -v 35641 |xargs -i kill {}
tips:目前服务器已经迁移到ACS云平台,特性中存在240s长连接即会终端,也算是给了服务器一些自我保障措施
总结:
1.监控:自身服务器监控不足,导致进程快爆炸了才知道
2.应用测试问题:新应用测试不足,连接未能及时释放
3.使用kill操作时,一定要仔细确认是否能够进行此操作,对于存在的应用风险是极大的