部分转自:Linux下Too many open files问题排查与解决 - Grey Zeng - 博客园 (cnblogs.com)
项目针对客户高并发、高可用需求,开发了一个网关服务。之所以没有使用Nginx等做负载均衡,是因为业务逻辑需要根据每个节点的任务队列情况,分配新的请求。Nginx的随机、轮询、ip_hash等策略不满足需求。
客户测试反馈,网关服务在长时间压测后不再接收新请求。查看日志发发现有大量“too many open files”的记录。
{"level":"error","timestamp":"2022-03-21T09:34:21+08:00","caller":"asrfile/upload.go:79","msg":"proxy request failed","lang type":"en-US","sample rate":16000,"ID":"en4","error":"dial tcp4 xxx.xxx.xxx.xxx:7100: socket: too many open files"}
{"level":"error","timestamp":"2022-03-21T09:34:21+08:00","caller":"asrfile/upload.go:79","msg":"proxy request failed","lang type":"en-US","sample rate":16000,"ID":"en3","error":"dial tcp4 xxx.xxx.xxx.xxx:7100: socket: too many open files"}
{"level":"error","timestamp":"2022-03-21T09:34:21+08:00","caller":"asrfile/upload.go:79","msg":"proxy request failed","lang type":"en-US","sample rate":16000,"ID":"en2","error":"dial tcp4 xxx.xxx.xxx.xxx:7100: socket: too many open files"}
{"level":"error","timestamp":"2022-03-21T09:34:21+08:00","caller":"asrfile/upload.go:79","msg":"proxy request failed","lang type":"en-US","sample rate":16000,"ID":"en1","error":"dial tcp4 xxx.xxx.xxx.xxx:7100: socket: too many open files"}
通过以下命令查看open files设置,发现是65535,已经是最大值了。
ulimit -a
后来找到一篇文章,说supervisor托管的服务会默认配置打开的句柄数量是1024,查看 /proc/进程ID/limits 发现果然网关进程的最大句柄数是1024。
根据文章中的修改方法,设置 /etc/supervisor.conf
[supervisord]
minfds=65535 ; min. avail startup file descriptors; default 1024
然后重启supervisor服务:systemctl restart supervisord,使配置生效。
此外,网关本身也有连接未关闭的问题。通过“lsof -p 进程ID”可查看当前的所有连接数。