nginx 的 recv() failed (104: Connection reset by peer)问题解决
先讲一下遇到这个问题的经历,踩过的坑吧。
因为公司业务需要,搭建了负载均衡架构,搭建之后发现网站页面偶尔出现500,去分析日志,在后端真实服务器中没有发现问题。
由于该系统不完全是由公司的研发所研发的,是先够买的一套系统,然后再开发的,所以牵扯的比较多,去看debug日志,发现页面出现500的时候,debug日志中会报访问接口超时,但这个报接口超时是访问这个系统的供应商的接口,一开始以为是系统供应商的问题,因为报请求他们的几口超时,去找人家,人家说人家的没问题。换思路吧!
debug显示是超时,那我就把nginx里的,php里的所有超时时间都修改了,调的很大,之前以为500s已经很大了,但是,没有解决问题,于是看了好多文档,尽可能的将自己的超时时间设置的很大,如下:
fastcgi_buffer_size 128k;
fastcgi_buffers 4 128k;
fastcgi_busy_buffers_size 256k;
fastcgi_connect_timeout 600;
fastcgi_send_timeout 600;
fastcgi_read_timeout 600;
proxy_buffers 4 128k;
proxy_busy_buffers_size 128k;
proxy_connect_timeout 600s;
proxy_read_timeout 1200;
proxy_send_timeout 1200;
keepalive_timeout 65s;
client_header_timeout 120s;
client_body_timeout 120s;
send_timeout 30s;
但是,页面还是会有500的时候。很奇怪,错误日志里没有任何报错信息,去查看php-fpm的错误日志,偶尔会出现busy的情况,如下图:
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 9 idle, and 62 total children
这个是因为php-fpm的配置文件里设置的进程数不够,将进程数调大就没有问题了。主要参数有
pm.max_children
pm.start_servers
pm.min_spare_servers
pm.max_spare_servers
可以根据自己服务器的性能来计算设置多少值合适,修改完之后php错误日志不报了,可是页面500的问题还是没有解决,好闹心啊!
继续分析,想着是不是负载均衡的原因,然后将我的负载均衡架构直接撤了,试了试不用任何架构,直接走单个的web服务器,撤了观察了一天一夜,还是有这种情况,排除是架构的问题,又上了负载均衡,这次上了之后,突然想起来,负载均衡调度器也记录了错误日志,去查看负载均衡的错误日志,哇!全是一个错!如下:
2019/06/19 07:10:53 [error] 6744#0: *7779178 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 104.206.96.10, server: c.bailitop.com, request: "GET /login HTTP/1.1", upstream: "http:/****/login", host: "c.bailitop.com", referrer: "http://c.bailitop.com/my/courses/learning"
2019/06/19 07:11:09 [error] 6746#0: *7779311 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 157.55.39.99, server: c.bailitop.com, request: "GET /user/3417/learn HTTP/1.1", upstream: "http://****/user/3417/learn", host: "c.bailitop.com"
2019/06/19 07:11:29 [error] 6747#0: *7779428 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 223.72.77.219, server: c.bailitop.com, request: "GET /my/course/1337 HTTP/1.1", upstream: "http://****/my/course/1337", host: "c.bailitop.com", referrer: "http://c.bailitop.com/"
2019/06/19 07:11:29 [error] 6742#0: *7744356 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 106.39.2.230, server: c.bailitop.com, request: "GET /course/1414/activity/50238/live_trigger?eventName=doing&data%5BlastTime%5D=1560859104&data%5Bevents%5D%5Bwatching%5D%5BwatchTime%5D=60 HTTP/1.1", upstream: "http://****/course/1414/activity/50238/live_trigger?eventName=doing&data%5BlastTime%5D=1560859104&data%5Bevents%5D%5Bwatching%5D%5BwatchTime%5D=60", host: "c.bailitop.com", referrer: "http://c.bailitop.com/course/1414/activity/50238/live_entry"
2019/06/19 07:11:58 [error] 6746#0: *7779597 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 40.77.167.145, server: c.bailitop.com, request: "GET /user/9470/teach HTTP/1.1", upstream: "http://****/user/9470/teach", host: "c.bailitop.com"
2019/06/19 07:12:13 [error] 6741#0: *7779646 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 40.77.167.145, server: c.bailitop.com, request: "GET /course/explore/djt?subCategory=mylg1&orderBy=latest HTTP/1.1", upstream: "http://****/course/explore/djt?subCategory=mylg1&orderBy=latest", host: "c.bailitop.com"
2019/06/19 07:12:28 [error] 6742#0: *7779697 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 104.206.96.10, server: c.bailitop.com, request: "GET /register?goto=%2Fcourse%2Fexplore%2Fact%3Ffilter%255BcurrentLevelId%255D%3Dall%26filter%255Bprice%255D%3Dall%26filter%255Btype%255D%3Dall%26orderBy%3DrecommendedSeq%26page%3D8%26tag%255Btags%255D%255B8%255D%3D34 HTTP/1.1", upstream: "http://****/register?goto=%2Fcourse%2Fexplore%2Fact%3Ffilter%255BcurrentLevelId%255D%3Dall%26filter%255Bprice%255D%3Dall%26filter%255Btype%255D%3Dall%26orderBy%3DrecommendedSeq%26page%3D8%26tag%255Btags%255D%255B8%255D%3D34", host: "c.bailitop.com", referrer: "http://www.bailitop.com"
2019/06/19 07:12:29 [error] 6742#0: *7779705 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 106.39.2.230, server: c.bailitop.com, request: "GET /course/1414/activity/50238/live_trigger?eventName=doing&data%5BlastTime%5D=1560859104&data%5Bevents%5D%5Bwatching%5D%5BwatchTime%5D=60 HTTP/1.1", upstream: "http://****/course/1414/activity/50238/live_trigger?eventName=doing&data%5BlastTime%5D=1560859104&data%5Bevents%5D%5Bwatching%5D%5BwatchTime%5D=60", host: "c.bailitop.com", referrer: "http://c.bailitop.com/course/1414/activity/50238/live_entry"
2019/06/19 07:13:29 [error] 6742#0: *7779705 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 106.39.2.230, server: c.bailitop.com, request: "GET /course/1414/activity/50238/live_trigger?eventName=doing&data%5BlastTime%5D=1560859104&data%5Bevents%5D%5Bwatching%5D%5BwatchTime%5D=60 HTTP/1.1", upstream: "http://****/course/1414/activity/50238/live_trigger?eventName=doing&data%5BlastTime%5D=1560859104&data%5Bevents%5D%5Bwatching%5D%5BwatchTime%5D=60", host: "c.bailitop.com", referrer: "http://c.bailitop.com/course/1414/activity/50238/live_entry"
2019/06/19 07:13:32 [error] 6742#0: *7779959 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 14.116.141.249, server: c.bailitop.com, request: "GET /files/course/2019/03-01/1717233a72f9696815.jpg HTTP/1.1", upstream: "http://****/files/course/2019/03-01/1717233a72f9696815.jpg", host: "c.bailitop.com", referrer: "http://usa.bailitop.com/topics/20120712/8653.html"
终于看到错误信息了,这个错误问了一下度娘,好多说法如下:
解决办法全都是这个,按照这个方法将php-fpm配置文件中的request_terminate_timeout修改成了0,重启了php-fpm,问题没有解决,还是有这个报错信息,最后确定问题原因是,buffers的值设置太小的原因,最后的解决办法是:
将nginx的配置文件里的buffers的值设置大一些,下面对比一下修改之前和修改之后的:
修改之前:
client_max_body_size 1024m;
client_body_buffer_size 512k;
client_header_buffer_size 512k;
proxy_buffers 4 64k;
proxy_busy_buffers_size 64k;
修改之后:
client_max_body_size 1024m;
client_body_buffer_size 10m;
client_header_buffer_size 10m;
proxy_buffers 4 128k;
proxy_busy_buffers_size 128k;
将参数调大之后,动态监控错误日志,recv() failed (104: Connection reset by peer)错误不再出现了,也没有发现页面报500了,好了,问题解决了。这个问题解决了好久,一直解决的方向不对,所以吸取经验,以后希望能够对症下药,一针见血,不会走太多的弯路。
另外出现这个问题的原因,也有可能是后端真实服务器和负载均衡调度器之间端口没有放行,比如后端真实服务器的监听端口是8091,而防火墙没有放行8091端口,也会导致这种错误