面对现象PHP,【PHP问题定位】2018-07-02 fpm掉底分析

顺风车运营研发团队 黄桃

问题现象

某机器这段时间出现cpu-idle掉地的报警 如图:

bVbc16m?w=1754&h=920

原因分析

查看当时的监控(php-fpm-idle、cpu-idle,io-wait、io-write 等)

(1)php-fpm-idle今天出现两次突降,一次是12点左右,一次是16:30左右,查看整周也经常出现突降,如图

bVbc16H?w=2558&h=900

(2)io-wait在11:58分时 突然升高

bVbc16S?w=2550&h=950

(3)io-write也在11:58时出现大量写:

bVbc16X?w=2558&h=956

(4)cpu-idle当时短暂出现降低,之后出现徒增,但是结合整周的曲线来看,一直维持在70-80之间,徒增原因待分析:

bVbc17E?w=2556&h=928

(5)原因推测:因为当时出现了大批量的写日志,导致io-wait上升,php-fpm进程因为写文件出现延时,造成整体响应过慢,从而导致了fpm掉地; 对同一个文件进行 write时,大批量并行会出现等待,阻塞

验证推测

(1)查看当时的php-fpm的慢日志,看当时阻塞的地方,基本是在调用fwrite阻塞

bVbc18g?w=1744&h=1034

(2)查看当时的程序日志trace.log的大小,日志文件越大的时间段,正好是fpm-idle下滑严重的阶段:

bVbc18q?w=988&h=990

(3)在通过 sar命令,验证下当时的写入磁盘情况,在出现掉地的时间段确实出现极大的写入,wr_sec/s 从每秒几百低峰的几百增长到十几万:

bVbc18W?w=1686&h=1222

问题原因及优化建议

写入日志大可能两方面原因:

(1)当时请求暴增

(2)请求未暴增,但是某些请求触发了某些不合理的打日志

验证原因1,出现问题时间段每秒的流量

32 [02/Jul/2018:12:01:12

18 [02/Jul/2018:12:01:13

18 [02/Jul/2018:12:01:14

42 [02/Jul/2018:12:01:15

30 [02/Jul/2018:12:01:16

35 [02/Jul/2018:12:01:17

26 [02/Jul/2018:12:01:18

30 [02/Jul/2018:12:01:19

1 [02/Jul/2018:12:01:22

108 [02/Jul/2018:12:01:24

17 [02/Jul/2018:12:01:25

1 [02/Jul/2018:12:01:27

1 [02/Jul/2018:12:01:29

1 [02/Jul/2018:12:01:30

9 [02/Jul/2018:12:01:33

1 [02/Jul/2018:12:01:31

1 [02/Jul/2018:12:01:32

146 [02/Jul/2018:12:01:33

62 [02/Jul/2018:12:01:34

44 [02/Jul/2018:12:01:35

1 [02/Jul/2018:12:01:37

1 [02/Jul/2018:12:01:38

1 [02/Jul/2018:12:01:41

2 [02/Jul/2018:12:01:44

2 [02/Jul/2018:12:01:50

1 [02/Jul/2018:12:01:45

12 [02/Jul/2018:12:01:50

2 [02/Jul/2018:12:01:45

7 [02/Jul/2018:12:01:50

1 [02/Jul/2018:12:01:46

1 [02/Jul/2018:12:01:50

1 [02/Jul/2018:12:01:46

15 [02/Jul/2018:12:01:50

7 [02/Jul/2018:12:01:48

2 [02/Jul/2018:12:01:50

1 [02/Jul/2018:12:01:48

342 [02/Jul/2018:12:01:50

65 [02/Jul/2018:12:01:51

46 [02/Jul/2018:12:01:52

54 [02/Jul/2018:12:01:53

1 [02/Jul/2018:12:01:55

1 [02/Jul/2018:12:01:56

1 [02/Jul/2018:12:01:57

1 [02/Jul/2018:12:01:59

16 [02/Jul/2018:12:02:03

1 [02/Jul/2018:12:02:01

1 [02/Jul/2018:12:02:02

42 [02/Jul/2018:12:02:03

1 [02/Jul/2018:12:02:02

187 [02/Jul/2018:12:02:03

39 [02/Jul/2018:12:02:04

40 [02/Jul/2018:12:02:05

25 [02/Jul/2018:12:02:06

44 [02/Jul/2018:12:02:07

29 [02/Jul/2018:12:02:08

正常情况是QPS是30左右,出问题时间段则非常不稳定,时高时低,相差非常大,例如在 12:01:50 时 342 qps,但是前十几秒则基本都是个位数;原因?前几十秒都阻塞了,响应不了,积累到了12:01:50才响应;整体流量并未出现暴增;

验证原因2,查看当时traceId日志,查看写进去的内容:

发现这段写入非常巨大,一行就119kb ,总共写了 33555 行,总大小占:33555 * 119KB = 3993045KB =3899M 基本可以断定为这行出的问题了

bVbc2bk?w=2162&h=1102

优化建议:在底层打日志的类中对字符串长度做限制,避免这种大批量的写入;

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值