tornado底层epoll陷入死循环,cpu100%
该进程所有fd详情信息:其中fd对应信息如下所示
4:tornado的epoll监听树的(epfd)句柄
3:listen监听socket
5:怀疑是gevent启动的epoll监听树的(epfd)句柄
lrwx------ 1 root root 64 Feb 22 15:42 0 -> /1
lrwx------ 1 root root 64 Feb 22 15:42 1 -> /1
lrwx------ 1 root root 64 Feb 22 15:42 10 -> socket:[3858679084]
lrwx------ 1 root root 64 Feb 22 15:42 11 -> socket:[3858531359]
lrwx------ 1 root root 64 Feb 22 15:42 12 -> socket:[3893062439]
lrwx------ 1 root root 64 Feb 22 15:42 13 -> socket:[4021272049]
lrwx------ 1 root root 64 Feb 22 15:42 14 -> socket:[3736829588]
lrwx------ 1 root root 64 Feb 22 15:42 15 -> socket:[3858821377]
lrwx------ 1 root root 64 Feb 19 13:42 2 -> /1
lrwx------ 1 root root 64 Feb 22 15:42 3 -> socket:[3736762451]
lrwx------ 1 root root 64 Feb 22 15:42 35 -> socket:[3858851998]
lrwx------ 1 root root 64 Feb 22 15:42 36 -> socket:[3858856577]
lrwx------ 1 root root 64 Feb 22 15:42 37 -> socket:[3858548454]
lrwx------ 1 root root 64 Feb 22 15:42 38 -> socket:[3858769612]
lrwx------ 1 root root 64 Feb 22 15:42 39 -> socket:[3858868371]
lrwx------ 1 root root 64 Feb 22 15:42 4 -> anon_inode:[eventpoll]
lrwx------ 1 root root 64 Feb 22 15:42 40 -> socket:[3858870568]
lrwx------ 1 root root 64 Feb 22 15:42 41 -> socket:[3858870569]
lrwx------ 1 root root 64 Feb 22 15:42 5 -> anon_inode:[eventfd]
l-wx------ 1 root root 64 Feb 22 15:42 6 -> /opt/log/api_20220225102901.log
lrwx------ 1 root root 64 Feb 22 15:42 7 -> socket:[3736762452]
lrwx------ 1 root root 64 Feb 22 15:42 8 -> socket:[3736745829]
lrwx------ 1 root root 64 Feb 22
tornado服务,监听fd为:3和7
python 32 root 3u IPv4 3736762451 0t0 TCP *:http (LISTEN)
python 32 root 7u IPv6
堆栈一直在循环打印下面的信息(10,35,36,37,15,39,11,38)
15:24:57.286761 epoll_wait(4, [{EPOLLIN, {u32=10, u64=145376053035018}}, {EPOLLIN, {u32=35, u64=119443040501795}}, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=36, u64=105853763977252}}, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=37, u64=92676804313125}}, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=15, u64=163247411953679}}, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=39, u64=47201690583079}}, {EPOLLIN, {u32=11, u64=156190780686347}}, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=38, u64=66795331387430}}], 64, 140) = 8
15:24:57.286845 epoll_ctl(4, EPOLL_CTL_MOD, 36, {EPOLLIN, {u32=36, u64=105853763977252}}) = 0
15:24:57.286902 epoll_ctl(4, EPOLL_CTL_MOD, 37, {EPOLLIN, {u32=37, u64=92676804313125}}) = 0
15:24:57.286960 epoll_ctl(4, EPOLL_CTL_MOD, 15, {EPOLLIN, {u32=15, u64=163247411953679}}) = 0
15:24:57.287029 epoll_ctl(4, EPOLL_CTL_MOD, 39, {EPOLLIN, {u32=39, u64=47201690583079}}) = 0
15:24:57.287088 epoll_ctl(4, EPOLL_CTL_MOD, 38, {EPOLLIN, {u32=38, u64=66795331387430}}) = 0
疑惑点
如果是陷入tornado epoll的死循环,epoll_wait为啥不能接受新的连接请求(也就是获取到客户端的请求),因为epoll_wait没有看到fd=3的读事件,按理说是可以的。
是fd=3 socket已经重红黑树里面删掉了还是有其他原因。当然也有可能是没看到,如果有的话,应该会多处于CLOSE_WAIT状态的socket
其中fd=10,fd=35,fd=11的都是tcp连接,服务端TCP状态为Close_WAIT(远程客户端调用服务端的请求,信息发送完毕,客户端已经发起并关闭socket,服务端socket接收缓冲区有数据),不知道为啥不会处理?
下面是这两个fd对应的socket信息:netstat -p 查出来的数据
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 1586 0 zte-rdcloud-testce:http 172.30.0.4:60970 CLOSE_WAIT 32/python
tcp 4088 0 zte-rdcloud-testce:http 172.30.0.4:43076 CLOSE_WAIT 32/python
tcp 4088 0 zte-rdcloud-testce:http 172.30COMMAND PID USER FD TYPE .0.4:43076 CLOSE_WAIT 32/python
fd:10,35,31用lsof查出来的数据:
lsof:
python 32 root 10u IPv4 3858679084 0t0 TCP zte-rdcloud-testcenter-testlib-41-fwx5f:http->172.30.0.4:60970 (CLOSE_WAIT)
python 32 root 35u IPv4 3858851998 0t0 TCP zte-rdcloud-testcenter-testlib-41-fwx5f:http->172.30.0.4:43076 (CLOSE_WAIT)
python 32 root 11u IPv4 3858531359 0t0 TCP zte-rdcloud-testcenter-testlib-41-fwx5f:http->172.30.0.3:40016 (CLOSE_WAIT)
其他fd(15,36,37,38,39):netstat 查不出来,lsof查出来的数据
lsof:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python 32 root 15u sock 0,7 0t0 3858821377 protocol: TCP
python 32 root 36u sock 0,7 0t0 3858856577 protocol: TCP
python 32 root 37u sock 0,7 0t0 3858548454 protocol: TCP
python 32 root 38u sock 0,7 0t0 3858769612 protocol: TCP
python 32 root 39u sock 0,7 0t0 3858868371 protocol: TCP
---------------------------------------------------------------------------------------------------------------------------------
新进展:
进程号:41
tornado:死循环调用epoll_wait,又不处理socket里面的数据
strace堆栈信息:循环打印下面的信息,显示一直循环调用。这这个socket都是客户端的连接,并且已经发送完数据
epoll_wait(4, [{EPOLLIN, {u32=24, u64=523904405733400}}, {EPOLLIN, {u32=25, u64=551181743030297}}, {EPOLLIN, {u32=15, u64=528938107404303}}, {EPOLLIN, {u32=37, u64=382690176008229}}, {EPOLLIN, {u32=27, u64=446019468787739}}, {EPOLLIN, {u32=26, u64=499624955609114}}, {EPOLLIN, {u32=38, u64=326211356065830}}], 64, 256) = 7
netstat:查看这几个socket,tcp状态信息(显示接受队列有数据,但是不读取)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 1 0 zte-rdcloud-testce:http 172.30.0.2:50762 CLOSE_WAIT 41/python
tcp 1830 0 zte-rdcloud-testce:http 172.30.0.4:38716 CLOSE_WAIT 41/python
tcp 884 0 zte-rdcloud-testce:http 172.30.0.3:51108 CLOSE_WAIT 41/python
tcp 1011 0 zte-rdcloud-testce:http 172.30.0.3:36216 CLOSE_WAIT 41/python
tcp 1 0 zte-rdcloud-testce:http 172.30.0.2:53818 CLOSE_WAIT 41/python
tcp 1 0 zte-rdcloud-testce:http 172.30.0.4:56326 CLOSE_WAIT 41/python
tcp 6979 0 zte-rdcloud-testce:http 172.30.0.3:50746 CLOSE_WAIT 41/python
通过strace命令分析tornado底层机制
1.web:tornado,service:one thread(单实例)
服务启动的状态
socket信息:
r-x------ 1 zxj zxj 64 2月 26 16:01 0 -> pipe:[13412432]
l-wx------ 1 zxj zxj 64 2月 26 16:01 1 -> pipe:[13412433]
l-wx------ 1 zxj zxj 64 2月 26 16:01 2 -> pipe:[13412434]
lrwx------ 1 zxj zxj 64 2月 26 16:01 3 -> socket:[13403746]
lrwx------ 1 zxj zxj 64 2月 26 16:01 4 -> anon_inode:[eventpoll]
lrwx------ 1 zxj zxj 64 2月 26 16:02 5 -> socket:[13403747]
lrwx------ 1 zxj zxj 64 2月 26 16:01 6 -> socket:[13403748]
0,1,2:标准的输入输出和错误流
3:监听socket
4.epoll句柄(epoll_create返回的句柄fd)
5,6:不清楚是干什么用的
netstat 状态
netstat -p | grep 7246
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
unix 3 [ ] 流 已连接 13403748 7246/python3.8
unix 3 [ ] 流 已连接 13403747 7246/python3.8
lsof:
lsof -p 7246
ython3.8 7246 zxj 0r FIFO 0,12 0t0 13412432 pipe
python3.8 7246 zxj 1w FIFO 0,12 0t0 13412433 pipe
python3.8 7246 zxj 2w FIFO 0,12 0t0 13412434 pipe
python3.8 7246 zxj 3u IPv4 13403746 0t0 TCP localhost:http-alt (LISTEN)
python3.8 7246 zxj 4u a_inode 0,13 0 10692 [eventpoll]
python3.8 7246 zxj 5u unix 0xffff9cad2e347800 0t0 13403747 type=STREAM
python3.8 7246 zxj 6u unix 0xffff9cad2e340c00 0t0 13403748 type=STREAM
epfd里面目前只注册了listen socket的读事件
sreace:
strace: Process 7246 attached
epoll_wait(4,
没有任何读写事件发生,阻塞
(1)正常连接和关闭
strace:
监听到客户端连接,创建一个socket:fd=7 和客户端通信,请求完成后,并没有关闭socket:fd=7,应该是会复用
epoll_wait(4, [{EPOLLIN, {u32=3, u64=19828060119367683}}], 2, -1) = 1
accept4(3, {sa_family=AF_INET, sin_port=htons(41198), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = 7
getsockname(7, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
ioctl(7, FIONBIO, [1]) = 0
accept4(3, 0x7ffc93b1caa0, 0x7ffc93b1ca8c, SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(4, [], 2, 0) = 0
recvfrom(7, "GET /test2 HTTP/1.1\r\nUser-Agent:"..., 65536, 0, NULL, NULL) = 231
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7eff5208d000
epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=19828060119367687}}) = 0
epoll_wait(4, [], 3, 0) = 0
sendto(7, "HTTP/1.1 200 OK\r\nServer: Tornado"..., 199, 0, NULL, 0) = 199
setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(7, SOL_TCP, TCP_NODELAY, [0], 4) = 0
epoll_wait(4, [], 3, 0) = 0
epoll_wait(4, [], 3, 0) = 0
recvfrom(7, 0x25f52c0, 65536, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(4,
(2)客户端非正常关闭
网络状态:
tcp 1 0 localhost:http-alt localhost:41198 CLOSE_WAIT 7246/python3.8
unix 3 [ ] 流 已连接 13403748 7246/python3.8
unix 3 [ ] 流 已连接 13403747 7246/python3.8
python3.8 7246 zxj 7u IPv4 13454501 0t0 TCP localhost:http-alt->localhost:41198 (CLOSE_WAIT)
strace:发现客户端socket异常后关闭socket,并重红黑树里面删去
epoll_wait(4, [{EPOLLIN, {u32=7, u64=19828060119367687}}], 3, 3600000) = 1
recvfrom(7, "GET /test1 HTTP/1.1\r\nContent-Typ"..., 65536, 0, NULL, NULL) = 515
epoll_ctl(4, EPOLL_CTL_DEL, 7, 0x7ffc93b1c650) = 0
epoll_wait(4, [], 2, 0) = 0
epoll_wait(4, [], 2, 0) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=19828060119367687}}) = 0
epoll_wait(4, [], 3, 0) = 0
select(0, NULL, NULL, NULL, {35, 0}) = 0 (Timeout)
write(1, "xxxxx", 5) = 5
write(1, "\n", 1) = 1
sendto(7, "HTTP/1.1 200 OK\r\nServer: Tornado"..., 199, 0, NULL, 0) = 199
setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(7, SOL_TCP, TCP_NODELAY, [0], 4) = 0
epoll_wait(4, [{EPOLLIN|EPOLLERR|EPOLLHUP, {u32=7, u64=19828060119367687}}], 3, 0) = 1
recvfrom(7, "", 65536, 0, NULL, NULL) = 0
epoll_ctl(4, EPOLL_CTL_DEL, 7, 0x7ffc93b1bfb0) = 0
close(7) = 0
epoll_wait(4, [], 2, 0) = 0
epoll_wait(4, [], 2, 0) = 0
epoll_wait(4,
2.tornado+gevent
(1):使用,monkey.patch()替换阻塞的系统调用
发现:一个实例的场景下,time.Sleep(),不会阻塞,这个时候两个请求打过来,服务端陷入
while true:
epoll_wait()的循环调用里面,cpu吃到100%,但是不理解的是,为啥无法处理listen socket的连接请求
cpu:100%
(2)gevent 使用gevent.spawn(),内部使用阻塞的系统调用
两个请求打过来,能正常处理listen socket的连接请求
(3)gevent 内部死循环
因为GIL锁,实例CPU陷入cpu循环调用,cpu100%,而且无法进入到epoll的IO_loop里面
综上所述:猜测上述形成的原因是使用了monkey.patch,但是没有使用gevent.spawn,调用阻塞系统调用导致的
下一个需要澄清的问题:进入io_loop,并且监听socket fd里面有可读事件,但是没法处理