tornado服务某个实例核吃满问题

tornado底层epoll陷入死循环,cpu100%

该进程所有fd详情信息:其中fd对应信息如下所示
4:tornado的epoll监听树的(epfd)句柄
3:listen监听socket
5:怀疑是gevent启动的epoll监听树的(epfd)句柄            
lrwx------ 1 root root 64 Feb 22 15:42 0 -> /1
lrwx------ 1 root root 64 Feb 22 15:42 1 -> /1
lrwx------ 1 root root 64 Feb 22 15:42 10 -> socket:[3858679084]
lrwx------ 1 root root 64 Feb 22 15:42 11 -> socket:[3858531359]
lrwx------ 1 root root 64 Feb 22 15:42 12 -> socket:[3893062439]
lrwx------ 1 root root 64 Feb 22 15:42 13 -> socket:[4021272049]
lrwx------ 1 root root 64 Feb 22 15:42 14 -> socket:[3736829588]
lrwx------ 1 root root 64 Feb 22 15:42 15 -> socket:[3858821377]
lrwx------ 1 root root 64 Feb 19 13:42 2 -> /1
lrwx------ 1 root root 64 Feb 22 15:42 3 -> socket:[3736762451]
lrwx------ 1 root root 64 Feb 22 15:42 35 -> socket:[3858851998]
lrwx------ 1 root root 64 Feb 22 15:42 36 -> socket:[3858856577]
lrwx------ 1 root root 64 Feb 22 15:42 37 -> socket:[3858548454]
lrwx------ 1 root root 64 Feb 22 15:42 38 -> socket:[3858769612]
lrwx------ 1 root root 64 Feb 22 15:42 39 -> socket:[3858868371]
lrwx------ 1 root root 64 Feb 22 15:42 4 -> anon_inode:[eventpoll]
lrwx------ 1 root root 64 Feb 22 15:42 40 -> socket:[3858870568]
lrwx------ 1 root root 64 Feb 22 15:42 41 -> socket:[3858870569]
lrwx------ 1 root root 64 Feb 22 15:42 5 -> anon_inode:[eventfd]
l-wx------ 1 root root 64 Feb 22 15:42 6 -> /opt/log/api_20220225102901.log
lrwx------ 1 root root 64 Feb 22 15:42 7 -> socket:[3736762452]
lrwx------ 1 root root 64 Feb 22 15:42 8 -> socket:[3736745829]
lrwx------ 1 root root 64 Feb 22


tornado服务,监听fd为:3和7
python   32 root    3u     IPv4         3736762451      0t0        TCP *:http (LISTEN)
python   32 root    7u     IPv6     

堆栈一直在循环打印下面的信息(10,35,36,37,15,39,11,38)
15:24:57.286761 epoll_wait(4, [{EPOLLIN, {u32=10, u64=145376053035018}}, {EPOLLIN, {u32=35, u64=119443040501795}}, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=36, u64=105853763977252}}, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=37, u64=92676804313125}}, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=15, u64=163247411953679}}, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=39, u64=47201690583079}}, {EPOLLIN, {u32=11, u64=156190780686347}}, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=38, u64=66795331387430}}], 64, 140) = 8
15:24:57.286845 epoll_ctl(4, EPOLL_CTL_MOD, 36, {EPOLLIN, {u32=36, u64=105853763977252}}) = 0
15:24:57.286902 epoll_ctl(4, EPOLL_CTL_MOD, 37, {EPOLLIN, {u32=37, u64=92676804313125}}) = 0
15:24:57.286960 epoll_ctl(4, EPOLL_CTL_MOD, 15, {EPOLLIN, {u32=15, u64=163247411953679}}) = 0
15:24:57.287029 epoll_ctl(4, EPOLL_CTL_MOD, 39, {EPOLLIN, {u32=39, u64=47201690583079}}) = 0
15:24:57.287088 epoll_ctl(4, EPOLL_CTL_MOD, 38, {EPOLLIN, {u32=38, u64=66795331387430}}) = 0

疑惑点
如果是陷入tornado epoll的死循环,epoll_wait为啥不能接受新的连接请求(也就是获取到客户端的请求),因为epoll_wait没有看到fd=3的读事件,按理说是可以的。
是fd=3 socket已经重红黑树里面删掉了还是有其他原因。当然也有可能是没看到,如果有的话,应该会多处于CLOSE_WAIT状态的socket


其中fd=10,fd=35,fd=11的都是tcp连接,服务端TCP状态为Close_WAIT(远程客户端调用服务端的请求,信息发送完毕,客户端已经发起并关闭socket,服务端socket接收缓冲区有数据),不知道为啥不会处理?
下面是这两个fd对应的socket信息:netstat -p 查出来的数据
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp     1586      0 zte-rdcloud-testce:http 172.30.0.4:60970        CLOSE_WAIT  32/python
tcp     4088      0 zte-rdcloud-testce:http 172.30.0.4:43076        CLOSE_WAIT  32/python
tcp     4088      0 zte-rdcloud-testce:http 172.30COMMAND PID USER   FD      TYPE     .0.4:43076        CLOSE_WAIT  32/python

fd:10,35,31用lsof查出来的数据:
lsof:
python   32 root   10u     IPv4         3858679084      0t0        TCP zte-rdcloud-testcenter-testlib-41-fwx5f:http->172.30.0.4:60970 (CLOSE_WAIT)
python   32 root   35u     IPv4         3858851998      0t0        TCP zte-rdcloud-testcenter-testlib-41-fwx5f:http->172.30.0.4:43076 (CLOSE_WAIT)
python   32 root   11u     IPv4         3858531359      0t0        TCP zte-rdcloud-testcenter-testlib-41-fwx5f:http->172.30.0.3:40016 (CLOSE_WAIT)


其他fd(15,36,37,38,39):netstat 查不出来,lsof查出来的数据
lsof:
COMMAND PID USER   FD      TYPE             DEVICE SIZE/OFF       NODE NAME
python   32 root   15u     sock                0,7      0t0 3858821377 protocol: TCP
python   32 root   36u     sock                0,7      0t0 3858856577 protocol: TCP
python   32 root   37u     sock                0,7      0t0 3858548454 protocol: TCP
python   32 root   38u     sock                0,7      0t0 3858769612 protocol: TCP
python   32 root   39u     sock                0,7      0t0 3858868371 protocol: TCP

---------------------------------------------------------------------------------------------------------------------------------

新进展:

进程号:41
tornado:死循环调用epoll_wait,又不处理socket里面的数据

strace堆栈信息:循环打印下面的信息,显示一直循环调用。这这个socket都是客户端的连接,并且已经发送完数据
epoll_wait(4, [{EPOLLIN, {u32=24, u64=523904405733400}}, {EPOLLIN, {u32=25, u64=551181743030297}}, {EPOLLIN, {u32=15, u64=528938107404303}}, {EPOLLIN, {u32=37, u64=382690176008229}}, {EPOLLIN, {u32=27, u64=446019468787739}}, {EPOLLIN, {u32=26, u64=499624955609114}}, {EPOLLIN, {u32=38, u64=326211356065830}}], 64, 256) = 7

netstat:查看这几个socket,tcp状态信息(显示接受队列有数据,但是不读取)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        1      0 zte-rdcloud-testce:http 172.30.0.2:50762        CLOSE_WAIT  41/python
tcp     1830      0 zte-rdcloud-testce:http 172.30.0.4:38716        CLOSE_WAIT  41/python
tcp      884      0 zte-rdcloud-testce:http 172.30.0.3:51108        CLOSE_WAIT  41/python
tcp     1011      0 zte-rdcloud-testce:http 172.30.0.3:36216        CLOSE_WAIT  41/python
tcp        1      0 zte-rdcloud-testce:http 172.30.0.2:53818        CLOSE_WAIT  41/python
tcp        1      0 zte-rdcloud-testce:http 172.30.0.4:56326        CLOSE_WAIT  41/python
tcp     6979      0 zte-rdcloud-testce:http 172.30.0.3:50746        CLOSE_WAIT  41/python

通过strace命令分析tornado底层机制

1.web:tornado,service:one thread(单实例)

服务启动的状态

socket信息:

r-x------ 1 zxj zxj 64 2月 26 16:01 0 -> pipe:[13412432]

l-wx------ 1 zxj zxj 64 2月 26 16:01 1 -> pipe:[13412433]

l-wx------ 1 zxj zxj 64 2月 26 16:01 2 -> pipe:[13412434]

lrwx------ 1 zxj zxj 64 2月 26 16:01 3 -> socket:[13403746]

lrwx------ 1 zxj zxj 64 2月 26 16:01 4 -> anon_inode:[eventpoll]

lrwx------ 1 zxj zxj 64 2月 26 16:02 5 -> socket:[13403747]

lrwx------ 1 zxj zxj 64 2月 26 16:01 6 -> socket:[13403748]

0,1,2:标准的输入输出和错误流

3:监听socket

4.epoll句柄(epoll_create返回的句柄fd)

5,6:不清楚是干什么用的

netstat 状态

netstat -p | grep 7246

Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name

unix 3 [ ] 流 已连接 13403748 7246/python3.8

unix 3 [ ] 流 已连接 13403747 7246/python3.8

lsof:

lsof -p 7246

ython3.8 7246 zxj 0r FIFO 0,12 0t0 13412432 pipe

python3.8 7246 zxj 1w FIFO 0,12 0t0 13412433 pipe

python3.8 7246 zxj 2w FIFO 0,12 0t0 13412434 pipe

python3.8 7246 zxj 3u IPv4 13403746 0t0 TCP localhost:http-alt (LISTEN)

python3.8 7246 zxj 4u a_inode 0,13 0 10692 [eventpoll]

python3.8 7246 zxj 5u unix 0xffff9cad2e347800 0t0 13403747 type=STREAM

python3.8 7246 zxj 6u unix 0xffff9cad2e340c00 0t0 13403748 type=STREAM

epfd里面目前只注册了listen socket的读事件

sreace:

strace: Process 7246 attached

epoll_wait(4,

没有任何读写事件发生,阻塞

(1)正常连接和关闭

strace:

监听到客户端连接,创建一个socket:fd=7 和客户端通信,请求完成后,并没有关闭socket:fd=7,应该是会复用

epoll_wait(4, [{EPOLLIN, {u32=3, u64=19828060119367683}}], 2, -1) = 1

accept4(3, {sa_family=AF_INET, sin_port=htons(41198), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = 7

getsockname(7, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0

ioctl(7, FIONBIO, [1]) = 0

accept4(3, 0x7ffc93b1caa0, 0x7ffc93b1ca8c, SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

epoll_wait(4, [], 2, 0) = 0

recvfrom(7, "GET /test2 HTTP/1.1\r\nUser-Agent:"..., 65536, 0, NULL, NULL) = 231

mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7eff5208d000

epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=19828060119367687}}) = 0

epoll_wait(4, [], 3, 0) = 0

sendto(7, "HTTP/1.1 200 OK\r\nServer: Tornado"..., 199, 0, NULL, 0) = 199

setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0

setsockopt(7, SOL_TCP, TCP_NODELAY, [0], 4) = 0

epoll_wait(4, [], 3, 0) = 0

epoll_wait(4, [], 3, 0) = 0

recvfrom(7, 0x25f52c0, 65536, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)

epoll_wait(4,

(2)客户端非正常关闭

网络状态:

tcp 1 0 localhost:http-alt localhost:41198 CLOSE_WAIT 7246/python3.8

unix 3 [ ] 流 已连接 13403748 7246/python3.8

unix 3 [ ] 流 已连接 13403747 7246/python3.8

python3.8 7246 zxj 7u IPv4 13454501 0t0 TCP localhost:http-alt->localhost:41198 (CLOSE_WAIT)

strace:发现客户端socket异常后关闭socket,并重红黑树里面删去

epoll_wait(4, [{EPOLLIN, {u32=7, u64=19828060119367687}}], 3, 3600000) = 1

recvfrom(7, "GET /test1 HTTP/1.1\r\nContent-Typ"..., 65536, 0, NULL, NULL) = 515

epoll_ctl(4, EPOLL_CTL_DEL, 7, 0x7ffc93b1c650) = 0

epoll_wait(4, [], 2, 0) = 0

epoll_wait(4, [], 2, 0) = 0

epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=19828060119367687}}) = 0

epoll_wait(4, [], 3, 0) = 0

select(0, NULL, NULL, NULL, {35, 0}) = 0 (Timeout)

write(1, "xxxxx", 5) = 5

write(1, "\n", 1) = 1

sendto(7, "HTTP/1.1 200 OK\r\nServer: Tornado"..., 199, 0, NULL, 0) = 199

setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0

setsockopt(7, SOL_TCP, TCP_NODELAY, [0], 4) = 0

epoll_wait(4, [{EPOLLIN|EPOLLERR|EPOLLHUP, {u32=7, u64=19828060119367687}}], 3, 0) = 1

recvfrom(7, "", 65536, 0, NULL, NULL) = 0

epoll_ctl(4, EPOLL_CTL_DEL, 7, 0x7ffc93b1bfb0) = 0

close(7) = 0

epoll_wait(4, [], 2, 0) = 0

epoll_wait(4, [], 2, 0) = 0

epoll_wait(4,

2.tornado+gevent

(1):使用,monkey.patch()替换阻塞的系统调用

发现:一个实例的场景下,time.Sleep(),不会阻塞,这个时候两个请求打过来,服务端陷入

while true:

   epoll_wait()的循环调用里面,cpu吃到100%,但是不理解的是,为啥无法处理listen socket的连接请求

cpu:100%

(2)gevent 使用gevent.spawn(),内部使用阻塞的系统调用

两个请求打过来,能正常处理listen socket的连接请求

(3)gevent 内部死循环

因为GIL锁,实例CPU陷入cpu循环调用,cpu100%,而且无法进入到epoll的IO_loop里面

综上所述:猜测上述形成的原因是使用了monkey.patch,但是没有使用gevent.spawn,调用阻塞系统调用导致的

下一个需要澄清的问题:进入io_loop,并且监听socket fd里面有可读事件,但是没法处理

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值