Socket Error 104 bug

原文链接:https://segmentfault.com/a/1190000017864652

bug概述

技术栈

  • nginx
  • uwsgi
  • bottle

错误详情

报警机器人经常有如下警告:

<27>1 2018-xx-xxT06:59:03.038Z 660ece0ebaad admin/admin 14 - - Socket Error: 104
<31>1 2018-xx-xxT06:59:03.038Z 660ece0ebaad admin/admin 14 - - Removing timeout for next heartbeat interval
<28>1 2018-xx-xxT06:59:03.039Z 660ece0ebaad admin/admin 14 - - Socket closed when connection was open
<31>1 2018-xx-xxT06:59:03.039Z 660ece0ebaad admin/admin 14 - - Added: {'callback': <bound method SelectConnection._on_connection_start of <pika.adapters.select_connection.SelectConnection object at 0x7f74752525d0>>, 'only': None, 'one_shot': True, 'arguments': None, 'calls': 1}
<28>1 2018-xx-xxT06:59:03.039Z 660ece0ebaad admin/admin 14 - - Disconnected from RabbitMQ at xx_host:5672 (0): Not specified
<31>1 2018-xx-xxT06:59:03.039Z 660ece0ebaad admin/admin 14 - - Processing 0:_on_connection_closed
<31>1 2018-xx-xxT06:59:03.040Z 660ece0ebaad admin/admin 14 - - Calling <bound method _CallbackResult.set_value_once of <pika.adapters.blocking_connection._CallbackResult object at 0x7f74752513f8>> for "0:_on_connection_closed"

debug过程

确定报错位置

有日志就很好办, 首先看日志在哪里打的. 从三个地方入手.

我们自己的代码

没有.

uwsgi的代码

root@660ece0ebaad:/# uwsgi --version
2.0.14
从github上co下来, 没有.

python library的代码

在容器中执行

>>> import sys
>>> sys.path
['', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages/PILcompat', '/usr/lib/python2.7/dist-packages/gtk-2.0']

在这些目录下grep, 在pika中找到

root@660ece0ebaad:/usr/local/lib/python2.7# grep "Socket Error" -R .
Binary file ./dist-packages/pika/adapters/base_connection.pyc matches
./dist-packages/pika/adapters/base_connection.py:            LOGGER.error("Fatal Socket Error: %r", error_value)
./dist-packages/pika/adapters/base_connection.py:            LOGGER.error("Socket Error: %s", error_code)

确定pika版本.

>>> import pika
>>> pika.__version__
'0.10.0'

确定错误逻辑

通过代码可以看到, Socket Error是errno的错误码, 确定错误码含义是对端发送了RST.

>>> import errno
>>> errno.errorcode[104]
'ECONNRESET'

怀疑rabbitmq server地址错误, 一个未listen的端口是会返回RST的, 验证后发现不是.
接着怀疑链接超时断开未通知客户端之类. 看rabbitmq server日志, 发现大量:

=ERROR REPORT==== 7-Dec-2018::20:43:18 ===
closing AMQP connection <0.9753.18> (172.17.0.19:27542 -> 192.168.44.112:5672):
missed heartbeats from client, timeout: 60s
--
=ERROR REPORT==== 7-Dec-2018::20:43:18 ===
closing AMQP connection <0.9768.18> (172.17.0.19:27544 -> 192.168.44.112:5672):
missed heartbeats from client, timeout: 60s

发现rabbitmq server和 admin docker的链接已经全部断开

root@xxxxxxx:/home/dingxinglong# netstat -nap | grep 5672  | grep "172.17.0.19"

那么, 为什么rabbitmq server会踢掉 pika建立的链接呢? 看pika代码注释:

    :param int heartbeat_interval: How often to send heartbeats.
                              Min between this value and server's proposal
                              will be used. Use 0 to deactivate heartbeats
                              and None to accept server's proposal.

我们没有传入心跳间隔, 理论上应该使用服务端默认的60S. 实际上, 客户端从来没有发出过心跳包. 于是继续看代码:
通过打印, 确认了HeartbeatChecker对象成功创建, 也成功地创建了timer, 但是timer从来没有回调过.
从代码一路跟下去, 我们用的是blocking_connections, 在其add_timeout注释中看到:

def add_timeout(self, deadline, callback_method):
    """Create a single-shot timer to fire after deadline seconds. Do not
    confuse with Tornado's timeout where you pass in the time you want to
    have your callback called. Only pass in the seconds until it's to be
    called.

    NOTE: the timer callbacks are dispatched only in the scope of
    specially-designated methods: see
    `BlockingConnection.process_data_events` and
    `BlockingChannel.start_consuming`.

    :param float deadline: The number of seconds to wait to call callback
    :param callable callback_method: The callback method with the signature
        callback_method()

timer的触发要靠process_data_events, 而我们没有调用. 所以客户端的heartbeat从来没被触发. 简单地将heartbeat关掉以解决这个问题.

具体触发点

调用代码如下: 没有跑main_loop, 故, 没处理 rabbitmq_server的FIN包, 无法跟踪链接状态.
一路跟踪basic_publish接口的代码.
在发送时, 收到RST, 最终跑到 base_connection.py:452, _handle_error函数中打印socket_error.

def connect_mq():
    mq_conf = xxxxx
    connection = pika.BlockingConnection(
        pika.ConnectionParameters(mq_conf['host'],
                                  int(mq_conf['port']),
                                  mq_conf['path'],
                                  pika.PlainCredentials(mq_conf['user'],
                                                        mq_conf['pwd']),
                                  heartbeat_interval=0))
    channel = connection.channel()
    channel.exchange_declare(exchange=xxxxx, type='direct', durable=True)
    return channel

channel = connect_mq()

def notify_xxxxx():
    global channel

    def _publish(product):
        channel.basic_publish(exchange=xxxxx,
                              routing_key='xxxxx',
                              body=json.dumps({'msg': 'xxxxx'}))
展开阅读全文

socket connect error 111

03-13

请达人赐教:rn这里有两段代码,一个是server,一个是client. client 连上server后向socket写入一串字符,server将其打印出来。 代码如下:rnrnServer断的代码:rnrn[root@app6 tmp]# vi socket-server.crnrn#include rn#include rn#include rn#include rn#include rn#include rn/* Read text from the socket and print it out. Continue until thernsocket closes. Return nonzero if the client sent a "quit"rnmessage, zero otherwise. */rnint server (int client_socket)rnrnwhile (1) rnint length;rnchar* text;rn/* First, read the length of the text message from the socket. Ifrnread returns zero, the client closed the connection. */rnif (read (client_socket, &length, sizeof (length)) == 0)rnreturn 0;rn/* Allocate a buffer to hold the text. */rn[b]length = 10;[/b][i][/i][color=#FF0000][/color] //Question:我将length打印出来,是个很大的数,运行时报segmentation fault,退出rn //因此,我将length改小了,代码可以运行起来。但又发现有很多的client去连它,不知道为什么rn //此时,我的client端根本还没起来rntext = (char*) malloc (length);rnprintf("break 4...and the length is:%d\n", length);rn/* Read the text itself, and print it. */rnread (client_socket, text, length);rnprintf("size of text is:%d\n",strlen(text));rnprintf ("%s\n", text);rn/* Free the buffer. */rnfree (text);rn/* If the client sent the message "quit," we¡¯re all done. */rnif (!strcmp (text, "quit"))rnreturn 1;rnrnrnrnint main (int argc, char* const argv[])rnrnrnconst char* const socket_name = argv[1];rnprintf("argv[1] is : %s\n",argv[1]);rnint socket_fd;rnstruct sockaddr_un name;rnint client_sent_quit_message;rn/* Create the socket. */rnsocket_fd = socket (PF_LOCAL, SOCK_STREAM, 0);rn/* Indicate that this is a server. */rnname.sun_family = AF_LOCAL;rnstrcpy (name.sun_path, socket_name);rnbind (socket_fd, (struct sockaddr *)&name, SUN_LEN (&name));rn/* Listen for connections. */rnprintf("break 1....\n");rnlisten (socket_fd, 5);rnprintf("break 2....\n");rn/* Repeatedly accept connections, spinning off one server() to dealrnwith each client. Continue until a client sends a "quit" message. */rndo rnstruct sockaddr_un client_name;rnsocklen_t client_name_len;rnint client_socket_fd;rn/* Accept a connection. */rnclient_socket_fd = accept (socket_fd,(struct sockaddr *)&client_name, &client_name_len);rnrnprintf("break 3....\n");rn/* Handle the connection. */rnclient_sent_quit_message = server (client_socket_fd);rn/* Close our end of the connection. */rnclose (client_socket_fd);rnrnwhile (!client_sent_quit_message);rn/* Remove the socket file. */rnclose (socket_fd);rnunlink (socket_name);rnreturn 0;rnrnrnrnclient端代码:rnrn#include rn#include rn#include rn#include rn#include rn#include rnrn/* Write TEXT to the socket given by file descriptor SOCKET_FD. */rnvoid write_text (int socket_fd, const char* text)rnrn/* Write the number of bytes in the string, includingrnNUL-termination. */rnint length = strlen (text) + 1;rnwrite (socket_fd, &length, sizeof (length));rn/* Write the string. */rnwrite (socket_fd, text, length);rnrnint main (int argc, char* const argv[])rnrnconst char* const socket_name = argv[1];rnconst char* const message = argv[2];rnint socket_fd;rnstruct sockaddr_un name;rnint flag;rn/* Create the socket. */rnsocket_fd = socket (AF_LOCAL, SOCK_STREAM, 0);rnprintf("socket_fd is : %d\n",socket_fd);rn/* Store the server¡¯s name in the socket address. */rnname.sun_family = AF_LOCAL;rnstrcpy (name.sun_path, socket_name);rn/* Connect the socket. */rnflag = connect (socket_fd, (struct sockaddr*)&name, SUN_LEN (&name));rnprintf("flag is :%d and errorNo : %d \n",flag,errno);rn/* Write the text on the command line to the socket. */rnwrite_text (socket_fd, message);rnclose (socket_fd);rnreturn 0;rnrnrn先起服务端: ./socket-server /root/tmp/socket-server | grep "hello"rn再起客服端: ./socket-client /root/tmp/socket-client "hello"rn[b]报错: connect函数的返回值是-1, errno: 111[/b][b][/b][color=#FF0000][/color]rn 论坛

没有更多推荐了,返回首页