缘由
这篇博文源于群里一个群友的提问
在 redis 里面存放了一个 1000w 长度的 list,然后使用 lrange 0 -1 全取出来,这会用很久。这时候我新建个连接,继续其他 key 的读写操作都是可以的。不应该是阻塞吗?
那么接下来就来分析为什么会这样,也就是对应标题中 Redis 是如何回复命令的。
注:本文中 Redis 版本为 6.2.4
数据存放位置
Redis 执行完命令后,会把回复的内存写入到当前客户端的两个地方 buf 和 reply,其中 buf 长度固定,大小为 16k, reply 是链表形式,存放前者超出的数据。
// server.h
#define PROTO_REPLY_CHUNK_BYTES (16*1024) /* 16k output buffer */
typedef struct client {
……
list *reply; /* List of reply objects to send to the client. */
……
/* Response buffer */
int bufpos;
char buf[PROTO_REPLY_CHUNK_BYTES];
……
}
以 lrange 来追下代码,便可一目了然。
// t_list.c
lrangeCommand
addListRangeReply
addReplyBulkCBuffer
addReplyProto
接下来着重看 addReplyProto 方法。
// networking.c
void addReplyProto(client *c, const char *s, size_t len) {
if (prepareClientToWrite(c) != C_OK) return;
if (_addReplyToBuffer(c,s,len) != C_OK)
_addReplyProtoToList(c,s,len);
}
上面代码中第二个判断说的就是,若是当前客户端中 buf 数组没有可用的空间,便存入到 reply 链表里,具体逻辑可见以下两个函数。
// networking.c
int _addReplyToBuffer(client *c, const char *s, size_t len) {
size_t available = sizeof(c->buf)-c->bufpos;
if (c->flags & CLIENT_CLOSE_AFTER_REPLY) return C_OK;
/* If there already are entries in the reply list, we cannot
* add anything more to the static buffer. */
if (listLength(c->reply) > 0) return C_ERR;
/* Check that the buffer has enough space available for this string. */
if (len > available) return C_ERR;
memcpy(c->buf+c->bufpos,s,len);
c->bufpos+=len;
return C_OK;
}
/* Adds the reply to the reply linked list.
* Note: some edits to this function need to be relayed to AddReplyFromClient. */
void _addReplyProtoToList(client *c, const char *s, size_t len) {
if (c->flags & CLIENT_CLOSE_AFTER_REPLY) return;
listNode *ln = listLast(c->reply);
clientReplyBlock *tail = ln? listNodeValue(ln): NULL;
/* Note that 'tail' may be NULL even if we have a tail node, because when
* addReplyDeferredLen() is used, it sets a dummy node to NULL just
* fo fill it later, when the size of the bulk length is set. */
/* Append to tail string when possible. */
if (tail) {
/* Copy the part we can fit into the tail, and leave the rest for a
* new node */
size_t avail = tail->size - tail->used;
size_t copy = avail >= len? len: avail;
memcpy(tail->buf + tail->used, s, copy);
tail->used += copy;
s += copy;
len -= copy;
}
if (len) {
/* Create a new node, make sure it is allocated to at
* least PROTO_REPLY_CHUNK_BYTES */
size_t size = len < PROTO_REPLY_CHUNK_BYTES? PROTO_REPLY_CHUNK_BYTES: len;
tail = zmalloc(size + sizeof(clientReplyBlock));
/* take over the allocation's internal fragmentation */
tail->size = zmalloc_usable_size(tail) - sizeof(clientReplyBlock);
tail->used = len;
memcpy(tail->buf, s, len);
listAddNodeTail(c->reply, tail);
c->reply_bytes += tail->size;
closeClientOnOutputBufferLimitReached(c, 1);
}
}
发送时机
特别需要注意的是,Redis 在处理完一个命令后,不会立即发送数据的,而是会把当前客户端添加到 clients_pending_write 等待写入返回值客户端队列中。
lrangeCommand
addListRangeReply
addReplyBulkCBuffer
addReply
prepareClientToWrite
clientInstallWriteHandler
/* This function puts the client in the queue of clients that should write
* their output buffers to the socket. Note that it does not *yet* install
* the write handler, to start clients are put in a queue of clients that need
* to write, so we try to do that before returning in the event loop (see the
* handleClientsWithPendingWrites() function).
* If we fail and there is more data to write, compared to what the socket
* buffers can hold, then we'll really install the handler. */
void clientInstallWriteHandler(client *c) {
/* Schedule the client to write the output buffers to the socket only
* if not already done and, for slaves, if the slave can actually receive
* writes at this stage. */
if (!(c->flags & CLIENT_PENDING_WRITE) &&
(c->replstate == REPL_STATE_NONE ||
(c->replstate == SLAVE_STATE_ONLINE && !c->repl_put_online_on_ack)))
{
/* Here instead of installing the write handler, we just flag the
* client and put it into a list of clients that have something
* to write to the socket. This way before re-entering the event
* loop, we can try to directly write to the client sockets avoiding
* a system call. We'll only really install the write handler if
* we'll not be able to write the whole reply at once. */
c->flags |= CLIENT_PENDING_WRITE;
listAddNodeHead(server.clients_pending_write,c);
}
}
到此,命令执行逻辑就结束了,需要等待 Redis 下一次事件循环处理时,将数据从输出缓冲区写入到 socket 中。
void aeMain(aeEventLoop *eventLoop) {
eventLoop->stop = 0;
while (!eventLoop->stop) {
aeProcessEvents(eventLoop, AE_ALL_EVENTS|
AE_CALL_BEFORE_SLEEP|
AE_CALL_AFTER_SLEEP);
}
}
int aeProcessEvents(aeEventLoop *eventLoop, int flags)
{
……
if (eventLoop->beforesleep != NULL && flags & AE_CALL_BEFORE_SLEEP)
eventLoop->beforesleep(eventLoop);
……
}
每次事件循环时,都会执行 beforesleep 方法,注意,Redis 给客户端发送数据就是在这里完成的。
注:Redis 6 版本中在接收和发送数据的时候加入了多线程,不过默认是关闭的,这里便不考虑多线程模式。
void beforeSleep(struct aeEventLoop *eventLoop) {
……
/* Handle writes with pending output buffers. */
handleClientsWithPendingWritesUsingThreads();
……
}
int handleClientsWithPendingWritesUsingThreads(void) {
int processed = listLength(server.clients_pending_write);
if (processed == 0) return 0; /* Return ASAP if there are no clients. */
/* If I/O threads are disabled or we have few clients to serve, don't
* use I/O threads, but the boring synchronous code. */
if (server.io_threads_num == 1 || stopThreadedIOIfNeeded()) {
return handleClientsWithPendingWrites();
}
……
}
/* This function is called just before entering the event loop, in the hope
* we can just write the replies to the client output buffer without any
* need to use a syscall in order to install the writable event handler,
* get it called, and so forth. */
int handleClientsWithPendingWrites(void) {
listIter li;
listNode *ln;
int processed = listLength(server.clients_pending_write);
listRewind(server.clients_pending_write,&li);
while((ln = listNext(&li))) {
client *c = listNodeValue(ln);
c->flags &= ~CLIENT_PENDING_WRITE;
listDelNode(server.clients_pending_write,ln);
/* If a client is protected, don't do anything,
* that may trigger write error or recreate handler. */
if (c->flags & CLIENT_PROTECTED) continue;
/* Don't write to clients that are going to be closed anyway. */
if (c->flags & CLIENT_CLOSE_ASAP) continue;
/* Try to write buffers to the client socket. */
if (writeToClient(c,0) == C_ERR) continue;
/* If after the synchronous writes above we still have data to
* output to the client, we need to install the writable handler. */
if (clientHasPendingReplies(c)) {
int ae_barrier = 0;
/* For the fsync=always policy, we want that a given FD is never
* served for reading and writing in the same event loop iteration,
* so that in the middle of receiving the query, and serving it
* to the client, we'll call beforeSleep() that will do the
* actual fsync of AOF to disk. the write barrier ensures that. */
if (server.aof_state == AOF_ON &&
server.aof_fsync == AOF_FSYNC_ALWAYS)
{
ae_barrier = 1;
}
if (connSetWriteHandlerWithBarrier(c->conn, sendReplyToClient, ae_barrier) == C_ERR) {
freeClientAsync(c);
}
}
}
return processed;
}
beforesleep 方法最终会调用 handleClientsWithPendingWrites() 方法,此方法便会遍历上述中 clients_pending_write 列表,调用 writeToClient 方法来将返回数据从输出缓存区写入到 soceket 中,如果还未写完,这时会调用 connSetWriteHandlerWithBarrier 方法中 aeCreateFileEvent 来注册一个写数据事件处理器 sendReplyToClient,等待 Redis 事件机制的再次调用。
上述中的事件机制就在主循环体中,也就是 beforesleep 执行之后的 aeApiPoll 中,因为刚刚注册了一个事件,这里便可以把此事件给读到,然后再执行写入的动作。
int aeProcessEvents(aeEventLoop *eventLoop, int flags)
{
……
if (eventLoop->beforesleep != NULL && flags & AE_CALL_BEFORE_SLEEP)
eventLoop->beforesleep(eventLoop);
/* Call the multiplexing API, will return only on timeout or when
* some event fires. */
numevents = aeApiPoll(eventLoop, tvp);
……
for (j = 0; j < numevents; j++) {
……
/* Fire the writable event. */
if (fe->mask & mask & AE_WRITABLE) {
if (!fired || fe->wfileProc != fe->rfileProc) {
fe->wfileProc(eventLoop,fd,fe->clientData,mask);
fired++;
}
}
……
}
}
static int aeApiPoll(aeEventLoop *eventLoop, struct timeval *tvp) {
aeApiState *state = eventLoop->apidata;
int retval, numevents = 0;
retval = epoll_wait(state->epfd,state->events,eventLoop->setsize,
tvp ? (tvp->tv_sec*1000 + (tvp->tv_usec + 999)/1000) : -1);
if (retval > 0) {
int j;
numevents = retval;
for (j = 0; j < numevents; j++) {
int mask = 0;
struct epoll_event *e = state->events+j;
if (e->events & EPOLLIN) mask |= AE_READABLE;
if (e->events & EPOLLOUT) mask |= AE_WRITABLE;
if (e->events & EPOLLERR) mask |= AE_WRITABLE|AE_READABLE;
if (e->events & EPOLLHUP) mask |= AE_WRITABLE|AE_READABLE;
eventLoop->fired[j].fd = e->data.fd;
eventLoop->fired[j].mask = mask;
}
}
return numevents;
}
如果执行完 sendReplyToClient 方法后还是未写完,那么就等下一次的事件循环处理执行 beforesleep (套娃)。
结语
至此,就可以解释了文章开头的那个问题,lrange 中 1000 万的数据的发送应该就在最后一步了,而且不是一次性的发送,而是分批发送,此时并不会阻塞其他命令执行,因此 Redis 可以执行其他命令。