一 序
上一篇整理了redis复制的概念性流程 ,本篇结合复制的实现来看下源码。
复制的建立方法有三种:
- 在redis.conf文件中配置slaveof <masterip> <masterport>选项,然后指定该配置文件启动Redis生效。
- 在redis-server启动命令后加上--slaveof <masterip> <masterport>启动生效。
- 直接使用 slaveof <masterip> <masterport>命令在从节点执行生效。
二 客户端
2.1 设置主服务的地址和端口
函数是slaveof,源码在replacation.c
// SLAVEOF host port命令实现
void slaveofCommand(client *c) {
/* SLAVEOF is not allowed in cluster mode as replication is automatically
* configured using the current address of the master node. */
//不允许在集群模式使用复制操作
if (server.cluster_enabled) {
addReplyError(c,"SLAVEOF not allowed in cluster mode.");
return;
}
/* The special host/port combination "NO" "ONE" turns the instance
* into a master. Otherwise the new master address is set. */
// SLAVEOF NO ONE 让从服务器转为主服务器
if (!strcasecmp(c->argv[1]->ptr,"no") &&
!strcasecmp(c->argv[2]->ptr,"one")) {
// 如果保存了主节点IP
if (server.masterhost) {
// 让服务器取消复制,成为主服务器
replicationUnsetMaster();
// 获取client的每种信息,并以sds形式返回,并打印到日志中
sds client = catClientInfoString(sdsempty(),c);
serverLog(LL_NOTICE,"MASTER MODE enabled (user request from '%s')",
client);
sdsfree(client);
}
} else {
long port;
// 获取端口参数
if ((getLongFromObjectOrReply(c, c->argv[2], &port, NULL) != C_OK))
return;
/* Check if we are already attached to the specified slave */
//检查输入的 host 和 port 是否服务器目前的主服务器
// 如果是的话,向客户端返回 +OK ,不做其他动作
if (server.masterhost && !strcasecmp(server.masterhost,c->argv[1]->ptr)
&& server.masterport == port) {
serverLog(LL_NOTICE,"SLAVE OF would result into synchronization with the master we are already connected with. No operation performed.");
addReplySds(c,sdsnew("+OK Already connected to specified master\r\n"));
return;
}
/* There was no previous master or the user specified a different one,
* we can continue. */
// 没有前任主服务器,或者客户端指定了新的主服务器
// 开始执行复制操作(设置服务器复制操作的主节点IP和端口 )
replicationSetMaster(c->argv[1]->ptr, port);
// 获取client的每种信息,并以sds形式返回,并打印到日志中
sds client = catClientInfoString(sdsempty(),c);
serverLog(LL_NOTICE,"SLAVE OF %s:%d enabled (user request from '%s')",
server.masterhost, server.masterport, client);
sdsfree(client);
}
addReply(c,shared.ok);//回复OK
}
当从节点的client执行SLAVEOF
命令后,该命令会被构建成Redis
协议格式,发送给从节点服务器,然后节点服务器会调用slaveofCommand()
函数执行该命令。slaveofCommand主要过程:
- 判断当前环境是否在集群模式下,因为集群模式下不行执行该命令。
- 是否执行的是
SLAVEOF NO ONE
命令,该命令会断开主从的关系,设置当前节点为主节点服务器。 - 设置从节点所属主节点的
IP
和port
。调用了replicationSetMaster()
函数
再看下replicationSetMaster,
/* Set replication to the specified master address and port. */
// 将服务器设为指定地址的从服务器
void replicationSetMaster(char *ip, int port) {
// 按需清除原来的主节点信息
sdsfree(server.masterhost);
// 设置ip和端口
server.masterhost = sdsnew(ip);
server.masterport = port;
// 如果有其他的主节点,在释放
// 例如服务器1是服务器2的主节点,现在服务器2要同步服务器3,服务器3要成为服务器2的主节点,因此要释放服务器1
if (server.master) freeClient(server.master);
// 解除所有客户端的阻塞状态
disconnectAllBlockedClients(); /* Clients blocked in master, now slave. */
// 关闭所有从节点服务器的连接,强制从节点服务器进行重新同步操作
disconnectSlaves(); /* Force our slaves to resync with us as well. */
// 释放主节点结构的缓存,不会执行部分重同步PSYNC
replicationDiscardCachedMaster(); /* Don't try a PSYNC. */
// 释放复制积压缓冲区
freeReplicationBacklog(); /* Don't allow our chained slaves to PSYNC. */
// 取消执行复制操作
cancelReplicationHandshake();
// 设置复制必须重新连接主节点的状态(重点)
server.repl_state = REPL_STATE_CONNECT;
// 初始化复制的偏移量
server.master_repl_offset = 0;
// 清零连接断开的时长
server.repl_down_since = 0;
}
看了函数主要是清理之前所属的主节点的信息、设置新的主节点IP
和port
等,也没找到复制主节点的内容啊。
书上说slaveof命令是一个异步命令,执行命令时,从节点保存主节点的信息,确立主从关系后就会立即返回,后续的复制流程在节点内部异步执行。那么真正的实现在哪里呢?
2.2 建立主从套接字
周期性执行的函数:serverCron()
函数在Redis
服务器初始化时被设置为时间事件的处理函数。这里就有复制相关的
/* Replication cron function -- used to reconnect to master,
* detect transfer failures, start background RDB transfers and so forth. */
// 周期性执行复制的任务
run_with_period(1000) replicationCron();
replicationCron()
函数执行频率为1秒一次:,源码在replcation.c.,主要负责监控主从复制过程中的各个状态,并根据不同情况作出不同处理。我们先看下replicationCron里面
链接主节点函数connectWithMaster。
/* Check if we should connect to a MASTER */
// 如果处于要必须连接主节点的状态,尝试连接
if (server.repl_state == REPL_STATE_CONNECT) {
serverLog(LL_NOTICE,"Connecting to MASTER %s:%d",
server.masterhost, server.masterport);
// 以非阻塞的方式连接主节点
if (connectWithMaster() == C_OK) {
serverLog(LL_NOTICE,"MASTER <-> SLAVE sync started");
}
}
// 以非阻塞方式连接主服务器
int connectWithMaster(void) {
int fd;
// 连接主服务器
fd = anetTcpNonBlockBestEffortBindConnect(NULL,
server.masterhost,server.masterport,NET_FIRST_BIND_ADDR);
if (fd == -1) {
serverLog(LL_WARNING,"Unable to connect to MASTER: %s",
strerror(errno));
return C_ERR;
}
// 监听主服务器 fd 的读和写事件,并绑定文件事件处理器
if (aeCreateFileEvent(server.el,fd,AE_READABLE|AE_WRITABLE,syncWithMaster,NULL) ==
AE_ERR)
{
close(fd);
serverLog(LL_WARNING,"Can't create readable event for SYNC");
return C_ERR;
}
// 最近一次读到RDB文件内容的时间
server.repl_transfer_lastio = server.unixtime;
// 从节点和主节点的同步套接字
server.repl_transfer_s = fd;
server.repl_state = REPL_STATE_CONNECTING; // 将状态改为已连接
return C_OK;
}
connectWithMaster()函数执行的操作可以总结为:
根据IP和port非阻塞的方式连接主节点,得到主从节点进行通信的文件描述符fd,并保存到从节点服务器server.repl_transfer_s中,并且将刚才的REPL_STATE_CONNECT状态设置为REPL_STATE_CONNECTING。
监听fd的可读和可写事件,并且设置事件发生的处理程序syncWithMaster()函数。
至此,主从网络建立就完成了。
2.3 握手过程及同步
后面再看的是syncWithMaster,这里面流程复杂,包括发送ping命令 、身份验证、发送端口信息、IP、CAPA、同步等。为了方便理解,我画个图供参考。
// 从节点同步主节点的回调函数
void syncWithMaster(aeEventLoop *el, int fd, void *privdata, int mask) {
char tmpfile[256], *err = NULL;
int dfd = -1, maxtries = 5;
int sockerr = 0, psync_result;
socklen_t errlen = sizeof(sockerr);
UNUSED(el);
UNUSED(privdata);
UNUSED(mask);
/* If this event fired after the user turned the instance into a master
* with SLAVEOF NO ONE we must just return ASAP. */
// 如果处于 SLAVEOF NO ONE 模式(关闭复制模式),那么关闭 fd
if (server.repl_state == REPL_STATE_NONE) {
close(fd);
return;
}
/* Check for errors in the socket. */
// 检查套接字错误
if (getsockopt(fd, SOL_SOCKET, SO_ERROR, &sockerr, &errlen) == -1)
sockerr = errno;
if (sockerr) {
serverLog(LL_WARNING,"Error condition on socket for SYNC: %s",
strerror(sockerr));
goto error;
}
/* Send a PING to check the master is able to reply without errors. */
// 如果复制的状态为REPL_STATE_CONNECTING,发送一个PING去检查主节点是否能正确回复一个PONG
if (server.repl_state == REPL_STATE_CONNECTING) {
serverLog(LL_NOTICE,"Non blocking connect for SYNC fired the event.");
/* Delete the writable event so that the readable event remains
* registered and we can wait for the PONG reply. */
// 暂时取消监听fd的写事件,以便等待PONG回复时,注册可读事件
aeDeleteFileEvent(server.el,fd,AE_WRITABLE);
// 设置复制状态为等待PONG回复
server.repl_state = REPL_STATE_RECEIVE_PONG;
/* Send the PING, don't check for errors at all, we have the timeout
* that will take care about this. */
// 同步发送 PING
err = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"PING",NULL);
if (err) goto write_error;
//等待 PONG 到达
return;
}
/* Receive the PONG command. */
// 接收 PONG 命令
if (server.repl_state == REPL_STATE_RECEIVE_PONG) {
// 从主节点读一个PONG命令
err = sendSynchronousCommand(SYNC_CMD_READ,fd,NULL);
/* We accept only two replies as valid, a positive +PONG reply
* (we just check for "+") or an authentication error.
* Note that older versions of Redis replied with "operation not
* permitted" instead of using a proper error code, so we test
* both. */
// 只接受两种有效的回复。一种是 "+PONG",一种是认证错误"-NOAUTH"。
// 旧版本的返回有"-ERR operation not permitted"
if (err[0] != '+' &&
strncmp(err,"-NOAUTH",7) != 0 &&
strncmp(err,"-ERR operation not permitted",28) != 0)
{
// 没有收到正确的PING命令的回复
serverLog(LL_WARNING,"Error reply to PING from master: '%s'",err);
sdsfree(err);
goto error;
} else {
serverLog(LL_NOTICE,
"Master replied to PING, replication can continue...");
}
sdsfree(err);
// 已经收到PONG,更改状态设置为发送认证命令AUTH给主节点
server.repl_state = REPL_STATE_SEND_AUTH;
}
/* AUTH with the master if required. */
// 进行身份验证
if (server.repl_state == REPL_STATE_SEND_AUTH) {
// 如果服务器设置了认证密码
if (server.masterauth) {
// 写AUTH给主节点
err = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"AUTH",server.masterauth,NULL);
if (err) goto write_error;
// 设置状态为等待接受认证回复
server.repl_state = REPL_STATE_RECEIVE_AUTH;
return;
} else { // 如果没有设置认证密码,直接设置复制状态为发送端口号给主节点
server.repl_state = REPL_STATE_SEND_PORT;
}
}
/* Receive AUTH reply. */
// 接受AUTH认证的回复
if (server.repl_state == REPL_STATE_RECEIVE_AUTH) {
// 从主节点读回复
err = sendSynchronousCommand(SYNC_CMD_READ,fd,NULL);
// 回复错误,认证失败
if (err[0] == '-') {
serverLog(LL_WARNING,"Unable to AUTH to MASTER: %s",err);
sdsfree(err);
goto error;
}
sdsfree(err);
// 设置复制状态为发送端口号给主节点
server.repl_state = REPL_STATE_SEND_PORT;
}
/* Set the slave port, so that Master's INFO command can list the
* slave listening port correctly. */
// 将从服务器的端口发送给主服务器,
// 使得主服务器的 INFO 命令可以显示从服务器正在监听的端口
if (server.repl_state == REPL_STATE_SEND_PORT) {
// 获取端口号
sds port = sdsfromlonglong(server.slave_announce_port ?
server.slave_announce_port : server.port);
// 将端口号写给主节点
err = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"REPLCONF",
"listening-port",port, NULL);
sdsfree(port);
if (err) goto write_error;
sdsfree(err);
// 设置复制状态为接受端口号
server.repl_state = REPL_STATE_RECEIVE_PORT;
return;
}
/* Receive REPLCONF listening-port reply. */
// 复制状态为接受端口号
if (server.repl_state == REPL_STATE_RECEIVE_PORT) {
// 从主节点读取端口号
err = sendSynchronousCommand(SYNC_CMD_READ,fd,NULL);
/* Ignore the error if any, not all the Redis versions support
* REPLCONF listening-port. */
// 忽略所有的错误,因为不是所有的Redis版本都支持REPLCONF listening-port命令
if (err[0] == '-') {
serverLog(LL_NOTICE,"(Non critical) Master does not understand "
"REPLCONF listening-port: %s", err);
}
sdsfree(err);
// 设置复制状态为发送IP
server.repl_state = REPL_STATE_SEND_IP;
}
/* Skip REPLCONF ip-address if there is no slave-announce-ip option set. */
// 复制状态为发送IP且服务器的slave_announce_ip没有保存发送给主节点的IP,直接设置复制状态
if (server.repl_state == REPL_STATE_SEND_IP &&
server.slave_announce_ip == NULL)
{
server.repl_state = REPL_STATE_SEND_CAPA;
}
/* Set the slave ip, so that Master's INFO command can list the
* slave IP address port correctly in case of port forwarding or NAT. */
// 复制状态为发送IP
if (server.repl_state == REPL_STATE_SEND_IP) {
// 将IP写给主节点
err = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"REPLCONF",
"ip-address",server.slave_announce_ip, NULL);
if (err) goto write_error;
sdsfree(err);
// 设置复制状态为接受IP
server.repl_state = REPL_STATE_RECEIVE_IP;
return;
}
/* Receive REPLCONF ip-address reply. */
// 复制状态为接受IP回复
if (server.repl_state == REPL_STATE_RECEIVE_IP) {
// 从主节点读一个IP回复
err = sendSynchronousCommand(SYNC_CMD_READ,fd,NULL);
/* Ignore the error if any, not all the Redis versions support
* REPLCONF listening-port. */
// 错误回复
if (err[0] == '-') {
serverLog(LL_NOTICE,"(Non critical) Master does not understand "
"REPLCONF ip-address: %s", err);
}
sdsfree(err);
// 设置复制状态为发送一个REPL_STATE_SEND_CAPA
server.repl_state = REPL_STATE_SEND_CAPA;
}
/* Inform the master of our capabilities. While we currently send
* just one capability, it is possible to chain new capabilities here
* in the form of REPLCONF capa X capa Y capa Z ...
* The master will ignore capabilities it does not understand. */
// 复制状态为发送capa,通知主节点从节点的能力
if (server.repl_state == REPL_STATE_SEND_CAPA) {
// 将从节点的capa写给主节点
err = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"REPLCONF",
"capa","eof",NULL);
if (err) goto write_error;
sdsfree(err);
// 设置复制状态为接受从节点的capa
server.repl_state = REPL_STATE_RECEIVE_CAPA;
return;
}
/* Receive CAPA reply. */
// 复制状态为接受从节点的capa回复
if (server.repl_state == REPL_STATE_RECEIVE_CAPA) {
// 从主节点读取capa回复
err = sendSynchronousCommand(SYNC_CMD_READ,fd,NULL);
/* Ignore the error if any, not all the Redis versions support
* REPLCONF capa. */
// 错误回复
if (err[0] == '-') {
serverLog(LL_NOTICE,"(Non critical) Master does not understand "
"REPLCONF capa: %s", err);
}
sdsfree(err);
// 设置复制状态为发送PSYNC命令
server.repl_state = REPL_STATE_SEND_PSYNC;
}
/* Try a partial resynchonization. If we don't have a cached master
* slaveTryPartialResynchronization() will at least try to use PSYNC
* to start a full resynchronization so that we get the master run id
* and the global offset, to try a partial resync at the next
* reconnection attempt. */
// 复制状态为发送PSYNC命令。尝试进行部分重同步。
// 如果没有缓冲主节点的结构,slaveTryPartialResynchronization()函数将会至少尝试使用PSYNC去进行一个全同步,
// 这样就能得到主节点的运行runid和全局复制偏移量。并且在下次重连接时可以尝试进行部分重同步。
if (server.repl_state == REPL_STATE_SEND_PSYNC) {
// 向主节点发送一个部分重同步命令PSYNC,参数0表示不读主节点的回复,只获取主节点的运行runid和全局复制偏移量
if (slaveTryPartialResynchronization(fd,0) == PSYNC_WRITE_ERROR) {
// 发送PSYNC出错
err = sdsnew("Write error sending the PSYNC command.");
goto write_error;
}
// 设置复制状态为等待接受一个PSYNC回复
server.repl_state = REPL_STATE_RECEIVE_PSYNC;
return;
}
/* If reached this point, we should be in REPL_STATE_RECEIVE_PSYNC. */
// 到达这里,服务器应该在REPL_STATE_RECEIVE_PSYNC状态中,如果不是,则出错
if (server.repl_state != REPL_STATE_RECEIVE_PSYNC) {
serverLog(LL_WARNING,"syncWithMaster(): state machine error, "
"state should be RECEIVE_PSYNC but is %d",
server.repl_state);
goto error;
}
// 那么尝试进行第二次部分重同步,从主节点读取指令来决定执行部分重同步还是全量同步
psync_result = slaveTryPartialResynchronization(fd,1);
// 如果返回PSYNC_WAIT_REPLY,则重新执行该函数
if (psync_result == PSYNC_WAIT_REPLY) return; /* Try again later... */
/* Note: if PSYNC does not return WAIT_REPLY, it will take care of
* uninstalling the read handler from the file descriptor. */
// 返回PSYNC_CONTINUE,表示可以执行部分重同步,直接返回
if (psync_result == PSYNC_CONTINUE) {
serverLog(LL_NOTICE, "MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.");
return;
}
/* PSYNC failed or is not supported: we want our slaves to resync with us
* as well, if we have any (chained replication case). The mater may
* transfer us an entirely different data set and we have no way to
* incrementally feed our slaves after that. */
// PSYNC执行失败,或者是不支持该命令。
// 关闭所有从节点服务器的连接,强制从节点服务器进行重新同步操作
disconnectSlaves(); /* Force our slaves to resync with us as well. */
// 释放复制积压缓冲区,禁止从节点执行PSYNC
freeReplicationBacklog(); /* Don't allow our chained slaves to PSYNC. */
/* Fall back to SYNC if needed. Otherwise psync_result == PSYNC_FULLRESYNC
* and the server.repl_master_runid and repl_master_initial_offset are
* already populated. */
// 主节点不支持PSYNC命令,那发送版本兼容的SYNC命令
if (psync_result == PSYNC_NOT_SUPPORTED) {
serverLog(LL_NOTICE,"Retrying with SYNC...");
// 向主服务器发送 SYNC 命令
if (syncWrite(fd,"SYNC\r\n",6,server.repl_syncio_timeout*1000) == -1) {
serverLog(LL_WARNING,"I/O error writing to MASTER: %s",
strerror(errno));
goto error;
}
}
// 如果执行到这里,
// 那么 psync_result == PSYNC_FULLRESYNC 或 PSYNC_NOT_SUPPORTED
/* Prepare a suitable temp file for bulk transfer */
// 准备一个合适临时文件用来写入和保存主节点传来的RDB文件数据
while(maxtries--) {
// 设置文件的名字
snprintf(tmpfile,256,
"temp-%d.%ld.rdb",(int)server.unixtime,(long int)getpid());
// 以读写,可执行权限打开临时文件
dfd = open(tmpfile,O_CREAT|O_WRONLY|O_EXCL,0644);
// 打开成功,跳出循环
if (dfd != -1) break;
sleep(1);
}
if (dfd == -1) {
serverLog(LL_WARNING,"Opening the temp file needed for MASTER <-> SLAVE synchronization: %s",strerror(errno));
goto error;
}
/* Setup the non blocking download of the bulk file. */
// 设置一个读事件处理器,来读取主服务器的 RDB 文件(设置该事件的处理程序为readSyncBulkPayload)
if (aeCreateFileEvent(server.el,fd, AE_READABLE,readSyncBulkPayload,NULL)
== AE_ERR)
{
serverLog(LL_WARNING,
"Can't create readable event for SYNC: %s (fd=%d)",
strerror(errno),fd);
goto error;
}
// 复制状态为正从主节点接受RDB文件
server.repl_state = REPL_STATE_TRANSFER;
//初始化RDB文件的大小
server.repl_transfer_size = -1;
//已读的大小
server.repl_transfer_read = 0;
//最近一个执行fsync的偏移量为0
server.repl_transfer_last_fsync_off = 0;
// 传输RDB文件的临时fd
server.repl_transfer_fd = dfd;
// 最近一次读到RDB文件内容的时间
server.repl_transfer_lastio = server.unixtime;
// 保存RDB文件的临时文件名
server.repl_transfer_tmpfile = zstrdup(tmpfile);
return;
// 错误处理
error:
// 删除fd的所有事件的监听
aeDeleteFileEvent(server.el,fd,AE_READABLE|AE_WRITABLE);
// 关闭dfd
if (dfd != -1) close(dfd);
// 关闭fd
close(fd);
// 从节点和主节点的同步套接字设置为-1
server.repl_transfer_s = -1;
// 复制状态为必须重新连接主节点
server.repl_state = REPL_STATE_CONNECT;
return;
// 写错误处理
write_error: /* Handle sendSynchronousCommand(SYNC_CMD_WRITE) errors. */
serverLog(LL_WARNING,"Sending command to master in replication handshake: %s", err);
sdsfree(err);
goto error;
}
大概说下流程:其中1-5 可以理解为握手过程,有个大概了解,扫一遍代码就行不用太关注,重点要看的是6
1 发送PING命令:主从建立网络时,同时注册fd的AE_READABLE|AE_WRITABLE事件,因此会触发一个AE_WRITABLE事件,调用syncWithMaster()函数,处理写事件。根据当前的REPL_STATE_CONNECTING状态,从节点向主节点发送PING命令,PING命令的目的有:
- 检测主从节点之间的网络是否可用。
- 检查主从节点当前是否接受处理命令。
主节点服务器从fd会读到一个PING命令,然后会回复一个PONG命令到fd中,执行的命令就是addReply(c,shared.pong)。此时,会触发fd的可读事件,调用syncWithMaster()函数来处理,此时从节点的复制状态为REPL_STATE_RECEIVE_PONG,等待主节点回复PONG。以读的方式调用sendSynchronousCommand(),并将读到的"+PONG\r\n"返回到err中,如果从节点正确接收到主节点发送的PONG命令,会将从节点的复制状态设置为server.repl_state = REPL_STATE_SEND_AUTH。等待进行权限的认证。
2. 认证权限:
如果从节点的服务器设置了认证密码,则会以写方式调用sendSynchronousCommand()
函数,将AUTH
命令和密码写到fd
中,并且将从节点的复制状态设置为server.repl_state = REPL_STATE_RECEIVE_AUTH
,接受AUTH
的验证。主节点会读取到AUTH
命令,调用authCommand()
函数来处理,主节点服务器会比较从节点发送过来的server.masterauth
和主节点服务器保存的server.requirepass
是否一致,如果一致,会回复一个"+OK\r\n"。
当主节点将回复写到fd
时,又会触发从节点的可读事件,紧接着调用syncWithMaster()
函数来处理接收AUTH
认证结果,以读方式从fd
中读取一个回复,判断认证是否成功,认证成功,则会将从节点的复制状态设置为server.repl_state = REPL_STATE_SEND_PORT
表示要发送一个端口号给主节点。这和没有设置认证的情况结果相同。
3.发送端口号
从节点在认证完权限后,会继续在syncWithMaster()
函数执行,处理发送端口号的状态,以REPLCONF listening-port
命令的方式,写到fd
中。然后将复制状态设置为server.repl_state = REPL_STATE_RECEIVE_PORT
,等待接受主节点的回复。
主节点从fd
中读到REPLCONF listening-port <port>
命令,调用replconfCommand()
命令来处理,最终回复一个"+OK\r\n"
状态的回复,写在fd
中。当主节点将回复写到fd
时,又会触发从节点的可读事件,紧接着调用syncWithMaster()
函数来处理接受端口号,验证主节点是否正确的接收到从节点的端口号。
如果主节点正确的接收到从节点的端口号,会将从节点的复制状态设置为server.repl_state = REPL_STATE_SEND_IP
表示要送一个IP
给主节点。
4. 发送IP地址
从节点发送完端口号并且正确收到主节点的回复后,紧接着syncWithMaster()
函数执行发送IP
的代码。发送IP
和发送端口号过程几乎一致。 如果主节点正确接收了从节点IP
,就会设置从节点的复制状态server.repl_state = REPL_STATE_SEND_CAPA
表示发送从节点的能力(capability)。
5. 发送能力(capability)
发送能力和发送端口和上面端口、IP一样。如果主节点正确接收了从节点capa
,就会设置从节点的复制状态server.repl_state = REPL_STATE_SEND_PSYNC
表示发送一个PSYNC
命令。
6. 发送PSYNC命令
从节点发送PSYNC
命令给主节点,尝试进行同步主节点的数据集。同步分为两种:
- 全量同步:第一次执行复制的场景。
- 部分同步:用于主从复制因为网络中断等原因造成数据丢失的场景。
从节点调用slaveTryPartialResynchronization()
函数尝试进行重同步.分成两部分,一部分是写,一部分是读.看下源码。
// 从节点发送PSYNC命令尝试进行部分重同步
int slaveTryPartialResynchronization(int fd, int read_reply) {
char *psync_runid;
char psync_offset[32];
sds reply;
/* Writing half */
// 如果read_reply为0,则该函数往socket上会写入一个PSYNC命令
if (!read_reply) {
/* Initially set repl_master_initial_offset to -1 to mark the current
* master run_id and offset as not valid. Later if we'll be able to do
* a FULL resync using the PSYNC command we'll set the offset at the
* right value, so that this information will be propagated to the
* client structure representing the master into server.master. */
// 将repl_master_initial_offset设置为-1表示主节点的run_id和全局复制偏移量是无效的。
// 如果能使用PSYNC命令执行一个全量同步,会正确设置全复制偏移量,以便这个信息被正确传播主节点的所有从节点中
server.repl_master_initial_offset = -1;
if (server.cached_master) {
// 缓存存在,尝试部分重同步
// 命令为 "PSYNC <master_run_id> <repl_offset>"
// 保存缓存runid
psync_runid = server.cached_master->replrunid;
// 获取已经复制的偏移量
snprintf(psync_offset,sizeof(psync_offset),"%lld", server.cached_master->reploff+1);
serverLog(LL_NOTICE,"Trying a partial resynchronization (request %s:%s).", psync_runid, psync_offset);
} else {
// 缓存不存在
// 发送 "PSYNC ? -1" ,要求完整重同步
serverLog(LL_NOTICE,"Partial resynchronization not possible (no cached master)");
psync_runid = "?";
memcpy(psync_offset,"-1",3);
}
/* Issue the PSYNC command */
// 发送一个PSYNC命令给主节点
reply = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"PSYNC",psync_runid,psync_offset,NULL);
// 写成功失败会返回一个"-"开头的字符串
if (reply != NULL) {
serverLog(LL_WARNING,"Unable to send PSYNC to master: %s",reply);
sdsfree(reply);
// 删除文件的可读事件,返回写错误PSYNC_WRITE_ERROR
aeDeleteFileEvent(server.el,fd,AE_READABLE);
return PSYNC_WRITE_ERROR;
}
// 返回等待回复的标识PSYNC_WAIT_REPLY,调用者会将read_reply设置为1,然后再次调用该函数,执行下面的读部分。
return PSYNC_WAIT_REPLY;
}
/* Reading half */
// 从主节点读一个命令保存在reply中
reply = sendSynchronousCommand(SYNC_CMD_READ,fd,NULL);
if (sdslen(reply) == 0) {
/* The master may send empty newlines after it receives PSYNC
* and before to reply, just to keep the connection alive. */
// 主节点为了保持连接的状态,可能会在接收到PSYNC命令后发送一个空行
sdsfree(reply);
return PSYNC_WAIT_REPLY;
}
// 如果读到了一个命令,删除fd的可读事件
aeDeleteFileEvent(server.el,fd,AE_READABLE);
// 接受到的是"+FULLRESYNC",表示进行一次全量同步
if (!strncmp(reply,"+FULLRESYNC",11)) {
char *runid = NULL, *offset = NULL;
/* FULL RESYNC, parse the reply in order to extract the run id
* and the replication offset. */
// 解析回复中的内容,将runid和复制偏移量提取出来
runid = strchr(reply,' ');
if (runid) {
runid++; //定位到runid的地址
offset = strchr(runid,' ');
if (offset) offset++; //定位offset
}
// 检查 run id 的合法性(runid和offset任意为空)
if (!runid || !offset || (offset-runid-1) != CONFIG_RUN_ID_SIZE) {
serverLog(LL_WARNING,
"Master replied with wrong +FULLRESYNC syntax.");
/* This is an unexpected condition, actually the +FULLRESYNC
* reply means that the master supports PSYNC, but the reply
* format seems wrong. To stay safe we blank the master
* runid to make sure next PSYNCs will fail. */
// 主服务器支持 PSYNC ,但是却发来了异常的 run id
// 只好将 run id 设为 0 ,让下次 PSYNC 时失败
memset(server.repl_master_runid,0,CONFIG_RUN_ID_SIZE+1);
} else {
// 设置服务器保存的主节点的运行ID
memcpy(server.repl_master_runid, runid, offset-runid-1);
server.repl_master_runid[CONFIG_RUN_ID_SIZE] = '\0';
// 主节点的偏移量
server.repl_master_initial_offset = strtoll(offset,NULL,10);
// 打印日志,这是一个 FULL resync
serverLog(LL_NOTICE,"Full resync from master: %s:%lld",
server.repl_master_runid,
server.repl_master_initial_offset);
}
/* We are going to full resync, discard the cached master structure. */
// 执行全量同步,所以缓存的主节点结构没用了,将其清空
replicationDiscardCachedMaster();
sdsfree(reply);
// 返回执行的状态
return PSYNC_FULLRESYNC;
}
// 接收到 CONTINUE ,表示进行一次部分重同步 (partial resync)
if (!strncmp(reply,"+CONTINUE",9)) {
/* Partial resync was accepted, set the replication state accordingly */
serverLog(LL_NOTICE,
"Successful partial resynchronization with master.");
sdsfree(reply);
// 因为执行部分重同步,因此要使用缓存的主节点结构,所以将其设置为当前的主节点,被同步的主节点
replicationResurrectCachedMaster(fd);
// 返回执行的状态
return PSYNC_CONTINUE;
}
/* If we reach this point we received either an error since the master does
* not understand PSYNC, or an unexpected reply from the master.
* Return PSYNC_NOT_SUPPORTED to the caller in both cases. */
// 接收到了错误,两种情况。
// 1. 主节点不支持PSYNC命令,Redis版本低于2.8
// 2. 从主节点读取了一个不期望的回复
if (strncmp(reply,"-ERR",4)) {
/* If it's not an error, log the unexpected event. */
serverLog(LL_WARNING,
"Unexpected reply to PSYNC from master: %s", reply);
} else {
serverLog(LL_NOTICE,
"Master does not support PSYNC or is in "
"error state (reply: %s)", reply);
}
sdsfree(reply);
// 释放主节点结构的缓存,不会执行部分重同步PSYNC
replicationDiscardCachedMaster();
// 发送不支持PSYNC命令的状态
return PSYNC_NOT_SUPPORTED;
}
slaveTryPartialResynchronization
函数首先会判断是否存在cached_master,如果存在会在发送psync命令时带上runid和offset,让master判断是进行全量同步还是增量同步。在下一次事件循环中slaveTryPartialResynchronization
函数会读取master的响应,master的响应会有几种:PSYNC_WAIT_REPLY, PSYNC_CONTINUE, PSYNC_NOT_SUPPORTED, PSYNC_FULLRESYNC, PSYNC_WRITE_ERROR
。
当master支持psync命令,且slave是第一次与master建立主从同步关系时,slaveTryPartialResynchronization
函数会创建tmpfile,用于接收master发来的rdb文件。同时注册可读事件readSyncBulkPayload
函数,并将server.repl_state更新为REPL_STATE_TRANSFER
slaveTryPartialResynchronization
函数会在每次文件事件触发时,读取master发送过来的rdb文件。接收完成后会清空db,使用master发送来的rdb文件初始化数据库,将repl_state改为REPL_STATE_CONNECTED
。至此全量数据同步完成,进入增量数据同步。
三 server端
3.1 syncCommand
/* SYNC and PSYNC command implemenation. */
// SYNC and PSYNC 命令实现
void syncCommand(client *c) {
/* ignore SYNC if already slave or in monitor mode */
// 已经是 SLAVE ,或者处于 MONITOR 模式,那么忽略同步命令,返回
if (c->flags & CLIENT_SLAVE) return;
/* Refuse SYNC requests if we are a slave but the link with our master
* is not ok... */
//如果这是一个从服务器,但与主服务器的连接仍未就绪,那么拒绝 SYNC
if (server.masterhost && server.repl_state != REPL_STATE_CONNECTED) {
addReplyError(c,"Can't SYNC while not connected with my master");
return;
}
/* SYNC can't be issued when the server has pending data to send to
* the client about already issued commands. We need a fresh reply
* buffer registering the differences between the BGSAVE and the current
* dataset, so that we can copy to other slaves if needed. */
// 如果指定的client的回复缓冲区中还有数据,则不能执行同步
if (clientHasPendingReplies(c)) {
addReplyError(c,"SYNC and PSYNC are invalid with pending output");
return;
}
serverLog(LL_NOTICE,"Slave %s asks for synchronization",
replicationGetSlaveName(c));
/* Try a partial resynchronization if this is a PSYNC command.
* If it fails, we continue with usual full resynchronization, however
* when this happens masterTryPartialResynchronization() already
* replied with:
*
* +FULLRESYNC <runid> <offset>
*
* So the slave knows the new runid and offset to try a PSYNC later
* if the connection with the master is lost. */
// 尝试执行一个部分同步PSYNC的命令,则masterTryPartialResynchronization()会回复一个 "+FULLRESYNC <runid> <offset>",
//如果失败则执行全量同步
// 所以,从节点会如果和主节点连接断开,从节点会知道runid和offset,随后会尝试执行PSYNC
// 如果是执行PSYNC命令
if (!strcasecmp(c->argv[0]->ptr,"psync")) {
// 主节点尝试执行部分重同步,执行成功返回C_OK
if (masterTryPartialResynchronization(c) == C_OK) {
// 可以执行PSYNC命令,则将接受PSYNC命令的个数加1
server.stat_sync_partial_ok++;
// 不需要执行后面的全量同步,直接返回
return; /* No full resync needed, return. */
} else { // 不能执行PSYNC部分重同步,需要进行全量同步
char *master_runid = c->argv[1]->ptr;
/* Increment stats for failed PSYNCs, but only if the
* runid is not "?", as this is used by slaves to force a full
* resync on purpose when they are not albe to partially
* resync. */
// 从节点以强制全量同步为目的,所以不能执行部分重同步,因此增加PSYNC命令失败的次数
if (master_runid[0] != '?') server.stat_sync_partial_err++;
}
} else {
/* If a slave uses SYNC, we are dealing with an old implementation
* of the replication protocol (like redis-cli --slave). Flag the client
* so that we don't expect to receive REPLCONF ACK feedbacks. */
// 旧版实现,设置标识,执行SYNC命令,不接受REPLCONF ACK
c->flags |= CLIENT_PRE_PSYNC;
}
// 以下是完整重同步的情况。。。
/* Full resynchronization. */
// 全量重同步次数加1
server.stat_sync_full++;
/* Setup the slave as one waiting for BGSAVE to start. The following code
* paths will change the state if we handle the slave differently. */
// 设置client状态为:从服务器节点等待BGSAVE节点的开始
c->replstate = SLAVE_STATE_WAIT_BGSAVE_START;
// 执行SYNC命令后是否关闭TCP_NODELAY
if (server.repl_disable_tcp_nodelay)
// 是的话,则启用nagle算法
anetDisableTcpNoDelay(NULL, c->fd); /* Non critical if it fails. */
// 保存主服务器传来的RDB文件的fd,设置为-1
c->repldbfd = -1;
// 设置client状态为从节点,标识client是一个从服务器
c->flags |= CLIENT_SLAVE;
// 添加到服务器从节点链表中
listAddNodeTail(server.slaves,c);
/* CASE 1: BGSAVE is in progress, with disk target. */
// 情况1. 正在执行 BGSAVE ,且是同步到磁盘上
if (server.rdb_child_pid != -1 &&
server.rdb_child_type == RDB_CHILD_TYPE_DISK)
{
/* Ok a background save is in progress. Let's check if it is a good
* one for replication, i.e. if there is another slave that is
* registering differences since the server forked to save. */
client *slave;
listNode *ln;
listIter li;
listRewind(server.slaves,&li);
// 遍历从节点链表
while((ln = listNext(&li))) {
slave = ln->value;
// 如果有从节点已经创建子进程执行写RDB操作,等待完成,那么可以复用这次bgsave产生的rdb文件,退出循环
if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_END) break;
}
/* To attach this slave, we check that it has at least all the
* capabilities of the slave that triggered the current BGSAVE. */
// 对于这个从节点,我们检查它是否具有触发当前BGSAVE操作的能力
if (ln && ((c->slave_capa & slave->slave_capa) == slave->slave_capa)) {
/* Perfect, the server is already registering differences for
* another slave. Set the right state, and copy the buffer. */
// 将slave的输出缓冲区所有内容拷贝给c的所有输出缓冲区中
copyClientOutputBuffer(c,slave);
// 设置全量重同步从节点的状态,设置部分重同步的偏移量
//函数内会修改客户端状态到:SLAVE_STATE_WAIT_BGSAVE_END
replicationSetupSlaveForFullResync(c,slave->psync_initial_offset);
serverLog(LL_NOTICE,"Waiting for end of BGSAVE for SYNC");
} else {
/* No way, we need to wait for the next BGSAVE in order to
* register differences. */
serverLog(LL_NOTICE,"Can't attach the slave to the current BGSAVE. Waiting for next BGSAVE for SYNC");
}
/* CASE 2: BGSAVE is in progress, with socket target. */
// 情况2. 正在执行BGSAVE,且是无盘同步,直接写到socket中
} else if (server.rdb_child_pid != -1 &&
server.rdb_child_type == RDB_CHILD_TYPE_SOCKET)
{
/* There is an RDB child process but it is writing directly to
* children sockets. We need to wait for the next BGSAVE
* in order to synchronize. */
// 虽然有子进程在执行写RDB,但是它直接写到socket中,所以等待下次执行BGSAVE
serverLog(LL_NOTICE,"Current BGSAVE has socket target. Waiting for next BGSAVE for SYNC");
/* CASE 3: There is no BGSAVE is progress. */
// 情况3:没有执行BGSAVE的进程
} else {
// 服务器支持无盘同步
if (server.repl_diskless_sync && (c->slave_capa & SLAVE_CAPA_EOF)) {
/* Diskless replication RDB child is created inside
* replicationCron() since we want to delay its start a
* few seconds to wait for more slaves to arrive. */
// 无盘同步复制的子进程被创建在replicationCron()中,因为想等待更多的从节点可以到来而延迟
if (server.repl_diskless_sync_delay)
serverLog(LL_NOTICE,"Delay next BGSAVE for diskless SYNC");
} else { // 服务器不支持无盘复制
/* Target is disk (or the slave is not capable of supporting
* diskless replication) and we don't have a BGSAVE in progress,
* let's start one. */
// 如果没有正在执行BGSAVE,且没有进行写AOF文件,则开始为复制执行BGSAVE,并且是将RDB文件写到磁盘上
if (server.aof_child_pid == -1) {
//开始执行bgsave,修改状态为:SLAVE_STATE_WAIT_BGSAVE_END
startBgsaveForReplication(c->slave_capa);
} else {
serverLog(LL_NOTICE,
"No BGSAVE in progress, but an AOF rewrite is active. "
"BGSAVE for replication delayed");
}
}
}
// 只有一个从节点,且backlog为空,则创建一个新的backlog
if (listLength(server.slaves) == 1 && server.repl_backlog == NULL)
createReplicationBacklog();
return;
}
对于同步函数主要有几部分:首先尝试局部同步如果失败那么进行全局同步,在全局同步中有三种情况
1.有rdb进程并且写入disk 利用已经准备好的数据进行同步,复制缓冲区然后开启全局同步调用 replicationSetupSlaveForFullResync函数
2.有rdb进程并且写入套接字 只有等待下一次bgsave进程
3.没有rdb进程 判断是写入disk还是直接套接字传输调用startBgsaveForReplication函数
源码在replacation.c
// 设置全量重同步从节点的状态
int replicationSetupSlaveForFullResync(client *slave, long long offset) {
char buf[128];
int buflen;
// 设置全量重同步的偏移量
slave->psync_initial_offset = offset;
// 设置从节点复制状态,开始累计差异数据
slave->replstate = SLAVE_STATE_WAIT_BGSAVE_END;
/* We are going to accumulate the incremental changes for this
* slave as well. Set slaveseldb to -1 in order to force to re-emit
* a SLEECT statement in the replication stream. */
// 将slaveseldb设置为-1,是为了强制发送一个select命令在复制流中
server.slaveseldb = -1;
/* Don't send this reply to slaves that approached us with
* the old SYNC command. */
// 如果从节点的状态是CLIENT_PRE_PSYNC,则表示是Redis是2.8之前的版本,不回复给从节点
if (!(slave->flags & CLIENT_PRE_PSYNC)) {
buflen = snprintf(buf,sizeof(buf),"+FULLRESYNC %s %lld\r\n",
server.runid,offset);
// 否则会将全量复制的信息写给从节点
if (write(slave->fd,buf,buflen) != buflen) {
freeClientAsync(slave);
return C_ERR;
}
}
return C_OK;
}
// 开始为复制执行BGSAVE,根据配置选择磁盘或套接字作为RDB发送的目标,在开始之前确保冲洗脚本缓存
int startBgsaveForReplication(int mincapa) {
int retval;
// 是否直接写到socket
int socket_target = server.repl_diskless_sync && (mincapa & SLAVE_CAPA_EOF);
listIter li;
listNode *ln;
serverLog(LL_NOTICE,"Starting BGSAVE for SYNC with target: %s",
socket_target ? "slaves sockets" : "disk");
if (socket_target)
// 直接写到socket中
// fork一个子进程将rdb写到 状态为等待BGSAVE开始 的从节点的socket中
retval = rdbSaveToSlavesSockets();
else
// 否则后台进行RDB持久化BGSAVE操作,保存到磁盘上
retval = rdbSaveBackground(server.rdb_filename);
/* If we failed to BGSAVE, remove the slaves waiting for a full
* resynchorinization from the list of salves, inform them with
* an error about what happened, close the connection ASAP. */
// BGSAVE执行错误,将等待全量同步的从节点从从节点链表中删除,打印发生错误,立即关闭连接
if (retval == C_ERR) {
serverLog(LL_WARNING,"BGSAVE for replication failed");
listRewind(server.slaves,&li);
// 遍历从节点链表
while((ln = listNext(&li))) {
client *slave = ln->value;
// 将等待全量同步的从节点从从节点链表中删除
if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) {
slave->flags &= ~CLIENT_SLAVE;
listDelNode(server.slaves,ln);
addReplyError(slave,
"BGSAVE failed, replication can't continue");
// 立即关闭client的连接
slave->flags |= CLIENT_CLOSE_AFTER_REPLY;
}
}
return retval;
}
/* If the target is socket, rdbSaveToSlavesSockets() already setup
* the salves for a full resync. Otherwise for disk target do it now.*/
// 如果是直接写到socket中,rdbSaveToSlavesSockets()已经会设置从节点为全量复制
// 否则直接写到磁盘上
if (!socket_target) {
listRewind(server.slaves,&li);
// 遍历从节点链表
while((ln = listNext(&li))) {
client *slave = ln->value;
// 设置等待全量同步的从节点的状态
if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) {
// 设置要执行全量重同步从节点的状态
replicationSetupSlaveForFullResync(slave,
getPsyncInitialOffset());
}
}
}
/* Flush the script cache, since we need that slave differences are
* accumulated without requiring slaves to match our cached scripts. */
// 刷新脚本的缓存
if (retval == C_OK) replicationScriptCacheFlush();
return retval;
}
bgsave完成之后都需要调用replicationSetupSlaveForFullResync函数来开启全局同步.
// 设置全量重同步从节点的状态
int replicationSetupSlaveForFullResync(client *slave, long long offset) {
char buf[128];
int buflen;
// 设置全量重同步的偏移量
slave->psync_initial_offset = offset;
// 设置从节点复制状态,开始累计差异数据
slave->replstate = SLAVE_STATE_WAIT_BGSAVE_END;
/* We are going to accumulate the incremental changes for this
* slave as well. Set slaveseldb to -1 in order to force to re-emit
* a SLEECT statement in the replication stream. */
// 将slaveseldb设置为-1,是为了强制发送一个select命令在复制流中
server.slaveseldb = -1;
/* Don't send this reply to slaves that approached us with
* the old SYNC command. */
// 如果从节点的状态是CLIENT_PRE_PSYNC,则表示是Redis是2.8之前的版本,不回复给从节点
if (!(slave->flags & CLIENT_PRE_PSYNC)) {
buflen = snprintf(buf,sizeof(buf),"+FULLRESYNC %s %lld\r\n",
server.runid,offset);
// 否则会将全量复制的信息写给从节点
if (write(slave->fd,buf,buflen) != buflen) {
freeClientAsync(slave);
return C_ERR;
}
}
return C_OK;
}
该函数中其实并没有做什么相关的触发包括函数调用以及添加文件事件,所以必然是在serverCron中开始全局同步.调用关系如下:
serverCron()
->backgroundSaveDoneHandler()
->backgroundSaveDoneHandlerDisk()
->updateSlavesWaitingBgsave()
updateSlavesWaitingBgsave源码在replication.c
void updateSlavesWaitingBgsave(int bgsaveerr, int type) {
listNode *ln;
int startbgsave = 0;
int mincapa = -1;
listIter li;
listRewind(server.slaves,&li);
// 遍历所有的从节点
while((ln = listNext(&li))) {
client *slave = ln->value;
// 判断状态:如果当前从节点的复制状态为,从服务器节点等待BGSAVE节点的开始,因此要生成一个新的RDB文件
if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) {
startbgsave = 1; // 设置开始BGSAVE的标志
mincapa = (mincapa == -1) ? slave->slave_capa :
(mincapa & slave->slave_capa);
} else if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_END) {
// 如果当前从节点的复制状态为,已经创建子进程执行写RDB操作,等待完成
struct redis_stat buf;
/* If this was an RDB on disk save, we have to prepare to send
* the RDB from disk to the slave socket. Otherwise if this was
* already an RDB -> Slaves socket transfer, used in the case of
* diskless replication, our work is trivial, we can just put
* the slave online. */
// 如果这是将RDB文件写到磁盘上,那么我们必须准备将RDB文件从磁盘发送到从节点的socket中
// 否则如果已经是RDB直接写到从节点的socket上,即无盘同步,那么我们所做的很少,只能将从节点设置为online状态
// 如果是将 RDB 被写到从节点的套接字中,无盘复制
if (type == RDB_CHILD_TYPE_SOCKET) {
serverLog(LL_NOTICE,
"Streamed RDB transfer with slave %s succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming",
replicationGetSlaveName(slave));
/* Note: we wait for a REPLCONF ACK message from slave in
* order to really put it online (install the write handler
* so that the accumulated data can be transfered). However
* we change the replication state ASAP, since our slave
* is technically online now. */
// 将从节点状态设置为online
slave->replstate = SLAVE_STATE_ONLINE;
// 设置从节点的写处理器的标志
slave->repl_put_online_on_ack = 1;
// 设置相应时间
slave->repl_ack_time = server.unixtime; /* Timeout otherwise. */
} else {
// 如果是将 RDB 被写到磁盘上
// BGSAVE出错,跳过当前从节点
if (bgsaveerr != C_OK) {
freeClient(slave);
serverLog(LL_WARNING,"SYNC failed. BGSAVE child returned an error");
continue;
}
// 打开RDB文件,设置复制的fd
if ((slave->repldbfd = open(server.rdb_filename,O_RDONLY)) == -1 ||
redis_fstat(slave->repldbfd,&buf) == -1) {
freeClient(slave);
serverLog(LL_WARNING,"SYNC failed. Can't open/stat DB after BGSAVE: %s", strerror(errno));
continue;
}
// 设置主服务器传来的RDB文件复制偏移量
slave->repldboff = 0;
// 设置RDB文件大小
slave->repldbsize = buf.st_size;
// 更新状态,正在发送RDB文件给从节点
slave->replstate = SLAVE_STATE_SEND_BULK;
// 设置主服务器传来的RDB文件的大小,符合协议的字符串形式
slave->replpreamble = sdscatprintf(sdsempty(),"$%lld\r\n",
(unsigned long long) slave->repldbsize);
// 清除之前的可写的处理程序
aeDeleteFileEvent(server.el,slave->fd,AE_WRITABLE);
//注册写事件处理器sendBulkToSlave,发送rdb文件给slave
if (aeCreateFileEvent(server.el, slave->fd, AE_WRITABLE, sendBulkToSlave, slave) == AE_ERR) {
freeClient(slave);
continue;
}
}
}
}
// 开始为复制执行BGSAVE,根据配置选择磁盘或套接字作为RDB发送的目标,在开始之前确保冲洗脚本缓存
if (startbgsave) startBgsaveForReplication(mincapa);
}
在updateSlavesWaitingBgsave函数中主要就是根据发送的两种方式,来设置slave的client的状态,如果是通过disk来同步,那么设置可写事件,往slave中传输数据,开始同步。并设置sendBulkToSlave()
函数为可写事件的处理程序.看下
// master 将 RDB 文件发送给 slave 的写事件处理器
void sendBulkToSlave(aeEventLoop *el, int fd, void *privdata, int mask) {
client *slave = privdata;
UNUSED(el);
UNUSED(mask);
char buf[PROTO_IOBUF_LEN];
ssize_t nwritten, buflen;
/* Before sending the RDB file, we send the preamble as configured by the
* replication process. Currently the preamble is just the bulk count of
* the file in the form "$<length>\r\n". */
// 发送"$<length>\r\n"表示即将发送RDB文件的大小
if (slave->replpreamble) {
nwritten = write(fd,slave->replpreamble,sdslen(slave->replpreamble));
if (nwritten == -1) {
serverLog(LL_VERBOSE,"Write error sending RDB preamble to slave: %s",
strerror(errno));
freeClient(slave);
return;
}
// 更新已经写到网络的字节数
server.stat_net_output_bytes += nwritten;
// 保留未写的字节,删除已写的字节
sdsrange(slave->replpreamble,nwritten,-1);
// 如果已经写完了,则释放replpreamble
if (sdslen(slave->replpreamble) == 0) {
sdsfree(slave->replpreamble);
slave->replpreamble = NULL;
/* fall through sending data. */
} else {
return;
}
}
/* If the preamble was already transfered, send the RDB bulk data. */
// 将文件指针移动到刚才发送replpreamble的下一个字节,准备写回复
lseek(slave->repldbfd,slave->repldboff,SEEK_SET);
// 将repldbfd读出RDB文件中的内容保存在buf中
buflen = read(slave->repldbfd,buf,PROTO_IOBUF_LEN);
if (buflen <= 0) {
serverLog(LL_WARNING,"Read error sending DB to slave: %s",
(buflen == 0) ? "premature EOF" : strerror(errno));
freeClient(slave);
return;
}
// 将保存RDB文件数据的buf写到从节点中
if ((nwritten = write(fd,buf,buflen)) == -1) {
if (errno != EAGAIN) {
serverLog(LL_WARNING,"Write error sending DB to slave: %s",
strerror(errno));
freeClient(slave);
}
return;
}
// 更新从节点读取主服务器传来的RDB文件的字节数
slave->repldboff += nwritten;
// 更新服务器已经写到网络的字节数
server.stat_net_output_bytes += nwritten;
// 如果写入完成。从网络读到的大小等于文件大小
if (slave->repldboff == slave->repldbsize) {
// 关闭RDB文件描述符
close(slave->repldbfd);
slave->repldbfd = -1;
// 删除等待从节点的文件可读事件
aeDeleteFileEvent(server.el,slave->fd,AE_WRITABLE);
// 将从节点置于在线状态,安装新写事件处理器sendReplyToClient
putSlaveOnline(slave);
}
}
sendBulkToSlave 函数中主要就是读rdb文件,将文件被容传送给slave,传送完毕后删除可写事件,然后调用了putSlaveOnline函数。
void putSlaveOnline(client *slave) {
// 设置从节点的状态为ONLINE
slave->replstate = SLAVE_STATE_ONLINE;
// 不设置从节点的写处理器
slave->repl_put_online_on_ack = 0;
// 设置通过ack命令接收到的偏移量所用的时间
slave->repl_ack_time = server.unixtime; /* Prevent false timeout. */
// 重新设置文件的可写事件的处理程序为sendReplyToClient
if (aeCreateFileEvent(server.el, slave->fd, AE_WRITABLE,
sendReplyToClient, slave) == AE_ERR) {
serverLog(LL_WARNING,"Unable to register writable event for slave bulk transfer: %s", strerror(errno));
freeClient(slave);
return;
}
// 更新当前状态良好的从节点的个数
refreshGoodSlavesCount();
serverLog(LL_NOTICE,"Synchronization with slave %s succeeded",
replicationGetSlaveName(slave));
}
putSlaveOnline主要就是修改slave状态为online表示传送完毕,然后创建可写事件用来回应slave服务器。
这一节总结下:当主节点执行周期性函数时,主节点会先清除之前监听的可写事件,然后立即监听新的可写事件,这样就会触发可写的事件,调用sendBulkToSlave()函数将RDB文件写入到fd中,触发从节点的读事件,从节点调用readSyncBulkPayload()函数(syncWithMaster函数注册的),来将RDB文件的数据载入数据库中,至此,就保证了主从同步。
3.2 发送输出缓冲区数据
主节点发送完RDB文件后,调用putSlaveOnline()函数将从节点client的复制状态设置为SLAVE_STATE_ONLINE,表示已经发送RDB文件完毕,要发送缓存更新了。于是会新创建一个事件,监听写事件的发生,设置sendReplyToClient为可写的处理程序,而且会将从节点client当做私有数据闯入sendReplyToClient()当做发送缓冲区的对象。
创建可写事件的时候,就会触发第一次可写,执行sendReplyToClient()
,该函数还直接调用了riteToClient(fd,privdata,1)
函数,于是将从节点client输出缓冲区的数据发送给了从节点服务器。这样就保证主从服务器的数据库状态一致了。
3.3 传播命令
主从节点在第一次全量同步之后就达到了一致,但是之后主节点如果执行了写命令,主节点的数据库状态就又可能发生变化,导致主从再次不一致。为了让主从节点回到一致状态,主机的执行命令后都需要将命令传播到从节点。
传播时会调用server.c
中的propagate()
函数。
void propagate(struct redisCommand *cmd, int dbid, robj **argv, int argc,
int flags)
{
// 传播到 AOF
if (server.aof_state != AOF_OFF && flags & PROPAGATE_AOF)
feedAppendOnlyFile(cmd,dbid,argv,argc);
// 传播到 slave
if (flags & PROPAGATE_REPL)
replicationFeedSlaves(server.slaves,dbid,argv,argc);
}
传播有两种方式aof传播就是将指令写到aof文件中去,以及主从传播,主从传播调用函数replicationFeedSlaves。
// 主节点将参数列表中的参数(指令副本)发送给从服务器
void replicationFeedSlaves(list *slaves, int dictid, robj **argv, int argc) {
listNode *ln;
listIter li;
int j, len;
char llstr[LONG_STR_SIZE];
/* If there aren't slaves, and there is no backlog buffer to populate,
* we can return ASAP. */
//如果没有副本缓冲区并且没有slave表示从来没有进行过主从同步,直接退出
if (server.repl_backlog == NULL && listLength(slaves) == 0) return;
/* We can't have slaves attached and no backlog. */
serverAssert(!(listLength(slaves) != 0 && server.repl_backlog == NULL));
/* Send SELECT command to every slave if needed. */
// 如果当前从节点使用的数据库不是目标的数据库,则要生成一个select命令
if (server.slaveseldb != dictid) {
robj *selectcmd;
/* For a few DBs we have pre-computed SELECT command. */
// 0 <= id < 10 ,可以使用共享的select命令
if (dictid >= 0 && dictid < PROTO_SHARED_SELECT_CMDS) {
selectcmd = shared.select[dictid];
} else {// 否则自行按照协议格式构建select命令对象
int dictid_len;
dictid_len = ll2string(llstr,sizeof(llstr),dictid);
selectcmd = createObject(OBJ_STRING,
sdscatprintf(sdsempty(),
"*2\r\n$6\r\nSELECT\r\n$%d\r\n%s\r\n",
dictid_len, llstr));
}
/* Add the SELECT command into the backlog. */
// 将select 命令添加到缓冲区(backlog)中
if (server.repl_backlog) feedReplicationBacklogWithObject(selectcmd);
/* Send it to slaves. */
//给slaves服务器发送指令
listRewind(slaves,&li);
// 遍历所有的从服务器节点
while((ln = listNext(&li))) {
client *slave = ln->value;
// 从节点服务器状态为等待BGSAVE的开始,因此跳过回复,遍历下一个节点
if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) continue;
// 添加select命令到当前从节点的回复中
addReply(slave,selectcmd);
}
// 释放临时对象
if (dictid < 0 || dictid >= PROTO_SHARED_SELECT_CMDS)
decrRefCount(selectcmd);
}
// 设置当前从节点使用的数据库ID
server.slaveseldb = dictid;
/* Write the command to the replication backlog if any. */
//将指令写到缓冲区(backlog)中
if (server.repl_backlog) {
char aux[LONG_STR_SIZE+3];
/* Add the multi bulk reply length. */
// 将参数个数构建成协议标准的字符串 *<argc>\r\n
aux[0] = '*';
len = ll2string(aux+1,sizeof(aux)-1,argc);
aux[len+1] = '\r';
aux[len+2] = '\n';
// 添加到backlog中
feedReplicationBacklog(aux,len+3);
// 遍历所有的参数
for (j = 0; j < argc; j++) {
// 返回参数对象的长度
long objlen = stringObjectLen(argv[j]);
/* We need to feed the buffer with the object as a bulk reply
* not just as a plain string, so create the $..CRLF payload len
* and add the final CRLF */
// 构建成协议标准的字符串,并添加到backlog中 $<len>\r\n<argv>\r\n
aux[0] = '$';
len = ll2string(aux+1,sizeof(aux)-1,objlen);
aux[len+1] = '\r';
aux[len+2] = '\n';
// 添加$<len>\r\n
feedReplicationBacklog(aux,len+3);
// 添加参数对象<argv>
feedReplicationBacklogWithObject(argv[j]);
// 添加\r\n
feedReplicationBacklog(aux+len+1,2);
}
}
/* Write the command to every slave. */
//将指令传播给slave服务器
listRewind(server.slaves,&li);
// 遍历从节点链表
while((ln = listNext(&li))) {
client *slave = ln->value;
/* Don't feed slaves that are still waiting for BGSAVE to start */
// 从节点服务器状态为等待BGSAVE的开始,因此跳过回复,遍历下一个节点
if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) continue;
/* Feed slaves that are waiting for the initial SYNC (so these commands
* are queued in the output buffer until the initial SYNC completes),
* or are already in sync with the master. */
// 将命令写给正在等待初次SYNC的从节点(所以这些命令在输出缓冲区中排队,直到初始SYNC完成),或已经与主节点同步
/* Add the multi bulk length. */
// 添加回复的长度
addReplyMultiBulkLen(slave,argc);
/* Finally any additional argument that was not stored inside the
* static buffer if any (from j to argc). */
// 将所有的参数列表添加到从节点的输出缓冲区
for (j = 0; j < argc; j++)
addReplyBulk(slave,argv[j]);
}
}
可见replicationFeedSlaves函数将master的指令先写入到复制缓冲区中,然后将指令写入到slave的输出缓冲区中。不仅写入了从节点client的输出缓冲区,而且还会将命令记录到主节点服务器的复制积压缓冲区server.repl_backlog
中,这是为了网络闪断后进行部分重同步。
四 部分重同步实现
刚才剖析完全量同步,但是没有考虑特殊的情况。如果在传输RDB文件的过程中,网络发生故障,主节点和从节点的连接中断,Redis会咋么做呢?Redis 2.8 版本之前会在进行一次连接然后进行全量复制,但是这样效率非常地下,之后的版本都提供了部分重同步的实现。
部分重同步在复制的过程中,相当于上面2.3部分8
的发送PSYNC
命令的部分,其他所有的部分都要进行,他只是主节点回复从节点的命令不同,回复+CONTINUE
则执行部分重同步,回复+FULLRESYNC
则执行全量同步。
4.1 心跳机制
主节点是如何发现和从节点连接中断?在主从节点建立连接后,他们之间都维护者长连接并彼此发送心跳命令。主从节点彼此都有心跳机制,各自模拟成对方的客户端进行通信。
主节点默认每隔10秒发送PING命令,判断从节点的连接状态。
文件配置项:repl-ping-salve-period,默认是10
/* First, send PING according to ping_slave_period. */
// 向所有已连接 slave (状态为 ONLINE)发送 PING
if ((replication_cron_loops % server.repl_ping_slave_period) == 0) {
// 创建PING命令对象
ping_argv[0] = createStringObject("PING",4);
//将ping命令发送给从服务器
replicationFeedSlaves(server.slaves, server.slaveseldb,
ping_argv, 1);
decrRefCount(ping_argv[0]);
}
- 从节点在主线程中每隔1秒发送
REPLCONF ACK <offset>
命令,给主节点报告自己当前复制偏移量。
/* Send ACK to master from time to time.
* Note that we do not send periodic acks to masters that don't
* support PSYNC and replication offsets. */
// 定期发送ack给主节点,旧版本的Redis除外
// 不过如果主服务器带有 REDIS_PRE_PSYNC 的话就不发送
if (server.masterhost && server.master &&
!(server.master->flags & CLIENT_PRE_PSYNC))
replicationSendAck();
源码在在周期性函数replicationCron().
master节点给slave节点发送ping指令,并且发送一个换行符,更新ack时间,避免由于等待rdb导致误以为失联;slave节点定期给master节点发送自己的offset;对于主从交互超时的节点断开释放client。
4.2 复制积压缓冲区(backlog)
复制积压缓冲区是一个大小为1M的循环队列。主节点在命令传播时,不仅会将命令发送给所有的从节点,还会将命令写入复制积压缓冲区中(具体请看上面的命令传播3.3)。
也就是说,复制积压缓冲区最多可以备份1M大小的数据,如果主从节点断线时间过长,复制积压缓冲区的数据会被新数据覆盖,那么当从主从中断连接起,主节点接收到的数据超过1M大小,那么从节点就无法进行部分重同步,只能进行全量复制。
在上面的3.1 介绍的syncCommand()命令中,调用masterTryPartialResynchronization()函数会进行尝试部分重同步,在我们之前分析的第一次全量同步时,该函数会执行失败,然后返回syncCommand()函数执行全量同步,而在进行恢复主从连接后,则会进行部分重同步,masterTryPartialResynchronization()函数代码如下:
// 该函数从主节点接收到部分重新同步请求的角度处理PSYNC命令
// 成功返回C_OK,否则返回C_ERR
int masterTryPartialResynchronization(client *c) {
long long psync_offset, psync_len;
char *master_runid = c->argv[1]->ptr; //主节点的运行ID
char buf[128];
int buflen;
/* Is the runid of this master the same advertised by the wannabe slave
* via PSYNC? If runid changed this master is a different instance and
* there is no way to continue. */
// 主节点的运行ID是否和从节点执行PSYNC的参数提供的运行ID相同。
// 如果运行ID发生了改变,则主节点是一个不同的实例,那么就不能进行继续执行原有的复制进程
if (strcasecmp(master_runid, server.runid)) {
/* Run id "?" is used by slaves that want to force a full resync. */
// 如果从节点的运行ID是"?",表示想要强制进行一个全量同步
if (master_runid[0] != '?') {
serverLog(LL_NOTICE,"Partial resynchronization not accepted: "
"Runid mismatch (Client asked for runid '%s', my runid is '%s')",
master_runid, server.runid);
} else {
serverLog(LL_NOTICE,"Full resync requested by slave %s",
replicationGetSlaveName(c));
}
goto need_full_resync;
}
/* We still have the data our slave is asking for? */
// 从参数对象中获取psync_offset
if (getLongLongFromObjectOrReply(c,c->argv[2],&psync_offset,NULL) !=
C_OK) goto need_full_resync;
//检查psync_offset范围,如果不符合则需要进行全量复制
if (!server.repl_backlog ||
psync_offset < server.repl_backlog_off ||
psync_offset > (server.repl_backlog_off + server.repl_backlog_histlen))
{
serverLog(LL_NOTICE,
"Unable to partial resync with slave %s for lack of backlog (Slave request was: %lld).", replicationGetSlaveName(c), psync_offset);
if (psync_offset > server.master_repl_offset) {
serverLog(LL_WARNING,
"Warning: slave %s tried to PSYNC with an offset that is greater than the master replication offset.", replicationGetSlaveName(c));
}
goto need_full_resync;
}
/* If we reached this point, we are able to perform a partial resync:
* 1) Set client state to make it a slave.
* 2) Inform the client we can continue with +CONTINUE
* 3) Send the backlog data (from the offset to the end) to the slave. */
// 执行到这里,则可以进行部分重同步
// 1. 设置client状态为从节点
// 2. 向从节点发送 +CONTINUE 表示接受 partial resync 被接受
// 3. 发送backlog的数据给从节点
// 设置client状态为从节点
c->flags |= CLIENT_SLAVE;
// 设置复制状态为在线,此时RDB文件传输完成,发送差异数据
c->replstate = SLAVE_STATE_ONLINE;
// 设置从节点收到ack的时间
c->repl_ack_time = server.unixtime;
// slave向master发送ack标志设置为0
c->repl_put_online_on_ack = 0;
// 将当前client加入到从节点链表中
listAddNodeTail(server.slaves,c);
/* We can't use the connection buffers since they are used to accumulate
* new commands at this stage. But we are sure the socket send buffer is
* empty so this write will never fail actually. */
// 向从节点发送 +CONTINUE
buflen = snprintf(buf,sizeof(buf),"+CONTINUE\r\n");
if (write(c->fd,buf,buflen) != buflen) {
freeClientAsync(c);
return C_OK;
}
// 将复制缓冲区(backlog)的数据发送从节点
psync_len = addReplyReplicationBacklog(c,psync_offset);
serverLog(LL_NOTICE,
"Partial resynchronization request from %s accepted. Sending %lld bytes of backlog starting from offset %lld.",
replicationGetSlaveName(c),
psync_len, psync_offset);
/* Note that we don't need to set the selected DB at server.slaveseldb
* to -1 to force the master to emit SELECT, since the slave already
* has this state from the previous connection with the master. */
// 计算延迟值小于min-slaves-max-lag的从节点的个数
refreshGoodSlavesCount();
return C_OK; /* The caller can return, no full resync needed. */
need_full_resync:
/* We need a full resync for some reason... Note that we can't
* reply to PSYNC right now if a full SYNC is needed. The reply
* must include the master offset at the time the RDB file we transfer
* is generated, so we need to delay the reply to that moment. */
return C_ERR;
}
首先判断是否可以进行局部同步,如果可以进行局部同步,主节点则会发送"+CONTINUE\r\n"作为从节点发送PSYNC回复(看标题2.3)。然后调用addReplyReplicationBacklog()函数,将backlog中的数据发送给从节点。于是就完成了部分重同步。
addReplyReplicationBacklog()函数所做的就是将backlog写到从节点的client的输出缓冲区中。
******************
总结:replication.c的源码挺长的,一开始看很容易跟丢了,所以不必纠结于函数细节,大概理清楚了主节点、从节点的经历过程,对应的调用关系。在看细节,不然容易绕晕了。
参考:
https://blog.csdn.net/men_wen/article/details/72628439
http://morningxb.cn/2017/11/22/redis%E6%BA%90%E7%A0%81-21-%E2%80%94%E2%80%94replication-c%E7%AF%87/