目录
1.1 redisServer结构体中与主从复制相关的成员变量
1.3.2 重要子步骤replicationSetMaster
1.3.3 看看server.repl_state的状态范围
1.5.1 serverCron中调用了replicationCron
1.5.3 connectWithMaster-建立主从之间的连接
0.阅读与引用
menwen-Redis 复制(replicate)源码详细解析
《Redis开发与运维》第6章 查看自己的电子书
《Redis5 设计与源码分析》第21章 主从复制 链接
《Redis深度历险》原理8 主从同步 查看电子书
本文中搜索关键字【待确定】, 方便复习
1.slave部分源码分析
1.1 redisServer结构体中与主从复制相关的成员变量
struct redisServer {
...
/* 记录所有的从服务器,是一个链表,链表节点值类型为client */
/* 记录所有的监控器,是一个链表,链表节点值类型为monitor */
list *slaves, *monitors; /* List of slaves and MONITORs */
/* Replication (master) */
/* 当前任期的master的运行Id*/
char replid[CONFIG_RUN_ID_SIZE+1]; /* My current replication ID. */
/* 上个任期的master的运行Id */
char replid2[CONFIG_RUN_ID_SIZE+1]; /* replid inherited from master */
/* 当前任期的缓冲区最后一个字节的复制偏移量 */
long long master_repl_offset; /* My current replication offset */
/* 上一个任期的缓冲区最后一个字节的复制偏移量 */
long long second_replid_offset; /* Accept offsets up to this for replid2*/
/* 待确定 */
int slaveseldb; /* Last SELECTed DB in replication output */
/*
表示发送心跳包的周期,主服务器以此周期向所有从服务器发送心跳包.
主服务器和从服务器之间是通过TCP长连接交互数据的,就必然需要周期性地发送心跳包来检测连接有
效性,该字段表示发送心跳包的周期,主服务器以此周期向所有从服务器发送心跳包. 可通过配置参数
repl-ping-replica-period或者repl-ping-slave-period设置,默认为10.
*/
int repl_ping_slave_period; /* Master pings the slave every N seconds */
/*
复制缓冲区,用于缓存主服务器已执行且待发送给从服务器的命令请求;
缓冲区大小由字段repl_backlog_size指定,其可通过配置参数repl-backlog-size设置,
默认为1MB
*/
char *repl_backlog; /* Replication backlog for partial syncs */
/* 复制缓冲区的大小 */
long long repl_backlog_size; /* Backlog circular buffer size */
/* 复制缓冲区中存储的命令请求数据长度 */
long long repl_backlog_histlen; /* Backlog actual data length */
/*复制缓冲区中存储的命令请求最后一个字节索引位置,即向复制缓冲区写入数据时会从该索引位置开始*/
long long repl_backlog_idx; /* Backlog circular buffer current offset,
that is the next byte will'll write to.*/
/* 复制缓冲区中第一个字节的复制偏移量 */
long long repl_backlog_off; /* Replication "master offset" of first
byte in the replication backlog buffer.*/
/* 待确定 */
time_t repl_backlog_time_limit; /* Time without slaves after the backlog
gets released. */
/* 表示有多久没有从机了*/
time_t repl_no_slaves_since; /* We have no slaves since that time.
Only valid if server.slaves len is 0. */
/* 当有效从服务器的数目小于该值时,主服务器会拒绝执行写命令 */
int repl_min_slaves_to_write; /* Min number of slaves to write. */
/* 决定从服务器是否处于失效状态的超时门限 */
int repl_min_slaves_max_lag; /* Max lag of <count> slaves to write. */
/*
当前有效从服务器的数目.
什么样的从服务器是有效的呢?我们说过主服务器和从服务器之间是通过TCP长连接交互数据的,并
且会发送心跳包来检测连接有效性;主服务器会记录每个从服务器上次心跳检测成功的时间
repl_ack_time,并且定时检测当前时间距离repl_ack_time是否超过一定超时门限,如果超过
则认为从服务器处于失效状态。字段repl_min_slaves_max_lag存储的就是该超时门限,可通过
配置参数min-slaves-max-lag或者min-replicas-max-lag设置,默认为10,单位秒。
*/
int repl_good_slaves_count; /* Number of slaves with lag <= max_lag. */
/* 待研究 */
int repl_diskless_sync; /* Master send RDB to slaves sockets directly. */
/* 待研究 */
int repl_diskless_load; /* Slave parse RDB directly from the socket.
* see REPL_DISKLESS_LOAD_* enum */
/* 待研究 */
int repl_diskless_sync_delay; /* Delay to start a diskless repl BGSAVE. */
/* Replication (slave) */
/* 必须要这个用户才能登录 */
char *masteruser; /* AUTH with this user and masterauth with master */
/* masteruser用户对应的验证密码,当主服务器配置了“requirepass password”时,即表示从服
务器必须通过密码认证才能同步主服务器数据。同样的需要在从服务器配置“masterauth<master-
password>”,用于设置请求同步主服务器时的认证密码.
*/
char *masterauth; /* AUTH with this password with master */
/* 主服务器的IP */
char *masterhost; /* Hostname of master */
/* 主服务器的端口 */
int masterport; /* Port of master */
/* 待研究 */
int repl_timeout; /* Timeout after N seconds of master idle */
/* 当主从服务器成功建立连接之后,从服务器将成为主服务器的客户端,同样的主服务器也会成为从服务
器的客户端,master即为主服务器,类型为client
*/
client *master; /* Client that is master for this slave */
client *cached_master; /* Cached master to be reused for PSYNC. */
int repl_syncio_timeout; /* Timeout for synchronous I/O calls */
/* 主从复制流程的进展(从服务器状态)*/
int repl_state; /* Replication status if the instance is a slave */
off_t repl_transfer_size; /* Size of RDB to read from master during sync. */
off_t repl_transfer_read; /* Amount of RDB read from master during sync. */
off_t repl_transfer_last_fsync_off; /* Offset when we fsync-ed last time. */
connection *repl_transfer_s; /* Slave -> Master SYNC connection */
int repl_transfer_fd; /* Slave -> Master SYNC temp file descriptor */
char *repl_transfer_tmpfile; /* Slave-> master SYNC temp file name */
time_t repl_transfer_lastio; /* Unix time of the latest read, for timeout */
/* 当主从服务器断开连接时,该变量表示从服务器是否继续处理命令请求,可通过配置参数
slave-serve-stale-data或者replica-serve-stale-data设置,默认为1,即可以继续处理
命令请求。
*/
int repl_serve_stale_data; /* Serve stale data when link is down? */
/* 配置从机是否只是可读的(不处理除了主服务器发来以外的写命令).
可通过配置参数slave-read-only或者replica-read-only设置,默认为1,即从服务器不处理写命
令请求,除非该命令是主服务器发送过来的.
*/
int repl_slave_ro; /* Slave is read only? */
/* 从机是否没有键的过期处理策略*/
int repl_slave_ignore_maxmemory; /* If true slaves do not evict. */
/* 从机与主机断开的时间 */
time_t repl_down_since; /* Unix time at which link with master went down */
int repl_disable_tcp_nodelay; /* Disable TCP_NODELAY after SYNC? */
int slave_priority; /* Reported in INFO and used by Sentinel. */
int slave_announce_port; /* Give the master this listening port. */
char *slave_announce_ip; /* Give the master this ip address. */
/* The following two fields is where we store master PSYNC replid/offset
* while the PSYNC is in progress. At the end we'll copy the fields into
* the server->master client structure. */
char master_replid[CONFIG_RUN_ID_SIZE+1]; /* Master PSYNC runid. */
long long master_initial_offset; /* Master PSYNC offset. */
int repl_slave_lazy_flush; /* Lazy FLUSHALL before loading DB? */
/* Replication script cache. */
dict *repl_scriptcache_dict; /* SHA1 all slaves are aware of. */
list *repl_scriptcache_fifo; /* First in, first out LRU eviction. */
unsigned int repl_scriptcache_size; /* Max number of elements. */
/* Synchronous replication. */
list *clients_waiting_acks; /* Clients waiting in WAIT command. */
int get_ack_from_slaves; /* If true we send REPLCONF GETACK. */
...
}
函数refreshGoodSlavesCount实现了从服务器有效性的检测;
1.2 从slaveof命令看起
/* 惊奇 slaveof命令调用的竟然不是slaveofCommand,而是replicaofCommand */
struct redisCommand redisCommandTable[] = {
...
{"slaveof",replicaofCommand,3,
"admin no-script ok-stale",
0,NULL,0,0,0,0,0,0},
/* 注意噢 这两个命令调用同一个函数 */
{"replicaof",replicaofCommand,3,
"admin no-script ok-stale",
0,NULL,0,0,0,0,0,0},
...
}
1.3 查看replicaofCommand的相关内容
1.3.1 replicaofCommand的实现
replicaofCommand做的事情:
- 判断当前环境是否在集群模式下, 如果是的, 就不能执行命令,给出相关提示并且返回;
- 如果输入的命令是slaveof no one,那么解除主从关系,设置当前节点为主节点服务器;
- 判断是否已经是指定host,ip所代表的服务器的从机了, 如果已经是了,就不能再执行这个命令了,给出相关提示并且返回;
- 如果不是以上三个步骤中的情况, 调用replicationSetMaster设置执行slaveof命令的服务器为指定host,ip所代表的主服务器的从服务器.
/* slave host port命令实现 */
void replicaofCommand(client *c) {
/* SLAVEOF is not allowed in cluster mode as replication is automatically
* configured using the current address of the master node. */
/* 如果服务器当前处于集群模式,不可以执行此操作 */
if (server.cluster_enabled) {
addReplyError(c,"REPLICAOF not allowed in cluster mode.");
return;
}
/* The special host/port combination "NO" "ONE" turns the instance
* into a master. Otherwise the new master address is set. */
/* SLAVEOF NO ONE命令使得这个从节点关闭复制功能,并从从节点的身份转变回主节点,
原来同步所得的数据集不会被丢弃*/
if (!strcasecmp(c->argv[1]->ptr,"no") &&
!strcasecmp(c->argv[2]->ptr,"one")) {
/* 如果当前服务器的主节点的主机名不为NULL */
if (server.masterhost) {
/* 取消复制操作,设置服务器为主服务器 */
replicationUnsetMaster();
/* 获取client的每种信息,并以sds形式返回,并打印到日志中 */
sds client = catClientInfoString(sdsempty(),c);
serverLog(LL_NOTICE,"MASTER MODE enabled (user request from '%s')",
client);
/* 释放内存 */
sdsfree(client);
}
} else {
/* 设置port临时变量 */
long port;
/* 如果当前客户端已经是一个从机 */
if (c->flags & CLIENT_SLAVE)
{
/* If a client is already a replica they cannot run this command,
* because it involves flushing all replicas (including this
* client) */
/* 返回错误,给出错误提示:当前机器已经被部属为从机,不可以使用此命令 */
addReplyError(c, "Command is not valid when client is a replica.");
return;
}
/* 获取端口号 */
if ((getLongFromObjectOrReply(c, c->argv[2], &port, NULL) != C_OK))
return;
/* Check if we are already attached to the specified slave */
/*
如果已存在从属于masterhost主节点且命令参数指定的主节点的host及port信息和
server.masterhost,server.masterport也相等,给出“已经是指定主机指定端
口的主服务器的从机了”,并直接返回
*/
if (server.masterhost && !strcasecmp(server.masterhost,c->argv[1]->ptr)
&& server.masterport == port) {
serverLog(LL_NOTICE,"REPLICAOF would result into synchronization "
"with the master we are already connected "
"with. No operation performed.");
addReplySds(c,sdsnew("+OK Already connected to specified "
"master\r\n"));
return;
}
/* There was no previous master or the user specified a different one,
* we can continue. */
/* 第一次设置端口和ip指定为某主服务器的从机或者是重新设置端口和IP指定当前机器为另一台主
服务器的的从服务器,这两种情况我们都可以继续 */
/* 设置端口和IP */
replicationSetMaster(c->argv[1]->ptr, port);
/* 获取client的每种信息, 并以sds形式返回, 并打印到日志中, 然后释放内存 */
sds client = catClientInfoString(sdsempty(),c);
serverLog(LL_NOTICE,"REPLICAOF %s:%d enabled (user request from '%s')",
server.masterhost, server.masterport, client);
sdsfree(client);
}
/* 回复ok */
addReply(c,shared.ok);
}
/* 以下为获取端口的时候调用的【getLongFromObjectOrReply】的实现步骤 */
int getLongFromObjectOrReply(client *c, robj *o, long *target, const char *msg) {
long long value;
if (getLongLongFromObjectOrReply(c, o, &value, msg) != C_OK) return C_ERR;
if (value < LONG_MIN || value > LONG_MAX) {
if (msg != NULL) {
addReplyError(c,(char*)msg);
} else {
addReplyError(c,"value is out of range");
}
return C_ERR;
}
*target = value;
return C_OK;
}
int getLongLongFromObject(robj *o, long long *target) {
long long value;
if (o == NULL) {
value = 0;
} else {
serverAssertWithInfo(NULL,o,o->type == OBJ_STRING);
if (sdsEncodedObject(o)) {
if (string2ll(o->ptr,sdslen(o->ptr),&value) == 0) return C_ERR;
} else if (o->encoding == OBJ_ENCODING_INT) {
value = (long)o->ptr;
} else {
serverPanic("Unknown string encoding");
}
}
if (target) *target = value;
return C_OK;
}
#define serverAssertWithInfo(_c,_o,_e) ((_e)?(void)0 : (_serverAssertWithInfo(_c,_o,#_e,__FILE__,__LINE__),_exit(1)))
void _serverAssertWithInfo(const client *c, const robj *o, const char *estr, const char *file, int line) {
if (c) _serverAssertPrintClientInfo(c);
if (o) _serverAssertPrintObject(o);
_serverAssert(estr,file,line);
}
1.3.2 重要子步骤replicationSetMaster
1.3.2.1
/* Set replication to the specified master address and port. */
/* 设置当前服务为指定ip,port所代表的主机的从机 */
void replicationSetMaster(char *ip, int port) {
/* == 的优先级高于 = */
/* 判断server.masterhost是否为空,并且将是否为空的结果存入was_master中 */
int was_master = server.masterhost == NULL;
/* 清空释放server.masterhost之前存入的内容*/
sdsfree(server.masterhost);
/* 重新设置server.masterhost和server.masterport的值 */
server.masterhost = sdsnew(ip);
server.masterport = port;
/* 如果server.master不为NULL */
/* 这里可以这么理解:
假设当前节点是B,且当前B的主节点是A,现在B想要将自己设置成C的从节点,那么B就要把自己之前存的关于A节点的信息给释放掉,
因为当前B的主节点是A,作为网络中的两个节点,那么它必定与A保持一定的连接,所以可将A看作是B的客户端,存入server.master中.
*/
/* 如果server.masterhost不为空 */
if (server.master) {
freeClient(server.master);/* 释放server.master这个客户端的信息 */
}
/*
断开所有阻塞着的客户端,现在可能出现主机变成别的主机的从机的情况,连接到本台机器上的连接
可能已经不安全了,需要将它们与当前机器的连接断开
*/
disconnectAllBlockedClients(); /* Clients blocked in master, now slave. */
/* Update oom_score_adj */
/* 设置更新内存溢出得分调整值*/
setOOMScoreAdj(-1);
/* Force our slaves to resync with us as well. They may hopefully be able
* to partially resync with us, but we can notify the replid change. */
/* 关闭所有从节点服务器的连接,强制从节点服务器进行重新同步操作 */
disconnectSlaves();
/* 取消主从复制的握手功能 */
cancelReplicationHandshake();
/* Before destroying our master state, create a cached master using
* our own parameters, to later PSYNC with the new master. */
/* 如果server.masterhost非空 */
if (was_master) {
/* 释放之前缓存的master的相关状态,看1.3.2.3中的具体实现 */
replicationDiscardCachedMaster();
/* 同步一下自己的master中的一些信息,也许在之后可以少同步一些内容,设置
server.cached_master = server.master,具体内容看1.3.2.4
*/
replicationCacheMasterUsingMyself();
}
/* Fire the role change modules event. */
/* 触发服务器的角色转变模块的事件 */
moduleFireServerEvent(REDISMODULE_EVENT_REPLICATION_ROLE_CHANGED,
REDISMODULE_EVENT_REPLROLECHANGED_NOW_REPLICA,
NULL);
/* Fire the master link modules event. */
/* 如果server.repl_state的状态是REPL_STATE_CONNECTED,触发主机连接模块的事件 */
if (server.repl_state == REPL_STATE_CONNECTED)
moduleFireServerEvent(REDISMODULE_EVENT_MASTER_LINK_CHANGE,
REDISMODULE_SUBEVENT_MASTER_LINK_DOWN,
NULL);
server.repl_state = REPL_STATE_CONNECT;
}
1.3.2.2
/*
这个函数阻止了一个正在进行的非阻塞的复制尝试(假设当前机器是A,可以理解成有一个B想要A
成为它的小弟-从机),如果复置所需要的握手已经完成了,那么返回1并且将server.repl_state
设置成REPL_STATE_CONNECT,如果复置所需要的握手没有完成,就返回0并且什么也不做.
*/
/* This function aborts a non blocking replication attempt if there is one
* in progress, by canceling the non-blocking connect attempt or
* the initial bulk transfer.
*
* If there was a replication handshake in progress 1 is returned and
* the replication state (server.repl_state) set to REPL_STATE_CONNECT.
*
* Otherwise zero is returned and no operation is perforemd at all. */
int cancelReplicationHandshake(void) {
if (server.repl_state == REPL_STATE_TRANSFER) {
replicationAbortSyncTransfer();
server.repl_state = REPL_STATE_CONNECT;
} else if (server.repl_state == REPL_STATE_CONNECTING ||
slaveIsInHandshakeState())
{
undoConnectWithMaster();
server.repl_state = REPL_STATE_CONNECT;
} else {
return 0;
}
return 1;
}
1.3.2.3
/* 释放缓存的master,在它们再也不用用于重连然后执行同步调用的时候被调用 */
/* Free a cached master, called when there are no longer the conditions for
* a partial resync on reconnection. */
void replicationDiscardCachedMaster(void) {
if (server.cached_master == NULL) return;
serverLog(LL_NOTICE,"Discarding previously cached master state.");
server.cached_master->flags &= ~CLIENT_MASTER;
freeClient(server.cached_master);
server.cached_master = NULL;
}
1.3.2.4
/* This function is called when a master is turend into a slave, in order to
* create from scratch a cached master for the new client, that will allow
* to PSYNC with the slave that was promoted as the new master after a
* failover.
*
* Assuming this instance was previously the master instance of the new master,
* the new master will accept its replication ID, and potentiall also the
* current offset if no data was lost during the failover. So we use our
* current replication ID and offset in order to synthesize a cached master. */
void replicationCacheMasterUsingMyself(void) {
serverLog(LL_NOTICE,
"Before turning into a replica, using my own master parameters "
"to synthesize a cached master: I may be able to synchronize with "
"the new master with just a partial transfer.");
/* This will be used to populate the field server.master->reploff
* by replicationCreateMasterClient(). We'll later set the created
* master as server.cached_master, so the replica will use such
* offset for PSYNC. */
server.master_initial_offset = server.master_repl_offset;
/* The master client we create can be set to any DBID, because
* the new master will start its replication stream with SELECT. */
replicationCreateMasterClient(NULL,-1);
/* Use our own ID / offset. */
memcpy(server.master->replid, server.replid, sizeof(server.replid));
/* Set as cached master. */
unlinkClient(server.master);
server.cached_master = server.master;
server.master = NULL;
}
1.3.3 看看server.repl_state的状态范围
在1.1中我们可以看到repl_state成员变量是在当前机器是从机的情况下的,从机的服务器状态.
刚刚阅读的1.3.2的源码中出现了很多对于server.repl_state的判断,对server.repl_state的状态迁移对的理解可以帮助我们理解主从复制的流程,下面我们就来看看从机包含哪些状态:
/* Slave replication state. Used in server.repl_state for slaves to remember
* what to do next. */
/* 从机的状态,用来提示从机接下来要做什么事情 */
#define REPL_STATE_NONE 0 /* No active replication */
#define REPL_STATE_CONNECT 1 /* Must connect to master */
#define REPL_STATE_CONNECTING 2 /* Connecting to master */
/* --- Handshake states, must be ordered --- */
#define REPL_STATE_RECEIVE_PONG 3 /* Wait for PING reply */
#define REPL_STATE_SEND_AUTH 4 /* Send AUTH to master */
#define REPL_STATE_RECEIVE_AUTH 5 /* Wait for AUTH reply */
#define REPL_STATE_SEND_PORT 6 /* Send REPLCONF listening-port */
#define REPL_STATE_RECEIVE_PORT 7 /* Wait for REPLCONF reply */
#define REPL_STATE_SEND_IP 8 /* Send REPLCONF ip-address */
#define REPL_STATE_RECEIVE_IP 9 /* Wait for REPLCONF reply */
#define REPL_STATE_SEND_CAPA 10 /* Send REPLCONF capa */
#define REPL_STATE_RECEIVE_CAPA 11 /* Wait for REPLCONF reply */
#define REPL_STATE_SEND_PSYNC 12 /* Send PSYNC */
#define REPL_STATE_RECEIVE_PSYNC 13 /* Wait for PSYNC reply */
/* --- End of handshake states --- */
#define REPL_STATE_TRANSFER 14 /* Receiving .rdb from master */
#define REPL_STATE_CONNECTED 15 /* Connected to master */
各状态含义如下。
❏ REPL_STATE_NONE:未开启主从复制功能,当前服务器是普通的Redis实例;
❏ REPL_STATE_CONNECT:待发起Socket连接主服务器;
❏ REPL_STATE_CONNECTING:Socket连接成功;
❏ REPL_STATE_RECEIVE_PONG:已经发送了PING请求包,并等待接收主服务器PONG回复;
❏ REPL_STATE_SEND_AUTH:待发起密码认证;
❏ REPL_STATE_RECEIVE_AUTH:已经发起了密码认证请求“AUTH<password>”,等待接收主服务器回复;
❏ REPL_STATE_SEND_PORT:待发送端口号;
❏ REPL_STATE_RECEIVE_PORT:已发送端口号“REPLCONF listening-port <port>”,等待接收主服务
器回复;
❏ REPL_STATE_SEND_IP:待发送IP地址;
❏ REPL_STATE_RECEIVE_IP:已发送IP地址“REPLCONF ip-address<ip>”,等待接收主服务器回复;
该IP地址与端口号用于主服务器主动建立Socket连接,并向从服务器同步数据;
❏ REPL_STATE_SEND_CAPA:主从复制功能进行过优化升级,不同版本Redis服务器支持的能力可能不同,因
此从服务器需要告诉主服务器自己支持的主从复制能力,通过命令“REPLCONF capa <capability>”实现;
❏ REPL_STATE_RECEIVE_CAPA:等待接收主服务器回复;❏ REPL_STATE_SEND_PSYNC:待发送PSYNC命
令;
❏ REPL_STATE_RECEIVE_PSYNC:等待接收主服务器PSYNC命令的回复结果;
❏ REPL_STATE_TRANSFER:正在接收RDB文件;
❏ REPL_STATE_CONNECTED:RDB文件接收并载入完毕,主从复制连接建立成功,此时从服务器只需要等待接
收主服务器同步数据即可。
1.4 阅读产生的疑问
通过1.3节中对replicaofCommand的阅读, 我们大概可以看到通过调用replicaofCommand的一些细节看出
它们做的事情好像只有清除旧的主机信息,如果需要的话断开一些旧的连接,状态被置为REPL_STATE_CONNECT(待发起Socket连接主服务器);
设置server.repl_state的状态,
那会存在两个问题:
(1)什么时候去建立与指定主机的连接?
(2)server.repl_state的状态转变是如何发生的?经历了哪些状态转变呢?
我们将在1.5节一起学习这两个问题的相关细节.
1.5 看看replicationCron
1.5.1 serverCron中调用了replicationCron
replicaofCommand函数实现并没有向主服务器发起连接请求,说明该操作应该是一个异步操作,那么很有可能
是在时间事件中执行,搜索时间事件处理函数serverCron会发现,以一秒为周期执行主从复制相关操作:
int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) {
...
/* Replication cron function -- used to reconnect to master,
* detect transfer failures, start background RDB transfers and so forth. */
run_with_period(1000) replicationCron();
...
}
1.5.2 replicationCron的实现
* --------------------------- REPLICATION CRON ---------------------------- */
/* Replication cron function, called 1 time per second. */
/* 复制的定时任务函数,每一秒钟调用一次 */
void replicationCron(void) {
/* */
static long long replication_cron_loops = 0;
/* */
/* Non blocking connection timeout? */
if (server.masterhost &&
(server.repl_state == REPL_STATE_CONNECTING ||
slaveIsInHandshakeState()) &&
(time(NULL)-server.repl_transfer_lastio) > server.repl_timeout)
{
serverLog(LL_WARNING,"Timeout connecting to the MASTER...");
cancelReplicationHandshake();
}
/* */
/* Bulk transfer I/O timeout? */
if (server.masterhost && server.repl_state == REPL_STATE_TRANSFER &&
(time(NULL)-server.repl_transfer_lastio) > server.repl_timeout)
{
serverLog(LL_WARNING,"Timeout receiving bulk data from MASTER... If the problem persists try to set the 'repl-timeout' parameter in redis.conf to a larger value.");
cancelReplicationHandshake();
}
/* */
/* Timed out master when we are an already connected slave? */
if (server.masterhost && server.repl_state == REPL_STATE_CONNECTED &&
(time(NULL)-server.master->lastinteraction) > server.repl_timeout)
{
serverLog(LL_WARNING,"MASTER timeout: no data nor PING received...");
freeClient(server.master);
}
/* 检查我们是否应该去尝试去连接master,当server.repl_state是REPL_STATE_CONNECT
(等待向主服务器发起Socket连接并且必须连接的状态)的时候,我们需要开始去连接
*/
/* Check if we should connect to a MASTER */
if (server.repl_state == REPL_STATE_CONNECT) {
serverLog(LL_NOTICE,"Connecting to MASTER %s:%d",
server.masterhost, server.masterport);
/* 以非阻塞的方式连接主节点 */
if (connectWithMaster() == C_OK) {
serverLog(LL_NOTICE,"MASTER <-> REPLICA sync started");
}
}
/* 当server.masterhost不为NULL且server.master不为NULL且master节点是支持
部分重同步功能的时候,向master节点发送一个REPLCONF ACK命令给主节点去报告关于
当前处理的offset
这里要注意一个细节问题:就是如何知道master是否支持部分重同步的?这个我们下面
会给出解释.
from time to time --- 定时
CLIENT_PRE_PSYNC --- 不支持PSYNC功能的客户端(PSYNC中有部分重传功能)
#define CLIENT_PRE_PSYNC (1<<16) /* Instance don't understand PSYNC.*/
*/
/* Send ACK to master from time to time.
* Note that we do not send periodic acks to masters that don't
* support PSYNC and replication offsets. */
if (server.masterhost && server.master &&
!(server.master->flags & CLIENT_PRE_PSYNC))
replicationSendAck();
/* */
/* If we have attached slaves, PING them from time to time.
* So slaves can implement an explicit timeout to masters, and will
* be able to detect a link disconnection even if the TCP connection
* will not actually go down. */
listIter li;
listNode *ln;
robj *ping_argv[1];
/* */
/* First, send PING according to ping_slave_period. */
if ((replication_cron_loops % server.repl_ping_slave_period) == 0 &&
listLength(server.slaves))
{
/* Note that we don't send the PING if the clients are paused during
* a Redis Cluster manual failover: the PING we send will otherwise
* alter the replication offsets of master and slave, and will no longer
* match the one stored into 'mf_master_offset' state. */
int manual_failover_in_progress =
server.cluster_enabled &&
server.cluster->mf_end &&
clientsArePaused();
if (!manual_failover_in_progress) {
ping_argv[0] = createStringObject("PING",4);
replicationFeedSlaves(server.slaves, server.slaveseldb,
ping_argv, 1);
decrRefCount(ping_argv[0]);
}
}
/* */
/* Second, send a newline to all the slaves in pre-synchronization
* stage, that is, slaves waiting for the master to create the RDB file.
*
* Also send the a newline to all the chained slaves we have, if we lost
* connection from our master, to keep the slaves aware that their
* master is online. This is needed since sub-slaves only receive proxied
* data from top-level masters, so there is no explicit pinging in order
* to avoid altering the replication offsets. This special out of band
* pings (newlines) can be sent, they will have no effect in the offset.
*
* The newline will be ignored by the slave but will refresh the
* last interaction timer preventing a timeout. In this case we ignore the
* ping period and refresh the connection once per second since certain
* timeouts are set at a few seconds (example: PSYNC response). */
listRewind(server.slaves,&li);
while((ln = listNext(&li))) {
client *slave = ln->value;
int is_presync =
(slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START ||
(slave->replstate == SLAVE_STATE_WAIT_BGSAVE_END &&
server.rdb_child_type != RDB_CHILD_TYPE_SOCKET));
if (is_presync) {
connWrite(slave->conn, "\n", 1);
}
}
/* */
/* Disconnect timedout slaves. */
if (listLength(server.slaves)) {
listIter li;
listNode *ln;
listRewind(server.slaves,&li);
while((ln = listNext(&li))) {
client *slave = ln->value;
if (slave->replstate != SLAVE_STATE_ONLINE) continue;
if (slave->flags & CLIENT_PRE_PSYNC) continue;
if ((server.unixtime - slave->repl_ack_time) > server.repl_timeout)
{
serverLog(LL_WARNING, "Disconnecting timedout replica: %s",
replicationGetSlaveName(slave));
freeClient(slave);
}
}
}
/* */
/* If this is a master without attached slaves and there is a replication
* backlog active, in order to reclaim memory we can free it after some
* (configured) time. Note that this cannot be done for slaves: slaves
* without sub-slaves attached should still accumulate data into the
* backlog, in order to reply to PSYNC queries if they are turned into
* masters after a failover. */
if (listLength(server.slaves) == 0 && server.repl_backlog_time_limit &&
server.repl_backlog && server.masterhost == NULL)
{
time_t idle = server.unixtime - server.repl_no_slaves_since;
if (idle > server.repl_backlog_time_limit) {
/* When we free the backlog, we always use a new
* replication ID and clear the ID2. This is needed
* because when there is no backlog, the master_repl_offset
* is not updated, but we would still retain our replication
* ID, leading to the following problem:
*
* 1. We are a master instance.
* 2. Our slave is promoted to master. It's repl-id-2 will
* be the same as our repl-id.
* 3. We, yet as master, receive some updates, that will not
* increment the master_repl_offset.
* 4. Later we are turned into a slave, connect to the new
* master that will accept our PSYNC request by second
* replication ID, but there will be data inconsistency
* because we received writes. */
changeReplicationId();
clearReplicationId2();
freeReplicationBacklog();
serverLog(LL_NOTICE,
"Replication backlog freed after %d seconds "
"without connected replicas.",
(int) server.repl_backlog_time_limit);
}
}
/* */
/* If AOF is disabled and we no longer have attached slaves, we can
* free our Replication Script Cache as there is no need to propagate
* EVALSHA at all. */
if (listLength(server.slaves) == 0 &&
server.aof_state == AOF_OFF &&
listLength(server.repl_scriptcache_fifo) != 0)
{
replicationScriptCacheFlush();
}
/* Start a BGSAVE good for replication if we have slaves in
* WAIT_BGSAVE_START state.
*
* In case of diskless replication, we make sure to wait the specified
* number of seconds (according to configuration) so that other slaves
* have the time to arrive before we start streaming. */
if (!hasActiveChildProcess()) {
time_t idle, max_idle = 0;
int slaves_waiting = 0;
int mincapa = -1;
listNode *ln;
listIter li;
listRewind(server.slaves,&li);
while((ln = listNext(&li))) {
client *slave = ln->value;
if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) {
idle = server.unixtime - slave->lastinteraction;
if (idle > max_idle) max_idle = idle;
slaves_waiting++;
mincapa = (mincapa == -1) ? slave->slave_capa :
(mincapa & slave->slave_capa);
}
}
if (slaves_waiting &&
(!server.repl_diskless_sync ||
max_idle > server.repl_diskless_sync_delay))
{
/* Start the BGSAVE. The called function may start a
* BGSAVE with socket target or disk target depending on the
* configuration and slaves capabilities. */
startBgsaveForReplication(mincapa);
}
}
/* */
/* Remove the RDB file used for replication if Redis is not running
* with any persistence. */
removeRDBUsedToSyncReplicas();
/* */
/* Refresh the number of slaves with lag <= min-slaves-max-lag. */
refreshGoodSlavesCount();
replication_cron_loops++; /* Incremented with frequency 1 HZ. */
}
1.5.3 connectWithMaster-建立主从之间的连接
/* 以非阻塞的方式建立与master的连接 */
int connectWithMaster(void) {
/* connection *repl_transfer_s; --是serverRedis中的成员变量
int tls_replication; --是serverRedis中的成员变量,TLS Configuration
获取server.repl_transfer_s的值,如果配置了TLS,就调用connCreateTLS()返回一个
加密的客户端连接,否则就调用connCreateSocket返回一个非加密的客户端连接.
*/
/* 为一个客户端连接申请内存初始化 */
server.repl_transfer_s = server.tls_replication ? connCreateTLS() : connCreateSocket();
/* 创建socket链接,注册循环事件,设置连接处理函数为syncWithMaster */
if (connConnect(server.repl_transfer_s, server.masterhost, server.masterport,
NET_FIRST_BIND_ADDR, syncWithMaster) == C_ERR) {
/* 如果创建socket,注册循环事件,设置连接处理函数等失败,则打印出提示内容,关闭连接,
将服务中用来复制同步内容的链接置为NULL
*/
serverLog(LL_WARNING,"Unable to connect to MASTER: %s",
connGetLastError(server.repl_transfer_s));
connClose(server.repl_transfer_s);
server.repl_transfer_s = NULL;
return C_ERR;
}
/* 最近一次读到RDB文件内容的时间,在之后的超时判断中会有用处
server.unixtime在rdbLoadProgressCallback中更新.
*/
server.repl_transfer_lastio = server.unixtime;
/* 将server.repl_state置为“Socket连接成功”*/
server.repl_state = REPL_STATE_CONNECTING;
/* 返回C_OK */
return C_OK;
}
/* 创建一个非加密的客户端连接 */
connection *connCreateSocket() {
connection *conn = zcalloc(sizeof(connection));
conn->type = &CT_Socket;
conn->fd = -1;
return conn;
}
/* 创建一个加密的客户端连接 */
connection *connCreateTLS(void) {
tls_connection *conn = zcalloc(sizeof(tls_connection));
conn->c.type = &CT_TLS;
conn->c.fd = -1;
conn->ssl = SSL_new(redis_tls_ctx);
return (connection *) conn;
}
/* 调用连接函数 */
static inline int connConnect(
connection *conn,
const char *addr,
int port,
const char *src_addr,
ConnectionCallbackFunc connect_handler)
{
return conn->type->connect(conn, addr, port, src_addr, connect_handler);
}
/* 在初始化的时候中的定义*/
typedef struct ConnectionType {
...
int (*connect)(struct connection *conn, const char *addr, int port, const char *source_addr, ConnectionCallbackFunc connect_handler);
...
} ConnectionType;
ConnectionType CT_TLS = {
...
.connect = connTLSConnect,
...
};
ConnectionType CT_Socket = {
...
.connect = connSocketConnect,
...
};
/* 调用连接函数(在非加密的情况下)实际上执行的是connSocketConnect */
static inline int connConnect(
connection *conn,
const char *addr,
int port,
const char *src_addr,
ConnectionCallbackFunc connect_handler)
{
return connSocketConnect(conn, addr, port, src_addr, connect_handler);
}
/* 看一下connSocketConnect的实现 */
static int connSocketConnect
(connection *conn,
const char *addr,
int port,
const char *src_addr,
ConnectionCallbackFunc connect_handler)
{
/* 非阻塞方式去创建socket */
int fd = anetTcpNonBlockBestEffortBindConnect(NULL,addr,port,src_addr);
if (fd == -1) {
conn->state = CONN_STATE_ERROR;
conn->last_errno = errno;
return C_ERR;
}
conn->fd = fd; /*将连接的fd置为socket函数返回的fd*/
conn->state = CONN_STATE_CONNECTING;/* 将连接的状态置为CONN_STATE_CONNECTING */
conn->conn_handler = connect_handler;/* 设置连接处理函数 */
aeCreateFileEvent(server.el, conn->fd, AE_WRITABLE,
conn->type->ae_handler, conn);/* 在事件循环上注册当前连接描述符的写事件 */
return C_OK;
}
int anetTcpNonBlockBestEffortBindConnect(char *err, const char *addr, int port,
const char *source_addr)
{
return anetTcpGenericConnect(err,addr,port,source_addr,
ANET_CONNECT_NONBLOCK|ANET_CONNECT_BE_BINDING);
}
static int anetTcpGenericConnect(char *err, const char *addr, int port,
const char *source_addr, int flags)
{
...
调用了socket创建和connect连接两个系统调用(从机作为客户端去连接主机,主机作为服务器端)
...
}
connectWithMaster()函数执行的操作可以总结为:
- 根据是否配置了TLS决定调用connCreateTLS()还是connCreateSocket()返回一个链接并将其赋值给server.repl_transfer_s;
- 调用connConnect, 执行创建socket链接,注册循环事件,设置连接处理函数为syncWithMaster(这里可能会失败返回);
- 记录最近一次读到RDB文件内容的时间到server.repl_transfer_lastio中(在之后的超时判断中会有用处);
- 将服务的复制状态server.repl_state置为REPL_STATE_CONNECTING (Socket连接成功);
- 成功返回.
这一步完成了主从网络连接的建立.
1.5.4 syncWithMaster实现
/* This handler fires when the non blocking connect was able to
* establish a connection with the master. */
void syncWithMaster(connection *conn) {
char tmpfile[256], *err = NULL;
int dfd = -1, maxtries = 5;
int psync_result;
/* If this event fired after the user turned the instance into a master
* with SLAVEOF NO ONE we must just return ASAP. */
if (server.repl_state == REPL_STATE_NONE) {
connClose(conn);
return;
}
/* Check for errors in the socket: after a non blocking connect() we
* may find that the socket is in error state. */
if (connGetState(conn) != CONN_STATE_CONNECTED) {
serverLog(LL_WARNING,"Error condition on socket for SYNC: %s",
connGetLastError(conn));
goto error;
}
/* Send a PING to check the master is able to reply without errors. */
if (server.repl_state == REPL_STATE_CONNECTING) {
serverLog(LL_NOTICE,"Non blocking connect for SYNC fired the event.");
/* Delete the writable event so that the readable event remains
* registered and we can wait for the PONG reply. */
connSetReadHandler(conn, syncWithMaster);
connSetWriteHandler(conn, NULL);
server.repl_state = REPL_STATE_RECEIVE_PONG;
/* Send the PING, don't check for errors at all, we have the timeout
* that will take care about this. */
err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"PING",NULL);
if (err) goto write_error;
return;
}
/* Receive the PONG command. */
if (server.repl_state == REPL_STATE_RECEIVE_PONG) {
err = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL);
/* We accept only two replies as valid, a positive +PONG reply
* (we just check for "+") or an authentication error.
* Note that older versions of Redis replied with "operation not
* permitted" instead of using a proper error code, so we test
* both. */
if (err[0] != '+' &&
strncmp(err,"-NOAUTH",7) != 0 &&
strncmp(err,"-NOPERM",7) != 0 &&
strncmp(err,"-ERR operation not permitted",28) != 0)
{
serverLog(LL_WARNING,"Error reply to PING from master: '%s'",err);
sdsfree(err);
goto error;
} else {
serverLog(LL_NOTICE,
"Master replied to PING, replication can continue...");
}
sdsfree(err);
server.repl_state = REPL_STATE_SEND_AUTH;
}
/* AUTH with the master if required. */
if (server.repl_state == REPL_STATE_SEND_AUTH) {
if (server.masteruser && server.masterauth) {
err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"AUTH",
server.masteruser,server.masterauth,NULL);
if (err) goto write_error;
server.repl_state = REPL_STATE_RECEIVE_AUTH;
return;
} else if (server.masterauth) {
err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"AUTH",server.masterauth,NULL);
if (err) goto write_error;
server.repl_state = REPL_STATE_RECEIVE_AUTH;
return;
} else {
server.repl_state = REPL_STATE_SEND_PORT;
}
}
/* Receive AUTH reply. */
if (server.repl_state == REPL_STATE_RECEIVE_AUTH) {
err = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL);
if (err[0] == '-') {
serverLog(LL_WARNING,"Unable to AUTH to MASTER: %s",err);
sdsfree(err);
goto error;
}
sdsfree(err);
server.repl_state = REPL_STATE_SEND_PORT;
}
/* Set the slave port, so that Master's INFO command can list the
* slave listening port correctly. */
if (server.repl_state == REPL_STATE_SEND_PORT) {
int port;
if (server.slave_announce_port) port = server.slave_announce_port;
else if (server.tls_replication && server.tls_port) port = server.tls_port;
else port = server.port;
sds portstr = sdsfromlonglong(port);
err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"REPLCONF",
"listening-port",portstr, NULL);
sdsfree(portstr);
if (err) goto write_error;
sdsfree(err);
server.repl_state = REPL_STATE_RECEIVE_PORT;
return;
}
/* Receive REPLCONF listening-port reply. */
if (server.repl_state == REPL_STATE_RECEIVE_PORT) {
err = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL);
/* Ignore the error if any, not all the Redis versions support
* REPLCONF listening-port. */
if (err[0] == '-') {
serverLog(LL_NOTICE,"(Non critical) Master does not understand "
"REPLCONF listening-port: %s", err);
}
sdsfree(err);
server.repl_state = REPL_STATE_SEND_IP;
}
/* Skip REPLCONF ip-address if there is no slave-announce-ip option set. */
if (server.repl_state == REPL_STATE_SEND_IP &&
server.slave_announce_ip == NULL)
{
server.repl_state = REPL_STATE_SEND_CAPA;
}
/* Set the slave ip, so that Master's INFO command can list the
* slave IP address port correctly in case of port forwarding or NAT. */
if (server.repl_state == REPL_STATE_SEND_IP) {
err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"REPLCONF",
"ip-address",server.slave_announce_ip, NULL);
if (err) goto write_error;
sdsfree(err);
server.repl_state = REPL_STATE_RECEIVE_IP;
return;
}
/* Receive REPLCONF ip-address reply. */
if (server.repl_state == REPL_STATE_RECEIVE_IP) {
err = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL);
/* Ignore the error if any, not all the Redis versions support
* REPLCONF listening-port. */
if (err[0] == '-') {
serverLog(LL_NOTICE,"(Non critical) Master does not understand "
"REPLCONF ip-address: %s", err);
}
sdsfree(err);
server.repl_state = REPL_STATE_SEND_CAPA;
}
/* Inform the master of our (slave) capabilities.
*
* EOF: supports EOF-style RDB transfer for diskless replication.
* PSYNC2: supports PSYNC v2, so understands +CONTINUE <new repl ID>.
*
* The master will ignore capabilities it does not understand. */
if (server.repl_state == REPL_STATE_SEND_CAPA) {
err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"REPLCONF",
"capa","eof","capa","psync2",NULL);
if (err) goto write_error;
sdsfree(err);
server.repl_state = REPL_STATE_RECEIVE_CAPA;
return;
}
/* Receive CAPA reply. */
if (server.repl_state == REPL_STATE_RECEIVE_CAPA) {
err = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL);
/* Ignore the error if any, not all the Redis versions support
* REPLCONF capa. */
if (err[0] == '-') {
serverLog(LL_NOTICE,"(Non critical) Master does not understand "
"REPLCONF capa: %s", err);
}
sdsfree(err);
server.repl_state = REPL_STATE_SEND_PSYNC;
}
/* Try a partial resynchonization. If we don't have a cached master
* slaveTryPartialResynchronization() will at least try to use PSYNC
* to start a full resynchronization so that we get the master run id
* and the global offset, to try a partial resync at the next
* reconnection attempt. */
if (server.repl_state == REPL_STATE_SEND_PSYNC) {
if (slaveTryPartialResynchronization(conn,0) == PSYNC_WRITE_ERROR) {
err = sdsnew("Write error sending the PSYNC command.");
goto write_error;
}
server.repl_state = REPL_STATE_RECEIVE_PSYNC;
return;
}
/* If reached this point, we should be in REPL_STATE_RECEIVE_PSYNC. */
if (server.repl_state != REPL_STATE_RECEIVE_PSYNC) {
serverLog(LL_WARNING,"syncWithMaster(): state machine error, "
"state should be RECEIVE_PSYNC but is %d",
server.repl_state);
goto error;
}
psync_result = slaveTryPartialResynchronization(conn,1);
if (psync_result == PSYNC_WAIT_REPLY) return; /* Try again later... */
/* If the master is in an transient error, we should try to PSYNC
* from scratch later, so go to the error path. This happens when
* the server is loading the dataset or is not connected with its
* master and so forth. */
if (psync_result == PSYNC_TRY_LATER) goto error;
/* Note: if PSYNC does not return WAIT_REPLY, it will take care of
* uninstalling the read handler from the file descriptor. */
if (psync_result == PSYNC_CONTINUE) {
serverLog(LL_NOTICE, "MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization.");
if (server.supervised_mode == SUPERVISED_SYSTEMD) {
redisCommunicateSystemd("STATUS=MASTER <-> REPLICA sync: Partial Resynchronization accepted. Ready to accept connections.\n");
redisCommunicateSystemd("READY=1\n");
}
return;
}
/* PSYNC failed or is not supported: we want our slaves to resync with us
* as well, if we have any sub-slaves. The master may transfer us an
* entirely different data set and we have no way to incrementally feed
* our slaves after that. */
disconnectSlaves(); /* Force our slaves to resync with us as well. */
freeReplicationBacklog(); /* Don't allow our chained slaves to PSYNC. */
/* Fall back to SYNC if needed. Otherwise psync_result == PSYNC_FULLRESYNC
* and the server.master_replid and master_initial_offset are
* already populated. */
if (psync_result == PSYNC_NOT_SUPPORTED) {
serverLog(LL_NOTICE,"Retrying with SYNC...");
if (connSyncWrite(conn,"SYNC\r\n",6,server.repl_syncio_timeout*1000) == -1) {
serverLog(LL_WARNING,"I/O error writing to MASTER: %s",
strerror(errno));
goto error;
}
}
/* Prepare a suitable temp file for bulk transfer */
if (!useDisklessLoad()) {
while(maxtries--) {
snprintf(tmpfile,256,
"temp-%d.%ld.rdb",(int)server.unixtime,(long int)getpid());
dfd = open(tmpfile,O_CREAT|O_WRONLY|O_EXCL,0644);
if (dfd != -1) break;
sleep(1);
}
if (dfd == -1) {
serverLog(LL_WARNING,"Opening the temp file needed for MASTER <-> REPLICA synchronization: %s",strerror(errno));
goto error;
}
server.repl_transfer_tmpfile = zstrdup(tmpfile);
server.repl_transfer_fd = dfd;
}
/* Setup the non blocking download of the bulk file. */
if (connSetReadHandler(conn, readSyncBulkPayload)
== C_ERR)
{
char conninfo[CONN_INFO_LEN];
serverLog(LL_WARNING,
"Can't create readable event for SYNC: %s (%s)",
strerror(errno), connGetInfo(conn, conninfo, sizeof(conninfo)));
goto error;
}
server.repl_state = REPL_STATE_TRANSFER;
server.repl_transfer_size = -1;
server.repl_transfer_read = 0;
server.repl_transfer_last_fsync_off = 0;
server.repl_transfer_lastio = server.unixtime;
return;
error:
if (dfd != -1) close(dfd);
connClose(conn);
server.repl_transfer_s = NULL;
if (server.repl_transfer_fd != -1)
close(server.repl_transfer_fd);
if (server.repl_transfer_tmpfile)
zfree(server.repl_transfer_tmpfile);
server.repl_transfer_tmpfile = NULL;
server.repl_transfer_fd = -1;
server.repl_state = REPL_STATE_CONNECT;
return;
write_error: /* Handle sendSynchronousCommand(SYNC_CMD_WRITE) errors. */
serverLog(LL_WARNING,"Sending command to master in replication handshake: %s", err);
sdsfree(err);
goto error;
}
1.5.5 从机向主机发送PING命令
1.找到syncWithMaster中从机发送PING命令的分支
void syncWithMaster(connection *conn) {
char *err = NULL;
...
/* Send a PING to check the master is able to reply without errors. */
/*如果从机的复制状态为REPL_STATE_CONNECTING,发送一个PING去检查主节点是否能正确回复一个PONG*/
if (server.repl_state == REPL_STATE_CONNECTING) {
serverLog(LL_NOTICE,"Non blocking connect for SYNC fired the event.");
/* Delete the writable event so that the readable event remains
* registered and we can wait for the PONG reply. */
/* 设置读事件的处理函数为syncWithMaster */
connSetReadHandler(conn, syncWithMaster);
/* 设置读事件的处理函数为NULL,目的是暂时对触发的写事件不做处理 */
connSetWriteHandler(conn, NULL);
/* 将server.repl_state置为“已经发送了PING请求包,并等待接受主服务器PONG回复”*/
server.repl_state = REPL_STATE_RECEIVE_PONG;
/* Send the PING, don't check for errors at all, we have the timeout
* that will take care about this. */
err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"PING",NULL);
if (err) goto write_error;
return;
}
...
}
2.对于SYNC_CMD_WRITE的理解
#define SYNC_CMD_READ (1<<0)
#define SYNC_CMD_WRITE (1<<1)
#define SYNC_CMD_FULL (SYNC_CMD_READ|SYNC_CMD_WRITE)
3.sendSynchronousCommand中写命令的实现调用的是connSyncWrite,
最终调用的是syncWrite
char *sendSynchronousCommand(int flags, connection *conn, ...) {
/* Create the command to send to the master, we use redis binary
* protocol to make sure correct arguments are sent. This function
* is not safe for all binary data. */
if (flags & SYNC_CMD_WRITE) {
char *arg;
va_list ap;
sds cmd = sdsempty();
sds cmdargs = sdsempty();
size_t argslen = 0;
va_start(ap,conn);
while(1) {
arg = va_arg(ap, char*);
if (arg == NULL) break;
cmdargs = sdscatprintf(cmdargs,"$%zu\r\n%s\r\n",strlen(arg),arg);
argslen++;
}
va_end(ap);
cmd = sdscatprintf(cmd,"*%zu\r\n",argslen);
cmd = sdscatsds(cmd,cmdargs);
sdsfree(cmdargs);
/* Transfer command to the server. */
if (connSyncWrite(conn,cmd,sdslen(cmd),server.repl_syncio_timeout*1000)
== -1)
{
sdsfree(cmd);
return sdscatprintf(sdsempty(),"-Writing to master: %s",
connGetLastError(conn));
}
sdsfree(cmd);
}
/* Read the reply from the server. */
if (flags & SYNC_CMD_READ) {
char buf[256];
if (connSyncReadLine(conn,buf,sizeof(buf),server.repl_syncio_timeout*1000)
== -1)
{
return sdscatprintf(sdsempty(),"-Reading from master: %s",
strerror(errno));
}
server.repl_transfer_lastio = server.unixtime;
return sdsnew(buf);
}
return NULL;
}
static inline ssize_t connSyncWrite(connection *conn, char *ptr, ssize_t size, long long timeout) {
return conn->type->sync_write(conn, ptr, size, timeout);
}
ConnectionType CT_Socket = {
...
.sync_write = connSocketSyncWrite,
...
}
static ssize_t connSocketSyncWrite(connection *conn, char *ptr, ssize_t size, long long timeout) {
return syncWrite(conn->fd, ptr, size, timeout);
}
4. syncWrite的实现
/* Write the specified payload to 'fd'. If writing the whole payload will be
* done within 'timeout' milliseconds the operation succeeds and 'size' is
* returned. Otherwise the operation fails, -1 is returned, and an unspecified
* partial write could be performed against the file descriptor. */
ssize_t syncWrite(int fd, char *ptr, ssize_t size, long long timeout) {
ssize_t nwritten, ret = size;
long long start = mstime();
long long remaining = timeout;
while(1) {
long long wait = (remaining > SYNCIO__RESOLUTION) ?
remaining : SYNCIO__RESOLUTION;
long long elapsed;
/* Optimistically try to write before checking if the file descriptor
* is actually writable. At worst we get EAGAIN. */
nwritten = write(fd,ptr,size);
if (nwritten == -1) {
if (errno != EAGAIN) return -1;
} else {
ptr += nwritten;
size -= nwritten;
}
if (size == 0) return ret;
/* Wait */
aeWait(fd,AE_WRITABLE,wait);
elapsed = mstime() - start;
if (elapsed >= timeout) {
errno = ETIMEDOUT;
return -1;
}
remaining = timeout - elapsed;
}
}
读1.5.3中的connSocketConnect实现,可以发现建立网络连接后,向循环事件注册fd的AE_WRITABLE事件,
因此会触发一个AE_WRITABLE事件,调用syncWithMaster()函数,处理写事件.
(待研究:写事件的注册是在connect之后发生的,仅仅通过fd, epoll就知道fd之前发生的事件吗?)
根据当前的REPL_STATE_CONNECTING状态,从节点向主节点发送PING命令, PING命令的目的有:
- 检测主从节点之间的网络是否可用;
- 检查主从节点当前是否接受处理命令;
发送
PING
命令主要的代码逻辑是:
- 将与主机创建连接返回的fd写事件设置为不做处理, 因为接下来要读主节点服务器发送过来的PONG回复, 此时可只处理读事件;
- 设置从节点的复制状态为
REPL_STATE_RECEIVE_PONG,
等待一个主节点回复一个PONG
命令;- 以写的方式调用sendSynchronousCommand()函数发送一个PING命令给主节点.
从机的复制状态变化情况为:
- REPL_STATE_CONNECTING--->
REPL_STATE_RECEIVE_PONG
1.5.6 从机接受并解析来自主机的对于PING命令的回复
1.找到syncWithMaster中从机接受主机对于PING命令的回复的分支
void syncWithMaster(connection *conn) {
char tmpfile[256], *err = NULL;
...
/* Receive the PONG command. */
/* 复制状态是REPL_STATE_RECEIVE_PONG */
if (server.repl_state == REPL_STATE_RECEIVE_PONG) {
/* 发送读请求 */
err = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL);
/* We accept only two replies as valid, a positive +PONG reply
* (we just check for "+") or an authentication error.
* Note that older versions of Redis replied with "operation not
* permitted" instead of using a proper error code, so we test
* both. */
/* 现在的版本中我们只接受两种有效的回复:
(1)如果一切正常的话, 就是“+PONG”;
(2)如果有问题的话就是验证错误.
怎么感觉这里写的有点问题?四个条件都要满足吗?err是什么样子的?
除非是比如"-abc"这样子,才可能使得四个判断条件都是1,走到错误响应的分支,
如果是"-NOAUTH"这样子的,是不能走到错误响应的分支的,
所以可以测试下看看err在错误情况下装的什么内容.
*/
if (err[0] != '+' &&
strncmp(err,"-NOAUTH",7) != 0 &&
strncmp(err,"-NOPERM",7) != 0 &&
strncmp(err,"-ERR operation not permitted",28) != 0)
{
serverLog(LL_WARNING,"Error reply to PING from master: '%s'",err);
sdsfree(err);
goto error;
} else {
serverLog(LL_NOTICE,
"Master replied to PING, replication can continue...");
}
sdsfree(err);
server.repl_state = REPL_STATE_SEND_AUTH;
}
...
}
2.如1.5.5中的类似分析, sendSynchronousCommand(SYNC_CMD_READ,conn,NULL)最终的实现是
syncRead,其实现如下:
/* Read the specified amount of bytes from 'fd'. If all the bytes are read
* within 'timeout' milliseconds the operation succeed and 'size' is returned.
* Otherwise the operation fails, -1 is returned, and an unspecified amount of
* data could be read from the file descriptor. */
ssize_t syncRead(int fd, char *ptr, ssize_t size, long long timeout) {
ssize_t nread, totread = 0;
long long start = mstime();
long long remaining = timeout;
if (size == 0) return 0;
while(1) {
long long wait = (remaining > SYNCIO__RESOLUTION) ?
remaining : SYNCIO__RESOLUTION;
long long elapsed;
/* Optimistically try to read before checking if the file descriptor
* is actually readable. At worst we get EAGAIN. */
nread = read(fd,ptr,size);
if (nread == 0) return -1; /* short read. */
if (nread == -1) {
if (errno != EAGAIN) return -1;
} else {
ptr += nread;
size -= nread;
totread += nread;
}
if (size == 0) return totread;
/* Wait */
aeWait(fd,AE_READABLE,wait);
elapsed = mstime() - start;
if (elapsed >= timeout) {
errno = ETIMEDOUT;
return -1;
}
remaining = timeout - elapsed;
}
}
在这个步骤当中, 当发现复制状态为REPL_STATE_RECEIVE_PONG的时候, 以读的方式调用sendSynchronousCommand(), 如果一切都没有问题, 将正确接受并读到的来自master的"+PONG\r\n", 此时会将从节点的复制状态设置为server.repl_state = REPL_STATE_SEND_AUTH, 之后进行下一步的操作. 此处的从节点的复制状态变更情况为:
REPL_STATE_RECEIVE_PONG--->
REPL_STATE_SEND_AUTH
1.5.7 发送权限验证相关信息的逻辑处理
/*
配置文件中给出的masterauth和masteruser的解释,当主服务器需要密码验证的情况下,从机请求
从主机那里同步数据就需要使用masterauth <master-password>,在6以上版本最好把特定的用户
也加上.
# If the master is password protected (using the "requirepass" configuration
# directive below) it is possible to tell the replica to authenticate before
# starting the replication synchronization process, otherwise the master will
# refuse the replica request.
#
# masterauth <master-password>
#
# However this is not enough if you are using Redis ACLs (for Redis version
# 6 or greater), and the default user is not capable of running the PSYNC
# command and/or other commands needed for replication. In this case it's
# better to configure a special user to use with replication, and specify the
# masteruser configuration as such:
#
# masteruser <username>
#
*/
void syncWithMaster(connection *conn) {
char tmpfile[256], *err = NULL;
...
/* AUTH with the master if required. */
if (server.repl_state == REPL_STATE_SEND_AUTH) {
/*
如果配置的用于同步的主机用户名和密码都不为NULL,将配置文件中配置的主机用户名
和主机密码发送给主机用于主从同步的验证码,将从节点的复制状态更新为
REPL_STATE_RECEIVE_AUTH
*/
if (server.masteruser && server.masterauth) {
err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"AUTH",
server.masteruser,server.masterauth,NULL);
if (err) goto write_error;
server.repl_state = REPL_STATE_RECEIVE_AUTH;
return;
}
/* 如果配置的用于同步的主机密码不为NULL,将配置文件中配置的主机密码发送给主机用于主
从同步的验证码,将从节点的复制状态更新为REPL_STATE_RECEIVE_AUTH
*/
else if (server.masterauth)
{
err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"AUTH",server.masterauth,NULL);
if (err) goto write_error;
server.repl_state = REPL_STATE_RECEIVE_AUTH;
return;
}
else
/* 如果没有配置需要验证,那么将从节点的复制状态更新为REPL_STATE_SEND_PORT
*/
{
server.repl_state = REPL_STATE_SEND_PORT;
}
}
...
}
在这个步骤当中, 当发现复制状态为REPL_STATE_RECEIVE_AUTH的时候, 以写的方式调用sendSynchronousCommand(),根据配置做好相应的处理, 在本步骤进行完毕后, 此处的从节点的复制状态变更为:
- REPL_STATE_SEND_AUTH
--->
REPL_STATE_RECEIVE_AUTH或者:
- REPL_STATE_SEND_AUTH
--->
REPL_STATE_SEND_PORT
1.5.8 接受来自主节点对于权限相关验证的消息
void syncWithMaster(connection *conn) {
char tmpfile[256], *err = NULL;
...
/* Receive AUTH reply. */
if (server.repl_state == REPL_STATE_RECEIVE_AUTH) {
err = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL);
if (err[0] == '-') {
serverLog(LL_WARNING,"Unable to AUTH to MASTER: %s",err);
sdsfree(err);
goto error;
}
sdsfree(err);
server.repl_state = REPL_STATE_SEND_PORT;
}
...
}
主节点会读取到
AUTH
命令, 调用authCommand()
函数来处理, 主节点服务器会比较从节点发送过来的server.masterauth
和主节点服务器保存的server.requirepass
是否一致,如果一致,会回复一个"+OK\r\n".
在本步骤进行完毕后, 此处的从节点的复制状态变更情况为:
- REPL_STATE_RECEIVE_AUTH
--->REPL_STATE_SEND_PORT
1.5.9 发送端口号给主节点
/* 配置文件中对端口号和IP的描述
# A Redis master is able to list the address and port of the attached
# replicas in different ways. For example the "INFO replication" section
# offers this information, which is used, among other tools, by
# Redis Sentinel in order to discover replica instances.
# Another place where this info is available is in the output of the
# "ROLE" command of a master.
#
# The listed IP and address normally reported by a replica is obtained
# in the following way:
#
# IP: The address is auto detected by checking the peer address
# of the socket used by the replica to connect with the master.
#
# Port: The port is communicated by the replica during the replication
# handshake, and is normally the port that the replica is using to
# listen for connections.
#
# However when port forwarding or Network Address Translation (NAT) is
# used, the replica may be actually reachable via different IP and port
# pairs. The following two options can be used by a replica in order to
# report to its master a specific set of IP and port, so that both INFO
# and ROLE will report those values.
#
# There is no need to use both the options if you need to override just
# the port or the IP address.
#
# replica-announce-ip 5.5.5.5
# replica-announce-port 1234
*/
void syncWithMaster(connection *conn) {
char tmpfile[256], *err = NULL;
...
/* Set the slave port, so that Master's INFO command can list the
* slave listening port correctly. */
/* 向主节点报告自己的端口信息,这样主节点之后才能用INFO命令正确地打印出自己的从节点
信息
*/
if (server.repl_state == REPL_STATE_SEND_PORT) {
int port;
if (server.slave_announce_port) port = server.slave_announce_port;
else if (server.tls_replication && server.tls_port) port = server.tls_port;
else port = server.port;
sds portstr = sdsfromlonglong(port);
err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"REPLCONF",
"listening-port",portstr, NULL);
sdsfree(portstr);
if (err) goto write_error;
sdsfree(err);
server.repl_state = REPL_STATE_RECEIVE_PORT;
return;
}
...
}
发送端口号信息给主节, 以REPLCONF listening-port命令的方式, 将复制状态设置为REPL_STATE_RECEIVE_PORT, 等待接受主节点的回复. 主节点从fd中读到REPLCONF listening-port <port>命令, 调用replconfCommand()命令来处理, 而replconfCommand()函数的定义就在replication.c文件中, REPLCONF命令可以设置多种不同的选项, 解析到端口号后,将端口号保存从节点对应client状态的c->slave_listening_port = port中, 最终回复一个"+OK\r\n" . 当主节点将回复写到fd时, 又会触发从节点的可读事件, 从节点紧接着调用syncWithMaster()函数来处理回复的信息.在本步骤进行完毕后, 此处的从节点的复制状态变更情况为:
REPL_STATE_SEND_PORT
--->REPL_STATE_RECEIVE_PORT
1.5.10 接受处理来自主节点对于端口号信息的回复
void syncWithMaster(connection *conn) {
char tmpfile[256], *err = NULL;
...
/* Receive REPLCONF listening-port reply. */
if (server.repl_state == REPL_STATE_RECEIVE_PORT) {
err = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL);
/* Ignore the error if any, not all the Redis versions support
* REPLCONF listening-port. */
if (err[0] == '-') {
serverLog(LL_NOTICE,"(Non critical) Master does not understand "
"REPLCONF listening-port: %s", err);
}
sdsfree(err);
server.repl_state = REPL_STATE_SEND_IP;
}
...
}
经过这几轮看现在再看就很简单了, 注意一下即使这里主节点返回的消息并不是正面的, 我们也忽略它, 因为并不是所有的redis都支持 ‘’ REPLCONF listening-port‘’, 在本步骤进行完毕后, 此处的从节点的复制状态变更情况为:
REPL_STATE_RECEIVE_PORT--->REPL_STATE_SEND_IP
1.5.11 发送IP给主节点
void syncWithMaster(connection *conn) {
char tmpfile[256], *err = NULL;
...
/* Skip REPLCONF ip-address if there is no slave-announce-ip option set. */
/* 如果slave-announce-ip没有设置就跳过 */
if (server.repl_state == REPL_STATE_SEND_IP &&
server.slave_announce_ip == NULL)
{
server.repl_state = REPL_STATE_SEND_CAPA;
}
/* Set the slave ip, so that Master's INFO command can list the
* slave IP address port correctly in case of port forwarding or NAT. */
/*
如果设置了从机的ip, 主节点就可以通过INFO命令正确地列出真实的ip,否在如果经过
端口转换或者地址转换的话, 主节点可能就无法将真实的ip地址列出
*/
if (server.repl_state == REPL_STATE_SEND_IP) {
err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"REPLCONF",
"ip-address",server.slave_announce_ip, NULL);
if (err) goto write_error;
sdsfree(err);
server.repl_state = REPL_STATE_RECEIVE_IP;
return;
}
...
}
向主节点调用REPLCONF ip-address <server.slave_announce_ip >命令的方式, 将从节点的IP写到fd中并发给主节点, 并且设置从节点的复制状态为REPL_STATE_RECEIVE_IP, 等待接受主节点的回复, 然后就直接返回, 等待fd可读事件触发. 主节点仍然会调用replication.c文件中实现的replconfCommand()函数来处理REPLCONF命令, 解析出REPLCONF ip-address ip命令,保存从节点的ip到主节点的对应从节点的client的c->slave_ip中, 将"+OK\r\n"状态, 写到fd中, 发给从节点. 此时, 从节点监听到fd触发了可读事件, 会调用syncWithMaster()函数来处理, 验证主节点是否正确接收到从节点的IP. 在本步骤进行完毕后, 此处的从节点的复制状态变更情况为:
REPL_STATE_SEND_IP--->REPL_STATE_RECEIVE_IP
REPL_STATE_SEND_IP--->REPL_STATE_SEND_CAPA
1.5.12 接受处理来自主节点对于IP信息的回复
void syncWithMaster(connection *conn) {
char tmpfile[256], *err = NULL;
...
/* Receive REPLCONF ip-address reply. */
if (server.repl_state == REPL_STATE_RECEIVE_IP) {
err = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL);
/* Ignore the error if any, not all the Redis versions support
* REPLCONF listening-port. */
/* 这里的注释是不是写错了?*/
if (err[0] == '-') {
serverLog(LL_NOTICE,"(Non critical) Master does not understand "
"REPLCONF ip-address: %s", err);
}
sdsfree(err);
server.repl_state = REPL_STATE_SEND_CAPA;
}
...
}
接受处理主节点返回的关于IP信息的消息, 注意一下即使这里主节点返回的消息并不是正面的, 我们也忽略它, 因为并不是所有的redis都支持 ‘’ REPLCONF ip-address‘’, 在本步骤进行完毕后, 此处的从节点的复制状态变更情况为:
REPL_STATE_SEND_IP--->REPL_STATE_SEND_CAPA
1.5.13 发送CAPA(发送能力)给主节点
void syncWithMaster(connection *conn) {
char tmpfile[256], *err = NULL;
...
/* Inform the master of our (slave) capabilities.
*
* EOF: supports EOF-style RDB transfer for diskless replication.
* PSYNC2: supports PSYNC v2, so understands +CONTINUE <new repl ID>.
*
* The master will ignore capabilities it does not understand. */
/* 通知主节点自己拥有的发送能力(我觉得说同步手段更好):
(1)EOF:支持不经过磁盘的EOF类型的RDB操作(大概就是二进制流并通过内存直接导入);
(2)PSYNC2:支持PSYNC v2, 它可以理解+CONTINUE <new repl ID>命令.
主节点将会忽略它不懂的同步方式.
*/
if (server.repl_state == REPL_STATE_SEND_CAPA) {
err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"REPLCONF",
"capa","eof","capa","psync2",NULL);
if (err) goto write_error;
sdsfree(err);
server.repl_state = REPL_STATE_RECEIVE_CAPA;
return;
}
...
}
从节点将
REPLCONF capa eof capa psync2
命令发送给主节点 , 写到fd
中, 主节点仍然会调用replication.c文件中实现的replconfCommand()函数来处理REPLCONF命令, 解析出REPLCONF capa eof capa psync2
命令, 将信息存入到client的c->slave_capa中, 然后将"+OK\r\n"写到fd中, 此时, 从节点监听到fd触发了可读事件,会调用syncWithMaster()函数来处理, 验证主节点是否正确接收到从节点的capa. 在本步骤进行完毕后, 此处的从节点的复制状态变更情况为:
REPL_STATE_SEND_CAPA--->REPL_STATE_RECEIVE_CAPA
1.5.14 接受处理来自主节点对于CAPA信息的回复
void syncWithMaster(connection *conn) {
char tmpfile[256], *err = NULL;
...
/* Receive CAPA reply. */
if (server.repl_state == REPL_STATE_RECEIVE_CAPA) {
err = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL);
/* Ignore the error if any, not all the Redis versions support
* REPLCONF capa. */
if (err[0] == '-') {
serverLog(LL_NOTICE,"(Non critical) Master does not understand "
"REPLCONF capa: %s", err);
}
sdsfree(err);
server.repl_state = REPL_STATE_SEND_PSYNC;
}
...
}
接受处理主节点返回的关于CAPA信息的消息, 注意一下即使这里主节点返回的消息并不是正面的, 我们也忽略它, 因为并不是所有的redis都支持 ‘’ REPLCONF capa‘’, 在本步骤进行完毕后, 此处的从节点的复制状态变更情况为:
REPL_STATE_RECEIVE_CAPA--->REPL_STATE_SEND_PSYNC
1.5.15 发送PSYNC命令给主节点
由于篇幅的限制,将在下文中继续分析下面的内容.
2.待探索的问题
2.1 如何知道master是否支持部分重同步
2. 2 从代码中梳理复制实现的逻辑
1)连接Socket;
2)发送PING请求包确认连接是否正确;
3)发起密码认证(如果需要);
4)信息同步;
5)发送PSYNC命令;
6)接收RDB文件并载入;
7)连接建立完成,等待主服务器同步命令请求。
2.3 从代码中梳理部分重同步的逻辑