redis源码浅析--十四.多机数据库的实现（二）--复制的实现SLAVEOF、PSYNY

本文链接：https://blog.csdn.net/qq_16399991/article/details/110203645

环境说明：redis源码版本 5.0.3；我在阅读源码过程做了注释，git地址：https://gitee.com/xiaoangg/redis_annotation
如有错误欢迎指正
参考书籍：《redis的设计与实现》

文章推荐：
redis源码阅读-一--sds简单动态字符串
 redis源码阅读--二-链表
 redis源码阅读--三-redis散列表的实现
 redis源码浅析--四-redis跳跃表的实现
 redis源码浅析--五-整数集合的实现
 redis源码浅析--六-压缩列表
 redis源码浅析--七-redisObject对象(下)（内存回收、共享）
redis源码浅析--八-数据库的实现
 redis源码浅析--九-RDB持久化
 redis源码浅析--十-AOF（append only file）持久化
 redis源码浅析--十一.事件（上）文件事件
 redis源码浅析--十一.事件（下）时间事件
 redis源码浅析--十二.单机数据库的实现-客户端
 redis源码浅析--十三.单机数据库的实现-服务端 - 时间事件
 redis源码浅析--十三.单机数据库的实现-服务端 - redis服务器的初始化
 redis源码浅析--十四.多机数据库的实现（一）--新老版本复制功能的区别与实现原理
 redis源码浅析--十四.多机数据库的实现（二）--复制的实现SLAVEOF、PSYNY
redis源码浅析--十五.哨兵sentinel的设计与实现
 redis源码浅析--十六.cluster集群的设计与实现
 redis源码浅析--十七.发布与订阅的实现
 redis源码浅析--十八.事务的实现
 redis源码浅析--十九.排序的实现
 redis源码浅析--二十.BIT MAP的实现
 redis源码浅析--二十一.慢查询日志的实现
 redis源码浅析--二十二.监视器的实现

推荐阅读
复制的搭建：https://blog.csdn.net/qq_16399991/article/details/99881319
复制的实现原理：https://blog.csdn.net/qq_16399991/article/details/109748991

一.复制的实现

1.1设置主服务器的地址和端口

通过向从服务器发送SLAVE命令，可以让一个从服务器去复制一个主服务器；

#复制主服务 127.0.0.1 6379端口
SLAVEOF 127.0.0.1 6379

slaveof要做的主要是给“从服务”设置的“主服务”地址和端口，会保存到从服务器的masterhost和masterport属性中；

slaveof是一个异步命令，完成设置后，会给客户端返回OK; 实际复制工作将在OK返回后真正开始执行；

slaveof命令的入口位于 replication.c/replicaofCommand：


/**
 * 复制命令的实现
 * slaveof
 * replicaof
 */ 
void replicaofCommand(client *c) {
    /**
     * 集群模式下禁止使用复制功能
     */ 
    /* SLAVEOF is not allowed in cluster mode as replication is automatically
     * configured using the current address of the master node. */
    if (server.cluster_enabled) {
        addReplyError(c,"REPLICAOF not allowed in cluster mode.");
        return;
    }

    /* The special host/port combination "NO" "ONE" turns the instance
     * into a master. Otherwise the new master address is set. */
    if (!strcasecmp(c->argv[1]->ptr,"no") &&
        !strcasecmp(c->argv[2]->ptr,"one")) { // slaveof no one 命令断开复制
        if (server.masterhost) {
            replicationUnsetMaster();
            sds client = catClientInfoString(sdsempty(),c);
            serverLog(LL_NOTICE,"MASTER MODE enabled (user request from '%s')",
                client);
            sdsfree(client);
        }
    } else {
        long port;

        //读取参数中master的端口号
        if ((getLongFromObjectOrReply(c, c->argv[2], &port, NULL) != C_OK))
            return;

        /**
         * 检查我们是否已经连接到指定的主服务器
         * 如果是，回复错误信息
         */ 
        /* Check if we are already attached to the specified slave */
        if (server.masterhost && !strcasecmp(server.masterhost,c->argv[1]->ptr)
            && server.masterport == port) {
            serverLog(LL_NOTICE,"REPLICAOF would result into synchronization with the master we are already connected with. No operation performed.");
            addReplySds(c,sdsnew("+OK Already connected to specified master\r\n"));
            return;
        }
        
        /**
         * 没有以前的主控形状或用户指定了其他主控形状，继续后续操作。
         * （这函数将主服务的host和port属性存储到masterhost和masterport属性中 ）
         */ 
        /* There was no previous master or the user specified a different one,
         * we can continue. */
        replicationSetMaster(c->argv[1]->ptr, port);

        //获取客户端信息； catClientInfoString是以字符串格式 获取客户端信息
        sds client = catClientInfoString(sdsempty(),c);
        serverLog(LL_NOTICE,"REPLICAOF %s:%d enabled (user request from '%s')",
            server.masterhost, server.masterport, client);
        sdsfree(client);
    }
    //回复ok
    addReply(c,shared.ok);
}

1.2 建立套接字连接

slaveof命令执行完毕后，从服务器会根据设置的ip地址和端口连接到主服务；代码入口位于 server.c/serverCron > replication.c/replicationCron > replication.c/connectWithMaster ;

如果从服务和主服务器连接成功，从服务器会给这个套接字关联一个处理复制工作的文件处理器。这个处理器位于replication.c/syncWithMaster ；

replication.c/syncWithMaster用于处理连接成功后的复制工作；例如接受RDB文件，接受主服务器传播来的写命令等；

replication.c/connectWithMaster如下：


/**
 * slave 连接到 master服务器
 */ 
int connectWithMaster(void) {
    int fd;

    //尽最大努力连接到master
    fd = anetTcpNonBlockBestEffortBindConnect(NULL,
        server.masterhost,server.masterport,NET_FIRST_BIND_ADDR);
    if (fd == -1) {
        serverLog(LL_WARNING,"Unable to connect to MASTER: %s",
            strerror(errno));
        return C_ERR;
    }

    //关联用于处理复制工作的处理器
    if (aeCreateFileEvent(server.el,fd,AE_READABLE|AE_WRITABLE,syncWithMaster,NULL) ==
            AE_ERR)
    {
        close(fd);
        serverLog(LL_WARNING,"Can't create readable event for SYNC");
        return C_ERR;
    }

    server.repl_transfer_lastio = server.unixtime;
    server.repl_transfer_s = fd;
    server.repl_state = REPL_STATE_CONNECTING;
    return C_OK;
}

1.3发送PING命令

从服务器连接主服务器后，做的第一件事就是向主服务发送一个PING命令；

通过ping命令可以检查套接字读写状态是否正常；
通过ping命令可以检查服务器是否正常；


void syncWithMaster(void) {
    //......  
 
   //发送ping命令到master 检查master是否成成
    /* Send a PING to check the master is able to reply without errors. */
    if (server.repl_state == REPL_STATE_CONNECTING) {
        serverLog(LL_NOTICE,"Non blocking connect for SYNC fired the event.");
        /* Delete the writable event so that the readable event remains
         * registered and we can wait for the PONG reply. */
        aeDeleteFileEvent(server.el,fd,AE_WRITABLE);
        server.repl_state = REPL_STATE_RECEIVE_PONG;
        /* Send the PING, don't check for errors at all, we have the timeout
         * that will take care about this. */
        err = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"PING",NULL);
        if (err) goto write_error;
        return;
    }

    //等待ping命令返回
    /* Receive the PONG command. */
    if (server.repl_state == REPL_STATE_RECEIVE_PONG) {
        err = sendSynchronousCommand(SYNC_CMD_READ,fd,NULL);

        /* We accept only two replies as valid, a positive +PONG reply
         * (we just check for "+") or an authentication error.
         * Note that older versions of Redis replied with "operation not
         * permitted" instead of using a proper error code, so we test
         * both. */
        if (err[0] != '+' &&
            strncmp(err,"-NOAUTH",7) != 0 &&
            strncmp(err,"-ERR operation not permitted",28) != 0)
        {
            serverLog(LL_WARNING,"Error reply to PING from master: '%s'",err);
            sdsfree(err);
            goto error;
        } else {
            serverLog(LL_NOTICE,
                "Master replied to PING, replication can continue...");
        }
        sdsfree(err);
        server.repl_state = REPL_STATE_SEND_AUTH;
    }


    //.....

}

1.4 身份验证

从服务器收到pong返回后，下一步就是进行身份验证

如果服务器设置了masterauth，那么就想master发送AUTH命令进行身份验证；

slave和master之间AUTH交互流程如下：

1.5 发送端口信息

身份验证后，从服务将执行REPLCONF listen-port {port-number}，向主服务器发送从服务监听的端口号；

主服务器接受到这个命令后，将从服务的端口号记录到客户端状态中的 slave_listening_port属性中；

void syncWithMaster(aeEventLoop *el, int fd, void *privdata, int mask) {
//.......
      
//向master 发送REPLCONF命令，设置从服务的端口号
    /* Set the slave port, so that Master's INFO command can list the
     * slave listening port correctly. */
    if (server.repl_state == REPL_STATE_SEND_PORT) {
        sds port = sdsfromlonglong(server.slave_announce_port ?
            server.slave_announce_port : server.port);
        err = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"REPLCONF",
                "listening-port",port, NULL);
        sdsfree(port);
        if (err) goto write_error;
        sdsfree(err);
        server.repl_state = REPL_STATE_RECEIVE_PORT;
        return;
    }

    /* Receive REPLCONF listening-port reply. */
    if (server.repl_state == REPL_STATE_RECEIVE_PORT) {
        err = sendSynchronousCommand(SYNC_CMD_READ,fd,NULL);
        /* Ignore the error if any, not all the Redis versions support
         * REPLCONF listening-port. */
        if (err[0] == '-') {
            serverLog(LL_NOTICE,"(Non critical) Master does not understand "
                                "REPLCONF listening-port: %s", err);
        }
        sdsfree(err);
        server.repl_state = REPL_STATE_SEND_IP;
    }

    //......
}

1.6 同步 & 命令传播

这一步从服务器向主服务器发送PSYNC命令；

同步完成后，主服务进入命令传播阶段，主服务器将自己执行的写命令发送给所有的从服务器，从服务执行接收到的写命令；

二 PSYNC命令的实现

PSYNC命令调用方式有两种：

PSYNC ? -1 全量复制
当从服务没有复制过主服务器，或者从服务执行过SLAVEOF NO ONE命令（取消复制），那么从服务将发送PSYNC ？-1命令；
PSYNC <runid> <offset> 部分复制
从服务已经复制过主服务器，那么从服务将向主服务器发送PSYNC <runid> <offset>， runid是主服务器的id，offset服务器当前的偏移量；

主服务器接受到PSYNC <runid> <offset> 命令后，主服务会判断是否能“部分同步”，向从服务回复相应的命令；

主服务向从服务的三种回复：

+FULLRESYNC <runid> <offset> 执行完全重同步；
+CONTINUE 执行部分重同步；
-ERR 不支持psync同步操作，从服务将发送sync命令到主服务器；

psync命令的实现入口位于replication.c/syncCommand;


/**
 * 这个函数用来处理psync命令
 * 
 * 成功返回 c_ok。否侧返回c_err，并且执行完全同步
 */ 
/* This function handles the PSYNC command from the point of view of a
 * master receiving a request for partial resynchronization.
 *
 * On success return C_OK, otherwise C_ERR is returned and we proceed
 * with the usual full resync. */
int masterTryPartialResynchronization(client *c) {
    //psync_offset ：命令行录入的偏移量
    long long psync_offset, psync_len;
    //要复制的master的run id
    char *master_replid = c->argv[1]->ptr;
    char buf[128];
    int buflen;

    /**
     * （
     * psync 命令语法 PSYNC <MASTER_RUN_ID> <OFFSET>
     * psync ? -1 将会全量复制
     * ）
     * 解析slave请求的复制偏移量。 
     * 如果解析失败，则执行完成同步,(我们应该杜绝这种情况发生，但是我们应该用鲁棒的代码防止宕机)
     */ 
    /* Parse the replication offset asked by the slave. Go to full sync
     * on parse error: this should never happen but we try to handle
     * it in a robust way compared to aborting. */
    if (getLongLongFromObjectOrReply(c,c->argv[2],&psync_offset,NULL) !=
       C_OK) goto need_full_resync;

    
    /**
     * 判断  PSYNC 传入的复制id 和 当前master的run id是否相同？
     * 如果复制的id 更新了，这个主机具有不同的复制历史记录，就不能继续复制了
     * (要全量复制)
     *
     * 请注意，有两个可能有效的复制标识：ID1和ID2。但是，ID2仅在特定偏移量下有效。
     *  
     */ 
    /* Is the replication ID of this master the same advertised by the wannabe
     * slave via PSYNC? If the replication ID changed this master has a
     * different replication history, and there is no way to continue.
     *
     * Note that there are two potentially valid replication IDs: the ID1
     * and the ID2. The ID2 however is only valid up to a specific offset. */
    if (strcasecmp(master_replid, server.replid) &&  //“传入的复制id 和 当前master的run id不相同” 并且
        (strcasecmp(master_replid, server.replid2) ||
         psync_offset > server.second_replid_offset)) // 或者超出的复制缓冲区的范围，则要触发全量复制 
    {
        
        /* Run id "?" is used by slaves that want to force a full resync. */
        if (master_replid[0] != '?') {
            if (strcasecmp(master_replid, server.replid) &&
                strcasecmp(master_replid, server.replid2)) //psync 命令传入的id和 当前server存的复制id 不同
            {
                serverLog(LL_NOTICE,"Partial resynchronization not accepted: "
                    "Replication ID mismatch (Replica asked for '%s', my "
                    "replication IDs are '%s' and '%s')",
                    master_replid, server.replid, server.replid2);
            } else {
                serverLog(LL_NOTICE,"Partial resynchronization not accepted: "
                    "Requested offset for second ID was %lld, but I can reply "
                    "up to %lld", psync_offset, server.second_replid_offset);
            }
        } else {
            serverLog(LL_NOTICE,"Full resync requested by replica %s",
                replicationGetSlaveName(c));
        }
        goto need_full_resync;
    }

    //复制积压缓冲去 是否有我们要的数据 。没有则全量复制
    /* We still have the data our slave is asking for? */
    if (!server.repl_backlog ||  
        psync_offset < server.repl_backlog_off ||
        psync_offset > (server.repl_backlog_off + server.repl_backlog_histlen)) // 要复制的偏移量不在【复制积压缓冲区中】
    {
        serverLog(LL_NOTICE,
            "Unable to partial resync with replica %s for lack of backlog (Replica request was: %lld).", replicationGetSlaveName(c), psync_offset);
        if (psync_offset > server.master_repl_offset) {
            serverLog(LL_WARNING,
                "Warning: replica %s tried to PSYNC with an offset that is greater than the master replication offset.", replicationGetSlaveName(c));
        }
        goto need_full_resync;
    }

    /**
     * 如果可以运行到这里，则开始执行部分重同步
     * 1)设置客户端状态是slave
     * 2)通知客户端 我们可以“部分重同步”
     * 3)发送 backlog中的数据 到slave
     */ 
    /* If we reached this point, we are able to perform a partial resync:
     * 1) Set client state to make it a slave.
     * 2) Inform the client we can continue with +CONTINUE
     * 3) Send the backlog data (from the offset to the end) to the slave. */
    c->flags |= CLIENT_SLAVE;  //标记客户端标记是 “slave”
    c->replstate = SLAVE_STATE_ONLINE; //设置复制状态是 “rdb文件传输完毕”
    c->repl_ack_time = server.unixtime; //复制确认时间
    c->repl_put_online_on_ack = 0;
    listAddNodeTail(server.slaves,c); 

    /**
     * 我们不能使用连接缓冲区，因为它们在这个阶段用于积累新的命令。
     * 但是我们确定套接字发送缓冲区是空的，所以这个写入永远不会失败。
     */ 
    /* We can't use the connection buffers since they are used to accumulate
     * new commands at this stage. But we are sure the socket send buffer is
     * empty so this write will never fail actually. */
    if (c->slave_capa & SLAVE_CAPA_PSYNC2) { //客户端支持psync2协议
        buflen = snprintf(buf,sizeof(buf),"+CONTINUE %s\r\n", server.replid);
    } else {
        buflen = snprintf(buf,sizeof(buf),"+CONTINUE\r\n");
    }

    //发送到“CONTINUE”到客户端 
    if (write(c->fd,buf,buflen) != buflen) {
        freeClientAsync(c);
        return C_OK;
    }
    psync_len = addReplyReplicationBacklog(c,psync_offset);
    serverLog(LL_NOTICE,
        "Partial resynchronization request from %s accepted. Sending %lld bytes of backlog starting from offset %lld.",
            replicationGetSlaveName(c),
            psync_len, psync_offset);
    /* Note that we don't need to set the selected DB at server.slaveseldb
     * to -1 to force the master to emit SELECT, since the slave already
     * has this state from the previous connection with the master. */

    refreshGoodSlavesCount();
    return C_OK; /* The caller can return, no full resync needed. */

need_full_resync:
    /* We need a full resync for some reason... Note that we can't
     * reply to PSYNC right now if a full SYNC is needed. The reply
     * must include the master offset at the time the RDB file we transfer
     * is generated, so we need to delay the reply to that moment. */
    return C_ERR;
}

三心跳检测

在命令传播阶段，从服务默认每秒一次的频率向从服务发送 REPLCONF ACK <replicaiotn_offset> ;replication_offset是当先从服务器的复制偏移量；

REPLICATION ACK的主要作用有：

检测与主服务的网络连接状态
主从服务器通过发送和接受REPLCONF 命令检查网络连接是否正常；
如果从服务器超过一秒没有接收到从服务的REPLCONF 命令，主服务器就知道从服务连接出了问题；
辅助实现min-slave选项
redis的min-slave-to-write和min-salve-max-lag可以防止主服务在不安全的情况下执行写命令；

例如主服务的min-slave-to-write和min-salve-max-lag配置如下：
min-salve-max-lag 10
min-slave-to-write 3

那么从服务的数量少于3个，或者3个从服务的延时（lag）值大于等于10秒时，主服务都不能执行写命令；
检测命令丢失
如果因为网络原因，主服务传播给从服务的命令丢失了。那么当从服务向主服务器放松RELPCONF ACK 命令时，主服务会发觉从服务的复制偏移量少于主服务的复制偏移量；

然后主服务会将丢失的部分发送给从服务；

心跳检测的代码入口位于：server.c/serverCron > replication.c/replicationCron > replication.c/replicationSendAck

//复制定时函数
//每秒执行一次
void replicationCron(void) {
    //..............

    /**
     * 时不时的向master发送 ACK
     * 如果master不支持PSYNC和复制偏移；不发送ack
     */ 
    /* Send ACK to master from time to time.
     * Note that we do not send periodic acks to masters that don't
     * support PSYNC and replication offsets. */
    if (server.masterhost && server.master &&
        !(server.master->flags & CLIENT_PRE_PSYNC))
        replicationSendAck();

    //..............

}


/* Send a REPLCONF ACK command to the master to inform it about the current
 * processed offset. If we are not connected with a master, the command has
 * no effects. */
void replicationSendAck(void) {
    client *c = server.master;

    if (c != NULL) {
        c->flags |= CLIENT_MASTER_FORCE_REPLY;
        addReplyMultiBulkLen(c,3);
        addReplyBulkCString(c,"REPLCONF");
        addReplyBulkCString(c,"ACK");
        addReplyBulkLongLong(c,c->reploff);
        c->flags &= ~CLIENT_MASTER_FORCE_REPLY;
    }
}