RocketMQ源码分析之主从数据复制


前言

在RocketMQ主从架构中master和slave之间会进行数据同步,其中数据同步包括元数据复制和commitlog复制,那么为什么同步的数据中不包括consumequeue和indexFile呢?这里大家可以思考下:master节点上consumequeue和indexFile是根据commitlog构建的,所以slave在同步完commitlog后只需要根据commitlog构建consumequeue和indexFile即可。本篇文章就来分析下master和slave之间是如何进行数据同步?


一、元数据复制

1.元数据复制入口
在RocketMQ中的主从架构中,在启动slave节点的过程中会启动一个定时任务,该定时任务的功能是从master节点获取元数据。具体如下:

private void handleSlaveSynchronize(BrokerRole role) {
        if (role == BrokerRole.SLAVE) {
            if (null != slaveSyncFuture) {
                slaveSyncFuture.cancel(false);
            }
            this.slaveSynchronize.setMasterAddr(null);
            slaveSyncFuture = this.scheduledExecutorService.scheduleAtFixedRate(new Runnable() {
                @Override
                public void run() {
                    try {
                        BrokerController.this.slaveSynchronize.syncAll();
                    }
                    catch (Throwable e) {
                        log.error("ScheduledTask SlaveSynchronize syncAll error.", e);
                    }
                }
            }, 1000 * 3, 1000 * 10, TimeUnit.MILLISECONDS);
        } else {
            //handle the slave synchronise
            if (null != slaveSyncFuture) {
                slaveSyncFuture.cancel(false);
            }
            this.slaveSynchronize.setMasterAddr(null);
        }
    }

public void syncAll() {
        this.syncTopicConfig();
        this.syncConsumerOffset();
        this.syncDelayOffset();
        this.syncSubscriptionGroupConfig();
    }

2.元数据复制都包含哪些内容?
从syncAll()方法可以看出元数据复制主要包含以下文件:
(1)topics.json :topic配置文件
(2)consumerOffset.json:consumer消费进度文件
(3)delayOffset.json :延迟消息拉取进度
(4)subscriptionGroup.json:consumerGroup配置文件
3.元数据同步流程
在RocketMQ中四种元数据文件同步的流程是一样的,这里以topics.json为例来分析其流程。
从上面syncAll()方法可知:topic配置文件的同步入口是syncTopicConfig()方法,具体如下:

private void syncTopicConfig() {
        String masterAddrBak = this.masterAddr;
        if (masterAddrBak != null && !masterAddrBak.equals(brokerController.getBrokerAddr())) {
            try {
                TopicConfigSerializeWrapper topicWrapper =
                    this.brokerController.getBrokerOuterAPI().getAllTopicConfig(masterAddrBak);
                if (!this.brokerController.getTopicConfigManager().getDataVersion()
                    .equals(topicWrapper.getDataVersion())) {

                    this.brokerController.getTopicConfigManager().getDataVersion()
                        .assignNewOne(topicWrapper.getDataVersion());
                    this.brokerController.getTopicConfigManager().getTopicConfigTable().clear();
                    this.brokerController.getTopicConfigManager().getTopicConfigTable()
                        .putAll(topicWrapper.getTopicConfigTable());
                    this.brokerController.getTopicConfigManager().persist();

                    log.info("Update slave topic config from master, {}", masterAddrBak);
                }
            } catch (Exception e) {
                log.error("SyncTopicConfig Exception, {}", masterAddrBak, e);
            }
        }
    }

首先slave会通过getAllTopicConfig方法以同步调用的方式向master发送RequestCode.GET_ALL_TOPIC_CONFIG请求来获取topic配置文件信息。

public TopicConfigSerializeWrapper getAllTopicConfig(
        final String addr) throws RemotingConnectException, RemotingSendRequestException,
        RemotingTimeoutException, InterruptedException, MQBrokerException {
        RemotingCommand request = RemotingCommand.createRequestCommand(RequestCode.GET_ALL_TOPIC_CONFIG, null);

        RemotingCommand response = this.remotingClient.invokeSync(MixAll.brokerVIPChannel(true, addr), request, 3000);
        assert response != null;
        switch (response.getCode()) {
            case ResponseCode.SUCCESS: {
                return TopicConfigSerializeWrapper.decode(response.getBody(), TopicConfigSerializeWrapper.class);
            }
            default:
                break;
        }

        throw new MQBrokerException(response.getCode(), response.getRemark(), addr);
    }

master在收到slave端的请求后会在AdminBrokerProcessor中进行处理,具体是调用getAllTopicConfig方法来处理,其处理过程就是将master端的topicConfigTable和dataVersion编码成json字符串并返回给slave。

private RemotingCommand getAllTopicConfig(ChannelHandlerContext ctx, RemotingCommand request) {
        final RemotingCommand response = RemotingCommand.createResponseCommand(GetAllTopicConfigResponseHeader.class);
        // final GetAllTopicConfigResponseHeader responseHeader =
        // (GetAllTopicConfigResponseHeader) response.readCustomHeader();

        String content = this.brokerController.getTopicConfigManager().encode();
        if (content != null && content.length() > 0) {
            try {
                response.setBody(content.getBytes(MixAll.DEFAULT_CHARSET));
            } catch (UnsupportedEncodingException e) {
                log.error("", e);

                response.setCode(ResponseCode.SYSTEM_ERROR);
                response.setRemark("UnsupportedEncodingException " + e);
                return response;
            }
        } else {
            log.error("No topic in this broker, client: {}", ctx.channel().remoteAddress());
            response.setCode(ResponseCode.SYSTEM_ERROR);
            response.setRemark("No topic in this broker");
            return response;
        }

        response.setCode(ResponseCode.SUCCESS);
        response.setRemark(null);

        return response;
    }

slave在收到master返回的数据后会先判断本地的dateVersion与master返回的是否一样,如果不一样则会进行以下操作:
(1)更新slave的dataVersion
(2)清空slave端的topicConfigTable并将master返回的数据写入
(3)将topic配置进行持久化
最后用下图来总结下整个流程:
在这里插入图片描述
RocketMQ其余的元数据同步过程与上图一样只是发送的请求类型不一样,在阅读源码时我有注意到一个问题:在同步topic配置文件时采用的是VIP通道(即连接的是master的10909端口),而在同步其余三种元数据时采用的是10911端口,那么问题就是其余三种元数据在同步时为什么采用的是10911而不是10909?我在GitHub上开了一个issue
,如果大家有兴趣可以一起讨论。这里我认为所有的元数据同步应该都使用10909端口,所以在GitHub提了一个pr来修复该问题。
在这里插入图片描述

二、commitlog复制

commitlog复制相关服务是如何被启动的呢?broker在启动过程中会启动DefaultMessageStore,在启动DefaultMessageStore的过程中会判断broker是否启用了Dledger,如果没有启动则会启动HAService,具体如下:

public void start() throws Exception {

        lock = lockFile.getChannel().tryLock(0, 1, false);
        if (lock == null || lock.isShared() || !lock.isValid()) {
            throw new RuntimeException("Lock failed,MQ already started");
        }

        lockFile.getChannel().write(ByteBuffer.wrap("lock".getBytes()));
        lockFile.getChannel().force(true);
        {
            /**
             * 1. Make sure the fast-forward messages to be truncated during the recovering according to the max physical offset of the commitlog;
             * 2. DLedger committedPos may be missing, so the maxPhysicalPosInLogicQueue maybe bigger that maxOffset returned by DLedgerCommitLog, just let it go;
             * 3. Calculate the reput offset according to the consume queue;
             * 4. Make sure the fall-behind messages to be dispatched before starting the commitlog, especially when the broker role are automatically changed.
             */
            long maxPhysicalPosInLogicQueue = commitLog.getMinOffset();
            for (ConcurrentMap<Integer, ConsumeQueue> maps : this.consumeQueueTable.values()) {
                for (ConsumeQueue logic : maps.values()) {
                    if (logic.getMaxPhysicOffset() > maxPhysicalPosInLogicQueue) {
                        maxPhysicalPosInLogicQueue = logic.getMaxPhysicOffset();
                    }
                }
            }
            if (maxPhysicalPosInLogicQueue < 0) {
                maxPhysicalPosInLogicQueue = 0;
            }
            ...

        if (!messageStoreConfig.isEnableDLegerCommitLog()) {
            this.haService.start();
            this.handleScheduleMessageService(messageStoreConfig.getBrokerRole());
        }

        this.flushConsumeQueueService.start();
        this.commitLog.start();
        this.storeStatsService.start();

        this.createTempFile();
        this.addScheduleTask();
        this.shutdown = false;
    }

master和slave之间commitlog复制的整个过程如下:
1.启动master并监听slave的连接
2.启动slave,建立与master的连接
3.slave向master发送待拉取数据的物理偏移量
4.master根据待拉取数据的物理偏移量打包数据并发给slave
5.slave同步master发送的数据并唤醒reputMessageService服务构建consumequeue和indexFile
下面详细分析master与slave的交互及commitlog同步过程

1.启动master并监听slave连接

master的启动过程如下,这里与master相关的是acceptSocketService和groupTransferService,其中groupTransferService与commitlog的同步复制相关,后面会详细说明。acceptSocketService主要负责master端监听slave连接。

public void start() throws Exception {
        this.acceptSocketService.beginAccept();
        this.acceptSocketService.start();
        //groupTransferService与commitlog同步复制有关
        this.groupTransferService.start();
        this.haClient.start();
    }

(1)acceptSocketService.beginAccept()
该函数的流程具体如下:
在这里插入图片描述

public void beginAccept() throws Exception {
            this.serverSocketChannel = ServerSocketChannel.open();
            this.selector = RemotingUtil.openSelector();
            this.serverSocketChannel.socket().setReuseAddress(true);
            this.serverSocketChannel.socket().bind(this.socketAddressListen);
            this.serverSocketChannel.configureBlocking(false);
            this.serverSocketChannel.register(this.selector, SelectionKey.OP_ACCEPT);
        }

(2)acceptSocketService.start()
该函数的具体流程如下:
在这里插入图片描述

public void run() {
            log.info(this.getServiceName() + " service started");

            while (!this.isStopped()) {
                try {
                    this.selector.select(1000);
                    Set<SelectionKey> selected = this.selector.selectedKeys();

                    if (selected != null) {
                        for (SelectionKey k : selected) {
                            if ((k.readyOps() & SelectionKey.OP_ACCEPT) != 0) {
                                SocketChannel sc = ((ServerSocketChannel) k.channel()).accept();

                                if (sc != null) {
                                    HAService.log.info("HAService receive new connection, "
                                        + sc.socket().getRemoteSocketAddress());

                                    try {
                                        HAConnection conn = new HAConnection(HAService.this, sc);
                                        //启动HAConnection的readSocketService和writeSocketService
                                        conn.start();
                                        //这里将slave的连接添加到一个列表中是因为一个master可能有多个slave
                                        HAService.this.addConnection(conn);
                                    } catch (Exception e) {
                                        log.error("new HAConnection exception", e);
                                        sc.close();
                                    }
                                }
                            } else {
                                log.warn("Unexpected ops in select " + k.readyOps());
                            }
                        }

                        selected.clear();
                    }
                } catch (Exception e) {
                    log.error(this.getServiceName() + " service has exception.", e);
                }
            }

            log.info(this.getServiceName() + " service end");
        }

上面过程中有个重要的过程就是启动了HAConnection,HAConnection表示的是master与slave之间的网络连接,在HAConnection中有两个重要的对象分别是readSocketService和writeSocketService,其中readSocketService是master读取实现类,writeSocketService是master写实现类,具体如下:

public void start() {
        this.readSocketService.start();
        this.writeSocketService.start();
    }

2.启动slave,建立与master连接及向master发送待拉取数据的物理偏移量

slave启动过程主要启动的是HAClient,具体如下:

public void run() {
            log.info(this.getServiceName() + " service started");

            while (!this.isStopped()) {
                try {
                    //连接master
                    if (this.connectMaster()) {
						//判断是否向master汇报slave当前commitlog的最大物理偏移量
                        if (this.isTimeToReportOffset()) {
                        	//向master汇报slave当前commitlog的最大物理偏移量
                            boolean result = this.reportSlaveMaxOffset(this.currentReportedOffset);
                            if (!result) {
                                this.closeMaster();
                            }
                        }

                        this.selector.select(1000);
						//slave端处理master返回的数据
                        boolean ok = this.processReadEvent();
                        if (!ok) {
                            this.closeMaster();
                        }
						//判断slave端的当前commitlog的最大物理偏移量是否有增长,如果有则更新reportSlaveMaxOffset并调用reportSlaveMaxOffset向master汇报
                        if (!reportSlaveMaxOffsetPlus()) {
                            continue;
                        }

                        long interval =
                            HAService.this.getDefaultMessageStore().getSystemClock().now()
                                - this.lastWriteTimestamp;
                        if (interval > HAService.this.getDefaultMessageStore().getMessageStoreConfig()
                            .getHaHousekeepingInterval()) {
                            log.warn("HAClient, housekeeping, found this connection[" + this.masterAddress
                                + "] expired, " + interval);
                            this.closeMaster();
                            log.warn("HAClient, master not response some time, so close connection");
                        }
                    } else {
                        this.waitForRunning(1000 * 5);
                    }
                } catch (Exception e) {
                    log.warn(this.getServiceName() + " service has exception. ", e);
                    this.waitForRunning(1000 * 5);
                }
            }

            log.info(this.getServiceName() + " service end");
        }

这里需要详细看几个比较重要的点:
(1)slave连接master
完成slave连接master的函数是connectMaster,该函数主要完成以下操作:

  • 获取master的地址并连接,这里需要注意在开始时将通道设置为阻塞的,在connect完成后又将连接通道设置为非阻塞的
public static SocketChannel connect(SocketAddress remote, final int timeoutMillis) {
        SocketChannel sc = null;
        try {
            sc = SocketChannel.open();
            sc.configureBlocking(true);
            sc.socket().setSoLinger(false, -1);
            sc.socket().setTcpNoDelay(true);
            sc.socket().setReceiveBufferSize(1024 * 64);
            sc.socket().setSendBufferSize(1024 * 64);
            sc.socket().connect(remote, timeoutMillis);
            sc.configureBlocking(false);
            return sc;
        } catch (Exception e) {
            if (sc != null) {
                try {
                    sc.close();
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
            }
        }

        return null;
    }
  • 注册OP_READ事件
  • 获取slave端commitlog的最大物理偏移量并缓存在currentReportedOffset
  • 更新lastWriteTimestamp,lastWriteTimestamp的作用是用来计算master和slave之间同步的时间间隔
private boolean connectMaster() throws ClosedChannelException {
            if (null == socketChannel) {
                String addr = this.masterAddress.get();
                if (addr != null) {

                    SocketAddress socketAddress = RemotingUtil.string2SocketAddress(addr);
                    if (socketAddress != null) {
                        this.socketChannel = RemotingUtil.connect(socketAddress);
                        if (this.socketChannel != null) {
                            this.socketChannel.register(this.selector, SelectionKey.OP_READ);
                        }
                    }
                }

                this.currentReportedOffset = HAService.this.defaultMessageStore.getMaxPhyOffset();

                this.lastWriteTimestamp = System.currentTimeMillis();
            }

            return this.socketChannel != null;
        }

(2)isTimeToReportOffset()
isTimeToReportOffset()方法的作用是判断是否向master汇报slave端commitlog的最大物理偏移量,判断依据是计算当前时间与lastWriteTimestamp的时间间隔,如果该时间间隔大于haSendHeartbeatInterval(默认是5秒,可以在配置文件中进行修改)

private boolean isTimeToReportOffset() {
            long interval =
                HAService.this.defaultMessageStore.getSystemClock().now() - this.lastWriteTimestamp;
            boolean needHeart = interval > HAService.this.defaultMessageStore.getMessageStoreConfig()
                .getHaSendHeartbeatInterval();

            return needHeart;
        }

(3)reportSlaveMaxOffset(this.currentReportedOffset)
reportSlaveMaxOffset方法的作用是向master汇报slave端commitlog最大物理偏移量,也就是将currentReportedOffset汇报给master,这里可以发现是将currentReportedOffset存储在reportOffset中(reportOffset是8个字节),然后将reportOffset发送给master,这里在向master发送时有个循环,猜测该循环与通道是非阻塞性质有关,加上循环可以确保reportOffset发送完成,发送完成后修改lastWriteTimestamp,最后如果reportOffset没有空余则返回true

private boolean reportSlaveMaxOffset(final long maxOffset) {
            this.reportOffset.position(0);
            this.reportOffset.limit(8);
            this.reportOffset.putLong(maxOffset);
            this.reportOffset.position(0);
            this.reportOffset.limit(8);

            for (int i = 0; i < 3 && this.reportOffset.hasRemaining(); i++) {
                try {
                    this.socketChannel.write(this.reportOffset);
                } catch (IOException e) {
                    log.error(this.getServiceName()
                        + "reportSlaveMaxOffset this.socketChannel.write exception", e);
                    return false;
                }
            }

            lastWriteTimestamp = HAService.this.defaultMessageStore.getSystemClock().now();
            return !this.reportOffset.hasRemaining();
        }

(4)reportSlaveMaxOffsetPlus()
reportSlaveMaxOffsetPlus方法的作用是判断slave端的当前commitlog的最大物理偏移量是否有增长,如果有则更新reportSlaveMaxOffset并调用reportSlaveMaxOffset向master汇报

private boolean reportSlaveMaxOffsetPlus() {
            boolean result = true;
            long currentPhyOffset = HAService.this.defaultMessageStore.getMaxPhyOffset();
            if (currentPhyOffset > this.currentReportedOffset) {
                this.currentReportedOffset = currentPhyOffset;
                result = this.reportSlaveMaxOffset(this.currentReportedOffset);
                if (!result) {
                    this.closeMaster();
                    log.error("HAClient, reportSlaveMaxOffset error, " + this.currentReportedOffset);
                }
            }

            return result;
        }

3.master根据待拉取数据的物理偏移量打包数据并发给slave

3.1 master读取数据

从上面可知master读取slave发送数据是由HAConnection的readSocketService对象的processReadEvent方法完成的,具体如下:
(1)首先判断byteBufferRead是否还有剩余空间,byteBufferRead的作用是存储master读取的数据,如果没有空间则对byteBufferRead进行flip操作,同时将processPosition重置为0,processPosition表示处理byteBufferRead的位置
(2)读取通道中的数据并存储到byteBufferRead
(3)判断byteBufferRead的position与processPosition之间的差值是否大于等于8,之所以进行判断是因为slave向master汇报commitlog的最大物理偏移量占了8个字节,所以如果大于等于8则表示有一个完整的数据可以进行处理。在条件满足的情况下会进行以下操作:

  • 计算出byteBufferRead距离当前位置最近的位置,具体见下图:假设下图中每个单元格代表8个字节,position是(24,32)之间的任意一个位置,现在要计算的是距离当前位置最近的一个完整的数据的结束位置,具体算法是:this.byteBufferRead.position() - (this.byteBufferRead.position() % 8),计算出结束的位置后用当前位置减去8就是开始的位置

在这里插入图片描述

  • 读取[pos-8,pos]之间的数据并存储在readOffset
  • 将processPosition移动到pos位置

(4)将上面读取到的待拉取数据的物理偏移量存储在slaveAckOffset
(5)判断slaveRequestOffset是否小于0,如果小于0则更新为本次slave待拉取数据的物理偏移量(slaveAckOffset存储的是slave端已经拉取完成的物理偏移量,slaveRequestOffset存储的是slave端请求拉取的数据的物理偏移量)

private boolean processReadEvent() {
            int readSizeZeroTimes = 0;

            if (!this.byteBufferRead.hasRemaining()) {
                this.byteBufferRead.flip();
                this.processPosition = 0;
            }

            while (this.byteBufferRead.hasRemaining()) {
                try {
                    int readSize = this.socketChannel.read(this.byteBufferRead);
                    if (readSize > 0) {
                        readSizeZeroTimes = 0;
                        this.lastReadTimestamp = HAConnection.this.haService.getDefaultMessageStore().getSystemClock().now();
                        if ((this.byteBufferRead.position() - this.processPosition) >= 8) {
                            int pos = this.byteBufferRead.position() - (this.byteBufferRead.position() % 8);
                            long readOffset = this.byteBufferRead.getLong(pos - 8);
                            this.processPosition = pos;

                            HAConnection.this.slaveAckOffset = readOffset;
                            if (HAConnection.this.slaveRequestOffset < 0) {
                                HAConnection.this.slaveRequestOffset = readOffset;
                                log.info("slave[" + HAConnection.this.clientAddr + "] request offset " + readOffset);
                            }

                            HAConnection.this.haService.notifyTransferSome(HAConnection.this.slaveAckOffset);
                        }
                    } else if (readSize == 0) {
                        if (++readSizeZeroTimes >= 3) {
                            break;
                        }
                    } else {
                        log.error("read socket[" + HAConnection.this.clientAddr + "] < 0");
                        return false;
                    }
                } catch (IOException e) {
                    log.error("processReadEvent exception", e);
                    return false;
                }
            }

            return true;
        }

3.2 master向slave写数据

接着来看下master读取完待拉取数据后的操作,这里就到writeSocketService了,其主要完成的master向slave写数据的功能。其实现都在run方法中,具体如下:
(1)首先判断slaveRequestOffset是否等于-1,如果等于-1则表示master还没有收到slave端待拉取数据的请求。这里需要注意slaveRequestOffset是master在收到slave端汇报的待拉取数据物理偏移量是更新的,即readSocketService的processReadEvent方法中。
(2)判断nextTransferFromWhere是否等于-1,nextTransferFromWhere表示master下次向slave同步数据的物理偏移量,如果它的值为-1则表示是第一次进行数据同步。如果slaveRequestOffset等于0则从当前commitlog中最后一个文件开始进行数据传输。如果slaveRequestOffset不等于0则nextTransferFromWhere被赋值为slaveRequestOffset,即从slave请求的位置开始传输数据。
(3)lastWriteOver表示的是上次数据传输是否完成,如果上次数据传输已经完成并且当前距离上次最后写入时间的间隔大于haSendHeartbeatInterval(默认是5秒可在配置文件中进行配置)则会向slave发送一个12字节的数据包,其中前8个字节用来存储nextTransferFromWhere,后4个字节存储的值为0。如果上次数据传输没有完成则会继续传输数据然后再判断是否传输完成,如果还是没有传输完成则结束本次事件处理,等到下次事件处理时继续传输上次没有传输完成的数据。
(4)如果上次数据已经传输完成,则会根据nextTransferFromWhere获取该偏移量之后所有的数据,如果在该偏移量之后没有数据则会等待100毫秒,如果数据不为空,首先会判断数据的长度是否大于haTransferBatchSize(默认是32KB)从这里可以看出slave很有可能会收到master传输的不完整的消息。接着会用变量thisOffset记录本次数据传输开始的偏移量,然后更新nextTransferFromWhere,并对selectResult的ByteBuffer进行limit操作限制本次传输数据的大小,最后将selectResult赋给selectMappedBufferResult。在进行数据传输前会先在byteBufferHeader中记录本次数据传输的开始位置thisOffset和传输的数据大小。
(5)调用transferData()方法进行数据传输,在进行数据传输时会先传输byteBufferHeader,然后传输selectMappedBufferResult。数据传输完成后会对selectMappedBufferResult释放。

这里我们稍微总结下,master向slave传输的数据包实际上分为两种:

  • 不包含消息的数据包
    这类数据包共12个字节,其中前8个字节用来存储master向slave同步数据的起始偏移量,后4个字节存储的是消息的长度,这里存储的值为0
    在这里插入图片描述
  • 包含消息的数据包
    这类数据包共分为三个部分,其中前8个字节用来存储本次向slave传输数据的起始偏移量,后四个字节用来存储本次传输的消息长度,最后一个部分占用size字节表示的是消息
    在这里插入图片描述
public void run() {
            HAConnection.log.info(this.getServiceName() + " service started");

            while (!this.isStopped()) {
                try {
                    this.selector.select(1000);

                    if (-1 == HAConnection.this.slaveRequestOffset) {
                        Thread.sleep(10);
                        continue;
                    }

                    if (-1 == this.nextTransferFromWhere) {
                        if (0 == HAConnection.this.slaveRequestOffset) {
                            long masterOffset = HAConnection.this.haService.getDefaultMessageStore().getCommitLog().getMaxOffset();
                            masterOffset =
                                masterOffset
                                    - (masterOffset % HAConnection.this.haService.getDefaultMessageStore().getMessageStoreConfig()
                                    .getMappedFileSizeCommitLog());

                            if (masterOffset < 0) {
                                masterOffset = 0;
                            }

                            this.nextTransferFromWhere = masterOffset;
                        } else {
                            this.nextTransferFromWhere = HAConnection.this.slaveRequestOffset;
                        }

                        log.info("master transfer data from " + this.nextTransferFromWhere + " to slave[" + HAConnection.this.clientAddr
                            + "], and slave request " + HAConnection.this.slaveRequestOffset);
                    }

                    if (this.lastWriteOver) {

                        long interval =
                            HAConnection.this.haService.getDefaultMessageStore().getSystemClock().now() - this.lastWriteTimestamp;

                        if (interval > HAConnection.this.haService.getDefaultMessageStore().getMessageStoreConfig()
                            .getHaSendHeartbeatInterval()) {

                            // Build Header
                            this.byteBufferHeader.position(0);
                            this.byteBufferHeader.limit(headerSize);
                            this.byteBufferHeader.putLong(this.nextTransferFromWhere);
                            this.byteBufferHeader.putInt(0);
                            this.byteBufferHeader.flip();

                            this.lastWriteOver = this.transferData();
                            if (!this.lastWriteOver)
                                continue;
                        }
                    } else {
                        this.lastWriteOver = this.transferData();
                        if (!this.lastWriteOver)
                            continue;
                    }

                    SelectMappedBufferResult selectResult =
                        HAConnection.this.haService.getDefaultMessageStore().getCommitLogData(this.nextTransferFromWhere);
                    if (selectResult != null) {
                        int size = selectResult.getSize();
                        if (size > HAConnection.this.haService.getDefaultMessageStore().getMessageStoreConfig().getHaTransferBatchSize()) {
                            size = HAConnection.this.haService.getDefaultMessageStore().getMessageStoreConfig().getHaTransferBatchSize();
                        }

                        long thisOffset = this.nextTransferFromWhere;
                        this.nextTransferFromWhere += size;

                        selectResult.getByteBuffer().limit(size);
                        this.selectMappedBufferResult = selectResult;

                        // Build Header
                        this.byteBufferHeader.position(0);
                        this.byteBufferHeader.limit(headerSize);
                        this.byteBufferHeader.putLong(thisOffset);
                        this.byteBufferHeader.putInt(size);
                        this.byteBufferHeader.flip();

                        this.lastWriteOver = this.transferData();
                    } else {

                        HAConnection.this.haService.getWaitNotifyObject().allWaitForRunning(100);
                    }
                } catch (Exception e) {

                    HAConnection.log.error(this.getServiceName() + " service has exception.", e);
                    break;
                }
            }

            HAConnection.this.haService.getWaitNotifyObject().removeFromWaitingThreadTable();

            if (this.selectMappedBufferResult != null) {
                this.selectMappedBufferResult.release();
            }

            this.makeStop();

            readSocketService.makeStop();

            haService.removeConnection(HAConnection.this);

            SelectionKey sk = this.socketChannel.keyFor(this.selector);
            if (sk != null) {
                sk.cancel();
            }

            try {
                this.selector.close();
                this.socketChannel.close();
            } catch (IOException e) {
                HAConnection.log.error("", e);
            }

            HAConnection.log.info(this.getServiceName() + " service end");
        }

4.slave读取master发送的数据包

slave处理读事件的方法是HAClient中的processReadEvent方法,byteBufferRead是slave端的读缓冲区。processReadEvent方法的逻辑是首先判断byteBufferRead是否还有剩余空间,如果还有剩余空间则将channel中的数据读取到byteBufferRead中,然后调用dispatchReadRequest方法处理读取的数据。

private boolean processReadEvent() {
            int readSizeZeroTimes = 0;
            while (this.byteBufferRead.hasRemaining()) {
                try {
                    int readSize = this.socketChannel.read(this.byteBufferRead);
                    if (readSize > 0) {
                        readSizeZeroTimes = 0;
                        boolean result = this.dispatchReadRequest();
                        if (!result) {
                            log.error("HAClient, dispatchReadRequest error");
                            return false;
                        }
                    } else if (readSize == 0) {
                        if (++readSizeZeroTimes >= 3) {
                            break;
                        }
                    } else {
                        log.info("HAClient, processReadEvent read socket < 0");
                        return false;
                    }
                } catch (IOException e) {
                    log.info("HAClient, processReadEvent read socket exception", e);
                    return false;
                }
            }

            return true;
        }

这里需要重点分析dispatchReadRequest方法:
(1)用readSocketPos记录byteBufferRead当前的position
(2)计算byteBufferRead中当前position与dispatchPosition(dispatchPosition指向byteBufferRead中已经处理的位置的指针)差值diff
(3)判断diff是否大于等于12字节,这里之所有判断与12字节的关系是因为master向slave发送的数据包前12字节包含了传输的数据的起始物理偏移量以及传输的数据的长度。如果条件成立则从byteBufferRead中分别读取传输的数据的起始物理偏移量和传输的数据的长度并记录在masterPhyOffset和bodySize中。
(4)从当前节点获取commitlog的最大物理偏移量slavePhyOffset,并判断slavePhyOffset与masterPhyOffset是否相等,正常情况下两者应该是相等的,如果不相等则输出error信息。
(5)判断diff是否大于等于msgHeaderSize + bodySize,如果条件成立表示当前有一个完整的数据包,然后执行以下操作:

  • 将byteBufferRead的position设置为dispatchPosition + msgHeaderSize,也就是数据包中数据开始的位置
  • 读取数据包中的消息并存储在bodyData中
  • 调用appendToCommitLog方法完成数据追加操作
  • 将byteBufferRead的position重新设置到readSocketPos
  • 将dispatchPosition向前移动msgHeaderSize + bodySize
  • 调用reportSlaveMaxOffsetPlus方法判断slave端commitlog的是否有追加,如果有新的增加则向master汇报currentReportedOffset
private boolean dispatchReadRequest() {
            final int msgHeaderSize = 8 + 4; // phyoffset + size
            int readSocketPos = this.byteBufferRead.position();

            while (true) {
                int diff = this.byteBufferRead.position() - this.dispatchPosition;
                if (diff >= msgHeaderSize) {
                    long masterPhyOffset = this.byteBufferRead.getLong(this.dispatchPosition);
                    int bodySize = this.byteBufferRead.getInt(this.dispatchPosition + 8);

                    long slavePhyOffset = HAService.this.defaultMessageStore.getMaxPhyOffset();

                    if (slavePhyOffset != 0) {
                        if (slavePhyOffset != masterPhyOffset) {
                            log.error("master pushed offset not equal the max phy offset in slave, SLAVE: "
                                + slavePhyOffset + " MASTER: " + masterPhyOffset);
                            return false;
                        }
                    }

                    if (diff >= (msgHeaderSize + bodySize)) {
                        byte[] bodyData = new byte[bodySize];
                        this.byteBufferRead.position(this.dispatchPosition + msgHeaderSize);
                        this.byteBufferRead.get(bodyData);

                        HAService.this.defaultMessageStore.appendToCommitLog(masterPhyOffset, bodyData);

                        this.byteBufferRead.position(readSocketPos);
                        this.dispatchPosition += msgHeaderSize + bodySize;

                        if (!reportSlaveMaxOffsetPlus()) {
                            return false;
                        }

                        continue;
                    }
                }

                if (!this.byteBufferRead.hasRemaining()) {
                    this.reallocateByteBuffer();
                }

                break;
            }

            return true;
        }

接着重点分析appendToCommitLog方法,可以看到这里包含两者重要操作分别是:

  • appendData(startOffset, data):将数据追加到commitlog
  • this.reputMessageService.wakeup():唤醒reputMessageService服务,该服务的作用是构建consumequeue和indexFile,这样就解释了为什么不需要同步consumequeue和indexFile
public boolean appendToCommitLog(long startOffset, byte[] data) {
        if (this.shutdown) {
            log.warn("message store has shutdown, so appendToPhyQueue is forbidden");
            return false;
        }

        boolean result = this.commitLog.appendData(startOffset, data);
        if (result) {
            this.reputMessageService.wakeup();
        } else {
            log.error("appendToPhyQueue failed " + startOffset + " " + data.length);
        }

        return result;
    }

5.总结

这里用一张图来总结下上述的交互过程:
在这里插入图片描述

三、同步复制与异步复制

commitlog的同步复制与异步复制的区别在于:如果是同步复制,producer会等待直到master完成数据存储、数据刷盘以及数据同步,但是如果是异步复制,master只需完成数据存储以及数据刷盘即可。所以这里的入口是master put message过程,这里只截取了部分代码:

CompletableFuture<PutMessageStatus> flushResultFuture = submitFlushRequest(result, msg);
        CompletableFuture<PutMessageStatus> replicaResultFuture = submitReplicaRequest(result, msg);
        return flushResultFuture.thenCombine(replicaResultFuture, (flushStatus, replicaStatus) -> {
            if (flushStatus != PutMessageStatus.PUT_OK) {
                putMessageResult.setPutMessageStatus(flushStatus);
            }
            if (replicaStatus != PutMessageStatus.PUT_OK) {
                putMessageResult.setPutMessageStatus(replicaStatus);
                if (replicaStatus == PutMessageStatus.FLUSH_SLAVE_TIMEOUT) {
                    log.error("do sync transfer other node, wait return, but failed, topic: {} tags: {} client address: {}",
                            msg.getTopic(), msg.getTags(), msg.getBornHostNameString());
                }
            }
            return putMessageResult;
        });

从这里可以看到master端处理数据同步过程的函数是submitReplicaRequest方法:

public CompletableFuture<PutMessageStatus> submitReplicaRequest(AppendMessageResult result, MessageExt messageExt) {
        if (BrokerRole.SYNC_MASTER == this.defaultMessageStore.getMessageStoreConfig().getBrokerRole()) {
            HAService service = this.defaultMessageStore.getHaService();
            if (messageExt.isWaitStoreMsgOK()) {
                if (service.isSlaveOK(result.getWroteBytes() + result.getWroteOffset())) {
                    GroupCommitRequest request = new GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes(),
                            this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout());
                    service.putRequest(request);
                    service.getWaitNotifyObject().wakeupAll();
                    return request.future();
                }
                else {
                    return CompletableFuture.completedFuture(PutMessageStatus.SLAVE_NOT_AVAILABLE);
                }
            }
        }
        return CompletableFuture.completedFuture(PutMessageStatus.PUT_OK);
    }

在submitReplicaRequest方法中首先会根据当前节点的角色区分是同步复制还是异步复制,如果当前节点的角色是ASYNC_MASTER则表示是异步复制此时直接返回,如果当前节点的角色是SYNC_MASTER则表示是同步复制。接下来详细看看同步复制的过程:
(1)判断当前master节点上是否有slave连接、slave与master之间commitlog offset差距是否小于haSlaveFallbehindMax(默认是256M)

public boolean isSlaveOK(final long masterPutWhere) {
        boolean result = this.connectionCount.get() > 0;
        result =
            result
                && ((masterPutWhere - this.push2SlaveMaxOffset.get()) < this.defaultMessageStore
                .getMessageStoreConfig().getHaSlaveFallbehindMax());
        return result;
    }

(2)如果(1)中的条件不满足则会返回SLAVE_NOT_AVAILABLE否则会构建GroupCommitRequest并将其添加到GroupTransferService的requestsWrite队列中。这里与前面master节点启动NIO过程就接上了,GroupTransferService会在master的NIO启动过程中启动,其作用是判断commitlog同步复制是否完成。在GroupTransferService中有两个队列分别是requestsWrite和requestsRead,这其实和刷盘线程的设计是一样的,目的在于避免锁竞争,此外还有notifyTransferObject对象,该对象用来实现同步等待。由于GroupTransferService继承了ServiceThread,所以其核心是在run方法中,在该方法中有两个关键函数分别是waitForRunning(10)和doWaitTransfer()。

public void run() {
            log.info(this.getServiceName() + " service started");

            while (!this.isStopped()) {
                try {
                    this.waitForRunning(10);
                    this.doWaitTransfer();
                } catch (Exception e) {
                    log.warn(this.getServiceName() + " service has exception. ", e);
                }
            }

            log.info(this.getServiceName() + " service end");
        }
  • 在waitForRunning方法执行的过程中会调用GroupTransferService的onWaitEnd方法,该方法的核心在于交换requestsWrite和requestsRead
protected void onWaitEnd() {
            this.swapRequests();
        }
private void swapRequests() {
            List<CommitLog.GroupCommitRequest> tmp = this.requestsWrite;
            this.requestsWrite = this.requestsRead;
            this.requestsRead = tmp;
        }
  • doWaitTransfer方法会遍历requestsRead队列,这里会对队列中的每个请求进行两个判断,首先是push2SlaveMaxOffset(所有与master连接的slave节点中已完成的数据同步的最大偏移量)是否大于等于req.getNextOffset()(即判断主从数据同步是否完成,如果条件成立则表示同步完成),其次是在这里会有一个时间上的判断,即数据同步的时间不能大于syncFlushTimeout(默认是5秒)。如果数据同步完成且没有超时、如果数据同步没有完成但是已超时或者数据已经同步完成但是超时则会跳出while循环,此时如果数据同步没有完成则会在日志中输出transfer messsage to slave timeout,最后会根据数据同步是否完成来决定GroupCommitRequest请求的最终状态,如果完成了数据同步则其状态为PutMessageStatus.PUT_OK否则其状态为PutMessageStatus.FLUSH_SLAVE_TIMEOUT。如果数据同步没有完成且没有超时,则GroupTransferService将会进入等待状态且超时时间是1秒
private void doWaitTransfer() {
            synchronized (this.requestsRead) {
                if (!this.requestsRead.isEmpty()) {
                    for (CommitLog.GroupCommitRequest req : this.requestsRead) {
                        boolean transferOK = HAService.this.push2SlaveMaxOffset.get() >= req.getNextOffset();
                        long waitUntilWhen = HAService.this.defaultMessageStore.getSystemClock().now()
                            + HAService.this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout();
                        while (!transferOK && HAService.this.defaultMessageStore.getSystemClock().now() < waitUntilWhen) {
                            this.notifyTransferObject.waitForRunning(1000);
                            transferOK = HAService.this.push2SlaveMaxOffset.get() >= req.getNextOffset();
                        }

                        if (!transferOK) {
                            log.warn("transfer messsage to slave timeout, " + req.getNextOffset());
                        }

                        req.wakeupCustomer(transferOK ? PutMessageStatus.PUT_OK : PutMessageStatus.FLUSH_SLAVE_TIMEOUT);
                    }

                    this.requestsRead.clear();
                }
            }
        }

那么GroupTransferService什么时候会被唤醒?master节点在处理slave的读事件时在更新slaveAckOffset和slaveRequestOffset之后会执行HAConnection.this.haService.notifyTransferSome(HAConnection.this.slaveAckOffset);来唤醒GroupTransferService,唤醒的条件是:slaveAckOffset大于push2SlaveMaxOffset且push2SlaveMaxOffset被成功更新。

public void notifyTransferSome(final long offset) {
        for (long value = this.push2SlaveMaxOffset.get(); offset > value; ) {
            boolean ok = this.push2SlaveMaxOffset.compareAndSet(value, offset);
            if (ok) {
                this.groupTransferService.notifyTransferSome();
                break;
            } else {
                value = this.push2SlaveMaxOffset.get();
            }
        }
    }

在同步复制的过程还需要注意一点:在submitReplicaRequest方法中,在将构建的GroupCommitRequest请求添加到GroupTransferService的requestsWrite队列后会唤醒writeSocketService继续传输数据给slave。
最后用一张图来总结commitlog同步复制的过程:
在这里插入图片描述

四、元数据复制与commitlog复制对比

首先从数据量大小上来看,元数据的数据量小,commitlog的数据量很大;其次从数据同步的实现上来看元数据复制采用的是RocketMQ中基于netty实现的网络通信模块以及定时任务,而commitlog复制采用的是Java NIO。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值