dolphinscheduler v2.0.1 master和worker执行流程分析（五）

最新推荐文章于 2024-07-25 19:06:24 发布

人生有如两个橘子

最新推荐文章于 2024-07-25 19:06:24 发布

阅读量488

点赞数

分类专栏： dolphinscheduler 文章标签： mybatis

本文链接：https://blog.csdn.net/qq_37706484/article/details/126861011

版权

dolphinscheduler 专栏收录该内容

6 篇文章 13 订阅

订阅专栏

command唯一消费实现原理

实现原理总共分三步：

1. 每个master分配slot

master在初次启动和注册的监听中都核心调用了 syncMasterNodes() 方法。

该方法主要更新全部MASTER_SIZE 和自身SLOT_LIST，SLOT_LIST只存放自身slot值。

至此，每个master都能知道总master个数和自己的slot值。

大致流程为：清空slot -> 获取锁 -> 更新master -> 释放锁

特别注意，这里SLOT_LIST.clear()和分布式锁，后面会有思考。

        private void updateMasterNodes() {
        // 清空slot, 此时每个master的slot都为0
        SLOT_LIST.clear();
        this.masterNodes.clear();
        String nodeLock = Constants.REGISTRY_DOLPHINSCHEDULER_LOCK_MASTERS;
        try {
            // 获取分布式锁
            registryClient.getLock(nodeLock);
            Collection<String> currentNodes = registryClient.getMasterNodesDirectly();
            List<Server> masterNodes = registryClient.getServerList(NodeType.MASTER);
            syncMasterNodes(currentNodes, masterNodes);
        } catch (Exception e) {
            logger.error("update master nodes error", e);
        } finally {
            // 释放分布式锁
            registryClient.releaseLock(nodeLock);
        }

    }

        private void syncMasterNodes(Collection<String> nodes, List<Server> masterNodes) {
        
        masterLock.lock();
        try {
            this.masterNodes.addAll(nodes);
            this.masterPriorityQueue.clear();
            this.masterPriorityQueue.putList(masterNodes);
            int index = masterPriorityQueue.getIndex(NetUtils.getHost());
            if (index >= 0) {
                // 更新master个数和自身slot
                MASTER_SIZE = nodes.size();
                SLOT_LIST.add(masterPriorityQueue.getIndex(NetUtils.getHost()));
            }
            logger.info("update master nodes, master size: {}, slot: {}",
                    MASTER_SIZE, SLOT_LIST.toString()
            );
        } finally {
            masterLock.unlock();
        }
    }

2. 消费command

消费条件：只要master_size不为0即可正常消费command

消费逻辑：使用command的ID % MASTER_SIZE == slot确定command属于哪个master。一次只消费一个command，高版本已经实现获取多个。

理论上，每个master都有各自的slot，一个command不会被多个master扫到，但是假如command被多个master扫到呢，为了防止重复消费，才有第三步。

特别注意，master能消费command的条件，后面会有思考。

    private Command findOneCommand() {
        int pageNumber = 0;
        Command result = null;
        while (Stopper.isRunning()) {
            // 只要master_size不为0即可正常消费command
            if (ServerNodeManager.MASTER_SIZE == 0) {
                return null;
            }
            List<Command> commandList = processService.findCommandPage(ServerNodeManager.MASTER_SIZE, pageNumber);
            if (commandList.size() == 0) {
                return null;
            }
            for (Command command : commandList) {
                int slot = ServerNodeManager.getSlot();
                // 获取属于自身的command
                if (ServerNodeManager.MASTER_SIZE != 0
                        && command.getId() % ServerNodeManager.MASTER_SIZE == slot) {
                    result = command;
                    break;
                }
            }
            if (result != null) {
                logger.info("find command {}, slot:{} :",
                        result.getId(),
                        ServerNodeManager.getSlot());
                break;
            }
            pageNumber += 1;
        }
        return result;
    }

3. 防止重复消费

如果没有删除到记录，表示已经被消费，抛异常，触发事务回滚

    @Transactional
    public ProcessInstance handleCommand(Logger logger, String host, Command command, HashMap<String, ProcessDefinition> processDefinitionCacheMaps
    ) {
        ProcessInstance processInstance = constructProcessInstance(command, host, processDefinitionCacheMaps);
        // cannot construct process instance, return null
        if (processInstance == null) {
            logger.error("scan command, command parameter is error: {}", command);
            moveToErrorCommand(command, "process instance is null");
            return null;
        }
        processInstance.setCommandType(command.getCommandType());
        processInstance.addHistoryCmd(command.getCommandType());
        saveProcessInstance(processInstance);
        this.setSubProcessParam(processInstance);
        // 删除并校验
        this.deleteCommandWithCheck(command.getId());
        return processInstance;
    }
    
    private void deleteCommandWithCheck(int commandId) {
        int delete = this.commandMapper.deleteById(commandId);
        // 通过删除 + 事务保证
        if (delete != 1) {
            throw new ServiceException("delete command fail, id:" + commandId);
        }
    }

思考

command为什么会被重复消费？

一旦所有master都已启动，且slot值都固定，command是不会被重复消费的，只有当master上下线，才有可能被重复消费。

在有command的前提下分析：

首先（见第2步骤）master消费command的条件是MASTER_SIZE != 0，（见第1步骤）当master发生上下线时，所有其余master会通过监听触发updateMasterNodes() 方法，执行以下2个操作

1）将SLOT_SIZE.clear()，这意味着getSlot()时都返回0。

    public static Integer getSlot() {
        if (SLOT_LIST.size() > 0) {
            return SLOT_LIST.get(0);
        }
        return 0;
    }

2）争夺分布式锁，切记这时串行的。也就是没抢到锁的master此时：slot = 0，master_size = 原master个数。此时是可以正常消费command的，且消费的一样

优化建议

虽然有第三步事务保证command不被重复消费，但是还是有优化空间的，尽可能减少重复消费。

1）上下线时，未获取到锁的master暂时不工作，只需置MASTER_SIZE = 0

private void updateMasterNodes() {
    SLOT_LIST.clear();
    // 设置为0
    MASTER_SIZE = 0;
    ......
}

2）上下线时，未获取到锁的master保留原slot，正常工作。移动SLOT_LIST.clear()到获取锁后

private void updateMasterNodes() {
    // 删除
    // SLOT_LIST.clear();
    this.masterNodes.clear();
    String nodeLock = Constants.REGISTRY_DOLPHINSCHEDULER_LOCK_MASTERS;
    try {
        registryClient.getLock(nodeLock);
        // 移到到获取锁后
        SLOT_LIST.clear();
        Collection<String> currentNodes = registryClient.getMasterNodesDirectly();
        List<Server> masterNodes = registryClient.getServerList(NodeType.MASTER);
        syncMasterNodes(currentNodes, masterNodes);
    } catch (Exception e) {
        logger.error("update master nodes error", e);
    } finally {
        registryClient.releaseLock(nodeLock);
    }

}