Ansible 源码解析：forks并发机制的实现

最新推荐文章于 2024-06-10 05:46:53 发布

三苦

最新推荐文章于 2024-06-10 05:46:53 发布

阅读量3.4k

点赞数 1

分类专栏： Ansible 文章标签： Ansible

本文链接：https://blog.csdn.net/avenger19/article/details/89217420

版权

Ansible 专栏收录该内容

21 篇文章 4 订阅

订阅专栏

（本文基于Ansible 2.7）
forks选项是Ansible原生支持的一种支持并发执行的方式，可以通过配置文件指定默认值，可以在运行ansible时指定，也可以在调用ansble API做开发时赋值。

forks选项的接收和处理在lib/ansible/cli/__init__.py 的442-444行：

        if fork_opts:
            parser.add_option('-f', '--forks', dest='forks', default=C.DEFAULT_FORKS, type='int',
                              help="specify number of parallel processes to use (default=%s)" % C.DEFAULT_FORKS)

389-391行表明，该值不可小于1

        if fork_opts:
            if op.forks < 1:
                self.parser.error("The number of processes (--forks) must be >= 1")

help信息表明，此选项是“使用的并发进程数”，默认值是C.DEFAULT_FORKS。我们可以在启动ansible时加入 -f 选项覆盖默认值。

那么Ansible是如何在运行中使用这个值的呢？我们知道，ansible是通过TaskQueueManager类来运行任务的（参见 Ansible 源码解析： Ansible的运行过程）
那么创建process的过程应该也是由TaskQueueManager来负责，我们来看一下TaskQueueManager的run方法：
lib/ansible/executer/task_queue_manager.py，220-299行：

    def run(self, play):
        '''
        Iterates over the roles/tasks in a play, using the given (or default)
        strategy for queueing tasks. The default is the linear strategy, which
        operates like classic Ansible by keeping all hosts in lock-step with
        a given task (meaning no hosts move on to the next task until all hosts
        are done with the current task).
        '''

        if not self._callbacks_loaded:
            self.load_callbacks()

        all_vars = self._variable_manager.get_vars(play=play)
        warn_if_reserved(all_vars)
        templar = Templar(loader=self._loader, variables=all_vars)

        new_play = play.copy()
        new_play.post_validate(templar)
        new_play.handlers = new_play.compile_roles_handlers() + new_play.handlers

        self.hostvars = HostVars(
            inventory=self._inventory,
            variable_manager=self._variable_manager,
            loader=self._loader,
        )

        play_context = PlayContext(new_play, self._options, self.passwords, self._connection_lockfile.fileno())
        for callback_plugin in self._callback_plugins:
            if hasattr(callback_plugin, 'set_play_context'):
                callback_plugin.set_play_context(play_context)

        self.send_callback('v2_playbook_on_play_start', new_play)

        # initialize the shared dictionary containing the notified handlers
        self._initialize_notified_handlers(new_play)

        # build the iterator
        iterator = PlayIterator(
            inventory=self._inventory,
            play=new_play,
            play_context=play_context,
            variable_manager=self._variable_manager,
            all_vars=all_vars,
            start_at_done=self._start_at_done,
        )

        # adjust to # of workers to configured forks or size of batch, whatever is lower
        self._initialize_processes(min(self._options.forks, iterator.batch_size))

        # load the specified strategy (or the default linear one)
        strategy = strategy_loader.get(new_play.strategy, self)
        if strategy is None:
            raise AnsibleError("Invalid play strategy specified: %s" % new_play.strategy, obj=play._ds)

        # Because the TQM may survive multiple play runs, we start by marking
        # any hosts as failed in the iterator here which may have been marked
        # as failed in previous runs. Then we clear the internal list of failed
        # hosts so we know what failed this round.
        for host_name in self._failed_hosts.keys():
            host = self._inventory.get_host(host_name)
            iterator.mark_host_failed(host)

        self.clear_failed_hosts()

        # during initialization, the PlayContext will clear the start_at_task
        # field to signal that a matching task was found, so check that here
        # and remember it so we don't try to skip tasks on future plays
        if getattr(self._options, 'start_at_task', None) is not None and play_context.start_at_task is None:
            self._start_at_done = True

        # and run the play using the strategy and cleanup on way out
        play_return = strategy.run(iterator, play_context)

        # now re-save the hosts that failed from the iterator to our internal list
        for host_name in iterator.get_failed_hosts():
            self._failed_hosts[host_name] = True

        strategy.cleanup()
        self._cleanup_processes()
        return play_return

其中第266-267行：

        # adjust to # of workers to configured forks or size of batch, whatever is lower
        self._initialize_processes(min(self._options.forks, iterator.batch_size))

表明创建process的数量是在目标数量和option中指定的forks数量中取较小值。
而_initialize_processes方法实际上仅仅是建立了一个空worker的list（113-117行）

    def _initialize_processes(self, num):
        self._workers = []

        for i in range(num):
            self._workers.append(None)

并没有创建进程的内容。
这时我们注意到，290-291行，play的运行过程是由strategy来负责的：

        # and run the play using the strategy and cleanup on way out
        play_return = strategy.run(iterator, play_context)

而strategy对象是根据self，即TaskQueueManager对象创建的（269-270行）：

        # load the specified strategy (or the default linear one)
        strategy = strategy_loader.get(new_play.strategy, self)

这时可以到strategy中查看它是如何做的。
默认的strategy是liner（lib/ansible/plugins/strategy/liner.py）,但其中仅有一次对workers的引用，是用来处理结果信息的，显然不是我们要寻找的内容。
再去strategy的基类StrategyBase中查找，发现了在liner strategy的run中有调用的_queue_task方法：

lib/ansible/plugins/strategy/__init__.py，279-336行：

    def _queue_task(self, host, task, task_vars, play_context):
        ''' handles queueing the task up to be sent to a worker '''

        display.debug("entering _queue_task() for %s/%s" % (host.name, task.action))

        # Add a write lock for tasks.
        # Maybe this should be added somewhere further up the call stack but
        # this is the earliest in the code where we have task (1) extracted
        # into its own variable and (2) there's only a single code path
        # leading to the module being run.  This is called by three
        # functions: __init__.py::_do_handler_run(), linear.py::run(), and
        # free.py::run() so we'd have to add to all three to do it there.
        # The next common higher level is __init__.py::run() and that has
        # tasks inside of play_iterator so we'd have to extract them to do it
        # there.

        if task.action not in action_write_locks.action_write_locks:
            display.debug('Creating lock for %s' % task.action)
            action_write_locks.action_write_locks[task.action] = Lock()

        # and then queue the new task
        try:

            # create a dummy object with plugin loaders set as an easier
            # way to share them with the forked processes
            shared_loader_obj = SharedPluginLoaderObj()

            queued = False
            starting_worker = self._cur_worker
            while True:
                worker_prc = self._workers[self._cur_worker]
                if worker_prc is None or not worker_prc.is_alive():
                    self._queued_task_cache[(host.name, task._uuid)] = {
                        'host': host,
                        'task': task,
                        'task_vars': task_vars,
                        'play_context': play_context
                    }

                    worker_prc = WorkerProcess(self._final_q, task_vars, host, task, play_context, self._loader, self._variable_manager, shared_loader_obj)
                    self._workers[self._cur_worker] = worker_prc
                    worker_prc.start()
                    display.debug("worker is %d (out of %d available)" % (self._cur_worker + 1, len(self._workers)))
                    queued = True
                self._cur_worker += 1
                if self._cur_worker >= len(self._workers):
                    self._cur_worker = 0
                if queued:
                    break
                elif self._cur_worker == starting_worker:
                    time.sleep(0.0001)

            self._pending_results += 1
        except (EOFError, IOError, AssertionError) as e:
            # most likely an abort
            display.debug("got an error while queuing: %s" % e)
            return
        display.debug("exiting _queue_task() for %s/%s" % (host.name, task.action))

这里，self._cur_worker 是一个计数器，循环中，每次创建WorkerProcess对象后会+1.当self._cur_worker值达到_wokers的长度时，计数器会被清零，继续循环。直到有任务结束。相当于创建了一个子进程池，在未给全部任务分配子进程前，任意子进程退出后就会有新的子进程填补进来，运行新的任务。