项目场景:
项目场景:airflow scheduler在运行一段时间后突然不工作或者空转,后台进程存在,但是不工作
问题描述
查看日志发现scheduler无心跳
================================================================================
[2022-06-14 14:51:06,301] {manager.py:1065} INFO - Finding 'running' jobs without a recent heartbeat
[2022-06-14 14:51:06,304] {manager.py:1069} INFO - Failing jobs without heartbeat after 2022-06-14 06:46:06.304241+00:00
[2022-06-14 14:51:16,495] {manager.py:1065} INFO - Finding 'running' jobs without a recent heartbeat
[2022-06-14 14:51:16,497] {manager.py:1069} INFO - Failing jobs without heartbeat after 2022-06-14 06:46:16.497371+00:00
查看scheduler错误日志
Traceback (most recent call last):
File "/home/airflow/env/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 697, in _finalize_fairy
fairy._reset(pool)
File "/home/airflow/env/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 893, in _reset
pool._dialect.do_rollback(self)
File "/home/airflow/env/lib/python3.9/site-packages/sqlalchemy/dialects/mysql/base.py", line 2513, in do_rollback
dbapi_connection.rollback()
MySQLdb._exceptions.OperationalError: (2013, 'Lost connection to MySQL server during query')
(env) *@:/home/>tail airflow/conf/airflow-scheduler.err
Traceback (most recent call last):
File "/home/airflow/env/lib/python3.9/threading.py", line 950, in _bootstrap_inner
self.run()
File "/home/airflow/env/lib/python3.9/concurrent/futures/process.py", line 317, in run
result_item, is_broken, cause = self.wait_result_broken_or_wakeup()
File "/home/airflow/env/lib/python3.9/concurrent/futures/process.py", line 376, in wait_result_broken_or_wakeup
worker_sentinels = [p.sentinel for p in self.processes.values()]
File "/home/airflow/env/lib/python3.9/concurrent/futures/process.py", line 376, in <listcomp>
worker_sentinels = [p.sentinel for p in self.processes.values()]
RuntimeError: dictionary changed size during iteration
原因分析:
追踪栈堆可见
Process 13749: python3.9 /home/airflow/env/bin/airflow scheduler -D
Python v3.9.0 (/home/airflow/env/bin/python3.9)
Thread 13749 (idle): "MainThread"
wait (threading.py:312)
result (concurrent/futures/_base.py:435)
result_iterator (concurrent/futures/_base.py:600)
_chain_from_iterable_of_lists (concurrent/futures/process.py:559)
_send_tasks_to_celery (airflow/executors/celery_executor.py:325)
_process_tasks (airflow/executors/celery_executor.py:277)
trigger_tasks (airflow/executors/celery_executor.py:268)
heartbeat (airflow/executors/base_executor.py:158)
_run_scheduler_loop (airflow/jobs/scheduler_job.py:734)
_execute (airflow/jobs/scheduler_job.py:651)
run (airflow/jobs/base_job.py:246)
_run_scheduler_job (airflow/cli/commands/scheduler_command.py:46)
scheduler (airflow/cli/commands/scheduler_command.py:70)
wrapper (airflow/utils/cli.py:92)
command (airflow/cli/cli_parser.py:48)
main (airflow/__main__.py:48)
<module> (airflow:8)
Thread 2111 (idle): "QueueFeederThread"
wait (threading.py:312)
_feed (multiprocessing/queues.py:233)
run (threading.py:888)
_bootstrap_inner (threading.py:950)
_bootstrap (threading.py:908)
发现是线程空转导致scheduler进程空等,一直饥饿,检查发现这其实是python3.9的bug
详见issue43498
解决方案:
一次性方案
通过web端或者命令行
airflow jobs check --job-type SchedulerJob --allow-multiple --limit 100
监控scheduler的状态
如果进程无心跳则重启
永久方案
升级python版本
python >=3.9.10或3.10.1