pyspark 3.0增加python woker进程崩毁时的日志记录

在spark 3.2.0版本以下,如果python的udf函数,在运行时候崩溃了,引发了 segmentation fault 异常时候,spark executor的错误日志,很模糊的只显示了一行日志:

python worker exited unexpectedly (crashed)

因为进程coredump时候,常规语言层面的try catch异常是无法捕捉的,这对排查问题,非常不友好, 这个问题在spark 3.2版本已经得到修复,具体issue参考:[SPARK-36062] Try to capture faulthanlder when a Python worker crashes. - ASF JIRA

在低于3.2.0版本的spark里面,可以把这个特性移值过来,我在spark 3.0.1版本里面尝试去Cherry-Pick合并,发现有很多冲突,最后为了稳妥,还是选择了手动合并,这样以来,如果python进程再崩溃,我们看到上面的executor的错误日志,就会变成如下的非常详细的日志:

23/02/28 18:44:06 ERROR Executor: Exception in task 2.0 in stage 4.0 (TID 8)
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Fatal Python error: Segmentation fault

Current thread 0x00007f247aafe740 (most recent call first):
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 922 in create_module
  File "<frozen importlib._bootstrap>", line 571 in module_from_spec
  File "<frozen importlib._bootstrap>", line 658 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 684 in _load
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/imp.py", line 343 in load_dynamic
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/imp.py", line 243 in load_module
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24 in swig_import_helper
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 49 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/__init__.py", line 24 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 5 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/backend/load_backend.py", line 90 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/backend/__init__.py", line 1 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/utils/conv_utils.py", line 9 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/utils/__init__.py", line 6 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/__init__.py", line 3 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 941 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 908 in subimport

这样排查问题就很方便了,可以清晰的看到是那个依赖库导致的

之所以能够捕捉segmentation fault进程崩溃异常,是利用了python 3.3版本之后的新功能 faulthandler 库,当故障、超时或收到用户信号时,利用本模块内的函数可转储 Python 跟踪信息。

Python官网一个小例子:

python3 -c "import ctypes; ctypes.string_at(0)"
Segmentation fault

python3 -q -X faulthandler
	>>> import ctypes
	>>> ctypes.string_at(0)
	Fatal Python error: Segmentation fault

	Current thread 0x00007fb899f39700 (most recent call first):
	File "/home/python/cpython/Lib/ctypes/__init__.py", line 486 in string_at
	File "<stdin>", line 1 in <module>
	Segmentation fault

感兴趣参考:faulthandler —— 转储 Python 的跟踪信息 — Python 3.11.2 文档

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值