Long running JOB due to parallel execution

最新推荐文章于 2025-07-12 20:12:26 发布

iteye_20478

最新推荐文章于 2025-07-12 20:12:26 发布

阅读量115

点赞数

文章标签：数据库

本文介绍了一个ETL并行作业长时间运行的问题排查过程。通过查询特定等待事件PXDeq:ExecuteReply发现并行扫描导致的readbyothersession等待事件成为性能瓶颈。关闭并行后，作业运行时间大幅缩短至30分钟内。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

今天遇到一个并行运行案例， ETL开发人员发信说某个job跑了6小时了还没跑完.

他们只给了我一个SID, 该SID=4861, 于是我根据SID 查询等待事件，该等待事件是

PX Deq: Execute Reply , 只要是PX 开头的等待事件，它通常与并行执行有关，这里分享

一个小心得，遇到并行执行，我们不要去查SID，而是要去查询SQL_ID

select (sysdate-logon_time)*24*60 minutes, username,machine,sid,serial#,event,p1,p2,p3,

row_wait_obj# obj#,row_wait_file# file#,row_wait_block# block#,sql_id,sql_child_number from

gv$session where sql_id='9tuyvbacwkc5k';

MINUTES USERNAME MACHINE SID SERIAL# EVENT P1 P2 P3 OBJ# FILE# BLOCK# SQL_ID SQL_CHILD_NUMBER 442 ADWU_OPTIMA_WE11_ETL bdhp4462 2237 180 PX Deq: Execution Msg 268566527 10600 0 -1 0 0 9tuyvbacwkc5k 1 442 ADWU_OPTIMA_WE11_ETL bdhp4462 2303 19672 read by other session 80 982406 1 61027312 80 982406 9tuyvbacwkc5k 1 442 ADWU_OPTIMA_WE11_ETL bdhp4462 2375 29536 read by other session 80 982342 1 61027312 80 982342 9tuyvbacwkc5k 1 442 ADWU_OPTIMA_WE11_ETL bdhp4462 2402 56932 PX Deq: Execution Msg 268566527 10600 0 -1 0 0 9tuyvbacwkc5k 1 442 ADWU_OPTIMA_WE11_ETL bdhp4462 4816 60934 read by other session 80 982406 1 61027312 80 982406 9tuyvbacwkc5k 1 442 ADWU_OPTIMA_WE11_ETL bdhp4462 4818 5555 unspecified wait event 0 0 0 61027312 80 982406 9tuyvbacwkc5k 1 442 ADWU_OPTIMA_WE11_ETL bdhp4462 4827 51041 PX Deq: Execution Msg 268566527 10601 0 -1 0 0 9tuyvbacwkc5k 1 443 ADWU_OPTIMA_WE11_ETL bdhp4462 4861 8746 PX Deq: Execute Reply 200 768 0 65764751 217 54823 9tuyvbacwkc5k 0 442 ADWU_OPTIMA_WE11_ETL bdhp4462 4867 42001 PX Deq: Execution Msg 268566527 10600 0 -1 0 0 9tuyvbacwkc5k 1

大家请看 SID=2303, SID=4816，他们都是read by other session 并且 p1, p2都相等.

我查看了执行计划，有个表会扫描8次，而且是并行的扫描，所以就出现了read by other session

我们只需要关闭并行，这个job就可以跑完了.最终这个JOB跑了不到30分钟

read by other session 等待事件的解释(Metalink 732891.1):

When information is requested from the database, Oracle will first read the data from disk into the database buffer cache. If two or more sessions request the same information, the first session will read the data into the buffer cache while other sessions wait. In previous versions, this wait was classified under the "buffer busy waits" event. However, in Oracle 10.1 and higher, this wait time is now broken out into the "read by other session" wait event. Excessive waits for this event are typically due to several processes repeatedly reading the same blocks, e.g. many sessions scanning the same index or performing full table scans on the same table. Tuning this issue is a matter of finding and eliminating this contention.

也就是说当多个session需要读取同一个block到buffer cache中(注意：该block没有在buffer cache中)

这个时候只能有一个session读，另外的session只能等待

在10g之前，这个等待事件叫做 buffer busy waits，到了10g之后，这个等待事件叫做read by other session

不过让我搞不明白的就是为什么并行扫描会读取block到 buffer cache中而不是 direct path read.莫非和跨节点并行扫描有关？