CASE：DB shutdown/open 过程中发生异常导致JOB不能自动执行

最新推荐文章于 2022-03-15 10:26:31 发布

zhou1862324

最新推荐文章于 2022-03-15 10:26:31 发布

阅读量1.4k

点赞数 1

分类专栏： Oracle Database 文章标签： database oracle

本文链接：https://blog.csdn.net/zhou1862324/article/details/17934461

版权

Oracle Database 专栏收录该内容

77 篇文章 0 订阅

订阅专栏

CASE：DB shutdown/open 过程中发生异常导致JOB不能自动执行

现象：
一个DB中的所有JOB在3月25日之后就不再自动运行，查询DBA_JOBS，发现LAST_DATE定格在3月25日之前。
SQL> show parameter job

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
job_queue_processes integer 5

SQL> select job,LAST_DATE,LAST_SEC,NEXT_DATE,NEXT_SEC,TOTAL_TIME,BROKEN,INTERVAL,what from dba_jobs;

JOB LAST_DATE LAST_SEC NEXT_DATE NEXT_SEC TOTAL_TIME BROKEN INTERVAL WHAT
---------- ----------- ---------------- ----------- ---------------- ---------- ------ ---------------------------------------- --------------------------------------------------
1 2013/3/25 1 17:26:39 2013/3/25 1 17:27:39 180875 N sysdate + 1 / (24 * 60) EMD_MAINTENANCE.EXECUTE_EM_DBMS_JOB_PROCS();
3 2013/3/25 6 06:10:18 2013/5/3 11 11:22:47 758193 N sysdate+1 P_IMG_CSU_STAT_INFO;
4 2013/3/22 1 12:07:20 2013/5/3 10 10:33:37 4165016 N sysdate + 1 p_IMG_csu_stat_daily_ex_exec;
5 2013/3/23 1 14:27:31 2013/4/10 6 06:00:00 13238996 N sysdate + 1 P_IMG_CSU_TWS_MONTHLY_EXEC;
122 2013/3/25 1 10:55:58 2013/4/10 1 10:55:58 140067 N sysdate+1 p_IMG_csu_stat_wt_daily_exec;
42 2013/3/25 1 17:23:29 2013/4/10 8 08:00:00 6803650 N SYSDATE+1/288 qpg_kj_tj;
62 2013/3/22 0 00:10:05 2013/4/10 0 00:10:00 7147390 N TRUNC(SYSDATE)+1/(24*6)+1 neimg.pack_qpg_IMG_dailytj.DAILY_STATISTICS;
63 2013/3/22 0 00:00:01

分析：
Step1,查看dba_jobs等视图
Step2,查看alert日志：
Mon Mar 25 09:05:22 2013
Errors in file /app/oracle/admin/neimg/bdump/neimg_j006_27877.trc: <===
ORA-12012: error on auto execute of job 42
ORA-01403: no data found
ORA-06512: at "neimg.qpg_KJ_TJ", line 33
ORA-06512: at line 1
Mon Mar 25 09:08:29 2013

通过alert日志的报错，追溯到neimg_j006_27877.trc，但没有有价值的信息。
Step3,从OS层面trace进程cjq0
[oracle@IMG_PIDB udump]$ ps -ef|grep cjq
oracle 4161 4051 0 08:30 pts/0 00:00:00 grep cjq
oracle 11984 1 0 Mar25 ? 00:00:02 ora_cjq0_neimg
[oracle@IMG_PIDB udump]$ strace -p 11984
Process 11984 attached - interrupt to quit
times(NULL) = 790492277
semtimedop(393217, 0x7fff39107150, 1, {1, 840000000}) = -1 EAGAIN (Resource temporarily unavailable)
times(NULL) = 790492461
times(NULL) = 790492461
semtimedop(393217, 0x7fff39107150, 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable) <<====
getrusage(RUSAGE_SELF, {ru_utime={1, 664746}, ru_stime={0, 500923}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={1, 664746}, ru_stime={0, 500923}, ...}) = 0
times(NULL) = 790492761
times(NULL) = 790492761
semtimedop(393217, 0x7fff39107150, 1, {3, 0} <unfinished ...>
Process 11984 detached

semtimedop(393217, 0x7fff39107150, 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)
通过trace，发现自从3月25日DB重启后一直在semtimedop内存段’0x7fff39107150'

Step4,删除job 4，并重建，新的job号是292，但是新建的JOB 292仍然不能自动执行。列THIS_DATE，THIS_SEC值为空。

Step5,做10046,如下：
SQL> conn /as sysdba
Connected.
SQL> oradebug setospid 11984;
oradebug unlimit;
oradebug event 10046 trace name context forever,level 12;
oradebug tracefile_nameOracle pid: 13, Unix process pid: 11984, image: oracle@IMG_PIDB (CJQ0)
SQL> Statement processed.
SQL> Statement processed.
SQL>
/app/oracle/admin/neimg/bdump/neimg_cjq0_11984.trc
SQL> conn neimg
Enter password:
Connected.
SQL> exec DBMS_JOB.NEXT_DATE(292,sysdate+1/24/60); <===设定为1分钟后执行

PL/SQL procedure successfully completed.

SQL> commit;

Commit complete.

SQL>

Unix process pid: 11984, image: oracle@IMG_PIDB (CJQ0)

*** 2013-05-07 09:12:19.739
Received ORADEBUG command 'unlimit' from process Unix process pid: 30143, image:
Received ORADEBUG command 'event 10046 trace name context forever,level 12' from process Unix process pid: 30143, image:
WAIT #0: nam='rdbms ipc message' ela= 23719 timeout=237 p2=0 p3=0 obj#=-1 tim=1335829238050367
WAIT #0: nam='rdbms ipc message' ela= 2295845 timeout=235 p2=0 p3=0 obj#=-1 tim=1335829240346359
WAIT #0: nam='rdbms ipc message' ela= 2930432 timeout=300 p2=0 p3=0 obj#=-1 tim=1335829243276873
Received ORADEBUG command 'tracefile_name' from process Unix process pid: 30143, image:
WAIT #0: nam='rdbms ipc message' ela= 1843426 timeout=300 p2=0 p3=0 obj#=-1 tim=1335829245120391
WAIT #0: nam='rdbms ipc message' ela= 1084937 timeout=111 p2=0 p3=0 obj#=-1 tim=1335829246205459
*** 2013-05-07 09:12:31.115
WAIT #0: nam='rdbms ipc message' ela= 2930461 timeout=300 p2=0 p3=0 obj#=-1 tim=1335829249136013
WAIT #0: nam='rdbms ipc message' ela= 2930386 timeout=300 p2=0 p3=0 obj#=-1 tim=1335829252066575
WAIT #0: nam='rdbms ipc message' ela= 2930460 timeout=300 p2=0 p3=0 obj#=-1 tim=1335829254997126
WAIT #0: nam='rdbms ipc message' ela= 2931251 timeout=300 p2=0 p3=0 obj#=-1 tim=1335829257928469

并作了system dump：

=== ODM Data Collection ===

PROCESS 13:
----------------------------------------
SO: 0x1f218b2f0, type: 2, owner: (nil), flag: INIT/-/-/0x00
(process) Oracle pid=13, calls cur/top: 0x1f62141c0/0x1f62141c0, flag: (2) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 33
last post received-location: ksrpublish
last process to post me: 1f41b0ab0 224 0
last post sent: 0 0 48
last post sent-location: ksoreq_reply
last process posted by me: 1fe1dc698 1 0
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 0x1f31ae220
O/S info: user: oracle, term: UNKNOWN, ospid: 11984
OSD pid info: Unix process pid: 11984, image: oracle@IMG_PIDB (CJQ0)
Short stack dump: ksdxfstk()+32<-ksdxcb()+1547<-sspuser()+90<-<0x3d0180eb10>

----------------------------------------
SO: 0x1f220f3b8, type: 4, owner: 0x1f218b2f0, flag: INIT/-/-/0x00
(session) sid: 544 trans: (nil), creator: 0x1f218b2f0, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
txn branch: (nil)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
waiting for 'rdbms ipc message' blocking sess=0x(nil) seq=47 wait_time=0 seconds since wait started=0
timeout=39, =0, =0
Dumping Session Wait History
for 'rdbms ipc message' count=1 wait_time=2373861
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2930517
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2930558
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2930496
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2930497
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2930666
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2930559
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2931411
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2930440
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2930679
timeout=12c, =0, =0
temporary object counter: 0

还是没有一丝线索。
Step6,查看Metalink上是否有相关的BUG或者案例（搜索job start）。
Note = Jobs Not Executing Automatically [ID 313102.1]
==>
11) Check for unsuccessful shutdowns:

A shutdown immediate may get canceled because active sessions prevent the database close operation.

Please review the alert log for the last two shutdown / startups and the messages:
SHUTDOWN: Active sessions prevent database close operation

Please refer to
Note 434690.1 - Database Jobs Do Not Run After a Failed 'Shutdown Immediate'

Database Jobs Do Not Run After a Failed 'Shutdown Immediate' [ID 434690.1]
==>
Symptoms

After issuing a 'shutdown immediate' command, if the command does not execute after an hour, the Oracle Server automatically cancels the operation. A message will appear in the alert log as follows:

SHUTDOWN: Active sessions prevent database close operation

Unfortunately this has the added effect of shutting down all the job queue slaves as part of the shutdown process. Any jobs listed in the DBA_JOBS view will not run. This can affect a wide variety of functions, such as statistics gathering, replication propagation, materialized view refresh, etc.
Cause

This is expected behavior. Once the 'shutdown' command is issued, the shutdown must proceed; the shutdown process cannot be undone. This is because the instance is not expected to live after a shutdown is canceled.
那么是否可能是这两篇文档中提到的情况呢？客户恰恰是在3月25日对DB进行了重启，于是仔细看了那天shutdown和open的日志

Mon Mar 25 13:11:48 2013
Shutting down instance (immediate)
License high water mark = 51
Mon Mar 25 13:11:48 2013
Stopping Job queue slave processes
Mon Mar 25 13:11:53 2013
Process OS id : 14300 alive after kill
Errors in file
Mon Mar 25 13:11:58 2013
Job queue slave processes stopped
All dispatchers and shared servers shutdown
Mon Mar 25 13:12:09 2013
PMON failed to acquire latch, see PMON dump
PMON failed to acquire latch, see PMON dump
PMON failed to acquire latch, see PMON dump
PMON failed to acquire latch, see PMON dump
PMON failed to acquire latch, see PMON dump
...
Mon Mar 25 13:13:43 2013
Completed: ALTER DATABASE CLOSE NORMAL
Mon Mar 25 13:13:43 2013
ALTER DATABASE DISMOUNT
Completed: ALTER DATABASE DISMOUNT
ARCH: Archival disabled due to shutdown: 1089
Shutting down archive processes
Archiving is disabled
Archive process shutdown avoided: 0 active
ARCH: Archival disabled due to shutdown: 1089
Shutting down archive processes
Archiving is disabled
Archive process shutdown avoided: 0 active <===shutdown过程没有问题
Mon Mar 25 13:40:19 2013
Starting ORACLE instance (normal)
...
Mon Mar 25 17:27:03 2013
Shutting down instance (abort)
License high water mark = 55
Instance terminated by USER, pid = 11609
Mon Mar 25 17:27:22 2013
Starting ORACLE instance (normal)
...
Mon Mar 25 17:27:33 2013
Errors in file /app/oracle/admin/neimg/udump/neimg_ora_11994.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-01013: user requested cancel of current operation ==> 启动过程中有客户有执行CANCEL的操作，并且有ora-7445报出
Mon Mar 25 17:27:33 2013
Errors in file /app/oracle/admin/neimg/udump/neimg_ora_11994.trc:
ORA-07445: exception encountered: core dump [kkttrex()+1365] [SIGSEGV] [Address not mapped to object] [0x000000074] [] []
...
Mon Mar 25 17:27:51 2013
alter database open
Mon Mar 25 17:27:51 2013
ORA-1531 signalled during: alter database open... <===正常启动成功后会有'Completed: ALTER DATABASE OPEN'的信息，但这里没有
Mon Mar 25 17:33:30 2013
Shutting down archive processes
Mon Mar 25 17:33:35 2013
ARCH shutting down
ARC2: Archival stopped
所有极有可能是在DB启动中途的那次客户手动CANCEL引起的，建议客户尝试重启DB，看能否解决这个问题。
shutdown abort
startup restrict
shutdown immediate
startup
重启DB之后问题解决。
总结：看来DB在做SHUTDOWN或者OPEN中途不能随意CTRL+C执行CANCEL，否则会导致DB状态异常，影响系统的一些功能，出现各种各样诡异的问题。

zhou1862324

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
CASE：DB shutdown/open 过程中发生异常导致JOB不能自动执行

CASE：DB shutdown/open 过程中发生异常导致JOB不能自动执行现象：一个DB中的所有JOB在3月25日之后就不再自动运行，查询DBA_JOBS，发现LAST_DATE定格在3月25日之前。SQL> show parameter jobNAME TYPE VALUE------------------------------------ --------
复制链接

扫一扫