下午开发人员告诉我,他们的应用程序断开连接了,于是我查看了告警日志有如下内容
Thu Jun 08
16:30:32 CST 2017
Process J000
died, see its trace file
Thu Jun 08
16:30:32 CST 2017
kkjcre1p: unable
to spawn jobq slave process
Thu Jun 08
16:30:32 CST 2017
Errors in file /oracle/admin/pmcpdp/bdump/pmcpdp1_cjq0_8324292.trc:
查看trace文件
$ more
/oracle/admin/pmcpdp/bdump/pmcpdp1_cjq0_8324292.trc:
/oracle/admin/pmcpdp/bdump/pmcpdp1_cjq0_8324292.trc
Oracle Database
10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the
Partitioning, Real Application Clusters, OLAP, Data Mining
and Real
Application Testing options
ORACLE_HOME =
/oracle/products/10.2/db
System
name: AIX
Node name: cmspdb1
Release: 1
Version: 6
Machine: 00F80C854C00
Instance name:
pmcpdp1
Redo thread
mounted by this instance: 1
Oracle process
number: 48
Unix process
pid: 8324292, image: oracle@cmspdb1 (CJQ0)
*** SERVICE
NAME:(SYS$BACKGROUND) 2017-06-08 16:30:32.727
*** SESSION
ID:(4359.1) 2017-06-08 16:30:32.726
*** 2017-06-08
16:30:32.726
Process J000 is
dead (pid=28508394, state=3):
*** 2017-06-08
16:32:43.735
Process J000 is
dead (pid=26674236, state=3):
*** 2017-06-08
16:32:48.736
Process J000 is
dead (pid=4457010, state=3):
*** 2017-06-08
16:32:53.7385%)
Process J000 is
dead (pid=26674250, state=3):
*** 2017-06-08
16:32:59.7390%)
Process J000 is
dead (pid=4457028, state=3):
*** 2017-06-08
16:33:04.741
Process J000 is
dead (pid=3408006, state=3):
*** 2017-06-08
16:33:09.742
Process J000 is
dead (pid=27133088, state=3):
*** 2017-06-08
16:35:04.743
Process J000 is
dead (pid=27591610, state=3):
*** 2017-06-08
16:35:10.749
Process J000 is
dead (pid=27591880, state=3):
*** 2017-06-08
16:35:15.751
Process J000 is
dead (pid=19137444, state=3):
*** 2017-06-08
16:35:20.752
Process J000 is
dead (pid=22872976, state=3):
$
cpdp1_cjq0_8324292.trc: END
我们知道,j000是任务调度进程CJQ0的slave进程,j000进程无法启动,也就是说CJQ0进程没有办法启动他的slave进程,从trace中的Oracle process
number: 48
Unix process
pid: 8324292, image: oracle@cmspdb1 (CJQ0)
可以看出,问题进程应该在CJQ0,也可用通过SQL查询找到问题进程
SQL> select program from v$process
where pid=48;
PROGRAM
------------------------------------------------
oracle@cmspdb2 (CJQ0)
然后我们来分析原因,首先是session,process是否足够
SQL> select
count(*) from v$process;
COUNT(*)
----------
593
SQL> show parameter
session;
NAME
------------------------------------
TYPE VALUE
--------------------------------------------
------------------------------
java_max_sessionspace_size
integer 0
java_soft_sessionspace_limit
integer 0
license_max_sessions
integer 0
license_sessions_warning
integer 0
logmnr_max_persistent_sessions
NAME
------------------------------------
TYPE VALUE
--------------------------------------------
------------------------------
integer 1
session_cached_cursors
integer 600
session_max_open_files
integer 10
sessions
integer 4405
shared_server_sessions
integer
SQL> select
count(*) from v$session;
COUNT(*)
----------
588
Process和session都没有问题,下面我们来看看job_queue_processes参数
SQL> show
parameter job_queue_processes;
NAME TYPE VALUE
------------------------------------
-------------------------------------------- ------------------------------
job_queue_processes integer 10
查看j000进程到底占用了多少。$ ps -ef
|grep ora_j |grep -v grep
oracle 27263908 1
0 17时10分05秒- 0:00 ora_j002_pmcpdp1
oracle 26215608 1 104 17时10分00秒- 0:29 ora_j000_pmcpdp1
oracle 26740142 1
0 17时10分00秒- 0:05 ora_j001_pmcpdp1
不幸的是,开发人员在告诉我他们的应用丢失数据库连接时,报错已经消失了,最晚的告警日志报错信息是在16点36分左右。
但是我们可以很容易推测,在空闲时期j000已经有3个了,10个是可能不够用的,所以我们要调整job_queue_processes的值预防下次出现此类错误
SQL> Alter
system set job_queue_processes=50 scope=both sid='*';
System altere