今天碰到一个比较严重的bug,导致了RAC环境的一个实例重新启动。
在alert文件中,错误信息如下:
Wed Oct 21 11:15:19 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade2_lmon_16650.trc:
ORA-00600: internal error code, arguments: [kclexpand_5], [], [], [], [], [], [], []
Wed Oct 21 11:15:19 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade2_lmon_16650.trc:
ORA-00600: internal error code, arguments: [kclexpand_5], [], [], [], [], [], [], []
Wed Oct 21 11:15:19 2009
LMON: terminating instance due to error 481
Wed Oct 21 11:15:20 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade2_lms0_16654.trc:
ORA-00481: LMON process terminated with error
Wed Oct 21 11:15:20 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade2_lms1_16658.trc:
ORA-00481: LMON process terminated with error
Wed Oct 21 11:15:20 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade2_mman_16662.trc:
ORA-00481: LMON process terminated with error
Wed Oct 21 11:15:20 2009
Trace dumping is performing id=[cdmp_20091021111520]
Wed Oct 21 11:15:20 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade2_lmd0_16652.trc:
ORA-00481: LMON process terminated with error
Wed Oct 21 11:15:21 2009
Shutting down instance (abort)
License high water mark = 129
Wed Oct 21 11:15:24 2009
Instance terminated by LMON, pid = 16650
Wed Oct 21 11:15:26 2009
Instance terminated by USER, pid = 6371
Wed Oct 21 11:15:44 2009
Starting ORACLE instance (normal)
Wed Oct 21 11:16:04 2009
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 ce1 10.0.1.0 configured from OCR for use as a cluster interconnect
Interface type 1 ce0 172.0.2.0 configured from OCR for use as a public interface
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.3.0.
System parameters with non-default values:
processes = 500
sessions = 555
__shared_pool_size = 7482638336
__large_pool_size = 16777216
__java_pool_size = 16777216
__streams_pool_size = 0
spfile = /dev/vx/rdsk/datavg/newtrade_spfile
backup_tape_io_slaves = FALSE
sga_target = 20971520000
control_files = /dev/vx/rdsk/datavg/newtrade_control1, /dev/vx/rdsk/datavg/newtrade_control2, /dev/vx/rdsk/datavg/newtrade_control3
control_file_record_keep_time= 30
db_block_size = 16384
__db_cache_size = 13438550016
compatible = 10.2.0.3.0
log_archive_dest_1 = LOCATION=/data1/oracle/oradata/newtrade/archivelog
log_archive_dest_2 = service=newtrade1
log_archive_local_first = FALSE
log_archive_format = %t_%s_%r.dbf
db_file_multiblock_read_count= 16
cluster_database = TRUE
cluster_database_instances= 2
thread = 2
instance_number = 2
undo_management = AUTO
undo_tablespace = UNDOTBS2
undo_retention = 10800
remote_login_passwordfile= EXCLUSIVE
db_domain = us.oracle.com
dispatchers = (PROTOCOL=TCP) (SERVICE=newtradeXDB)
local_listener = (ADDRESS = (PROTOCOL = TCP)(HOST = 172.0.2.59)(PORT = 1521))
remote_listener = LISTENERS_NEWTRADE
job_queue_processes = 10
background_dump_dest = /data/oracle/admin/newtrade/bdump
user_dump_dest = /data/oracle/admin/newtrade/udump
core_dump_dest = /data/oracle/admin/newtrade/cdump
audit_file_dest = /data/oracle/admin/newtrade/adump
db_name = newtrade
open_cursors = 5000
pga_aggregate_target = 4294967296
Cluster communication is configured to use the following interface(s) for this instance
10.0.1.4
Wed Oct 21 11:16:05 2009
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
PMON started with pid=3, OS id=7660
DIAG started with pid=5, OS id=7662
PSP0 started with pid=7, OS id=7664
LMON started with pid=9, OS id=7666
LMD0 started with pid=2, OS id=7668
LMS0 started with pid=4, OS id=7670
LMS1 started with pid=11, OS id=7674
MMAN started with pid=13, OS id=7678
DBW0 started with pid=6, OS id=7680
DBW1 started with pid=15, OS id=7693
LGWR started with pid=17, OS id=7713
CKPT started with pid=19, OS id=7715
SMON started with pid=21, OS id=7717
RECO started with pid=23, OS id=7719
CJQ0 started with pid=8, OS id=7721
MMON started with pid=25, OS id=7723
Wed Oct 21 11:16:06 2009
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
MMNL started with pid=27, OS id=7725
Wed Oct 21 11:16:06 2009
starting up 1 shared server(s) ...
Wed Oct 21 11:16:06 2009
lmon registered with NM - instance id 2 (internal mem no 1)
Wed Oct 21 11:16:10 2009
Reconfiguration started (old inc 0, new inc 34)
List of nodes:
0 1
Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
Communication channels reestablished
* domain 0 valid = 1 according to instance 0
Wed Oct 21 11:16:10 2009
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Wed Oct 21 11:16:10 2009
LMS 1: 0 GCS shadows cancelled, 0 closed
Wed Oct 21 11:16:10 2009
LMS 0: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Wed Oct 21 11:16:11 2009
LMS 0: 0 GCS shadows traversed, 0 replayed
Wed Oct 21 11:16:11 2009
LMS 1: 0 GCS shadows traversed, 0 replayed
Wed Oct 21 11:16:11 2009
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
LCK0 started with pid=14, OS id=7992
Wed Oct 21 11:16:16 2009
ALTER DATABASE MOUNT
Wed Oct 21 11:16:20 2009
Setting recovery target incarnation to 1
Wed Oct 21 11:16:20 2009
Successful mount of redo thread 2, with mount id 2817896731
Wed Oct 21 11:16:20 2009
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Completed: ALTER DATABASE MOUNT
Wed Oct 21 11:16:21 2009
ALTER DATABASE OPEN
Picked broadcast on commit scheme to generate SCNs
Wed Oct 21 11:16:21 2009
LGWR: STARTING ARCH PROCESSES
ARC0 started with pid=16, OS id=8141
Wed Oct 21 11:16:21 2009
ARC0: Archival started
ARC1: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC1 started with pid=31, OS id=8143
Wed Oct 21 11:16:21 2009
Thread 2 opened at log sequence 5162
Current log# 6 seq# 5162 mem# 0: /dev/vx/rdsk/datavg/newtrade_redo2_3_1_1g
Current log# 6 seq# 5162 mem# 1: /dev/vx/rdsk/datavg/newtrade_redo2_3_2_1g
Successful open of redo thread 2
Wed Oct 21 11:16:21 2009
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Wed Oct 21 11:16:21 2009
ARC1: Becoming the 'no FAL' ARCH
ARC1: Becoming the 'no SRL' ARCH
Wed Oct 21 11:16:21 2009
ARC0: Becoming the heartbeat ARCH
Wed Oct 21 11:16:21 2009
SMON: enabling cache recovery
Wed Oct 21 11:16:26 2009
Successfully onlined Undo Tablespace 4.
Wed Oct 21 11:16:26 2009
SMON: enabling tx recovery
Wed Oct 21 11:16:26 2009
Database Characterset is ZHS16GBK
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=18, OS id=8302
Wed Oct 21 11:16:28 2009
Completed: ALTER DATABASE OPEN
在对应的trace文件中包含下面的信息:
Dump event group for SESSION
Dump event group for SYSTEM
DYNAMICALLY MODIFIED PARAMETERS:
control_file_record_keep_time= 30
log_archive_dest_1 = LOCATION=/data1/oracle/oradata/newtrade/archivelog
log_archive_dest_2 = service=newtrade1
log_archive_local_first = FALSE
undo_retention = 10800
local_listener = (ADDRESS = (PROTOCOL = TCP)(HOST = 172.0.2.59)(PORT = 1521))
remote_listener = LISTENERS_NEWTRADE
open_cursors = 5000
error 600 detected in background process
ORA-00600: internal error code, arguments: [kclexpand_5], [], [], [], [], [], [], []
ksuitm: waiting for [5] seconds before killing DIAG
由于ORA-600错误出现后,Oracle开始结束后台的进程,且上面的trace文件中包含了ksuitm: waiting for [5] seconds before killing DIAG的信息,不难推断出是由于ORA-600(kclexpand_5)错误的出现,才导致了当前节点上实例的重启。
查询METALINK,果然是Oracle的bug,文档描述为:Doc ID: 6658484.8。在RAC节点间获取主站点信息时出现的问题,导致了其中一个实例被重新启动。这个bug影响10.2.0.3和10.2.0.4版本。在10.2.0.5和11.1.0.7中,这个问题被解决。
引发错误的具体还不清楚,因此暂时不好确定这个问题出现的几率。不过这个bug会导致一个实例重启,应该说对数据库环境的影响还是很大的。
Oracle虽然提供了单独的补丁,但是提供的平台十分有限,以SOLARIS SPARC 64平台为例,如果要打这个补丁,数据库的版本必须是10.2.0.4。
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/4227/viewspace-617117/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/4227/viewspace-617117/