最近生产机上db2 sql replication 的capture端进程总是自动退出,现已经第5次出现了。

ibm的分析结果

从日志来看Capture异常停止地原因应该是capture应用程序在申请私有内存地时候,内存地址已经达到上限(3G)无法再分配内存导致。
V9.5开始DB2改为线程模式,instance_memory限制整个实例地内存大小(共享+私有),因为目前无法改变DB2使用内存的上限,我们只能减小共享内存大小来满足私有内存的增长(减小共享内存可能会降低DB2性能)。
解决方法:
1
)将实例升级到64位,并将instance_memory改为AUTOMATIC
2
)将instance_memory改为3G,减小共享内存总量,共享内存主要构成:(bufferpools + database heap + utility heap + locklist + packagecache  + SHEAPTHRES),并减小database_memory大小。
3)
instance_memory改为3G,self_tuning开关打开,,所有和内存相关参数设置为AUTOMATIC的内存堆将会根据系统负载自动调整内存使用量,若应用程序申请私有内存时DB2无法提供足够内存,那么可以从其他内存区域进行调整。若可调整的内存还不足以满足要求,可能还会报错。目前共享内存只要使用者是bufferpool,其大小是不可调整的,可减小bufferpool大小来释放共享内存使用量。


db2diag.log

2012-09-06-17.46.01.105000-300 E5839119H728       LEVEL: Error (OS)
PID     : 552                  TID  : 4884        PROC : db2syscs.exe
INSTANCE: DB2                  NODE : 000         DB   : sample

APPHDL  : 0-8907               APPID: *LOCAL.DB2.120907043724
AUTHID  : sample_CAP
EDUID   : 4884                 EDUNAME: db2agent (sample)
FUNCTION: DB2 UDB, SQO Memory Management, sqloLogMemoryCondition, probe:100
CALLED  : OS, -, VirtualAlloc
OSERR   : 8 "Not enough storage is available to process this command."
MESSAGE : Private memory and/or virtual address space exhausted
DATA #1 : Requested size, PD_TYPE_MEM_REQUESTED_SIZE, 4 bytes
16842752
DATA #2 : Current set size, PD_TYPE_SET_SIZE, 4 bytes
1636696064

2012-09-06-17.46.01.120000-300 E5839849H717       LEVEL: Warning
PID     : 552                  TID  : 4884        PROC : db2syscs.exe
INSTANCE: DB2                  NODE : 000         DB   : sample

APPHDL  : 0-8907               APPID: *LOCAL.DB2.120907043724
AUTHID  : sample_CAP
EDUID   : 4884                 EDUNAME: db2agent (sample)
FUNCTION: DB2 UDB, SQO Memory Management, sqloMemLogPoolConditions, probe:30
DATA #1 : <preformatted>
Out of memory failure for Private Heap.
Requested block size           : 16777216 bytes.
Physical heap size             : 17760256 bytes.
Configured heap size           : 2147483648 bytes.
Unreserved memory used by heap : 0 bytes.
Unreserved memory left in set  : 0 bytes.

capture日志错误信息

2012-09-06-17.51.30.378000 <logrd8::readTheLog> ASN8041D  "Capture" : "ASN_CCD" : "WorkerThread" : db2LogRead API is sending us backwards in theDB2 Log: First LSN is "0000:0000:0278:6982:f576" while Next Start LSN is "0000:0000:0000:0000:0000"
2012-09-06-17.51.30.378000 <logrd::readTheLog> ASN0005E  CAPTURE "ASN_CCD" : "WorkerThread". The Capture program encountered an error when reading the DB2 log.  The log sequence number is "0000:0000:0278:6982:F576", the SQLCODE is "-2038", and the reason code is "-2038".
2012-09-06-17.51.30.378000 <logrd::readTheLog> ASN8999D  "Capture" : "ASN_CCD" : "WorkerThread" :