DB2数据库内存耗尽故障处理经验分享

问题现象

P595服务器中有一个DB2数据库,在连接时,系统PAGING SPACE资源耗尽,从而宕机。

问题分析

问题初步分析

1.如果非计算内存太多,计算内存太少,有可能是这个问题导致计算内存不够用,使用到PAGING SPACE。可以使用vmo -L maxperm%命令来检查文件系统缓存参数的设置

2.如果maxperm%设置为20%左右(对于DB2环境来说,比较适合),那么是不是DB2参数/对象本身分配的内存过多造成计算内存不够用。需要仔细检查实例参数、数据库参数是否设置不当。

3.如果上述原因排除,那么分析db2mtrk得到的数据,分析DB2数据库在运行过程中是否分配了过多的计算内存(内存参数、缓冲池可能使用了AUTOMATIC设置;在这种情况下,内存是按需分配的,理论上可以无限大),如果是,那么分析是由于DB2实例、数据库占用的内存过多,还是代理进程(应用程序)占用的内存过多(比如使用动态数组,在死循环中不停的填充动态数组,导致动态数组无限增长,无论多少内存都能耗光),如果存在这个问题,那么继续分析.

可以使用如下命令跟踪DB2数据库使用内存的情况。

db2mtrk -i -d -p -r 300 500
>> db2mtrk.out
#每5分钟收集一次,收集500次

4.如果上述原因排除,那么结合ipcs得到的数据,分析AIX操作系统中除了DB2使用的内存段之外,还有没有其他的非DB2进程在使用大内存段占用过多的内存,如果有,那么继续分析是否是它导致的.

服务器内存使用

内存:8G

文件系统缓存设置:maxperm%=30

操作系统约消耗1G内存

所以留给DB2(实例、数据库、代理)使用的内存约为:8G-8G*30%-1G=4.5G

检查实例、数据库参数

Database Configuration for Database XXXX

Database configuration release level
= 0x0b00

Database release level
= 0x0b00

Database territory
= CN

Database code page
= 1386

Database code set
= GBK

Database country/region code
= 86

Database collating sequence
= UNIQUE

Alternate collating sequence
(ALT_COLLATE) =

Database page size
= 4096

Dynamic SQL Query management
(DYN_QUERY_MGMT) = DISABLE

Discovery support for this database
(DISCOVER_DB) = ENABLE

Restrict access
= NO

Default query optimization class
(DFT_QUERYOPT) = 9

Degree of parallelism
(DFT_DEGREE) = 1

Continue upon arithmetic exceptions
(DFT_SQLMATHWARN) = NO

Default refresh age
(DFT_REFRESH_AGE) = 0

Default maintained table types for opt (DFT_MTTB_TYPES) = SYSTEM

Number of frequent values retained
(NUM_FREQVALUES) = 10

Number of quantiles retained
(NUM_QUANTILES) = 20

Backup pending
= NO

Database is consistent
= YES

Rollforward pending
= NO

Restore pending
= NO

Multi-page file allocation enabled
= YES

Log retain for recovery status
= NO

User exit for logging status
= NO

Self tuning memory
(SELF_TUNING_MEM) = ON

Size of database shared memory (4KB)
(DATABASE_MEMORY) = AUTOMATIC

Database memory threshold
(DB_MEM_THRESH) = 10

Max storage for lock list (4KB)
(LOCKLIST) = AUTOMATIC

Percent. of lock lists per application
(MAXLOCKS) = AUTOMATIC

Package cache size (4KB)
(PCKCACHESZ) = AUTOMATIC

Sort heap thres for shared sorts (4KB) (SHEAPTHRES_SHR) = AUTOMATIC

Sort list heap (4KB)
(SORTHEAP) = AUTOMATIC

Database heap (4KB)
(DBHEAP) = 4137

Catalog cache size (4KB)
(CATALOGCACHE_SZ) = 579

Log buffer size (4KB)
(LOGBUFSZ) = 108

Utilities heap size (4KB)
(UTIL_HEAP_SZ) = 524288

Buffer pool size (pages)
(BUFFPAGE) = 1000

Max size of appl. group mem set (4KB) (APPGROUP_MEM_SZ) = 40933

Percent of mem for appl. group heap
(GROUPHEAP_RATIO) = 70

Max appl. control heap size (4KB)
(APP_CTL_HEAP_SZ) = 128

SQL statement heap (4KB)
(STMTHEAP) = 4096

Default application heap (4KB)
(APPLHEAPSZ) = 1024

Statistics heap size (4KB)
(STAT_HEAP_SZ) = 4384

Interval for checking deadlock (ms)
(DLCHKTIME) = 10000

Lock timeout (sec)
(LOCKTIMEOUT) = 11

Changed pages threshold
(CHNGPGS_THRESH) = 80

Number of asynchronous page cleaners
(NUM_IOCLEANERS) = AUTOMATIC

Number of I/O servers
(NUM_IOSERVERS) = AUTOMATIC

Index sort flag
(INDEXSORT) = YES

Sequential detect flag
(SEQDETECT) = YES

Default prefetch size (pages)
(DFT_PREFETCH_SZ) = AUTOMATIC

Track modified pages
(TRACKMOD) = OFF

Default number of containers
= 1

Default tablespace extentsize (pages)
(DFT_EXTENT_SZ) = 32

Max number of active applications
(MAXAPPLS) = AUTOMATIC

Average number of active applications
(AVG_APPLS) = AUTOMATIC

Max DB files open per application
(MAXFILOP) = 64

Log file size (4KB)
(LOGFILSIZ) = 40960

Number of primary log files
(LOGPRIMARY) = 50

Number of secondary log files
(LOGSECOND) = 50

Changed path to log files
(NEWLOGPATH) =

Path to log files
=

Overflow log path
(OVERFLOWLOGPATH) =

Mirror log path
(MIRRORLOGPATH) =
First active log file
=
Block log on disk full
(BLK_LOG_DSK_FUL) = NO

Percent max primary log space by transaction
(MAX_LOG) = 0

Num. of active log files for 1 active UOW(NUM_LOG_SPAN) = 0

Group commit count
(MINCOMMIT) = 1

Percent log file reclaimed before soft chckpt (SOFTMAX) = 2000

Log retain for recovery enabled
(LOGRETAIN) = OFF

User exit for logging enabled
(USEREXIT) = OFF

HADR database role
= STANDARD

HADR local host name
(HADR_LOCAL_HOST) =

HADR local service name
(HADR_LOCAL_SVC) =

HADR remote host name
(HADR_REMOTE_HOST) =

HADR remote service name
(HADR_REMOTE_SVC) =

HADR instance name of remote server
(HADR_REMOTE_INST) =

HADR timeout value
(HADR_TIMEOUT) = 120

HADR log write synchronization mode
(HADR_SYNCMODE) = NEARSYNC

First log archive method
(LOGARCHMETH1) = OFF

Options for logarchmeth1
(LOGARCHOPT1) =

Second log archive method
(LOGARCHMETH2) = OFF

Options for logarchmeth2
(LOGARCHOPT2) =

Failover log archive path
(FAILARCHPATH) =

Number of log archive retries on error
(NUMARCHRETRY) = 5

Log archive retry Delay (secs)
(ARCHRETRYDELAY) = 20

Vendor options
(VENDOROPT) =

Auto restart enabled
(AUTORESTART) = ON

Index re-creation time and redo index build
(INDEXREC) = SYSTEM (RESTART)

Log pages during index build
(LOGINDEXBUILD) = OFF

Default number of loadrec sessions
(DFT_LOADREC_SES) = 1

Number of database backups to retain
(NUM_DB_BACKUPS) = 12

Recovery history retention (days)
(REC_HIS_RETENTN) = 366

TSM management class
(TSM_MGMTCLASS) =

TSM node name
(TSM_NODENAME) =

TSM owner
(TSM_OWNER) =

TSM password
(TSM_PASSWORD) =

Automatic maintenance
(AUTO_MAINT) = ON
Automatic database backup
(AUTO_DB_BACKUP) = OFF

Automatic table maintenance
(AUTO_TBL_MAINT) = ON

Automatic runstats
(AUTO_RUNSTATS) = ON

Automatic statistics profiling
(AUTO_STATS_PROF) = OFF

Automatic profile updates
(AUTO_PROF_UPD) = OFF
Automatic reorganization
(AUTO_REORG) = OFF
Database Manager Configuration
Node type = Enterprise Server Edition with local and remote clients

Database manager configuration release level
= 0x0b00

CPU speed (millisec/instruction)
(CPUSPEED) = 2.676617e-07

Communications bandwidth (MB/sec)
(COMM_BANDWIDTH) = 1.000000e+02

Max number of concurrently active databases
(NUMDB) = 8

Federated Database System Support
(FEDERATED) = NO

Transaction processor monitor name
(TP_MON_NAME) =

Default charge-back account
(DFT_ACCOUNT_STR) =

Java Development Kit installation path
(JDK_PATH) = /home/sqllib/java/jdk64

Diagnostic error capture level
(DIAGLEVEL) = 3

Notify Level
(NOTIFYLEVEL) = 3

Diagnostic data directory path
(DIAGPATH) = /home/sqllib/db2dump

Default database monitor switches

Buffer pool
(DFT_MON_BUFPOOL) = ON

Lock
(DFT_MON_LOCK) = ON

Sort
(DFT_MON_SORT) = ON

Statement
(DFT_MON_STMT) = ON

Table
(DFT_MON_TABLE) = ON

Timestamp
(DFT_MON_TIMESTAMP) = ON

Unit of work
(DFT_MON_UOW) = OFF

Monitor health of instance and databases
(HEALTH_MON) = ON

SYSADM group name
(SYSADM_GROUP) = DB2GRP1

SYSCTRL group name
(SYSCTRL_GROUP) =

SYSMAINT group name
(SYSMAINT_GROUP) =

SYSMON group name
(SYSMON_GROUP) =

Client Userid-Password Plugin
(CLNT_PW_PLUGIN) =

Client Kerberos Plugin
(CLNT_KRB_PLUGIN) =

Group Plugin
(GROUP_PLUGIN) =

GSS Plugin for Local Authorization
(LOCAL_GSSPLUGIN) =

Server Plugin Mode
(SRV_PLUGIN_MODE) = UNFENCED

Server List of GSS Plugins
(SRVCON_GSSPLUGIN_LIST) =

Server Userid-Password Plugin
(SRVCON_PW_PLUGIN) =

Server Connection Authentication
(SRVCON_AUTH) = NOT_SPECIFIED

Database manager authentication
(AUTHENTICATION) = SERVER

Cataloging allowed without authority
(CATALOG_NOAUTH) = NO

Trust all clients
(TRUST_ALLCLNTS) = YES

Trusted client authentication
(TRUST_CLNTAUTH) = CLIENT

Bypass federated authentication
(FED_NOAUTH) = NO

Default database path
(DFTDBPATH) = /home

Database monitor heap size (4KB)
(MON_HEAP_SZ) = 90

Java Virtual Machine heap size (4KB)
(JAVA_HEAP_SZ) = 2048

Audit buffer size (4KB)
(AUDIT_BUF_SZ) = 0

Size of instance shared memory (4KB)
(INSTANCE_MEMORY) = AUTOMATIC

Backup buffer default size (4KB)
(BACKBUFSZ) = 1024

Restore buffer default size (4KB)
(RESTBUFSZ) = 1024

Sort heap threshold (4KB)
(SHEAPTHRES) = 0

Directory cache support
(DIR_CACHE) = YES

Application support layer heap size (4KB)
(ASLHEAPSZ) = 15

Max requester I/O block size (bytes)
(RQRIOBLK) = 32767

Query heap size (4KB)
(QUERY_HEAP_SZ) = 1000

Workload impact by throttled utilities(UTIL_IMPACT_LIM) = 10
Priority of agents
(AGENTPRI) = SYSTEM

Max number of existing agents
(MAXAGENTS) = 400

Agent pool size
(NUM_POOLAGENTS) = 400

Initial number of agents in pool
(NUM_INITAGENTS) = 0

Max number of coordinating agents
(MAX_COORDAGENTS) = (MAXAGENTS - NUM_INITAGENTS)

Max no. of concurrent coordinating agents
(MAXCAGENTS) = MAX_COORDAGENTS

Max number of client connections
(MAX_CONNECTIONS) = MAX_COORDAGENTS

Keep fenced process
(KEEPFENCED) = YES

Number of pooled fenced processes
(FENCED_POOL) = MAX_COORDAGENTS

Initial number of fenced processes
(NUM_INITFENCED) = 0

Index re-creation time and redo index build
(INDEXREC) = RESTART

Transaction manager database name
(TM_DATABASE) = 1ST_CONN

Transaction resync interval (sec)
(RESYNC_INTERVAL) = 180

SPM name
(SPM_NAME) =

SPM log size
(SPM_LOG_FILE_SZ) = 256

SPM resync agent limit
(SPM_MAX_RESYNC) = 20

SPM log path
(SPM_LOG_PATH) =

TCP/IP Service name
(SVCENAME) = 60009

Discovery mode
(DISCOVER) = SEARCH

Discover server instance
(DISCOVER_INST) = ENABLE

Maximum query degree of parallelism
(MAX_QUERYDEGREE) = 7

Enable intra-partition parallelism
(INTRA_PARALLEL) = YES

Maximum Asynchronous TQs per query
(FEDERATED_ASYNC) = 0

No. of int. communication buffers(4KB)(FCM_NUM_BUFFERS) = AUTOMATIC

No. of int. communication channels
(FCM_NUM_CHANNELS) = AUTOMATIC

Node connection elapse time (sec)
(CONN_ELAPSE) = 10

Max number of node connection retries (MAX_CONNRETRIES) = 5

Max time difference between nodes (min) (MAX_TIME_DIFF) = 60

db2start/db2stop timeout (min)
(START_STOP_TIME) = 10

从以上实例、数据库参数分析,没有任何相关参数设置不当,导致分配了过多的内存。

DB2诊断日志分析

问题数据库版本为V9.1,DB2从版本9开始支持数据库内存自动调优。本数据库大部分内存参数也是使用AUTOMATIC方式由数据库引擎自动计算决定(由db2stmm进程进行实际调整),通过上面的从诊断日志中摘取的片段也可以看出来,编号①~⑤表示5个日志事件,日志①②表示db2stmm进程将IBMDEFAULTBP缓冲池根据需要调整了2次大小(第一次从1000调整到1512、第二次从1512调整到2280),日志③表示db2stmm进程将SORTHEAP从61调整到65,到这时为止,数据库的内存使用没有任何问题,问题出在日志④上,有人人为的将IBMDEFAULTBP缓冲池从5160一下子调整到了1300000——为什么说是人为的呢?因为这次调整不是由db2stmm进程完成,而是由db2agent代理程序(应用程序在数据库服务器中的替身)完成 ,而IBMDEFAULTBP缓冲池是在Database_memory总大小中分配的,而Database_memory被设置为AUTOMATIC,所以db2stmm进程必须先将Database_memory调整到足够大,以容纳IBMDEFAULTBP缓冲池(1300000*4K 大约为 5G),于是日志⑤便证明了我们的推断——db2stmm进程将Database_memory从789400 (789400*4K 约等于3G) 一下子调整到了2373371(2373371*4K 约等于9G),而操作系统留给DB2使用的内存只有4.5G左右,所以这时数据库进程将使用PAGING SPACE(也就是虚拟内存,相当于物理内存的扩展)来满足需要,一旦进程使用到了PAGING SPACE,那么操作系统将发生大量的PAGE IN/PAGE OUT操作,PAGE IN/OUT操作将导致系统震荡,响应缓慢,操作系统出于自我保护,甚至会杀死DB2服务进程,最严重的情况是:当PAGING SPACE也用完时,系统出于自我保护目的,会自动宕机。我们这个数据库由于最后PAGING SPACE也被用完,最终导致了系统宕机。

问题解决

通过以上的分析,我们知道IBMDEFAULTBP设置的太大了(1300000*4K),连接数据库时导致系统宕机。所以我们下面设法将IBMDEFAULTBP改小。

操作步骤:

db2set DB2_OVERRIDE_BPF=10000
#使IBMDEFAULTBP使用这个值

db2 terminate

db2stop

db2start

db2 connect to db
#连上db

db2 alter bufferpool ibmdefaultbp numblockpages 0
#原来的块SIZE太大,我们这里禁用它

db2 force applications all

db2 connect to db

db2 alter bufferpool ibmdefaultbp immediate size 50000
#将SIZE改小

db2set DB2_OVERRIDE_BPF= #设置为空,还原回去

db2 terminate

db2 force applications all

db2stop

db2start

OK!!!!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值