Db2性能问题:临时表空间太大,导致连不上数据库

问题现象:

某Db2系统经常出现连不上数据库的情况(连库一直HANG住),过一段时间自己就好了。期间观察到有latch现象。于是收集了db2pd -latch和db2pd -stack all的数据。

数据分析:

1. 首先查看latch wait现象,有109个应用在等SQLO_LT_sqeLocalDatabase__dblatch,这个latch是连库所必须的, 就是因为等这个latch,所以连不上库:

qingsong@db2a:~$ grep -i SQLO_LT_sqeLocalDatabase__dblatch latch.out | wc -l
109

等这个latch的应用的stack显示在尝试连库:
<StackTrace>
-------Frame------ ------Function + Offset------
0x09000000078484FC sqloXlatchConflict + 0x70
0x09000000078486C0 sqloXlatchConflict@glue1AA + 0x78
0x0900000007A56F48 Start UsingLocalDatabase__8sqeDBMgrFP8SQLE_BWAP8sqeAgentRccP8sqlo_gmt + 0xC6C
0x0900000007A53510 AppStartUsing__14sqeApplicationFP8SQLE_BWAP8sqeAgentcT3P5sqlcaPc + 0x17C
0x0900000007A8B460 AppLocalStart__14sqeApplicationFP14db2UCinterface + 0x49C
0x0900000007994828 sqlelostWrp__FP14db2UCinterface + 0x3C
0x09000000079948D8 sqleUC engnInit__FP14db2UCinterfaceUs + 0x60
0x0900000007A94D80 sqleUC agentConnect + 0x2A8
0x0900000007A96294 sqljs ConnectAttach__FP13sqljsDrdaAsCbP14db2UCinterface + 0x104
0x0900000007A958E4 sqljs_ddm_accsec__FP14db2UCinterfaceP13sqljDDMObject + 0x340
0x0900000007A9550C sqljsParseConnect__FP13sqljsDrdaAsCbP13sqljDDMObjectP14db2UCinterface + 0x50
0x0900000007A95450 .sqljsParse.fdpr.clone.0__FP13sqljsDrdaAsCbP14db2UCinterfaceP8sqeAgentb + 0x1C
0x0900000007940BA4 @63@sqljsSqlam__FP14db2UCinterfaceP8sqeAgentb + 0x2D4
0x0900000007ACF5DC @63@sqljsDriveRequests__FP8sqeAgentP14db2UCconHandle + 0xB4
0x0900000007ACF290 @63@sqljsDrdaAsInnerDriver__FP18SQLCC_INITSTRUCT_Tb + 0x2D0
0x0900000007ACED1C sqljsDrdaAsDriver__FP18SQLCC_INITSTRUCT_T + 0x100
0x0900000007796D1C RunEDU__8sqeAgentFv + 0x2F4
0x0900000007793418 EDUDriver__9sqzEDUObjFv + 0xF4
0x090000000778A1CC sqloEDUEntry + 0x264
</StackTrace>
<LatchInformation>

Waiting on latch type: (SQLO_LT_sqeLocalDatabase__dblatch) - Address: (7800000002eb208), Line: 1013, File: sqle_database_services.C

</LatchInformation>

2. 再看hold这个latch的,都是同一个应用,200876,看看这个应用的stack:

xxxxxxxx.200876.000.stack.txt
<StackTrace>
-------Frame------ ------Function + Offset------
0x0900000000007758 stat64 + 0x18
0x0900000007983800 sqlofindn2 + 0x84
0x09000000082745AC sqlofindn2@glue1F2 + 0x7C
0x0900000007B8C20C next__13SQLB_FIND_BUFFPP16sqlo_filefindbuf + 0x24
0x0900000009283660 sqlb MonSMSGetContainerPages__FP12SQLB_POOL_CBiPUi + 0x214
0x0900000009564370 sqlb SnapshotTablespace__FRC16sqlm_bp_int_infoP16sqeLocalDatabaseCUiT3P15sqlm_tablespaceP19sqm_snapshot_bufferP16sqlm_header_infob + 0xB0
0x09000000092ABE98 @92@write_sqlm_ts__FPC16sqm_agent_entityP16 sqeLocalDatabaseCUiT3P19sqm_snapshot_bufferP16sqlm_header_info + 0x200
0x09000000092AB838 @92@snap_tspaces_for_db__FP16sqm_agent_entityP19sqm_snapshot_bufferP16sqeLocalDatabaseP14sqlm_collectedPCcCUiPP22sqlm_tablespace_headerP16sqlm_header_info + 0x420
0x0900000009512DB0 sqlmonssagnt__FUi13sqm_entity_idP6sqlmaiT1PvP14sqlm_collectedUsT7P5sqlca + 0xA8
0x090000000841D444 sqlmonssbackend__FP12SQLE_DB2RA_T + 0x22C
0x0900000008D9F2BC sqlesrvr__FP14db2UCinterface + 0x678
0x0900000007A4C6CC sqleMappingFnServer__FP5sqldaP5sqlca + 0x1D0
0x0900000007A4BFD4 sqlerKnownProcedure__FiPcPiP5sqldaT4P13sqlerFmpTableP8sqeAgentP5sqlca + 0x344
0x0900000007A25694 sqlerCallDL__FP14db2UCinterfaceP9UCstpInfo + 0x580
0x0900000007A25C64 .sqljs_ddm_excsqlstt.fdpr.clone.104__FP14db2UCinterfaceP13sqljDDMObject + 0xC0
0x090000000794213C sqljsParseRdbAccessed__FP13sqljsDrdaAsCbP13sqljDDMObjectP14db2UCinterface + 0x114
0x0900000007941EE4 .sqljsParse.fdpr.clone.0__FP13sqljsDrdaAsCbP14db2UCinterfaceP8sqeAgentb + 0x6DC
0x0900000007940BA4 @63@sqljsSqlam__FP14db2UCinterfaceP8sqeAgentb + 0x2D4
0x0900000007ACF5DC @63@sqljsDriveRequests__FP8sqeAgentP14db2UCconHandle + 0xB4
0x0900000007ACF290 @63@sqljsDrdaAsInnerDriver__FP18SQLCC_INITSTRUCT_Tb + 0x2D0
0x0900000007ACED1C sqljsDrdaAsDriver__FP18SQLCC_INITSTRUCT_T + 0x100
0x0900000007796D1C RunEDU__8sqeAgentFv + 0x2F4
0x0900000007793418 EDUDriver__9sqzEDUObjFv + 0xF4
0x090000000778A1CC sqloEDUEntry + 0x264
</StackTrace>

可以通过stack看到,它是在做表空间的snapshot,进一步来讲,MonSMSGetContainerPages是在 MONITOR SMS 表空间时获取容器(CONTAINER)的数据页(PAGE)。在Db2的设计中,snapshot是要获取dblatch的。所以,根据以上信息可以推断出连不上库的原因是临时表空间页数过多,导致snapshot获取它的容器信息时间过长,没有释放连库所需要的latch。

3. 再看一下其他应用的stack信息,搜索Temp关键字,发现有75个应用都在写临时表空间。可以由stack看到,是卡在了pwrite64,也就是写临时数据。
qingsong@db2a:~$ grep -i "sqldInsertTemp" -l *.stack.txt | wc -l
75

以xxxxxxxx.150760.000.stack.txt为例
<StackTrace>
-------Frame------ ------Function + Offset------
0x090000000002A678 pwrite64 + 0x38
0x09000000078AFC7C .sqloseekwrite64.fdpr.clone.29 + 0x134
0x09000000078AFABC sqloWriteBlocks + 0xD4
0x09000000078AF960 sqlbWriteBlocks__FP16SqlbOpenFileInfoPvlUlUiPUlP12SQLB_GLOBALS + 0x50
0x09000000078AE83C sqlbWritePageToDisk__FP11SQLO_IO_REQP16SqlbOpenFileInfoP12SQLB_GLOBALS + 0x190
0x09000000078AE60C sqlbWritePage + 0x100
0x09000000078AE45C sqlbWritePage@glue4FC + 0x78
0x09000000078D0728 .@71@sqlbCheckVictimSlot.fdpr.clone.1873__FP12SQLB_GLOBALSP15SQLB_BufferPoolP19SQLB_PAGE_VICTIM_CBP11SQLB_FIX_CBP8SQLB_BPDP13SQLB_BPD_BASEPP8SQLB_BPD + 0x1C
0x09000000078CEDF4 .sqlbFreeUpSlot.fdpr.clone.231__FP11SQLB_FIX_CBPUiPP8SQLB_BPD + 0xC68
0x09000000078CF1A0 sqlbGetVictimSlot__FP11SQLB_FIX_CBiPUiPP8SQLB_BPDUl + 0x250
0x09000000078D3970 .sqlb GetPageFromDisk.fdpr.clone.101__FP11SQLB_FIX_CBi + 0xD0
0x09000000078D637C sqlbfix__FP11SQLB_FIX_CB + 0xCC0
0x09000000078D6A68 @147@497002@sqlbFixPage__FP11SQLB_FIX_CBiUiT2 + 0x54
0x09000000078B5664 sqlbFixNextNewPage__FP11SQLB_FIX_CBbUiT3@glueDB2 + 0xE0
0x0900000007972D58 sqld AppendTempRow__FP8sqeAgentPciT3P8SQLZ_RIDP8SQLD_TCB + 0x2E0
0x0900000007972820 @101@sqld InsertTemp__FP8sqeAgentP8SQLD_TCBUlT3PP10SQLD_VALUEP8SQLZ_RID + 0x1B0
0x09000000078868C4 sqld RowInsert__FP8sqeAgentUsT2UcUliPP10SQLD_VALUEP13SQLD_TDATARECP8SQLZ_RID + 0x30
0x090000000A3016AC sqlri_hsjnFlushSinglePartition__FP8sqlrr_cbP11sqlri_hsjnoUiP20sqlri_hsjnTupleBlock + 0x97C
0x090000000A300188 @44@sqlri_hsjnFlushBlocks__FP8sqlrr_cbP11sqlri_hsjnoP18sqlri_shared_hsjnoPP20sqlri_hsjnTupleBlockliT6 + 0x5E8
0x09000000099F5DB8 sqlri_hsjnPartitionFull__FP8sqlrr_cbiP25sqlri_hsjnPartitionBucketP11sqlri_hsjnoP18sqlri_shared_hsjno + 0x43C
0x090000000788F824 sqlri_hsjnNewTuple__FP8sqlrr_cbP11sqlri_hsjno + 0x24
0x090000000788FF64 sqlrihsjnpd__FP8sqlrr_cb + 0x84
0x09000000078FD2EC sqldEvalDataPred__FP13SQLD_DFM_WORKPUlP10SQLD_DPRED + 0x240
0x09000000078FA970 sqldReadNorm__FP13SQLD_DFM_WORKl + 0x540
0x09000000078A0248 @101@sqldFetchNext__FP13SQLD_DFM_WORK + 0x190
0x09000000078A0938 sqldFetchNext__FP13SQLD_DFM_WORK@glueD8F + 0x74
0x09000000078A0B64 sqldfrd__FP13SQLD_DFM_WORK + 0x170
0x09000000078EB5C0 sqldfrd__FP13SQLD_DFM_WORK@glue5A9 + 0x74
0x09000000078EB400 sqldRowFetch__FP8sqeAgentP8SQLD_CCBUlT3PP10SQLD_VALUEP8SQLZ_RIDT3P12SQLD_ID_LISTP9SQLP_LSN8 + 0xF0
0x09000000078ECDEC sqlritaSimplePerm__FP8sqlrr_cb + 0x184
0x09000000078F3624 sqlriSectInvoke__FP8sqlrr_cbP12sqlri_opparm + 0x30
0x09000000080A8098 sqlrr_dss_router__FP8sqlrr_cb + 0x6F4
0x090000000937FAA8 sqlrr_subagent_router__FP8sqeAgentP12SQLE_DB2RA_T + 0x5CC
0x09000000086F0FE4 sqleSubRequestRouter__FP8sqeAgentPUiT2 + 0x3FC
0x09000000075E4F60 sqleProcessSubRequest__FP8sqeAgent + 0x2D5C
0x0900000007D05BEC RunEDU__8sqeAgentFv + 0xEC
0x0900000007793418 EDUDriver__9sqzEDUObjFv + 0xF4
0x090000000778A1CC sqloEDUEntry + 0x264
</StackTrace>

<LatchInformation>
Holding Latch type: (SQLO_LT_SQLB_POOL_CB__writeLatch) - Address: (7000000490fb838), Line: 2476, File: sqlbpacc.C HoldCount: 1
Holding Latch type: (SQLO_LT_SQLD_APP_CB__inDMSlatch) - Address: (700000027860098), Line: 7052, File: /view/db2_v97fp8_aix64_s130316/vbs/engn/include/sqldinlines.h HoldCount: 1
</LatchInformation>


结论

综上:所以问题应该是由 临时表空间过大->运行监控函数时间长,一直持有latch->新的应用不能连接。在HANG期间,一般建议至少要收集db2pd -eve和db2pd -stack all的数据,由于没有db2pd -eve的数据,看不到出问题时候的临时表空间使用量。不过,在后续的监控中,果然发现出问题时临时表空间使用量在疯狂增长。找到应用对应的SQL,并对其进行调优之后,问题消失。
另外,有一个类似的文章,是原IBM Db2开发人员郝庆运写的,也是由于临时表空间比较大,导致应用性能问题,如下:
https://www.ibm.com/developerworks/cn/data/library/ba-cn-db2-tempspace-oversize-performance/index.html
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值