客户在insert某条数据的时候出了一个DSNISGRT:500A的abend,送来这个abend的dump,通过初步分析,可以看到如下信息
IP VERBX MTRACE
B000000 KOE6 22259 19:38:37.12 S0672209 00000291 DSNI014I )DSZP DSNISGRT DATA IN USE DURING ABEND 204
204 00000091 REASON 00C90101
204 00000091 ERQUAL 500A
204 00000091 TYPE 00000308
204 00000091 NAME RKWWZDD0.RKTS1178.X'0253'.X'000000C7' <----TS page
204 00000091 CONNECTION-ID=DSZP
204 00000091 CORRELATION-ID=rsa-saison-e
204 00000091 LUW-ID=800800CE.8FB6.220916173256=141781
B000000 KOE6 22259 19:38:37.12 S0672209 00000291 DSNI014I )DSZP DSNISGRT DATA IN USE DURING ABEND 205
205 00000091 REASON 00C90101
205 00000091 ERQUAL 500A
205 00000091 TYPE 00000303
205 00000091 NAME RKWWZDD0.T1178X0A.X'002E73EA' <---IX page
205 00000091 CONNECTION-ID=DSZP
205 00000091 CORRELATION-ID=rsa-saison-e
205 00000091 LUW-ID=800800CE.8FB6.220916173256=141781
==SQL Statement
-------------
INSERT INTO RKWWC0.T1178_ZD_SAISONZM ( BILANZSTELLE, BEREICH, KO
STENSTELLE, PRUEFZIFFER, ARTIKEL_NR, WOCHE, ZMW, ZM1, ZM2,
ZM3, ZM4, ZM5, ZM6, ZM7 ) VALUES ( ?, ?, ?, ?, ?,
?, ?, 0, 0, 0, 0, 0, 0, 0 )
497 Getp S-L DC1DF1DAF19BB802 0169.01DC.0000.002E73EA 00007100 88040004 DSNKTRAV+00010376 0000015217D9DB40 0000015217519000 0000
498 Relp Unl DC1DF1DAF19BBBE6 0169.01DC.0000.0029AF65 00007100 80040024 DSNKTRAV+00015996 00000151FC5D1680 00000151FC168000 0000
499 Getp DC1DF1DAF19BC42E 0169.01D9.00FD.00000001 00000100 C8041010 DSNISUPI+000003C2 00000150B0D4AF80 00000150AFE34000 0000
500 Getp X-L DC1DF1DAF19BD6C4 0169.01D9.00FD.000000C7 80003900 C8040000 DSNISGRT+000053C4 000001510E5295C0 000001510D367000 0000
所以是在执行上述insert语句,通过index leaf page X'002E73EA' 定位到candidate data page X'000000C7', 然后往data page X'000000C7' insert 某条记录出了DSNISGRT:500A的abend,那么下一步是搞清楚insert那条记录报的这个错,以及报错的原因是什么?
通过查看Db2z的源码,我们直到DSNISGRT:500A报错的原因是Db2需要利用已经free掉的hole的space去插入新的数据,Db2检测到一旦插入这条record,剩余的hole的长度是个负数,即当前待插入的hole的长度比待插入的record的长度还要短。通过dump分析,我们很容易知道是要在page C7 offset x'006B'处的overflow hole C7-08插入长度为9的data record,Db2源码标识overflow hold的record长度是0,所以一旦insert了之后,剩余hole的length就会变长-9,这样Db2就报了DSNISGRT:500A的abend。下面是内存buffer pool里面page C7的镜像:
PAGE: # 000000C7 ----------------------------------------------------------------------------------
DATA PAGE: PGCOMB='00'X PGBIGRBA='00DC1DF1DAF19BE0DA00'X PGNUM='000000C7'X PGFLAGS='80'X
PGFREE=2054 PGFREE='0806'X PGFREEP=3603 PGFREEP='0E13'X PGHOLE1='0662'X
PGMAXID='EC'X PGNANCH=104
PGBIGTAIL: PGPRETAIL='000000000000000000DC1DF1DAF19BE0DA00'X PGIDFREE='00'X PGEND='42'X
ID-MAP FOLLOWS:
01 0014 001F 002A 0035 0044 0053 005F 8000
09 0074 0085 0097 00A9 00BA 00CC 00DB 00EA
…
RECORD: XOFFSET='005F'X PGSFLAGS='06'X PGSLTH=12 PGSLTH='000C'X PGSOBD='0000'X PGSBID='07'X
F085F036 F915F6F1 315403F0 202301F0 00000000 F0000000 00F00000 0000F000 0e0.9.61...0...0....0....0....0.
000000F0 00000000 F0000000 00F00000 0000F000 000000 ...0....0....0....0....
RECORD: XOFFSET='006B'X PGSFLAGS='58'X PGSRIDOF='000000C400'X <---PGSNODAT & PGSXE means deleted pointer record,长度为0,因为58后面的2个字节是x'0000'
RECORD: XOFFSET='0071'X PGSFLAGS='68'X PGSOLDOW='D86D0600'X PGSBID='11'X
RECORD: XOFFSET='0077'X PGSFLAGS='00'X PGSLTH=9 PGSLTH='0009'X PGSOBD='70B2'X PGSBID='33'X
35E102 ...
我们可以看到dump时间点附近有如下archive log:
00000004D1498A3E3A99 TYPE( UNDO REDO ) URID(00000004D1498A1FD520)
LRSN(00DC1DF1DAF19BE0DA00) DBID(0169) OBID(01D9) PART(00FE) PAGE(000000C7) 19:38:37 22.259
SUBTYPE(INSERT IN A DATA PAGE) CLR(NO) PROCNAME(DSNISGRT) COMPRESSED DATA
*LRH* 0000008D 00C80009 0EA00000 00000000 00000004 D1498A1F D5200000 00000000 * H J N
00000004 D1498A3E 39D15000 06000001 00000000 0004D149 8A3E39D1 000000DC * J J& J J
1DF1DAF1 9BE0DA00 00020000 00000000 * 1 1
*LG** 88016901 D9000000 C70000DC 1DF1DAF1 34FB0400 4C40C000 00FE1DF1 DAC89000 *h R G 1 1 < 1 H
00000000 *
0000 00194008 01DA4013 00000100 00000000 04000900 000868D8 6D * Q_
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 112572 Line(s) not Displayed
00000004D1498DEDA2E4 TYPE( UNDO REDO ) URID(00000004D1498A1FD520)
LRSN(00DC1DF1E3FCA9C9C800) DBID(0169) OBID(01D9) PART(00FE) PAGE(000000C7) 19:38:46 22.259
SUBTYPE(DELETE IN A DATA PAGE) CLR(YES) PROCNAME(DSNISGRT)
COMPRESSED DATA
*LRH* 0000008D 00BE0009 0EE00000 00000000 00000004 D1498A1F D5200000 00000000 * J N
00000004 D1498A6E 2CEE5000 06000001 00000000 0004D149 8A3E39D1 000000DC * J > & J J
1DF1E3FC A9C9C800 00020000 00000000 * 1T zIH
*LG** 88016901 D9000000 C70000DC 1DF1DAF1 9BE0DA00 4C40C000 00FE1DF1 DAC89000 *h R G 1 1 < 1 H
00000000 *
0000 00192008 01DA4013 00000100 00000000 04000900 000868D8 6D * Q_
所以在LRSN 00DC1DF1DAF19BE0DA00的时候确实是往C7-08的位置插入长度为9的data record。 下一步是要搞清楚这个pseudo delete pointer record是怎么搞出来的,很遗憾,客户只保留了15天的archive log,并没有产生这些pseudo delete pointer record的log,但是anyway,有大量基于这些pseudo delete pointer record的insert操作,因此Db2的developer同意开APAR在做insert的时候,判断一下利用pseudo delete pointer hole的情况。