测试ogg的时候发现不知道什么时候datapump居然挂掉了,报错OGG-01733,搜索了一下mos,发现是trail文件损坏导致。按照下面的方法成功恢复。
In this Document
Symptoms |
Cause |
Solution |
Oracle to Oracle database replication recovery steps: |
Replicat Abend with ERROR OGG-01733 |
Extract pump Abend with ERROR OGG-01733 |
APPLIES TO:
Oracle GoldenGate - Version 11.2.1.0.1 and laterInformation in this document applies to any platform.
SYMPTOMS
Extract pump or replicat reading a trail gets ERROR OGG-01733 Oracle GoldenGate Capture for Oracle, example:
2013-12-11 22:20:26 ERROR OGG-01733 Oracle GoldenGate Capture for Oracle, pp2p101.prm: Trail file header file size value 46359762 for trail file '/ggtrail/EP2P201/se000021' differs from the actual size of the file (47308971).
CAUSE
An unplanned system failure or reboot can cause dirty disk cache not to flush to disk. When the system is recovered, the data that was written by extract can be lost. This is an Operating System issue. Goldengate checkpoint data may not match the actual eof of the trail. When this happens, extract will abend with ERROR OGG-01733.
SOLUTION
As trail data is now lost, the trail(s) has to be rebuilt.
Oracle to Oracle database replication recovery steps:
- Stop all Goldengate processes that writes or reads this trail.
- Stop the replicat at target that consume these data and get the checkpoint information by doing ggsci> info <full replicat name(Do not use wildcards or ALL)>
- execute Logdump, open the trail and position as per info replicat above.
logdump
> open <trail as per info>
> pos <rba>
> ggstokens detail
> ghdr on
> next
Record the LogCSN ggstoken value. - Go to the source local trail as written by the tranlog extract to locate the OGG record using this LogCSN. Use logdump open the trail that you think has this record and search for this. This can be achieved by
logdump
> open <trail, you might need to search for more than one trail>
> ghdr on
> ggstokens detail
> filter include LogCSN <recorded LogCSN>
> next - Record the trail location, trail sequence and RBA of this record.
- Follow the guidelines depending on which component fails:
Replicat Abend with ERROR OGG-01733
-
- At the source, ggsci
> alter <extract pump>, etrollover
> alter <extract pump>, extseqno <trail sequence>, extrba <RBA>
> start <extract pump> - At the target, ggsci
> alter <replicat>, extseqno <new trail created by extract pump>, extrba 0
> start <replicat>, ATCSN <LogCSN>
- At the source, ggsci
Extract pump Abend with ERROR OGG-01733
-
- At the source use LogMiner to get the timestamp for LogCSN.
- At the source, ggsci
> alter <tranlog extract>, begin <timestamp - a buffer**>
> alter <tranlog extract>, etrollover
> start <tranlog extract> (take note of new trail sequence)
> alter <extract pump>, extseqno <new sequence from tranlog extract>, extrba 0
> alter <extract pump>, etrollover
> start <extract pump>
At the target, ggsci
> > alter <replicat>, extseqno <new trail created by extract pump>, extrba 0
> start <replicat>, ATCSN <LogCSN> - (Alternatively find the SCN in new trails manually, and position the pump or replicat to the beginning of next transaction)
Note:
** Buffer The obtained SCN is commit SCN, and there is no way to know the start time of the oldest pending transaction at that SCN point. The customer has to make a decision here.
Example, if this is OLTP env, and customer knows that transactions should not last more than 1 hour then we may position the extract 2hr before the SCN point.
GGSCI (bre2) 4> dblogin userid ggt ,password ggt
Successfully logged into database.
GGSCI (bre2 as ggt@bre2) 5> info replicat detail
ERROR: REPLICAT does not exist.
GGSCI (bre2 as ggt@bre2) 6> info rep_t1 detail
REPLICAT REP_T1 Last Started 2015-09-13 03:47 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:08 ago)
Process ID 5657
Log Read Checkpoint File ./dirdat/rt000110
2015-09-13 20:57:55.127364 RBA 48477
Current Log BSN value: 2095819
Last Committed Transaction CSN value: 2132870
Extract Source Begin End
./dirdat/rt000110 2015-09-13 03:45 2015-09-13 20:57
./dirdat/rt000012 2015-09-13 03:19 2015-09-13 03:45
./dirdat/rt000012 2015-09-13 03:16 2015-09-13 03:19
./dirdat/rt000012 2015-09-13 03:16 2015-09-13 03:16
./dirdat/rt000012 2015-09-13 03:16 2015-09-13 03:16
./dirdat/rt000012 2015-09-13 03:16 2015-09-13 03:16
./dirdat/rt000012 2015-09-13 03:16 2015-09-13 03:16
./dirdat/rt000012 2015-09-13 03:16 2015-09-13 03:16
./dirdat/rt000012 2015-09-13 03:16 2015-09-13 03:16
./dirdat/rt000012 2015-09-13 03:16 2015-09-13 03:16
./dirdat/rt000012 2015-09-13 01:15 2015-09-13 03:16
./dirdat/rt000012 2015-09-13 01:06 2015-09-13 01:15
./dirdat/rt000012 2015-09-10 20:55 2015-09-13 01:06
./dirdat/rt000010 2015-09-09 21:25 2015-09-10 20:55
./dirdat/rt000010 2015-09-09 21:09 2015-09-09 21:25
./dirdat/rt000008 2015-09-09 20:59 2015-09-09 21:09
./dirdat/rt000006 2015-09-09 20:54 2015-09-09 20:59
./dirdat/rt000004 2015-09-09 20:33 2015-09-09 20:54
./dirdat/rt000002 2015-09-09 20:26 2015-09-09 20:33
./dirdat/rt000002 * Initialized * 2015-09-09 20:26
./dirdat/rt000000 * Initialized * First Record
Current directory /ogg/oracle
Report file /ogg/oracle/dirrpt/REP_T1.rpt
Parameter file /ogg/oracle/dirprm/rep_t1.prm
Checkpoint file /ogg/oracle/dirchk/REP_T1.cpr
Checkpoint table ggt.chkpt
Process file /ogg/oracle/dirpcs/REP_T1.pcr
Error log /ogg/oracle/ggserr.log
GGSCI (bre1) 19> info dp_t1
EXTRACT DP_T1 Last Started 2015-09-13 21:07 Status ABENDED
Checkpoint Lag 11:54:18 (updated 00:01:00 ago)
Log Read Checkpoint File ./dirdat/lt000012
2015-09-13 09:12:53.000000 RBA 2311
使用logdump查看是什么时候的什么操作导致的问题:
Logdump 139 >open lt000012
Current LogTrail is /ogg/oracle/dirdat/lt000012
Logdump 140 >ggstoken detail
Logdump 141 >ghdr on
Logdump 143 >pos 2311
Reading forward from RBA 2311
Logdump 144 >n
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x04)
UndoFlag : . (x00) BeforeAfter: A (x41)
RecLength : 23 (x0017) IO Time : 2015/09/13 18:02:58.000.000
IOType : 5 (x05) OrigNode : 255 (xff)
TransInd : . (x03) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 159 AuditPos : 12360208
Continued : N (x00) RecCount : 1 (x01)
2015/09/13 18:02:58.000.000 Insert Len 23 RBA 2311
Name: TEST2.T3
After Image: Partition 4 G s
0000 0006 0000 0002 3337 0001 0009 0000 0005 746f | ........37........to
6d61 73 | mas
GGS tokens:
TokenID x52 'R' ORAROWID Info x00 Length 20
4141 4145 444e 4141 4541 4141 5a50 2b41 4141 0001 | AAAEDNAAEAAAZP+AAA..
TokenID x4c 'L' LOGCSN Info x00 Length 7
3231 3238 3032 35 | 2128025
TokenID x36 '6' TRANID Info x00 Length 8
322e 3130 2e39 3637 | 2.10.967
按照文档操作:
GGSCI (bre1) 27> alter extract ext_t1 ,begin 2015-09-13 18:02:58
EXTRACT altered.
--下面这一步表示重新生成队列文件,虽然有一个错误提示,但是不影响
GGSCI (bre1) 28> alter extract ext_t1,etrollover
2015-09-13 21:17:07 INFO OGG-01520 Rollover performed. For each affected output trail of Version 10 or higher format, after starting the source extract, issue ALTER EXTSEQNO for that trail's reader (either pump EXTRACT or REPLICAT) to move the reader's scan to the new trail file; it will not happen automatically.
EXTRACT altered.
GGSCI (bre1) 32> start ext_t1
Sending START request to MANAGER ...
EXTRACT EXT_T1 starting
--下面这一步是让datapump从16号trail重新发送
GGSCI (bre1) 1> alter extract dp_t1 ,extseqno 16 ,extrba 0
GGSCI (bre1) 2> alter extract dp_t1,etrollover
2015-09-13 21:25:12 INFO OGG-01520 Rollover performed. For each affected output trail of Version 10 or higher format, after starting the source extract, issue ALTER EXTSEQNO for that trail's reader (either pump EXTRACT or REPLICAT) to move the reader's scan to the new trail file; it will not happen automatically.
EXTRACT altered.
GGSCI (bre1) 4> start dp_t1
Sending START request to MANAGER ...
EXTRACT DP_T1 starting
GGSCI (bre1) 5> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING DP_T1 00:00:00 00:00:12
EXTRACT RUNNING EXT_T1 00:00:00 00:00:07
alter replicat rep_t1,extseqno 16,extrba 0
start replicat rep_t1,atcsn 2132870