故障处理
故障阶段 | 本地 | 远端 | 状态检查 | 处理方法 | 解释 |
PREPARE PHASE | GC | CPS | 本地: select local_tran_id,state from dba_2pc_pending; LOCAL_TRAN_ID STATE ---------------------- ---------------- 2.12.64845 collecting 远程: select local_tran_id,state from dba_2pc_pending; no rows selected | 本地 execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id'); | 本地数据库要求其他点做好commit或者rollback准备,现在正在“收集”其他点的数据库的返回信息,但是远程数据库未知状态(in doubt)。我们需要将本地的Global Coordinator的状态清除掉 |
本地: select local_tran_id,state from dba_2pc_pending; LOCAL_TRAN_ID STATE ---------------------- ---------------- 2.12.64845 prepared 远程: select local_tran_id,state from dba_2pc_pending; no rows selected | 本地: rollback force 'local_tran_id'; execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id'); | 本地Global Coordinator已经做好准备,已经将分布式锁放到各个事务的表上,但是远程数据库的状态再次未知(in doubt),我们需要手工的将本地的transaction rollback掉,并且清除分布式事务信息 | |||
COMMIT PHASE | GC | CPS | 本地: select local_tran_id,state from dba_2pc_pending; LOCAL_TRAN_ID STATE ---------------------- ---------------- 2.12.64845 prepared 远程: select local_tran_id,state from dba_2pc_pending; LOCAL_TRAN_ID STATE ---------------------- ---------------- 1.92.66874 commited | 本地: commit force 'local_tran_id'; execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id'); 远程: execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id'); | 远程数据库可能已经commit,但是本地Global Coordinator未知远程数据库的状态,还是处于prepare的状态 |
本地: select local_tran_id,state from dba_2pc_pending; LOCAL_TRAN_ID STATE ---------------------- ---------------- 2.12.64845 commited 远程: select local_tran_id,state from dba_2pc_pending; LOCAL_TRAN_ID STATE ---------------------- ---------------- 1.92.66874 commited | 本地: execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id'); 远程: execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id'); | 远程数据库和本地数据库均已经完成commit,但是分布式事务的信息尚未清除,我们需要在本地和远程运行 | |||
FORGET PHASE | GC | CPS | 本地: select local_tran_id,state from dba_2pc_pending; LOCAL_TRAN_ID STATE ---------------------- ---------------- 2.12.64845 commited 远程: select local_tran_id,state from dba_2pc_pending; no rows selected | 本地: execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id'); | 远程commit point site已经完成commit,通知Global Coordinator清除本地的分布式事务信息,但是Global Coordinator没有收到该信息 |
特殊情况 | |||||
情况1,当我们遇到使用rollback/commit force的时候,无法正常清除分布式事务的信息,会报错ORA-02058: no prepared transaction found with ID X.XX.XXXXX时,需要通过手工方式来清除该信息 | 在dba_2pc表中还有事务记录,但是实际已经不存在该事务了: select local_tran_id, state from dba_2pc_pending; LOCAL_TRAN_ID STATE ---------------------- ---------------- 1.92.66874 prepared (注:'1.92.66874' 的结构为rbs#, slot#, wrap#,此事务在rollback segment #1) 找出使用rollback segment #1且状态是active的transaction: SELECT KTUXEUSN, KTUXESLT, KTUXESQN, /* Transaction ID */ KTUXESTA Status, KTUXECFL Flags FROM x$ktuxe WHERE ktuxesta!='INACTIVE' AND ktuxeusn= 1; <== 这是rollback segment#,即rbs# no rows selected 因此我们在rollback force的时候会报错: ORA-02058: no prepared transaction found with ID 1.92.66874 | 我们需要如下处理: set transaction use rollback segment SYSTEM; delete from sys.pending_trans$ where local_tran_id = '1.92.66874'; delete from sys.pending_sessions$ where local_tran_id = '1.92.66874'; delete from sys.pending_sub_sessions$ where local_tran_id = '1.92.66874'; commit; | |||
情况2,这种情况比较少见,在dba_2pc表中无法查到分布式事务信息,但是实际上却是存在该分布式事务的 | 我们在alertlog中可以看到: ORA-1591: lock held by in-doubt distributed transaction 1.92.66874 查询dba_2pc的表,发现没有分布式事务信息: select local_tran_id, state from dba_2pc_pending where local_tran_id='1.92.66874'; no rows selected 但是去查实际的rollback segment信息,却发现有prepared状态的分布式事务存在: SELECT KTUXEUSN, KTUXESLT, KTUXESQN, /* Transaction ID */ KTUXESTA Status, KTUXECFL Flags FROM x$ktuxe WHERE ktuxesta!='INACTIVE' AND ktuxeusn= 1; KTUXEUSN KTUXESLT KTUXESQN STATUS FLAGS ---------- ---------- ---------- ---------------- ------------------------ 1 92 66874 PREPARED SCO|COL|REV|DEAD 我们无法做commit force或者rollback force: rollback force '1.92.66874'; ORA-02058: no prepared transaction found with ID 1.92.66874 | 我们用如下的方式手工清理: alter system disable distributed recovery; insert into pending_trans$ ( LOCAL_TRAN_ID, GLOBAL_TRAN_FMT, GLOBAL_ORACLE_ID, STATE, STATUS, SESSION_VECTOR, RECO_VECTOR, TYPE#, FAIL_TIME, RECO_TIME) values( '1.92.66874', /* <== 此处为你的local tran id */ 306206, /* */ 'XXXXXXX.12345.1.2.3', /* 这些值不必更改, */ 'prepared','P', /* 是静态参数,可以直接 */ hextoraw( '00000001' ), /* 在这个sql中使用 */ hextoraw( '00000000' ), /* */ 0, sysdate, sysdate ); insert into pending_sessions$ values( '1.92.66874',/* <==此处为你的local tran id */ 1, hextoraw('05004F003A1500000104'), 'C', 0, 30258592, '', 146); commit; commit force '1.92.66874'; 此时如果commit force还是出现报错,需要继续执行: delete from pending_trans$ where local_tran_id='1.92.66874'; delete from pending_sessions$ where local_tran_id='1.92.66874'; commit; alter system enable distributed recovery; 此时如果没有报错,则执行以下语句: alter system enable distributed recovery; connect / as sysdba alter session set "_smu_debug_mode" = 4; (注:如果使用auto的undo管理方式,需要执行此步骤,此步骤能避免在后续执行purge_lost_db_entry出现ORA-01453 的报错,详细信息可见Bug2191458) commit; exec dbms_transaction.purge_lost_db_entry( '1.92.66874' ) |
分布式锁-解锁
运行shell脚本后,会生成 roll.sql文件。
只需要运行roll.sql就会解除dba_2pc_pending 中的分布式锁。
注意:根据情况决定shell脚本中是否需要where 条件,比如过滤status 是prepare的,还是其他的,还是全部的。
----------------------shell脚本-------------------------
#!/bin/bash
. /home/oracle/.profile
sqlplus -S / as sysdba <<EOF > roll.sql
set heading off pagesize 0 echo off linesize 200 feedback off trimspool on
select 'rollback force ''' || LOCAL_TRAN_ID || ''';' || chr(10) ||
'execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY(''' || LOCAL_TRAN_ID
|| ''');' || chr(10) || 'commit;' from DBA_2PC_PENDING;
exit;
EOF
------------------------上述sh生成的roll.sql文件格式-----------------------
rollback force '1506.7.4851';
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('1506.7.4851');
commit;
rollback force '84.4.6215898';
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('84.4.6215898');
commit;
参考