银行某系统存储过程报错,重新调起仍报错,而前几天这个一直正常。通过应用日志定位到报错信息,
ORA-03113: 通信通道的文件结尾
进程 ID: 36503726
会话 ID: 586 序列号: 65345
遇到这种错误,习惯性在网上找答案。然而网上那些关于这个经典错误的经典案例无一能和我的错误匹配
这时候我会习惯性的"看看SQL"
从存储过程里面定位到这个导致报错的SQL,单独拿出来跑,报错!!SQL语句如下:
SELECT *
FROM (SELECT ROW_NUMBER() OVER(ORDER BY 1) AS RN, BUSINESS_QUERY.*
FROM (select distinct t.*,
t7.org_id as grant_user_id,
t6.user_name,
t2.grant_user,
t4.cust_zh_name,
t5.name,
t2.grant_name,
t2.grant_date,
t2.grant_type,
t2.grant_start_date,
t2.grant_end_date
from tttttttttttt1 t
left join tttttttttttt2 t2
on t.grant_id = t2.id
left join scrm.tttttttttttt4 t4
on t.cust_id = t4.cust_id
left join tttttttttttt6 t6
on t2.grant_user = t6.account_name
left join tttttttttttt5 t5
on t5.emp_id = t.emp_id
left join tttttttttttt7 t7
on t7.account_name = t2.grant_user
where t.id in (select max(t1.id)
from tttttttttttt1 t1
group by t1.cust_id, t1.emp_id)
and t2.grant_end_date >=
to_date('2018-10-12', 'YYYY-mm-dd')
and t2.grant_type = '1'
and t7.org_id IN
(SELECT UNITID
FROM SYS_UNITS
WHERE UNITSEQ LIKE
'%' || (SELECT UNITSEQ
FROM SYS_UNITS
WHERE UNITID = 'B9999'))
and t4.data_dt in( select crm_dt-1 from tttttttttttt9 )
order by t.grant_id desc,
t2.grant_date desc,
t2.grant_name desc) BUSINESS_QUERY
where rownum <= 100) SUB_QUERY
WHERE RN >= 1
这个SQL看起来 平平无奇,如果单通过这个SQL想去定位问题,非常难!但是系统负责人那边还有个信息
"这个SQL之前一直没有出问题,前几天一切正常!"
如果是往常 ,我会看一下SQL的执行计划,因为执行计划“突变”,这类问题解决过太多次。而针对这个我觉得没有看执行计划的必要,因为ORA-03113不算“性能”类的错误,我坚信执行计划那里获取不到有用的信息,那么,下一步 该怎么走
-----------------------------------------------------------一起思考一下-----------------------------------------------------------
-----------------------------------------------------------再往下想一点-----------------------------------------------------------
既然是前几天没有报错。。我们抓住了救命稻草:看SQL中的 时间列,时间列,时间过滤列
SQL中的时间过滤列就两个
and t2.grant_end_date >= to_date('2018-10-12', 'YYYY-mm-dd')
and t4.data_dt in( select crm_dt-1 from o_pub_project_draw )
代入 昨天的时间 果然SQL正常~~~
下面的诊断方法,简单 粗暴 高效-------注释掉t2和t4所在的部分再跑SQL看看会不会报错
我这里先查一下T2和T4表大小 以往的经验告诉我,大表出问题的概率一般比较大
tttttttttttt4 t4是一张视图,里面就一张基表 AAAAAAAAAA_SUM .表信息如下:
SQL> select segment_name,sum(bytes/1024/1024/1024)as GB from dba_segments where segment_name in('AAAAAAAAAA_SUM','TTTTTTTTTTT2') GROUP BY segment_name;
SEGMENT_NAME GB
-------------------------------------------------------------------------------- -----------------------
AAAAAAAAAA_SUM 60.9638671
TTTTTTTTTTT2 0.0001264
我们这里注释点t4部分 ,SQL无报错!
这里我们定位到T4表出了问题.........
我们先排查一下表上面有没有约束,主外键? pass掉
剩下的排查方向就是两个:表有问题,表上面索引有问题
因为是分区表,我习惯性先排查global索引(解决过太多global索引失效的问题),查看索引状态
SQL> SELECT INDEX_NAME,TABLE_NAME,TABLESPACE_NAME,BLEVEL,LEAF_BLOCKS,STATUS,LAST_ANALYZED,DEGREE,PARTITIONED FROM DBA_INDEXES WHERE TABLE_NAME IN('A_M_C_CUST_BASE_SUM','A_M_C_CUST_BASE_SUM_1010');
INDEX_NAME TABLE_NAME TABLESPACE_NAME BLEVEL LEAF_BLOCKS STATUS LAST_ANALYZED DEGREE PARTITIONED
------------------------------ ------------------------------ ------------------------------ ---------- ----------- -------- ------------- ---------------------------------------- -----------
IDX_AAAAAAAAAA_SUM_NO AAAAAAAAAA_SUM 2 865562 N/A 2018/9/29 17: 16 YES
IDX_AAAAAAAAAA_SUM_02 AAAAAAAAAA_SUM XXXXXXXXXXXX 3 1915940 VALID 2018/10/10 15 1 NO
IDX_AAAAAAAAAA_SUM_ID AAAAAAAAAA_SUM XXXXXXXXXXXX 3 984802 VALID 2018/10/11 14 4 NO
global索引没有问题~~~~~
排查表的问题没什么特别好的方法,使用cts(create table as)重建一张一模一样的表AAAAAAAAAA_SUM_1010
开整之前我还怀疑cts语句 执行会不会报错。结果~没事。而且把表AAAAAAAAAA_SUM_1010代入到原SQL语句中,SQL完美执行,没有报错 !!表确实有问题,但是CTS为何没有报错?(我当时排查的时候一直往表坏块的方向走)
正绝望中,看到了上面排查索引的结果:
STATUS LAST_ANALYZED DEGREE PARTITIONED
-------- ------------- ------- -----------
N/A 2018/9/29 17 16 YES
VALID 2018/10/10 15 1 NO
VALID 2018/10/11 14 4 NO
分区索引会不会失效?
select distinct status from dba_ind_partitions where index_name in ('IDX_AAAAAAAAAA_SUM_NO')
status
----------
UNUSABLE
VALID
顺便查一下又哪些索引失效
select 'alter index CCCC.IDX_AAAAAAAAAA_SUM_NO rebuild partition '||partition_name||' parallel 8;' from dba_ind_partitions where index_name in ('IDX_A_M_C_CUST_BASE_CUST_NO') and status='UNUSABLE';
alter index CCCC.IDX_AAAAAAAAAA_SUM_NO rebuild partition P_20180829 parallel 8;
alter index CCCC.IDX_AAAAAAAAAA_SUM_NO rebuild partition P_20180830 parallel 8;
alter index CCCC.IDX_AAAAAAAAAA_SUM_NO rebuild partition P_20180831 parallel 8;
alter index CCCC.IDX_AAAAAAAAAA_SUM_NO rebuild partition P_20180901 parallel 8;
alter index CCCC.IDX_AAAAAAAAAA_SUM_NO rebuild partition P_20180902 parallel 8;
alter index CCCC.IDX_AAAAAAAAAA_SUM_NO rebuild partition P_20180903 parallel 8;
alter index CCCC.IDX_AAAAAAAAAA_SUM_NO rebuild partition P_20180904 parallel 8;
alter index CCCC.IDX_AAAAAAAAAA_SUM_NO rebuild partition P_20180905 parallel 8;
alter index CCCC.IDX_AAAAAAAAAA_SUM_NO rebuild partition P_20180906 parallel 8;
alter index CCCC.IDX_AAAAAAAAAA_SUM_NO rebuild partition P_20180908 parallel 8;
alter index CCCC.IDX_AAAAAAAAAA_SUM_NO rebuild partition P_20180909 parallel 8;
alter index CCCC.IDX_AAAAAAAAAA_SUM_NO rebuild partition P_20180910 parallel 8;
alter index CCCC.IDX_AAAAAAAAAA_SUM_NO rebuild partition P_20180911 parallel 8;
alter index CCCC.IDX_AAAAAAAAAA_SUM_NO rebuild partition P_20180912 parallel 8;
alter index CCCC.IDX_AAAAAAAAAA_SUM_NO rebuild partition P_20180913 parallel 8;
... ...
... ...
... ...
alter index CCCC.IDX_AAAAAAAAAA_SUM_NO noparallel;
索引REBUILD 重新调起存储过程~完美通过
这个系统上次给他们处理过一次问题,删除分区之后都会加update global indexes。没想到local index会出问题
总结:1.删除分区会导致全局和分区索引都失效,使用update global indexes对local索引没有效果。
2.ORA-03113报错无分析头绪的时候查看一下相关表的索引状态