[20151214]exadata-豆腐渣系统的保护神.txt
--很久以前写的http://www.itpub.net/thread-1927992-1-1.html,忘记贴到blog,补上。
--这个完全是由感而发,这个是前几个星期遇到的事情,由于上面统一发票格式,对程序做一些改动.
--上线后我发现一条sql语句存在问题,实际加入的部分使用union all连接起来,前面那部分以前也是存在问题的,
--我使用sql profile把执行计划稳定下来,在加入union all后,执行计划再次回到原来的状态.
--其中2个表1个500M,1个7.XG,全表扫描,做bloom filter,再使用hash join连接起来,我看了大约有10台机器在执行这个语句,
--是一个刷屏命令,30秒定时执行1次.这样1个小时大约1200次.
--我自己执行看了大约4秒执行完成,真快!!我这次没有使用sql profile稳定计划,因为即使那样,逻辑度也很高.
--我直接提交要求开发改代码.当然我并没有讲如何改,因为我认为开发应该很容易发现问题修改代码.
--结果,没有任何人重视,就这样一直拖了2个星期,直到星期六早上出现性能问题,我事后看了报表:
--10:11点:
Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tota Wait % DB
Event Waits Time Avg(ms) time Wait Class
------------------------------ ------------ ---- ------- ------ ----------
DB CPU 10.8 48.9
gc buffer busy acquire 12,869,510 4860 0 22.0 Cluster
read by other session 10,487,086 4153 0 18.8 User I/O
cell single block physical rea 5,050,453 1499 0 6.8 User I/O
cell multiblock physical read 3,725,544 1301 0 5.9 User I/O
SQL*Net more data from client 230,252 157. 1 .7 Network
SQL*Net more data to client 4,702,358 57.2 0 .3 Network
log file sync 115,817 44.2 0 .2 Commit
SQL*Net break/reset to client 137,108 37.7 0 .2 Applicatio
SQL*Net message to client 14,156,077 30.1 0 .1 Network
--很明显问题出在rac的内联方面。
--查看SQL ordered by Cluster Wait Time部分:
SQL ordered by Cluster Wait Time DB/Inst: DBCN/dbcn1 Snaps: 5465-5466
-> %Total - Cluster Time as a percentage of Total Cluster Wait Time
-> %Clu - Cluster Time as a percentage of Elapsed Time
-> %CPU - CPU Time as a percentage of Elapsed Time
-> %IO - User I/O Time as a percentage of Elapsed Time
-> Only SQL with Cluster Wait Time > .005 seconds is reported
-> Total Cluster Wait Time (s): 4,863
-> Captured SQL account for 98.1% of Total
Cluster Elapsed
Wait Time (s) Executions %Total Time(s) %Clu %CPU %IO SQL Id
-------------- ------------ ------ ---------- ------ ------ ------ -------------
4,772.2 1,050 98.1 18,000.4 26.5 38.0 37.6 dz7bbcwhuhhp7
Module: XXX.EXE
SELECT MS_CF01.CFHM, MS_CF01.CFSB, MS_CF01.BRXM,
MS_CF01.BRID, MS_CF01.FPHM, MS_CF01.CFLX, MS_CF0
1.KFRQ, MS_CF01.CFTS, MS_CF01.FYRQ, MS_CF01.
FYCK, MS_CF01.KSDM, MS_CF01.YSDM, MS_MZXX.CZGH,
--在另外的实例2上看到:
SQL ordered by Cluster Wait Time DB/Inst: DBCN/dbcn2 Snaps: 5465-5466
-> %Total - Cluster Time as a percentage of Total Cluster Wait Time
-> %Clu - Cluster Time as a percentage of Elapsed Time
-> %CPU - CPU Time as a percentage of Elapsed Time
-> %IO - User I/O Time as a percentage of Elapsed Time
-> Only SQL with Cluster Wait Time > .005 seconds is reported
-> Total Cluster Wait Time (s): 2
-> Captured SQL account for 54.4% of Total
Cluster Elapsed
Wait Time (s) Executions %Total Time(s) %Clu %CPU %IO SQL Id
-------------- ------------ ------ ---------- ------ ------ ------ -------------
1.0 1 56.2 13.0 7.4 73.6 11.1 b6usrg82hwsa3
Module: DBMS_SCHEDULER
call dbms_stats.gather_database_stats_job_proc ( )
--估计是系统分析了表ms_cf01导致数据大部分在第2个实例,导致出现cluster相关的等待事件。
--实际上我看最后的分析时间并没有变化,并没有分析成功.有点不明白?而且上面的执行时间很短仅仅13秒.
--我们正常这条语句都在实例1上执行,为什么表ms_cf01数据大量在实例2上,还是不明白?
--实例1上:
Segments by Global Cache Buffer Busy
% of Capture shows % of GC Buffer Busy for each top segment compared
with GC Buffer Busy for all segments captured by the Snapshot
Owner Tablespace Name Object Name Subobject Name Obj. Type GC Buffer Busy % of Capture
xxxxxx_xxx xxxxxx_xxx MS_CF01 TABLE 18,524,883 100.00
--实际上当时紧急修改了代码,但是依旧有机器在执行这条语句.我们的系统要退出才会自动升级.
--至少当时的下午已经恢复正常.
--我那这条语句在dg上执行,第1次4分钟,第2次1分钟多一点.
--正好前一阵子帮别人解决问题,听说对方也准备上exadata,我感觉国内许多大单位都有意上exadata,而不愿意花精力在解决程序以及
--设计的问题上。