今日对某局方的数据库进行巡检,发现alert.log日志里面有大量的ORA-32701: Possible hangs up to hang ID=57 detected报错,完整的日志报错如下:
Sun Dec 13 01:08:12 2015
Errors in file /oracle/app/oracle/diag/rdbms/huibuy/huibuy1/trace/huibuy1_dia0_80848.trc (incident=200081):
ORA-32701: Possible hangs up to hang ID=57 detected
Incident details in: /oracle/app/oracle/diag/rdbms/huibuy/huibuy1/incident/incdir_200081/huibuy1_dia0_80848_i200081.trc
DIA0 requesting termination of session sid:5217 with serial # 33617 (ospid:41409) on instance 2
due to a GLOBAL, HIGH confidence hang with ID=57.
Hang Resolution Reason: Although the number of affected sessions did not
justify automatic hang resolution initially, this previously ignored
hang was automatically resolved.
DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=57.
进一步跟踪日志:
/oracle/app/oracle/diag/rdbms/huibuy/huibuy1/trace/huibuy1_dia0_80848.trc,
发现:
Incident details in: /oracle/app/oracle/diag/rdbms/huibuy/huibuy1/incident/incdir_200081/huibuy1_dia0_80848_i200081.trc
inst# SessId Ser# OSPID PrcNm Event
1 6610 9703 89862 M000 enq: WF - contention
2 5217 33617 41409 M000 not in wait
M000进程出现等待,根据运维人员的反馈,出现hang的时间很短,业务侧反馈,数据库主机只能够承受1000左右的连接,256G内存,64CORE的X86性能怎么会如此低下?还会导致库经常出现hang?感觉非常不科学。并且1000左右的连接就消耗完了整个主机内存。基本定位出,用户连接不合理导致消耗大量的内存,M000等待内存分配从而出现hang。
附上完整的数据库hang分析步骤:
- 收集systemstate dump:
Oracle$ sqlplus -prelim / as sysdba
SQL>oradebug setmypid
SQL>oradebug unlimit
SQL>oradebug –g all dump systemstate 266
Wait for 30 seconds
SQL>oradebug -g all dump systemstate 266
Wait for 30 seconds
SQL>oradebug -g all dump systemstate 266
SQL>oradebug tracefile_name 显示trace file name
- hanganalyze:
SQL>sqlplus -prelim / as sysdba
SQL>oradebug setmypid
SQL>oradebug unlimit;
SQL>oradebug –g all hanganalyze 3
Wait for 30 seconds
SQL>oradebug –g all hanganalyze 3
Wait for 30 seconds
SQL>oradebug –g all hanganalyze 3
SQL>oradebug tracefile_name 显示trace file name
这个hang只是一个表象,真实的原因还是内存的分配,结合awr报告,可以看到两个节点之间有大量的gc,还有当前x86主机的Pagetables占用了将近25GB。所以一个整体的优化思路是:
一、优化gc
二、优化cache buffer chain和lantch
三、优化内存,采用Hugepages。
具体优化手段见后续的(二)。