这周又出了一次生产事件,发生在另一个运营了近四年的云环境中,受影响客户较多,好在影响时间不长,较快的恢复了生产。具体的排查过程就不说了,在这里先做一下排查记录。
服务器message日志
根据此日志文件可以查看oracle是因为什么宕掉的,这次事件通过此文件发现,是因为free swap 为0KB 引发了系统主动kill了oracle进程
[root@57373ded4b19 log]# more /var/log/messages
应用日志
app日志
查询DB宕机时间段的应用状态
web-apache日志
统计应用的交易请求量
oracle日志
alert*.log 是oracle的警告日志文件,能够看出来出问题的时候oracle在做什么,是因为什么引发的问题产生
trace 日志
[oracle@57373ded4b19 trace]$ sqlplus / as sysdba
SQL> show parameter dump
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
background_core_dump string partial
background_dump_dest string /home/oracle/app/oracle/diag/rdbms/helowin/helowin/trace
core_dump_dest string /home/oracle/app/oracle/diag/rdbms/helowin/helowin/cdump
max_dump_file_size string unlimited
shadow_core_dump string partial
user_dump_dest string /home/oracle/app/oracle/diag/rdbms/helowin/helowin/trace
SQL> select * from v$diag_info;
INST_ID NAME VALUE
---------- ---------------------------------------------------------------- -------------------------------------------------------------------------------------
1 Diag Enabled TRUE
1 ADR Base /home/oracle/app/oracle
1 ADR Home /home/oracle/app/oracle/diag/rdbms/helowin/helowin
1 Diag Trace /home/oracle/app/oracle/diag/rdbms/helowin/helowin/trace
1 Diag Alert /home/oracle/app/oracle/diag/rdbms/helowin/helowin/alert
1 Diag Incident /home/oracle/app/oracle/diag/rdbms/helowin/helowin/incident
1 Diag Cdump /home/oracle/app/oracle/diag/rdbms/helowin/helowin/cdump
1 Health Monitor /home/oracle/app/oracle/diag/rdbms/helowin/helowin/hm
1 Default Trace File /home/oracle/app/oracle/diag/rdbms/helowin/helowin/trace/helowin_ora_25878.trc
1 Active Problem Count 1
1 Active Incident Count 26
11 rows selected.
SQL>
[oracle@57373ded4b19 alert]$ cd /home/oracle/app/oracle/diag/rdbms/helowin/helowin/alert
[oracle@57373ded4b19 alert]$ ls
log.xml
[oracle@57373ded4b19 alert]$ cd /home/oracle/app/oracle/diag/rdbms/helowin/helowin/trace
[oracle@57373ded4b19 trace]$ ls alert_helowin.log
alert_helowin.log
[oracle@57373ded4b19 trace]$
ASM日志
[oracle@57373ded4b19 trace]$ sqlplus / as sysasm
SQL> show parameter dump
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
background_core_dump string partial
background_dump_dest string /home/oracle/app/grid/diag/asm/+asm/+A
SM1/trace
core_dump_dest string /home/oracle/app/grid/diag/asm/+asm/+A
SM1/cdump
max_dump_file_size string unlimited
shadow_core_dump string partial
user_dump_dest string /home/oracle/app/grid/diag/asm/+asm/+ASM1/trace
SQL> select * from v$diag_info;
INST_ID NAME VALUE
---------- ---------------------------------------------------------------- -------------------------------------------------------------------------------------
1 Diag Enabled TRUE
1 ADR Base /home/oracle/app/grid
1 ADR Home /home/oracle/app/grid/diag/asm/+asm/+ASM1
1 Diag Trace /home/oracle/app/grid/diag/asm/+asm/+ASM1/trace
1 Diag Alert /home/oracle/app/grid/diag/asm/+asm/+ASM1/alert
1 Diag Incident /home/oracle/app/grid/diag/asm/+asm/+ASM1/incident
1 Diag Cdump /home/oracle/app/grid/diag/asm/+asm/+ASM1/cdump
1 Health Monitor /home/oracle/app/grid/diag/asm/+asm/+ASM1/hm
1 Default Trace File /home/oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_11192.trc
1 Active Problem Count 0
1 Active Incident Count 0
11 rows selected.
[oracle@57373ded4b19 trace]$ cd /home/oracle/app/grid/diag/asm/+asm/+ASM1/trace
[oracle@57373ded4b19 trace]$ more alert_+ASM1.log
oracle 导出AWR
sqlplus / as sysdba
SQL> @?/rdbms/admin/awrrpt.sql
然后根据提示输入:
- 导出文件类型
‘html’ HTML format (default)
‘text’ Text format
‘active-html’ Includes Performance Hub active report
- 导出的AWR报告天数
- 根据提示输入开始和结束时间点的 Snap Id
- 输入导出的文件名称
即可导出AWR 报告。
每一次生产问题的排查与解决都是从成堆的日志文件中的不知道多少行的多少字符中筛选那么一点信息,去比对定位。
*哎,天天脑壳疼 *