昨天一套核心库出现故障,几百个session堵塞,最后宕机,由于数据库是客服库,很敏感,所以办公室顿时嘈杂起来。
我们处理这类故障的原则是先恢复业务,其它的都随后在展开,在随后的分析工作中,走了点弯路,下面我把整个故障的情况展现给大家。
首先,在alert里面发现了ora-01555,这个是从7点左右开始的,一直持续到下午2点都有这样的事件.
Tue May 22 07:11:56 2012
ORA-01555 caused by SQL statement below (SQL ID: gd31z0zdh33nj, Query Duration=101 sec, SCN: 0x0b3c.37ef1c98)
Tue May 22 08:45:34 2012
Errors in file /u01/oracle/app/oracle/admin/crmkf/bdump/crmkf1_smon_8004.trc:
ORA-30036: unable to extend segment by 4 in undo tablespace 'UNDOTBS1'
直到八点多报出ORA-30036,在随后一直有ORA-01555,的确当时UNDO消耗殆尽,大家都走入了误区,
把数据库Hang都集中在undo相关的因素上,在这个基础上去找原因,但是最后也没啥进展。
看看如下的统计,这是在事故发生的前几分钟的信息:
ORA-01555 caused by SQL statement below (SQL ID: gd31z0zdh33nj, Query Duration=101 sec, SCN: 0x0b3c.37ef1c98):
ORA-01555 caused by SQL statement below (SQL ID: 18hwx0w69f9pg, Query Duration=107 sec, SCN: 0x0b3c.381970a9):
ORA-01555 caused by SQL statement below (SQL ID: 18hwx0w69f9pg, Query Duration=152 sec, SCN: 0x0b3c.380ff215):
ORA-01555 caused by SQL statement below (SQL ID: 18hwx0w69f9pg, Query Duration=73 sec, SCN: 0x0b3c.3822c0ef):
ORA-01555 caused by SQL statement below (SQL ID: dx2gpf5jk1jnu, Query Duration=113 sec, SCN: 0x0b3c.38e2030f):
ORA-01555 caused by SQL statement below (SQL ID: f1v5vww09yzzd, Query Duration=170 sec, SCN: 0x0b3c.3a53f9b0):
ORA-01555 caused by SQL statement below (SQL ID: 948xj46pvh0j4, Query Duration=28 sec, SCN: 0x0b3c.3a84c77d):
ORA-01555 caused by SQL statement below (SQL ID: gd31z0zdh33nj, Query Duration=186 sec, SCN: 0x0b3c.3a7d6206):
ORA-01555 caused by SQL statement below (SQL ID: gd31z0zdh33nj, Query Duration=226 sec, SCN: 0x0b3c.3a76b4ab):
ORA-01555 caused by SQL statement below (SQL ID: dx2gpf5jk1jnu, Query Duration=29 sec, SCN: 0x0b3c.3b871228):
ORA-01555 caused by SQL statement below (SQL ID: 02pnhfvmwua8r, Query Duration=37 sec, SCN: 0x0b3c.3be2e941):
ORA-01555 caused by SQL statement below (SQL ID: 05m5p2na5f8yd, Query Duration=6 sec, SCN: 0x0b3c.3c17f5fa):
ORA-01555 caused by SQL statement below (SQL ID: 1xj2gpb5x1a38, Query Duration=6 sec, SCN: 0x0b3c.3c24d864):
ORA-01555 caused by SQL statement below (SQL ID: 4qcd9jnyt4j27, Query Duration=66 sec, SCN: 0x0b3c.3c4c2d04):
ORA-01555 caused by SQL statement below (SQL ID: 2xp5nc6rmd42t, Query Duration=15 sec, SCN: 0x0b3c.3c63797d):
ORA-01555 caused by SQL statement below (SQL ID: 8r4xzv0amrkyb, Query Duration=64 sec, SCN: 0x0b3c.3c5c7c15):
ORA-01555 caused by SQL statement below (SQL ID: 8v307x8qzc78y, Query Duration=12 sec, SCN: 0x0b3c.3d084c98):
很短小的事务也失败,这个有点不可理解,duration=6的也报出ora-01555。
再来看看当时数据库的等待事件:
col samp_time for a25
col event for a30
select session_id sid,
to_char(SAMPLE_TIME,'YY-MM-DD:HH:MI:SS') samp_time,
session_state,
event,
p1,
p2,
p3,
wait_time,
time_waited
from DBA_HIST_ACTIVE_SESS_HISTORY
where to_char(SAMPLE_TIME,'YY-MM-DD:HH24:MI:SS') between
'12-05-22:14:45:00' and '12-05-22:14:55:00' --and event like'cursor%' and p1=2057858921
order by sample_id desc;
SID SAMP_TIME SESSION_STATE EVENT P1 P2 P3 TIME_WAITED
---------- ------------------------- ------------- ---------------------- ---------- ---------- ---------- -----------
5608 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986328
5613 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986320
5615 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986324
5616 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986314
5619 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986327
5621 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986321
5622 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986311
5623 12-05-22:02:52:30 WAITING enq: FB - contention 1178730502 5 42451933 498021
5625 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986323
5627 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986321
5631 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986322
5633 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986323
5634 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986327
5637 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986308
5638 12-05-22:02:52:30 WAITING enq: FB - contention 1178730502 5 42451933 498022
5640 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986328
5641 12-05-22:02:52:30 WAITING log file sync 4294967295 0 0 986328
5642 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986329
5656 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986315
5658 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986278
这里只摘录了一部分,看看等待的统计信息:
SQL> select event,count(*) from
2 dba_hist_active_sess_history
3 where to_char(SAMPLE_TIME,'YY-MM-DD:HH24:MI:SS') between
4 '12-05-22:14:45:00' and '12-05-22:14:55:00'
5 group by event
6 order by 2 desc;
EVENT COUNT(*)
--------------------------------- --------
log file sync 12370
enq: FB - contention 1018
buffer busy waits 497
db file sequential read 290
db file parallel write 231
direct path read 165
control file sequential read 92
enq: HW - contention 79
log file parallel write 27
control file parallel write 25
read by other session 13
gcs log flush sync 2
wait for scn ack 1
当时数据库基本都在等待log file sync,enq:FB-contention,尤其log file sync等待的时间是s级别。这个基本可以断定IO肯定有问题了,
于是查看系统log,有了实质性的发现,导致redo写入异常,从而引发数据库的不正常,废话少说了,看下面log:
May 22 14:51:14 zjddkf02 vmunix: LVM: WARNING: VG 64 0x090000: LV 3: Some I/O requests to this LV are waiting
May 22 14:51:14 zjddkf02 vmunix: LVM: WARNING: VG 64 0x140000: LV 10: Some I/O requests to this LV are waiting
May 22 14:51:18 zjddkf02 vmunix: LVM: WARNING: VG 64 0x090000: LV 1: Some I/O requests to this LV are waiting
May 22 14:51:19 zjddkf02 vmunix: LVM: WARNING: VG 64 0x140000: LV 12: Some I/O requests to this LV are waiting
May 22 14:51:26 zjddkf02 vmunix: LVM: WARNING: VG 64 0x110000: LV 18: Some I/O requests to this LV are waiting
May 22 14:51:26 zjddkf02 vmunix: LVM: VG 64 0x170000: PVLink 1 0x000078 Failed! The PV is not accessible.
May 22 14:51:29 zjddkf02 vmunix: LVM: VG 64 0x120000: PVLink 1 0x000035 Failed! The PV is not accessible.
May 22 14:51:26 zjddkf02 vmunix: indefinitely for an unavailable PV. These requests will be queued until
May 22 14:51:38 zjddkf02 above message repeats 25 times
May 22 14:51:38 zjddkf02 vmunix: LVM: WARNING: VG 64 0x190000: LV 2: Some I/O requests to this LV are waiting
May 22 14:51:26 zjddkf02 vmunix: the PV becomes available (or a timeout is specified for the LV).
May 22 14:51:38 zjddkf02 above message repeats 25 times
May 22 14:51:38 zjddkf02 vmunix: indefinitely for an unavailable PV. These requests will be queued until
May 22 14:51:38 zjddkf02 vmunix: the PV becomes available (or a timeout is specified for the LV).
May 22 14:51:39 zjddkf02 vmunix: LVM: VG 64 0x090000: PVLink 1 0x000054 Failed! The PV is not accessible.
May 22 14:51:41 zjddkf02 vmunix: LVM: WARNING: VG 64 0x190000: LV 14: Some I/O requests to this LV are waiting
May 22 14:51:47 zjddkf02 vmunix: LVM: VG 64 0x130000: PVLink 1 0x000037 Failed! The PV is not accessible.
May 22 14:51:47 zjddkf02 vmunix: LVM: VG 64 0x130000: PVLink 1 0x000038 Failed! The PV is not accessible.
May 22 14:51:41 zjddkf02 vmunix: indefinitely for an unavailable PV. These requests will be queued until
May 22 14:51:47 zjddkf02 vmunix: LVM: VG 64 0x130000: PVLink 1 0x00003c Failed! The PV is not accessible.
May 22 14:51:41 zjddkf02 vmunix: the PV becomes available (or a timeout is specified for the LV).
May 22 14:51:47 zjddkf02 vmunix: LVM: VG 64 0x130000: PVLink 1 0x000044 Failed! The PV is not accessible.
May 22 14:51:49 zjddkf02 cmdisklockd[4459]: Still trying to inquire cluster lock disk /dev/disk/disk274
May 22 14:51:59 zjddkf02 vmunix: LVM: VG 64 0x180000: PVLink 1 0x000083 Failed! The PV is not accessible.
May 22 14:51:59 zjddkf02 vmunix: LVM: VG 64 0x180000: PVLink 1 0x000085 Failed! The PV is not accessible.
...
May 22 14:55:37 zjddkf02 above message repeats 4 times
May 22 14:55:37 zjddkf02 vmunix: LVM: WARNING: VG 64 0x110000: LV 19: Some I/O requests to this LV are waiting
May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x090000: PVLink 1 0x000054 Recovered.
May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x090000: LV 1: All I/O requests to this LV that were
May 22 14:55:37 zjddkf02 vmunix: indefinitely for an unavailable PV. These requests will be queued until
May 22 14:55:38 zjddkf02 vmunix: waiting indefinitely for an unavailable PV have now completed.
May 22 14:55:37 zjddkf02 vmunix: the PV becomes available (or a timeout is specified for the LV).
May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x090000: LV 3: All I/O requests to this LV that were
May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x090000: LV 5: All I/O requests to this LV that were
May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x090000: LV 4: All I/O requests to this LV that were
May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000045 Recovered.
May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000046 Recovered.
May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000047 Recovered.
May 22 14:55:38 zjddkf02 vmunix: waiting indefinitely for an unavailable PV have now completed.
May 22 14:55:38 zjddkf02 above message repeats 3 times
May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000049 Recovered.
May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000050 Recovered.
May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000053 Recovered.
May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000051 Failed! The PV is not accessible.
May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x140000: LV 5: All I/O requests to this LV that were
May 22 14:55:38 zjddkf02 vmunix: waiting indefinitely for an unavailable PV have now completed.
May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x140000: LV 16: All I/O requests to this LV that were
May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x140000: LV 2: All I/O requests to this LV that were
May 22 14:55:38 zjddkf02 vmunix: LVM: WARNING: VG 64 0x140000: LV 12: Some I/O requests to this LV are waiting
May 22 14:55:38 zjddkf02 vmunix: indefinitely for an unavailable PV. These requests will be queued until
May 22 14:55:38 zjddkf02 vmunix: the PV becomes available (or a timeout is specified for the LV).
May 22 14:55:38 zjddkf02 vmunix: waiting indefinitely for an unavailable PV have now completed.
May 22 14:55:38 zjddkf02 vmunix: LVM: WARNING: VG 64 0x190000: LV 14: Some I/O requests to this LV are waiting
May 22 14:55:38 zjddkf02 vmunix: LVM: WARNING: VG 64 0x130000: LV 16: Some I/O requests to this LV are waiting
May 22 14:55:38 zjddkf02 vmunix: indefinitely for an unavailable PV. These requests will be queued until
May 22 14:55:38 zjddkf02 above message repeats 2 times
May 22 14:55:38 zjddkf02 vmunix: the PV becomes available (or a timeout is specified for the LV).
May 22 14:55:38 zjddkf02 above message repeats 2 times
May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x150000: PVLink 1 0x000056 Recovered.
May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x150000: PVLink 1 0x000057 Failed! The PV is not accessible.
May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x150000: PVLink 1 0x000058 Failed! The PV is not accessible.
May 22 14:55:39 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x150000: LV 29: All I/O requests to this LV that were
May 22 14:55:39 zjddkf02 vmunix: waiting indefinitely for an unavailable PV have now completed.
May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x120000: PVLink 1 0x000027 Recovered.
May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x120000: PVLink 1 0x000028 Recovered.
May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x120000: PVLink 1 0x000029 Recovered.
May 22 14:55:39 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x120000: LV 27: All I/O requests to this LV that were
May 22 14:55:39 zjddkf02 vmunix: waiting indefinitely for an unavailable PV have now completed.
May 22 14:55:39 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x130000: LV 16: All I/O requests to this LV that were
May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x180000: PVLink 1 0x000083 Recovered.
May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x180000: PVLink 1 0x000084 Recovered.
May 22 14:55:39 zjddkf02 vmunix: waiting indefinitely for an unavailable PV have now completed.
摘录部分信息,从14:51:14开始出现问题,pv不可得,物理链路丢失,但是到14:55:39已经recovered完成。唉,悲催的,有人动了SAN的相关配置文件。
又是误操作,,,,
再来说说UNDO tablespace,其实我们的UNDO完全是够用的,看看现在的配置:
SQL> show parameter undo
NAME TYPE VALUE
------------------------------------ -------------------------------- ------------------------------
_gc_undo_affinity boolean FALSE
undo_management string AUTO
undo_retention integer 36000
undo_tablespace string UNDOTBS1
SQL> SELECT d.undo_size/(1024*1024) "ACTUAL UNDO SIZE [MByte]",
2 SUBSTR(e.value,1,25) "UNDO RETENTION [Sec]",
3 (TO_NUMBER(e.value) * TO_NUMBER(f.value) *
4 g.undo_block_per_sec) / (1024*1024)
5 "NEEDED UNDO SIZE [MByte]"
6 FROM (
7 SELECT SUM(a.bytes) undo_size
8 FROM v$datafile a,
9 v$tablespace b,
10 dba_tablespaces c
11 WHERE c.contents = 'UNDO'
12 AND c.status = 'ONLINE'
13 AND b.name = c.tablespace_name
14 AND a.ts# = b.ts#
15 ) d,
16 v$parameter e,
17 v$parameter f,
18 (
19 SELECT MAX(undoblks/((end_time-begin_time)*3600*24))
20 undo_block_per_sec
21 FROM v$undostat
22 ) g
23 WHERE e.name = 'undo_retention'
24 AND f.name = 'db_block_size'
25 /
ACTUAL UNDO SIZE [MByte] UNDO RETENTION [Sec] NEEDED UNDO SIZE [MByte]
------------------------ ----------------------------------- ------------------------
96000 36000 35242.5
要保证undo retention需要的表空间大小事35242.5M,我们的UNDO表空间大小是96000M,为什么早晨会开始报ORA-01555呢,
经过和应用沟通,最近他们把本来晚上跑的批处理,报表,放在白天开始跑。隐隐蛋疼的感觉,这么大的操作,也能随意调时间,
白天业务忙,事务量高,一个报表都要跑1-2小时,不报01555才怪呢。
汇报到这里吧,IT系统有风险,操作需谨慎,可惜误操作的哥们,撞到雷了,,,,一定要小心,小心,再小心