SMON: recover undo segment与并行事务恢复
现象:
20150107早上10点
客户反映数据库响应很慢,系统cpu很忙、io很高,业务请求几乎无法响应、时好时坏;
smon回滚产生的trc文件不停地增长迅速,几个G;
最近操作:
20150106晚上做数据库维护,备份数据、drop某个业务表、建索引,导入数据,数据量大概几个G,几十万条记录;
第二天早上发现导入还没完成,怕影响白天的生产,于是手工取消了操作;
接下来发现异常,数据库响应变慢,smon回滚产生的trc文件增长迅速;
客户早上10点shutdown abort关闭了数据库后,重新启动了数据库;
检查告警日志:
数据库版本 9.2.0.1.0.
操作系统 AIX 5.3
Wed Jan 7 09:37:40 2015
ARC1: Evaluating archive log 2 thread 1 sequence 83967
ARC1: Beginning to archive log 2 thread 1 sequence 83967
Creating archive destination LOG_ARCHIVE_DEST_1: '/oradata/admin/orcl/archive/orcl_83967_1.arc'
Wed Jan 7 09:37:58 2015
ARC1: Completed archiving log 2 thread 1 sequence 83967
Wed Jan 7 09:38:35 2015
Errors in file /oradata/admin/orcl/udump/orcl_ora_1302584.trc:
ORA-00600: internal error code, arguments: [6122], [0], [1], [0], [], [], [], []
Wed Jan 7 09:38:38 2015
Recovery of Online Redo Log: Thread 1 Group 3 Seq 83968 Reading mem 0
Mem# 0 errs 0: /oradata/orcl/redo03.log
Recovery of Online Redo Log: Thread 1 Group 3 Seq 83968 Reading mem 0
Mem# 0 errs 0: /oradata/orcl/redo03.log
Wed Jan 7 10:00:44 2015
Shutting down instance: further logons disabled
Wed Jan 7 10:01:27 2015
ARC1: Completed archiving log 2 thread 1 sequence 83985
Wed Jan 7 10:01:27 2015
Shutting down instance (immediate)
Wed Jan 7 10:06:44 2015
Starting ORACLE instance (normal)
Wed Jan 7 10:07:37 2015
SMON: enabling tx recovery
Wed Jan 7 10:07:37 2015
Database Characterset is ZHS16GBK
Wed Jan 7 10:07:37 2015
SMON: about to recover undo segment 45
SMON: about to recover undo segment 45
SMON: mark undo segment 45 as available
SMON: about to recover undo segment 45
……
SMON: mark undo segment 45 as available
Wed Jan 7 10:24:04 2015
Undo Segment 43 Onlined
Wed Jan 7 10:24:04 2015
Undo Segment 44 Onlined
Wed Jan 7 10:24:05 2015
Undo Segment 45 Onlined
Wed Jan 7 10:24:05 2015
Undo Segment 46 Onlined
Wed Jan 7 10:24:05 2015
Undo Segment 47 Onlined
Wed Jan 7 10:34:26 2015
ARC1: Evaluating archive log 1 thread 1 sequence 84008
ARC1: Beginning to archive log 1 thread 1 sequence 84008
Creating archive destination LOG_ARCHIVE_DEST_1: '/oradata/admin/orcl/archive/orcl_84008_1.arc'
Wed Jan 7 10:35:01 2015
Errors in file /oradata/admin/orcl/udump/orcl_ora_1577162.trc:
ORA-00600: internal error code, arguments: [6122], [0], [1], [0], [], [], [], []
Wed Jan 7 10:35:04 2015
Recovery of Online Redo Log: Thread 1 Group 2 Seq 84009 Reading mem 0
Mem# 0 errs 0: /oradata/orcl/redo02.log
Wed Jan 7 10:35:11 2015
ARC1: Completed archiving log 1 thread 1 sequence 84008
Wed Jan 7 10:35:51 2015
Thread 1 advanced to log sequence 84010
Current log# 3 seq# 84010 mem# 0: /oradata/orcl/redo03.log
Wed Jan 7 10:35:51 2015
ARC0: Evaluating archive log 2 thread 1 sequence 84009
ARC0: Beginning to archive log 2 thread 1 sequence 84009
Creating archive destination LOG_ARCHIVE_DEST_1: '/oradata/admin/orcl/archive/orcl_84009_1.arc'
Wed Jan 7 10:36:20 2015
ARC0: Completed archiving log 2 thread 1 sequence 84009
Wed Jan 7 10:36:46 2015
Errors in file /oradata/admin/orcl/udump/orcl_ora_1577162.trc:
ORA-00600: internal error code, arguments: [6122], [0], [1], [0], [], [], [], []
分析与处理:
客户早上取消了导入操作,数据库需要回滚,10点钟客户使用shutdown abort方式关闭了数据库,随后启动数据库,10点07分开始大事务回滚,告警日志信息表明45号回滚段出现问题,SMON不断尝试恢复该回滚段,同时不断产生trace占用空间;这个恢复尝试占用了大量的cpu资源。
询问用户当时的情况是,数据库运行及其缓慢,用户请求得不到响应,就强行关闭了数据库。在此提醒大家,使用abort方式关闭数据库是具有相当风险的,具体执行时应该相当谨慎。
进一步检查数据库的当前状态,查询视图,发现数据库当前存在大量等待:
select sid,event from v$session_wait where event not like 'SQL%';
其中存在大量队列竞争,我们进而检查视图v$lock;
SQL> select * from v$lock where type<>'MR';
从以上输出中注意到,大量session的请求都被阻塞,而阻塞这些session的进程正是smon进程(sid=5)
最终解决方法
调整参数
设置fast_start_parallel_rollback参数为false,关闭数据库的并行恢复功能,重启数据库,数据库正常,故障消失。
ALTER SYSTEM SET fast_start_parallel_rollback='FALSE' SCOPE=BOTH;
客户反映,磁盘io等待从100%迅速下降,系统响应恢复正常,可以了。
FAST_START_PARALLEL_ROLLBACK determines the maximum number of processes that can exist for performing parallel rollback. This parameter is useful on systems in which some or all of the transactions are long running.
- FALSE indicates that parallel rollback is disabled
- LOW limits the number of rollback processes to 2 * CPU_COUNT
- HIGH limits the number of rollback processes to 4 * CPU_COUNT
通过这三种情况下的每秒钟回滚undo数据块数量比较可以知道在LOW状态下最快,HIGH状态下次之,FALSE最慢。其实这个实验没有任何实际说明力,只是想说明几个问题:
1)Oracle大事物回滚,是没有办法取消,但是可以通过FAST_START_PARALLEL_ROLLBACK干预回滚速度
2)数据库的并发效率高于低,取决于系统的资源情况(如果你系统的cpu非常强大,那么可能设置HIGH速度最快)
3)回滚的数据类型,在回滚表中数据时可能设置并发比FALSE快,
但是如果是要回滚串行数据(如:index),那么可能串行方法方式速度更快
4)根据你的系统的使用状况,比如你想让系统的业务受到的影响最小,那么设置FALSE可能是个不错的选择。
遗留问题
密切监控
ORA-00600: 内部错误代码,参数: [6122] 索引损坏 需要重建或者删除后再建立
参考文件
http://www.eygle.com/archives/2010/02/smon_recover_undo_segment.html
http://blog.itpub.net/7199859/viewspace-608944/
http://www.xifenfei.com/2534.html
http://blog.itpub.net/26634508/viewspace-735208/
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/21256317/viewspace-1813783/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/21256317/viewspace-1813783/