今天早上发现grid control服务器数据库不能登录,系统变成了只读模式
平台:oracle linux5.6 oracle11gR2
主要现象:--数据库无法登录
[oracle@gc ~]$ sqlplus "/as sysdba"
SQL*Plus: Release 11.2.0.2.0 Production on 星期四 12月 1 09:14:29 2011
Copyright (c) 1982, 2010, Oracle. All rights reserved.
ERROR:
ORA-09925: Unable to create audit trail file
Linux-x86_64 Error: 30: Read-only file system
Additional information: 9925
ORA-09925: Unable to create audit trail file
Linux-x86_64 Error: 30: Read-only file system
Additional information: 9925
--alter.log错误
Thread 1 cannot allocate new log, sequence 13854
Private strand flush not complete
Current log# 2 seq# 13853 mem# 0: /u01/app/ora11g/oradata/gcdb/redo02.log
Thread 1 advanced to log sequence 13854 (LGWR switch)
Current log# 3 seq# 13854 mem# 0: /u01/app/ora11g/oradata/gcdb/redo03.log
Thu Dec 01 04:58:26 2011
Archived Log entry 13846 added for thread 1 sequence 13853 ID 0x6b0802ee dest 1:
Thu Dec 01 05:13:18 2011
KCF: read, write or open error, block=0x2f3 online=1
file=7 '/u01/app/ora11g/oradata/gcdb/mgmt.dbf'
error=27072 txt: 'Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 755
Additional information: -1'
--系统只读,不能创建文件
[oracle@gc trace]$ touch test
touch: 无法触碰 “test”: 只读文件系统
[oracle@gc trace]$ mkdir test
mkdir: 无法创建目录 “test”: 只读文件系统
[oracle@gc gcdb]$ su - root
[root@gc ~]# touch test
touch: 无法触碰 “test”: 只读文件系统
[root@gc ~]# rm -f install.log.syslog
rm: 无法删除 “install.log.syslog”: 只读文件系统
--系统日志
[root@gc log]# cd /var/log
[root@gc log]# more messages
Nov 27 04:02:02 gc syslogd 1.4.1: restart.
Nov 27 09:50:01 gc auditd[3207]: Audit daemon rotating log files
Nov 28 17:23:14 gc avahi-daemon[4053]: Invalid query packet.
Nov 28 17:29:23 gc avahi-daemon[4053]: Invalid query packet.
Nov 28 17:31:45 gc last message repeated 13 times
Nov 28 17:32:24 gc last message repeated 6 times
Dec 1 05:08:00 gc kernel: INFO: task kjournald:360 blocked for more than 120 seconds.
Dec 1 05:08:00 gc kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 1 05:08:00 gc kernel: kjournald D 0000000000000006 0 360 2 0x00000000
Dec 1 05:08:00 gc kernel: ffff880c235d1c40 0000000000000046 0000000000000000 ffffffffb1b637ea
Dec 1 05:08:00 gc kernel: ffff880c235c4400 ffff880624dda100 ffff880c235c47d8 00000003261137fc
Dec 1 05:08:00 gc kernel: 00000000235d1cd0 0000000000000000 0000000000000000 ffff880c235c4400
Dec 1 05:08:00 gc kernel: Call Trace:
Dec 1 05:08:00 gc kernel: [] io_schedule+0x42/0x5c
Dec 1 05:08:00 gc kernel: [] sync_buffer+0x2a/0x2e
Dec 1 05:08:00 gc kernel: [] __wait_on_bit+0x4a/0x7c
Dec 1 05:08:00 gc kernel: [] ? sync_buffer+0x0/0x2e
Dec 1 05:08:00 gc kernel: [] ? sync_buffer+0x0/0x2e
Dec 1 05:08:00 gc kernel: [] out_of_line_wait_on_bit+0x73/0x80
Dec 1 05:13:18 gc auditd[3207]: fsync: Audit daemon detected an error writing an event to disk (Input/output error)
Dec 1 05:13:18 gc kernel: [] ? wake_bit_function+0x0/0x2f
Dec 1 05:13:18 gc kernel: [] ? submit_bh+0x136/0x144
Dec 1 05:13:18 gc kernel: [] __wait_on_buffer+0x24/0x26
Dec 1 05:13:18 gc kernel: [] wait_on_buffer+0x31/0x35
Dec 1 05:13:18 gc kernel: [] journal_commit_transaction+0x4d9/0xe34
[root@gc log]#
--发现klogd进行占了大量cpu资源,而且kill不掉
[oracle@gc trace]$ top
top - 09:29:20 up 106 days, 23:59, 0 users, load average: 7.16, 7.11, 7.02
Tasks: 348 total, 5 running, 343 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.3%us, 16.5%sy, 0.0%ni, 52.9%id, 29.3%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 49548068k total, 31446072k used, 18101996k free, 937524k buffers
Swap: 40957676k total, 0k used, 40957676k free, 27066012k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3235 root 20 0 3820 440 352 R 47.6 0.0 131:28.80 klogd
3232 root 20 0 5924 640 516 S 37.4 0.0 90:08.47 syslogd
23128 oracle 20 0 483m 16m 13m D 35.4 0.0 89:35.41 oracle
22940 oracle 20 0 483m 16m 13m R 34.7 0.0 89:54.43 oracle
23210 oracle 20 0 483m 16m 13m R 34.7 0.0 89:04.26 oracle
22894 oracle 20 0 483m 16m 13m D 34.4 0.0 90:29.37 oracle
24512 oracle 20 0 482m 14m 12m D 34.4 0.0 84:05.51 oracle
23273 oracle 20 0 483m 16m 13m R 34.1 0.0 89:34.17 oracle
6352 oracle 20 0 484m 17m 14m D 33.7 0.0 108:53.11 oracle
491 oracle 20 0 12888 1316 820 S 0.7 0.0 0:00.38 top
56 root 20 0 0 0 0 S 0.3 0.0 2:46.82 events/5
490 oracle 20 0 12888 1312 820 R 0.3 0.0 0:00.37 top
3294 root 20 0 0 0 0 S 0.3 0.0 20:57.02 kondemand/0
3898 root 20 0 66960 2324 784 S 0.3 0.0 2:46.35 sendmail
4020 oracle 20 0 365m 39m 19m S 0.3 0.1 207:20.91 ohasd.bin
4506 oracle 20 0 1526m 662m 29m S 0.3 1.4 509:29.38 java
6197 oracle 20 0 242m 20m 10m S 0.3 0.0 46:14.85 cssdagent
6217 oracle 20 0 156m 16m 8724 S 0.3 0.0 11:28.67 diskmon.bin
1 root 20 0 10364 692 580 S 0.0 0.0 2:13.02 init
--尝试remount失败
[root@gc ~]# mount -o remount,rw /
mount: block device /dev/sdc2 is write-protected, mounting read-only
问题分析:在网上查了些资料,确实也有人遇到同样的问题,大概有的解决方法是:
先remount,如失败进入 rescue 模式 fsck
具体方法如下:
进入linux rescue模式(设置光盘启动,安装盘插入,安装界面boot出现后输入linux rescue,回车,按照向导即可)
进入shell之后,执行下列操作修复/文件系统。
# umount /
# fsck -fn / 或是fsck -fy /
修复完成后,
mount /
$reboot -f
问题解决:--尝试用命令关闭系统没有反应
[root@gc ~]# shutdown -h now
--之后出下面的刷屏信息
--最后直接关闭电源,然后再重新启动,一切正常
--只是在系统启动时出现在下面报错,但是没影响系统使用,数据库和系统没有出现异常。
总结:本次这个问题很奇怪,是在凌晨5点左右发生,应该不会是有人误操修改系统文件所导致的,其它就是由于系统漏洞或是bug引起的。
但本次幸好是在监控服务器上发生的,有时间去查问题原因的解决办法,最后没有办法只能重启系统解决。如果在生产环境遇到估计就更加手忙脚乱了。
下次在生产环境遇到心里也算有底了,看linux系统也不是完全稳定可靠的。