1.问题:
一套centos 6.8 + oracle 11.2.0.1 RAC的环境。 这几天连续出现2台主机分别重启的现象。
2.排查过程:
数据库alert日志如下:
opiodr aborting process unknown ospid (65714) as a result of ORA-609
Fri Apr 02 04:19:48 2021
Process q002 died, see its trace file
Fri Apr 02 04:34:57 2021
PMON (ospid: 11122): terminating the instance due to error 471
Fri Apr 02 04:34:57 2021
System state dump is made for local instance
System State dumped to trace file /home/app/oracle/diag/rdbms/xxdb/xxdb1/trace/xxdb1_diag_11130.trc
Fri Apr 02 04:34:58 2021
ORA-1092 : opitsk aborting process
Fri Apr 02 04:34:58 2021
License high water mark = 64
Instance terminated by PMON, pid = 11122
USER (ospid: 66586): terminating the instance
Instance terminated by USER, pid = 66586
Fri Apr 02 04:35:33 2021
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
操作系统日志如下:
Apr 1 11:14:05 XXRAC01 ntpd[8971]: 0.0.0.0 0515 05 clock_sync
Apr 1 15:07:38 XXRAC01 sshd[24383]: Accepted password for oracle from 192.168.108.80 port 57869 ssh2
Apr 2 04:19:06 XXRAC01 kernel: INFO: task jbd2/dm-11-8:6568 blocked for more than 120 seconds.
Apr 2 04:19:06 XXRAC01 kernel: Not tainted 2.6.32-642.el6.x86_64 #1
Apr 2 04:19:06 XXRAC01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 2 04:19:06 XXRAC01 kernel: jbd2/dm-11-8 D 0000000000000032 0 6568 2 0x00000000
Apr 2 04:19:06 XXRAC01 kernel: ffff884017f23d20 0000000000000046 0000000000000000 ffff884017f23ce4
Apr 2 04:19:06 XXRAC01 kernel: ffff880100000000 ffff88407fe87600 000038b1992f04b9 ffff880191b56ec0
Apr 2 04:19:06 XXRAC01 kernel: 0000000000000200 0000000103b37a02 ffff88401339b068 ffff884017f23fd8
Apr 2 04:19:06 XXRAC01 kernel: Call Trace:
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffffa04c080f>] jbd2_journal_commit_transaction+0x19f/0x14f0 [jbd2]
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff8108f14c>] ? lock_timer_base+0x3c/0x70
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff810a6ac0>] ? autoremove_wake_function+0x0/0x40
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffffa04c6a38>] kjournald2+0xb8/0x220 [jbd2]
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff810a6ac0>] ? autoremove_wake_function+0x0/0x40
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffffa04c6980>] ? kjournald2+0x0/0x220 [jbd2]
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff810a662e>] kthread+0x9e/0xc0
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff810a6590>] ? kthread+0x0/0xc0
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Apr 2 04:19:06 XXRAC01 kernel: INFO: task oraagent.bin:9962 blocked for more than 120 seconds.
Apr 2 04:19:06 XXRAC01 kernel: Not tainted 2.6.32-642.el6.x86_64 #1
Apr 2 04:19:06 XXRAC01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 2 04:19:06 XXRAC01 kernel: oraagent.bin D 0000000000000015 0 9962 1 0x00000000
Apr 2 04:19:06 XXRAC01 kernel: ffff881a456aba88 0000000000000086 0000000000000000 ffffffff8117fb38
Apr 2 04:19:06 XXRAC01 kernel: ffff881a456aba70 ffff8814f3b79000 000038b1a838d49b ffff8840103ec7e0
Apr 2 04:19:06 XXRAC01 kernel: ffff881a456aba28 0000000103b37ab5 ffff884017215068 ffff881a456abfd8
Apr 2 04:19:06 XXRAC01 kernel: Call Trace:
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff8117fb38>] ? ____cache_alloc_node+0x108/0x160
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffffa04befca>] start_this_handle+0x25a/0x480 [jbd2]
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff810a6ac0>] ? autoremove_wake_function+0x0/0x40
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffffa04bf3d5>] jbd2_journal_start+0xb5/0x100 [jbd2]
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffffa04fef96>] ext4_journal_start_sb+0x56/0xe0 [ext4]
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffffa04e7f6a>] ext4_dirty_inode+0x2a/0x60 [ext4]
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff811c695b>] __mark_inode_dirty+0x3b/0x160
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff811b7032>] file_update_time+0xf2/0x170
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff811309e0>] __generic_file_aio_write+0x230/0x490
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff81130cc8>] generic_file_aio_write+0x88/0x100
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffffa04e1e08>] ext4_file_write+0x58/0x190 [ext4]
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff81199b4a>] do_sync_write+0xfa/0x140
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff810a6ac0>] ? autoremove_wake_function+0x0/0x40
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff8123ac26>] ? security_file_permission+0x16/0x20
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff81199e48>] vfs_write+0xb8/0x1a0
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff8119b35f>] ? fget_light_pos+0x3f/0x50
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff8119a981>] sys_write+0x51/0xb0
Apr 2 04:19:06 XXRAC01 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Apr 2 04:19:06 XXRAC01 kernel: INFO: task cssdmonitor:10000 blocked for more than 120 seconds.
Apr 2 04:19:06 XXRAC01 kernel: Not tainted 2.6.32-642.el6.x86_64 #1
Apr 2 04:19:06 XXRAC01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 2 04:19:06 XXRAC01 kernel: cssdmonitor D 0000000000000020 0 10000 1 0x00000000
Apr 2 04:19:06 XXRAC01 kernel: ffff881a4573ba88 0000000000000086 0000000000000000 0000000200000000
Apr 2 04:19:06 XXRAC01 kernel: 000000010000001e ffff880191dd0b00 000038b2702f3525 0000000000000008
Apr 2 04:19:06 XXRAC01 kernel: 0000000000016ec0 0000000103b3883c ffff88401cb27ad8 ffff881a4573bfd8
Apr 2 04:19:06 XXRAC01 kernel: Call Trace:
4:27:42 XXRAC01 kernel: Out of memory: Kill process 65090 (oracle) score 111 or sacrifice child
Apr 2 04:27:42 XXRAC01 kernel: Killed process 65090, UID 54321, (oracle) total-vm:157506752kB, anon-rss:6100kB, file-rss:31659480kB
Apr 2 04:27:55 XXRAC01 kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Apr 2 04:27:55 XXRAC01 kernel: oracle cpuset=/ mems_allowed=0
Apr 2 04:27:55 XXRAC01 kernel: Pid: 11224, comm: oracle Not tainted 2.6.32-642.el6.x86_64 #1
Apr 2 04:27:55 XXRAC01 kernel: Call Trace:
查看内存使用情况 free -g ,空闲内存较少。(这是发生重启后看的)
# free -g
total used free shared buffers cached
Mem: 252 157 94 0 0 2
-/+ buffers/cache: 155 97
Swap: 19 0 19
查看大页使用情况:
#cat /proc/meminfo |grep Huge
AnonHugePages: 0 kB
HugePages_Total: 76900
HugePages_Free: 76900
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
#cat /etc/security/limits.conf
oracle hard memlock 118536840
从上面可知,系统已经锁定150G大页内存,oracle 无法使用到大页内存。
数据库SGA大小
SQL> show parameter sga
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
lock_sga boolean FALSE
pre_page_sga boolean FALSE
sga_max_size big integer 150G
sga_target big integer 150G
SQL> show parameter pga
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
pga_aggregate_target big integer 40G
当数据库分配的SGA使用到100G时,系统会内存溢出,发生OOM杀死oracle进程,同时由于内存不足,导致操作系统重启。
3.解决方法
因此,调整limits.conf以下参数(比大页内存大),可以避免再次发生内存溢出
#cat /etc/security/limits.conf
oracle soft memlock 209715200
oracle hard memlock 209715200
最后调整完后重启数据库即可。
使用一段时间后,观察到内存变化
# cat /proc/meminfo |grep Huge
AnonHugePages: 0 kB
HugePages_Total: 76900
HugePages_Free: 8075
HugePages_Rsvd: 7976
HugePages_Surp: 0
Hugepagesize: 2048 kB