一个rac问题分析过程由于进程资源不足导致的问题

rac 数据库ons资源异常分析

2014-12-11 09:30左右收到短信告警信息,如下:
orclrac1[172.19.0.87] resoure offline NAME=ora.ons
TYPE=ora.ons.type
TARGET=ONLINE , ONLINE
STATE=OFFLINE, ONLINE on orclrac2
从告警信息可以看出ons资源offline了 (问题出现多次)

影响:
从业务人员反馈来看当时出现大量的异常订单,订单处理失败

当时紧急处理是:
先把启动停止的ons服务,发现启动失败,启动不了,就把整改集群资源重启
如下:
crsctl stop has;
crsctl start has;
重启后ons资源恢复正常

进一步分析原因:
cd /oracle/grid/11ggrid/app/11.2.0/grid/log/orclrac1
1.查看集权clusterware日志alertorclrac1.log
发现如下错误,意思无法启动新的进程,从下面的日志还不出什么原因

[ctssd(9679)]CRS-2408:The clock on host orclrac1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
2014-12-11 09:21:38.403:
[/oracle/grid/11ggrid/app/11.2.0/grid/bin/oraagent.bin(9946)]CRS-5016:Process "/oracle/grid/11ggrid/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/oracle/grid/11ggrid/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/grid/11ggrid/app/11.2.0/grid/log/orclrac1/agent/crsd/oraagent_grid/oraagent_grid.log"
2014-12-11 09:23:06.674:
[/oracle/grid/11ggrid/app/11.2.0/grid/bin/oraagent.bin(9946)]CRS-5013:Agent "/oracle/grid/11ggrid/app/11.2.0/grid/bin/oraagent.bin" failed to start process "/oracle/grid/11ggrid/app/11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CLSN00008:)" in "/oracle/grid/11ggrid/app/11.2.0/grid/log/orclrac1/agent/crsd/oraagent_grid/oraagent_grid.log"
2014-12-11 09:23:20.058:
[/oracle/grid/11ggrid/app/11.2.0/grid/bin/oraagent.bin(9946)]CRS-5016:Process "/oracle/grid/11ggrid/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/oracle/grid/11ggrid/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/grid/11ggrid/app/11.2.0/grid/log/orclrac1/agent/crsd/oraagent_grid/oraagent_grid.log"
2014-12-11 09:23:51.216:
[/oracle/grid/11ggrid/app/11.2.0/grid/bin/oraagent.bin(9946)]CRS-5016:Process "/oracle/grid/11ggrid/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/oracle/grid/11ggrid/app/11.2.0/grid/bin/oraagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/grid/11ggrid/app/11.2.0/grid/log/orclrac1/agent/crsd/oraagent_grid/oraagent_grid.log"
2014-12-11 09:33:50.969:

2.查看"/oracle/grid/11ggrid/app/11.2.0/grid/log/orclrac1/agent/crsd/oraagent_grid/oraagent_grid.log"
日志如下发现调用操作系统函数execCmd scls_process_spawn() failed 1 spawn进程失败,也就意味节点无法创建新的连接,不能进行登录了可以从这里看出
[check] (:CLSN00008:) category: -2, operation: fork, loc: spawnproc28, OS error: 11, other: forked failed [-1]

2014-12-11 09:22:54.932: [ora.ons][944477952]{0:1:71} [check] getOracleHomeAttrib: oracle_home = /oracle/grid/11ggrid/app/11.2.0/grid
2014-12-11 09:22:54.933: [ora.ons][944477952]{0:1:71} [check] Utils:execCmd action = 3 flags = 6 ohome = /oracle/grid/11ggrid/app/11.2.0/grid/opmn/ cmdname = onsctli. 
2014-12-11 09:23:06.524: [ora.asm][949495552]{1:37481:2} [check] CrsCmd::ClscrsCmdData::stat entity 1 statflag 33 useFilter 0
2014-12-11 09:23:06.544: [ora.asm][949495552]{1:37481:2} [check] AsmProxyAgent::check clsagfw_res_status 0
2014-12-11 09:23:06.666: [ora.LISTENER.lsnr][961095424]{1:37481:2} [check] Utils:execCmd action = 3 flags = 38 ohome = (null) cmdname = lsnrctl. 
2014-12-11 09:23:06.666: [ora.LISTENER.lsnr][961095424]{1:37481:2} [check] (:CLSN00008:)Utils:execCmd scls_process_spawn() failed 1
2014-12-11 09:23:06.666: [ora.LISTENER.lsnr][961095424]{1:37481:2} [check] (:CLSN00008:) category: -2, operation: fork, loc: spawnproc28, OS error: 11, other: forked failed [-1]
2014-12-11 09:23:06.674: [ora.LISTENER.lsnr][961095424]{1:37481:2} [check] clsnUtils::error Exception type=2 string=
CRS-5013: Agent "/oracle/grid/11ggrid/app/11.2.0/grid/bin/oraagent.bin" failed to start process "/oracle/grid/11ggrid/app/11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CLSN00008:)" in "/oracle/grid/11ggrid/app/11.2.0/grid/log/orclrac1/agent/crsd/oraagent_grid/oraagent_grid.log"
2014-12-11 09:23:06.681: [ora.LISTENER.lsnr][961095424]{1:37481:2} [check] clsn_agent::check: Exception SclsProcessSpawnException
2014-12-11 09:23:06.681: [    AGFW][952690432]{1:37481:2} ora.LISTENER.lsnr orclrac1 1 state changed from: ONLINE to: UNKNOWN
2014-12-11 09:23:06.681: [    AGFW][952690432]{1:37481:2} ora.LISTENER.lsnr orclrac1 1 would be continued to monitored!
2014-12-11 09:23:06.681: [    AGFW][952690432]{0:1:73} Generating new Tint for unplanned state change. Original Tint: {1:37481:2}
2014-12-11 09:23:06.681: [    AGFW][952690432]{0:1:73} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:1780005
2014-12-11 09:23:20.057: [ora.ons][944477952]{0:1:71} [check] (:CLSN00010:)ons is not running ...
2014-12-11 09:23:20.057: [ora.ons][944477952]{0:1:71} [check] (:CLSN00010:)
2014-12-11 09:23:20.058: [ora.ons][944477952]{0:1:71} [check] (:CLSN00010:)Utils:execCmd scls_process_join() uret 2
2014-12-11 09:23:20.058: [ora.ons][944477952]{0:1:71} [check] execCmd ret = 2
2014-12-11 09:23:20.058: [ora.ons][944477952]{0:1:71} [check] (:CLSN00010:)ons is not running ...

从以上2点怀疑是因为操作进程数设置不足导致的

3.继续分析ons.out日志

cd /oracle/grid/11ggrid/app/11.2.0/grid/opmn/logs
[grid@orclrac1 logs]$ ls
ons.log.orclrac1  ons.out

发现创建线程的函数pthread_create()创建失败 Resource temporarily unavailable资源无法获取
[2014-12-11T09:21:54+08:00] [ons] [TRACE:2] [] [internal] ONS worker process stopped (2)
pthread_create() Resource temporarily unavailable
pthread_create() Resource temporarily unavailable
[2014-12-11T09:23:35+08:00] [ons] [ERROR:1] [] [internal] ONS: fork failed -- Resource temporarily unavailable
pthread_create() Resource temporarily unavailable
[2014-12-11T09:24:06+08:00] [ons] [TRACE:2] [] [internal] ONS worker process stopped (2)
pthread_create() Resource temporarily unavailable
pthread_create() Resource temporarily unavailable
pthread_create() Resource temporarily unavailable
pthread_create() Resource temporarily unavailable
pthread_create() Resource temporarily unavailable
[2014-12-11T10:31:38+08:00] [ons] [TRACE:2] [] [internal] ONS worker process stopped (2)
pthread_create() Resource temporarily unavailable
pthread_create() Resource temporarily unavailable

从以上分析可以看出是系统进程设置限制导致的问题

4.查看 cat /etc/security/limits.conf 

oracle   soft  nproc  16384
oracle   hard  nproc  16384
oracle   soft  nofile 16384
oracle   hard  nofile  65536
grid   soft   nofile    16384
grid   hard   nofile    65536
grid   soft   nproc    16384
grid   hard   nproc    16384
grid   soft   stack    10240
grid   hard   stack    32768

设置的应该够用的,但是为什么会失败呢???

5.使用ulimit -u命令查看当前设置情况,惊奇的发现limits.conf没有生效,最大进程数为1024
grid的用户的:
[grid@orclrac1 orclrac1]$ ulimit -u
1024
oracle用户的:
[oracle@orclrac1 ~]$ ulimit -u
1024
root用户的:
[root@orclrac1 ~]# ulimit -u
1024

6.通过查找资料发现在cent os 6以后的版本中有一个/etc/security/limits.d/90-nproc.conf 
内容如下
root@orclrac1 ~]#  cat /etc/security/limits.d/90-nproc.conf
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.
*          soft    nproc     1024

普通用户用了/etc/security/limits.d/90-nproc.conf 软限制的配置, 而root用户用了/etc/security/limits.conf硬限制的配置,/etc/security/limits.d/90-nproc.conf优先级高于limits.conf 

修改该/etc/security/limits.d/90-nproc.conf 中的*          soft    nproc     1024 为*          soft    nproc     131072
并设置root用户的最大进程数为unlimited
如下:
root     soft  nproc  unlimited
root     hard  nproc  unlimited
oracle   soft  nproc  139264
oracle   hard  nproc  139264
oracle   soft  nofile  16384
oracle   hard  nofile  65536
grid   soft   nofile    16384
grid   hard   nofile    65536
grid   soft   nproc    16384
grid   hard   nproc    16384
grid   soft   stack    10240
grid   hard   stack    32768

查看设置生效:
[grid@orclrac1 logs]$ ulimit -u
139264
[root@orclrac1 ~]# ulimit -u
unlimited
[oracle@orclrac1 ~]$ ulimit -u
139264

至此问题到一段落,目前ons运行正常,需要观察一段时间,继续监控

过了一段时间后问题重复出现了

由于之前分析资源offline的原因操作系统最大进程设置较小导致,修改后问题又重复出现,所以问题要从新分析定位
问题连续重复出现加班重新分析这个问题
今天主要的分析思路
收集rac环境所有异常日志和osw监控日志
监控用户的进程数和会话数
从新分析所有日志信息
分析过程:
1. 登录到环境收集日志
把rac相关的所有日志,以及ows监控日志取到本地(防止日志被覆盖掉)
2.监控用户的进程数和会话数
上次分析的结果是由于无法创建进程导致的失败,用户无法连接数据库,所以想了解出问题时数据库的会话数和进程,在rac部署如下脚本
-- Create table
create table MONSESSION
(
  MTIME    DATE,
  INDST_ID NUMBER,
  COUN     NUMBER,
  NAME     VARCHAR2(20)
)
create or replace procedure wproc_session
as
begin
insert into monsession
  (select sysdate, inst_id, count(*), 'session'
    from gV$session
   group by inst_id);
insert into monsession
  (select sysdate, inst_id, count(*), 'process'
    from gv$process
   group by inst_id);
commit;
exception
  when others then
    wprc_exp_log('wproc_session', '出错');
    commit;
end wproc_session;
--job每2分钟运行一次
begin
  sys.dbms_job.submit(job => :job,
                      what => 'wproc_session();',
                      next_date => to_date('14-12-2014 10:01:02', 'dd-mm-yyyy hh24:mi:ss'),
                      interval => 'sysdate+2/1440');
  commit;
end;
/
3. 对日志进行重新分析
由于之前分析结果,修改后问题又重复出现,可以推断问题原因不是由于操作系统最大进程数设置过小导致,所以需要换一种思路去重新分析。
经过重新分析日志,可以得出结论问题出现的原因就是操作系统资源无法获取导致的
Resource temporarily unavailable,到底是什么资源无法获取导致的??
猜测可能是
内存资源不足
磁盘空间不足
swap空间不足
进程数设置较小(已经排除)
通过分析出问题时间点的osw监控日志
排查内存不足,磁盘空间不足,swap空间不足的原因
zzz ***Sun Dec 14 05:02:18 CST 2014
top - 05:02:20 up 120 days, 18:47,  7 users,  load average: 0.35, 0.33, 0.36
Tasks: 1740 total,   1 running, 1739 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.9%us,  0.4%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  32853644k total, 31692916k used,  1160728k free,   233892k buffers
Swap: 33554424k total,   675160k used, 32879264k free, 15600456k cached
综合所有日志分析得出的结论还是进程数问题导致

通过分析日志得出的结论和上次结论是一样的,所以有感觉到很郁闷,从一开始就想会不会是oracle自身的bug导致的
于是开始沿着oracle 自身bug进行分析

首先最大进程数设置在系统中是生效的如下


那问题在oracle是不是生效的,怎么去查看呢?
通过在网上查询资料,最后在mos上看到一篇文章
The processes and resources started by CRS (Grid Infrastructure) do not inherit the ulimit setting for "max user processes" from /etc/security/limits.conf setting (文档 ID 1594606.1)
通过标题一看,对这篇文件欣喜若狂
文挡介绍问题症状是
SYMPTOMS
Database alert.log shows the following:
WARNING: Oracle instance running on a system with low open file 
descriptor limit. Tune your system to increase this limit 
to avoid severe performance degradation. Current limit = <nnnn>
OR
ORA-27300: OS system dependent operation:fork failed with status: 11 
ORA-27301: OS failure message: Resource temporarily unavailable 
ORA-27302: failure occurred at: skgpspawn3
一看和我们症状一样,如下:


文章介绍,问题原因(有由于oracle的bug导致)
CAUSE
Due to bug 17301761 THE MAXIMUM PROCESS VALUE OF OHASD,BIN IS TOO SMALL IF THE CRS STARTS MANUALLY
Duplicates: bug 16875455 bug 17007666 bug 17232135
影响版本:
Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Information in this document applies to any platform.
我们数据库的版本是11.2.0.4的说明我们版本也在影响范围之内
下面就是解决方案了
1. 打补丁
2. 规避
查看当前oracle 进程限制情况

数据库服务器上
[root@rac1 ~]# ps -ef |grep pmon
grid      46506      1  0 Dec12 ?        00:00:44 asm_pmon_+ASM1
oracle    47051      1  0 Dec12 ?        00:02:05 ora_pmon_orcl1
root     115832  91913  0 11:40 pts/12   00:00:00 grep pmon
查看pmon进程栈信息
[root@rac1 ~]# cd /proc
[root@rac1 proc]# cd 46506
 [root@rac1 46506]# cat limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            33554432             unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             1024                 256514               processes   
Max open files            65536                65536                files     
Max locked memory         unlimited            unlimited            bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       256514               256514               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
[root@rac1 46506]# cd ../47051
 [root@rac1 47051]# cat limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            33554432             unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             1024                 256514               processes 
Max open files            65536                65536                files     
Max locked memory         unlimited            unlimited            bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       256514               256514               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
[root@rac1 47051]#
从上面红色标注的可以看出oracle ,asm的pmon进程栈信息的max processes为1024
这就是问题一直出现的原因,设置操作系统最大进程数在oracle中根本就没有生效

Mos给出的解决方法如下:
通过修改两个相关的脚本来显示的设置最大进程数

The ohasd script needs to be modified to setthe ulimit explicitly for all grid and database resources that are started bythe Grid Infrastructure (GI).

1) go to GI_HOME/bin

2) make a backup of ohasd script file

3) in the ohasd script file, locate thefollowing code:

    Linux)
        # MEMLOCK limit is for Bug 9136459
        ulimit -l unlimited
        if [ "$?" != "0"]
        then
            $CLSECHO -phas -f crs -l -m 6021 "l" "unlimited"
        fi
        ulimit -c unlimited
        if [ "$?" != "0"]
        then
            $CLSECHO -phas -f crs -l -m 6021 "c" "unlimited"
        fi
        ulimit -n 65536

In the above code, insert the following linejust before the line with "ulimit -n 65536"

       ulimit -u 16384

4) Recycle CRS manually so that the ohasdwill not use new ulimit setting for open files.
After the database is started, please issue "ps -ef | grep pmon" andget the pid of it.
Then, issue "cat /proc/<pid of the pmon proces>/limits | grepprocess" and find out if the Max process is set to 16384.
Setting the number of processes to 16384 should be enough for most serverssince having 16384 processes normally mean the server to loaded veryheavily.  using smaller number like 4096 or 8192 should also suffice formost users.
In addition to above, the ohasd template needs to be modified to insure thatnew ulimit setting persists even after a patch is applied.
1) go to GI_HOME/crs/sbs

2) make a backup of crswrap.sh.sbs

3) in crswrap.sh.sbs, insert the followingline just before the line "# MEMLOCK limit is for Bug 9136459"

       ulimit -u 16384
Finally, although the above setting is successfully used to increase the numberof processes setting, please test this on the test server first before settingthe ulimit on the production.


按照mos给的解决方案,修改后重启crs栈,问题得到解决,以后再也没有出现问题


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值