接到一个问题,客户的某个系统突然CPU冲高,一个小时内CPU从5%冲到60%以上,在数据库中发现大量的会话在等待CSS initialization。
该系统是非RAC非ASM的系统,一般来说,不会出现CSS(cluster synchronization service)的等待,只有在访问voting disk或ocr的时候才会有这个等待,但是为什么在一个非RAC非ASM系统中出现了这个等待呢?
进一步检查会话的SQL,发现是一个监控程序发起的,里面有关于空间的检查,有查到了v$asm_diskgroup视图。这个倒是可以理解,查asm diskgroup的时候,出现了css的等待,但是这个等待一般不会持续很长时间,为什么会长时间的hang在这里?我们在测试机上执行select * from v$asm_diskgroup不到1秒就出结果了,从10046的trace看,虽然期间也有CSS initialization的等待,但是只是维持了很短一瞬间,而在客户的生产环境中,这个等待却持续很长时间,5分钟了还一直hang在那里不动。
查询metalink,发现有个Bug 10024824 – Database/session hang with ‘CSS initialization’ ,版本是10.2.0.5,而我们的数据库版本也正好是这个版本。
==============================================================================
== Bug 10024824 : DATABASE/SESSION HANGS WITH 'CSS INITIALIZATION' ==
PROBLEM:
--------
10.2.0.5 non-RAC ASM set up. After upgrading to 10.2.0.5 patchset, running
dbms_stats, hangs everything.
DIAGNOSTIC ANALYSIS:
--------------------
Happens only when ( and immediately ) when dbms_stats is run from sqlplus.
Does not seem to happen when run automatically during the job's window.
From systemstate dump ( uploaded ) we see process 42 is waiting on 'CSS
initialization'. There is nothing unusual in CSS logs
WORKAROUND:
-----------
RELATED BUGS:
-------------
9917797
==============================================================================
== Bug 10024824 - Database/session hang with 'CSS initialization' [ID 10024824.8] ==
Description
Database/session hang with 'CSS initialization' can occur
when the OH/log//client directory has the wrong permissions
in a RAC environment.
Workaround
Change the permission of OH/log//client directory to 771
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
==============================================================================
==Bug10024824:DATABASE/SESSIONHANGSWITH'CSS INITIALIZATION'==
PROBLEM:
--------
10.2.0.5non-RACASMsetup.Afterupgradingto10.2.0.5patchset,running
dbms_stats,hangseverything.
DIAGNOSTICANALYSIS:
--------------------
Happensonlywhen(andimmediately)whendbms_statsisrunfromsqlplus.
Doesnotseemtohappenwhenrunautomaticallyduringthejob's window.
From systemstate dump ( uploaded ) we see process 42 is waiting on 'CSS
initialization'. There is nothing unusual in CSS logs
WORKAROUND:
-----------
RELATED BUGS:
-------------
9917797
==============================================================================
== Bug 10024824 - Database/session hang with 'CSSinitialization' [ID 10024824.8] ==
Description
Database/session hang with 'CSSinitialization'canoccur
whentheOH/log//clientdirectoryhasthewrongpermissions
inaRACenvironment.
Workaround
ChangethepermissionofOH/log//clientdirectoryto771
看上去像是和目录权限有关。我们去检查该目录的权限,发现权限是751:
mydb2013:ctdb > pwd
/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013
mydb2013:ctdb > ls -l
total 1024
drwxr-xr-x 2 opctdb dba 516096 Jul 9 18:22 client
mydb2013:ctdb >
1
2
3
4
5
6
mydb2013:ctdb>pwd
/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013
mydb2013:ctdb>ls-l
total1024
drwxr-xr-x2opctdbdba516096Jul918:22client
mydb2013:ctdb>
难道真的要把权限改成771才能解决这个问题?
在之前的测试机上的测试中,我们发现每查询一次v$asm_diskgroup都会在client目录下生成一个新的cssN.log,如果是按照bug来说,是权限的问题,那么client目录一定是不能被写入,所以才hang住。我们继续检查目录和目录中的文件。
检查时我们发现这个目录下面存在大量的文件,ls的结果返回很多很多,如果想看看是否有最近的文件,ls -lrt根本没法看,所以我们只能先ls -lrt到一个文件,再tail看最后的几行是不是最近的,以确定是不是一直在写:
mydb2013:ctdb > ls -lrt >111.txt
mydb2013:ctdb > cat 111.txt |wc -l
20189
mydb2013:ctdb > tail -20 111.txt
-rw-r--r-- 1 opctdb dba 182 Jul 9 16:59 css19996.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 17:00 css19997.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 17:00 css19998.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 17:00 css19999.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 17:00 css20000.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 17:00 css20001.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 17:00 css20002.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 17:05 css20003.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 17:05 css20004.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 17:05 css20005.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 17:10 css20006.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 17:14 css20007.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 17:18 css20008.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 17:37 css20009.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 17:56 css20010.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 18:16 css20011.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 18:16 css20012.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 18:18 css20013.log
-rw-r--r-- 1 opctdb dba 182 Jul 9 18:22 css20014.log
-rw-r--r-- 1 opctdb dba 0 Jul 9 18:26 111.txt
mydb2013:ctdb >
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
mydb2013:ctdb>ls-lrt>111.txt
mydb2013:ctdb>cat111.txt|wc-l
20189
mydb2013:ctdb>tail-20111.txt
-rw-r--r--1opctdbdba182Jul916:59css19996.log
-rw-r--r--1opctdbdba182Jul917:00css19997.log
-rw-r--r--1opctdbdba182Jul917:00css19998.log
-rw-r--r--1opctdbdba182Jul917:00css19999.log
-rw-r--r--1opctdbdba182Jul917:00css20000.log
-rw-r--r--1opctdbdba182Jul917:00css20001.log
-rw-r--r--1opctdbdba182Jul917:00css20002.log
-rw-r--r--1opctdbdba182Jul917:05css20003.log
-rw-r--r--1opctdbdba182Jul917:05css20004.log
-rw-r--r--1opctdbdba182Jul917:05css20005.log
-rw-r--r--1opctdbdba182Jul917:10css20006.log
-rw-r--r--1opctdbdba182Jul917:14css20007.log
-rw-r--r--1opctdbdba182Jul917:18css20008.log
-rw-r--r--1opctdbdba182Jul917:37css20009.log
-rw-r--r--1opctdbdba182Jul917:56css20010.log
-rw-r--r--1opctdbdba182Jul918:16css20011.log
-rw-r--r--1opctdbdba182Jul918:16css20012.log
-rw-r--r--1opctdbdba182Jul918:18css20013.log
-rw-r--r--1opctdbdba182Jul918:22css20014.log
-rw-r--r--1opctdbdba0Jul918:26111.txt
mydb2013:ctdb>
我们发现cssN.log文件一直在产生,最近的一个是4分钟前产生,平均2,3分钟就产生一个,忙的时候每分钟产生5,6个。目前已经大约有2万多个了,从属主看,权限设置正常,看上去不像权限设置771之后就能解决的问题。
我们继续分析,我们再次测试运行一个查询,并且用truss追踪该进程,我们此时才发现了问题的根源:进程大部分的时间是花在遍历client下cssN.log文件:
truss -o /tmp/mylog.log -laefdD -p xxx
2896/1: 14.3703 0.0000 brk(0x10666A310) = 0
2896/1: 14.3706 0.0003 access("/var/tmp/.oracle", F_OK) = 0
2896/1: 14.3708 0.0002 chmod("/var/tmp/.oracle", 01777) Err#1 EPERM [ALL]
2896/1: 14.3709 0.0001 brk(0x10666A310) = 0
2896/1: 14.3710 0.0001 brk(0x10666E310) = 0
2896/1: 14.3711 0.0001 so_socket(PF_UNIX, SOCK_STREAM, 0, "", SOV_DEFAULT) = 14
2896/1: 14.3712 0.0001 access("/var/tmp/.oracle/sOCSSD_LL_mydb2013_", F_OK) Err#2 ENOENT
2896/1: 14.3713 0.0001 access("/var/tmp/o/sOCSSD_LL_mydb2013_", F_OK) Err#2 ENOENT
2896/1: 14.3714 0.0001 close(14) = 0
2896/1: 14.3716 0.0002 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs", 1024) = 0
2896/1: 14.3718 0.0002 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client") = 0
2896/1: 14.3719 0.0001 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client", 1024) = 0
2896/1: 14.3721 0.0002 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs") = 0
2896/1: 14.3722 0.0001 brk(0x10666E310) = 0
2896/1: 14.3723 0.0001 brk(0x106672310) = 0
2896/1: 14.3724 0.0001 brk(0x106672310) = 0
2896/1: 14.3725 0.0001 brk(0x106692310) = 0
2896/1: 14.3730 0.0005 access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css.log", F_OK) = 0 //<==从css.log开始遍历起
2896/1: 14.4276 0.0546 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs", 1024) = 0
2896/1: 14.4278 0.0002 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client") = 0
2896/1: 14.4279 0.0001 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client", 1024) = 0
2896/1: 14.4282 0.0003 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs") = 0
2896/1: 14.4287 0.0005 access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css1.log", F_OK) = 0 //<==读css1.log
2896/1: 14.4716 0.0429 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs", 1024) = 0
2896/1: 14.4719 0.0003 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client") = 0
2896/1: 14.4720 0.0001 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client", 1024) = 0
2896/1: 14.4722 0.0002 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs") = 0
2896/1: 14.4725 0.0003 access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css2.log", F_OK) = 0 //<==读css2.log
2896/1: 14.4870 0.0145 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs", 1024) = 0
2896/1: 14.4872 0.0002 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client") = 0
2896/1: 14.4873 0.0001 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client", 1024) = 0
2896/1: 14.4874 0.0001 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs") = 0
2896/1: 14.4877 0.0003 access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css3.log", F_OK) = 0
2896/1: 14.5022 0.0145 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs", 1024) = 0
2896/1: 14.5025 0.0003 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client") = 0
2896/1: 14.5026 0.0001 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client", 1024) = 0
2896/1: 14.5028 0.0002 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs") = 0
2896/1: 14.5031 0.0003 access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css4.log", F_OK) = 0
2896/1: 14.5585 0.0554 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs", 1024) = 0
2896/1: 14.5587 0.0002 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client") = 0
2896/1: 14.5589 0.0002 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client", 1024) = 0
2896/1: 14.5590 0.0001 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs") = 0
2896/1: 14.5595 0.0005 access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css5.log", F_OK) = 0
2896/1: 14.5738 0.0143 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs", 1024) = 0
2896/1: 14.5739 0.0001 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client") = 0
2896/1: 14.5741 0.0002 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client", 1024) = 0
2896/1: 14.5742 0.0001 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs") = 0
......
2896/1: 149.3333 0.0003 access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css9069.log", F_OK) = 0
2896/1: 149.3470 0.0137 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs", 1024) = 0
2896/1: 149.3471 0.0001 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client") = 0
2896/1: 149.3473 0.0002 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client", 1024) = 0
2896/1: 149.3474 0.0001 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs") = 0
2896/1: 149.3478 0.0004 access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css9070.log", F_OK) = 0
2896/1: 149.3616 0.0138 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs", 1024) = 0
2896/1: 149.3618 0.0002 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client") = 0
2896/1: 149.3619 0.0001 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client", 1024) = 0
2896/1: 149.3620 0.0001 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs") = 0
2896/1: 149.3623 0.0003 access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css9071.log", F_OK) = 0
2896/1: 149.3763 0.0140 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs", 1024) = 0
2896/1: 149.3765 0.0002 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client") = 0
2896/1: 149.3766 0.0001 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client", 1024) = 0
2896/1: 149.3767 0.0001 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs") = 0
2896/1: 149.3770 0.0003 access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css9072.log", F_OK) = 0 //<==读css9072.log
2896/1: 149.3907 0.0137 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs", 1024) = 0
2896/1: 149.3909 0.0002 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client") = 0
2896/1: 149.3910 0.0001 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client", 1024) = 0
2896/1: 149.3911 0.0001 chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs") = 0
2896/1: 149.3914 0.0003 access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css9073.log", F_OK) = 0 //<==读css9073.log
2896/1: 149.4051 0.0137 getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs", 1024) = 0
...... //<==后面还一直读cssN.log
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
truss-o/tmp/mylog.log-laefdD-pxxx
2896/1:14.37030.0000brk(0x10666A310)=0
2896/1:14.37060.0003access("/var/tmp/.oracle",F_OK)=0
2896/1:14.37080.0002chmod("/var/tmp/.oracle",01777)Err#1 EPERM [ALL]
2896/1:14.37090.0001brk(0x10666A310)=0
2896/1:14.37100.0001brk(0x10666E310)=0
2896/1:14.37110.0001so_socket(PF_UNIX,SOCK_STREAM,0,"",SOV_DEFAULT)=14
2896/1:14.37120.0001access("/var/tmp/.oracle/sOCSSD_LL_mydb2013_",F_OK)Err#2 ENOENT
2896/1:14.37130.0001access("/var/tmp/o/sOCSSD_LL_mydb2013_",F_OK)Err#2 ENOENT
2896/1:14.37140.0001close(14)=0
2896/1:14.37160.0002getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs",1024)=0
2896/1:14.37180.0002chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client")=0
2896/1:14.37190.0001getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client",1024)=0
2896/1:14.37210.0002chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs")=0
2896/1:14.37220.0001brk(0x10666E310)=0
2896/1:14.37230.0001brk(0x106672310)=0
2896/1:14.37240.0001brk(0x106672310)=0
2896/1:14.37250.0001brk(0x106692310)=0
2896/1:14.37300.0005access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css.log",F_OK)=0//<==从css.log开始遍历起
2896/1:14.42760.0546getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs",1024)=0
2896/1:14.42780.0002chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client")=0
2896/1:14.42790.0001getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client",1024)=0
2896/1:14.42820.0003chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs")=0
2896/1:14.42870.0005access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css1.log",F_OK)=0//<==读css1.log
2896/1:14.47160.0429getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs",1024)=0
2896/1:14.47190.0003chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client")=0
2896/1:14.47200.0001getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client",1024)=0
2896/1:14.47220.0002chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs")=0
2896/1:14.47250.0003access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css2.log",F_OK)=0//<==读css2.log
2896/1:14.48700.0145getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs",1024)=0
2896/1:14.48720.0002chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client")=0
2896/1:14.48730.0001getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client",1024)=0
2896/1:14.48740.0001chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs")=0
2896/1:14.48770.0003access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css3.log",F_OK)=0
2896/1:14.50220.0145getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs",1024)=0
2896/1:14.50250.0003chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client")=0
2896/1:14.50260.0001getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client",1024)=0
2896/1:14.50280.0002chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs")=0
2896/1:14.50310.0003access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css4.log",F_OK)=0
2896/1:14.55850.0554getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs",1024)=0
2896/1:14.55870.0002chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client")=0
2896/1:14.55890.0002getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client",1024)=0
2896/1:14.55900.0001chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs")=0
2896/1:14.55950.0005access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css5.log",F_OK)=0
2896/1:14.57380.0143getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs",1024)=0
2896/1:14.57390.0001chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client")=0
2896/1:14.57410.0002getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client",1024)=0
2896/1:14.57420.0001chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs")=0
......
2896/1:149.33330.0003access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css9069.log",F_OK)=0
2896/1:149.34700.0137getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs",1024)=0
2896/1:149.34710.0001chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client")=0
2896/1:149.34730.0002getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client",1024)=0
2896/1:149.34740.0001chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs")=0
2896/1:149.34780.0004access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css9070.log",F_OK)=0
2896/1:149.36160.0138getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs",1024)=0
2896/1:149.36180.0002chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client")=0
2896/1:149.36190.0001getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client",1024)=0
2896/1:149.36200.0001chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs")=0
2896/1:149.36230.0003access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css9071.log",F_OK)=0
2896/1:149.37630.0140getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs",1024)=0
2896/1:149.37650.0002chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client")=0
2896/1:149.37660.0001getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client",1024)=0
2896/1:149.37670.0001chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs")=0
2896/1:149.37700.0003access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css9072.log",F_OK)=0//<==读css9072.log
2896/1:149.39070.0137getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs",1024)=0
2896/1:149.39090.0002chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client")=0
2896/1:149.39100.0001getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client",1024)=0
2896/1:149.39110.0001chdir("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs")=0
2896/1:149.39140.0003access("/orac/my/ctdb/data/app/oracle/product/10.2.0/log/mydb2013/client/css9073.log",F_OK)=0//<==读css9073.log
2896/1:149.40510.0137getcwd("/orac/my/ctdb/data/app/oracle/product/10.2.0/dbs",1024)=0
......//<==后面还一直读cssN.log
因此,我们判断,在每次查询v$asm_diskgroup的时候,都会在client下生成一个新的cssN.log文件(10.2.0.5才有,其他版本没发现),生成的命名规则是前一个数字加1。因此,生成新的cssN.log文件时,需要遍历整个client目录下的cssN.log文件,才能知道最大的数字是多少,才能生成第N+1的文件。而在client下不断生成大量文件,这个和oracle的一个unpublish bug 6004127 有关(ID 729349.1)。目前没有patch,文档上说解决的方法是用crontab定期清理client下的cssN.log
综上,所以监控软件查询v$asm_diskgroup时,总是会hang住(其实也不是完全hang住,等遍历完client下所有文件,就能跑完了,但是这会随着每次查询,每生成一个cssN.log,变得越来越慢)。当比较多的foglight进程查询v$asm_diskgroup就会堵住。
解决方案:定期清理$ORACLE_HOME/log/hostname/client下的文件,或者升级数据库到10.2.0.5以上,或者更改代码,不查询v$asm_diskgroup。
ps:当晚客户清理了目录,测试查询v$asm_diskgroup,不到1秒就出结果了。