今天處理了一例Solaris CPU佔用率高的問題
描述如下:
該機器是做nca用的,正常情況下Loading比較低,都在30%以下,這兩天不知道怎麼會事CPU利用率總沖到100%,其中kernel的利用率更是在90%以上,
打TOP後發現系統Loading33-40,開始佔用CPU比較高的進程是tar,最後變成了syslogd,一直持續到系統沒有響應為止.
分析過程:
1.
用下面四個指令任選其一查看進程所用的CPU比率,找出那些佔用CPU高的進程到底是那些,它們在幹什麼.
prstat -cvm/prstat -a/ps -eo pid,pcpu,args |sort +1n/top
2.
挑出幾個可疑的進程
A.syslogd:
改進程是系統日誌進程,經常是它把kernel的利用率給弄高
B.insnca_compard.sh:
這個進程是我寫的用來收集nca信息用的,它給系統帶來的負載並不高,但是很快就執行完的shell,變的很慢,導致系統中有大量的insnca_compard.sh在運行
C.tar:
該進程是把nca備份到本地硬盤上,每天晚上12:00都會運行,最近也是白天才結束,以前不會
3.
先撿容易的把它給辦了.
A.
用指令for x in ....把insnca_compard.sh全都幹掉,發現問題依舊.
B.
把tar停掉,CPU降下來了,但是5分鐘過後CPU利用率又升高了.
C.
正在處理呢,忽然...
系統不停的報下面這個錯誤,滿屏幕都是,syslogd的利用率升高到60%.
message overflow on /dev/log minor #6 -- is syslogd(1M) running?
系統的Loading也在升高,10分鐘以後慢的不能使用了,看來基本上能夠肯定是系統出問題了,不停的寫日誌,導致不能響應.
分析並解決根本問題:
先排除低級錯誤,依次使用下面的指令:
df -k/share
把系統重新啟動,把日誌download下來,看了一下,發現有下面這些錯誤:
...
Jun 3 09:41:36 nca mibiisa: [ID 942500 daemon.error] Can not set up management information base (MIB)
Jun 3 09:41:54 nca sendmail[270]: [ID 702911 mail.alert] unable to qualify my own domain name (nca) -- using short name
Jun 3 13:13:32 nca SUNW,UltraSPARC-IIe: [ID 960406 kern.info] [AFT0] Corrected Memory Error detected by CPU0, errID 0x00000c01.b4db2a6e
Jun 3 13:13:32 nca AFSR 0x00000000.00100000 AFAR 0x00000000.8ffbd150
Jun 3 13:13:32 nca AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xfe441f8c
Jun 3 13:13:32 nca UDBH Syndrome 0x1a Memory Module Invalid Syndrome
Jun 3 13:13:32 nca SUNW,UltraSPARC-IIe: [ID 725850 kern.info] [AFT0] errID 0x00000c01.b4db2a6e Corrected Memory Error on Invalid Syndrome is Intermittent
Jun 3 13:13:32 nca SUNW,UltraSPARC-IIe: [ID 240906 kern.info] [AFT0] errID 0x00000c01.b4db2a6e ECC Data Bit 49 was in error and corrected
Jun 3 13:13:33 nca SUNW,UltraSPARC-IIe: [ID 286331 kern.info] [AFT0] Corrected Memory Error detected by CPU0, errID 0x00000c01.e3aec4b7
Jun 3 13:13:33 nca AFSR 0x00000000.00100000 AFAR 0x00000000.8ffdc058
Jun 3 13:13:33 nca AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x4c0df0
Jun 3 13:13:33 nca UDBH Syndrome 0x1a Memory Module Invalid Syndrome
Jun 3 13:13:33 nca SUNW,UltraSPARC-IIe: [ID 121157 kern.info] [AFT0] errID 0x00000c01.e3aec4b7 Corrected Memory Error on Invalid Syndrome is Intermittent
Jun 3 13:13:33 nca SUNW,UltraSPARC-IIe: [ID 388550 kern.info] [AFT0] errID 0x00000c01.e3aec4b7 ECC Data Bit 49 was in error and corrected
Jun 3 13:13:33 nca SUNW,UltraSPARC-IIe: [ID 232456 kern.info] [AFT0] Corrected Memory Error detected by CPU0, errID 0x00000c01.e3b3f62c
Jun 3 13:13:33 nca AFSR 0x00000000.00100000 AFAR 0x00000000.8ffdc0a0
Jun 3 13:13:33 nca AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x4c1080
Jun 3 13:13:33 nca UDBH Syndrome 0x1a Memory Module Invalid Syndrome
Jun 3 13:13:33 nca SUNW,UltraSPARC-IIe: [ID 997157 kern.info] [AFT0] errID 0x00000c01.e3b3f62c Corrected Memory Error on Invalid Syndrome is Intermittent
Jun 3 13:13:33 nca SUNW,UltraSPARC-IIe: [ID 826550 kern.info] [AFT0] errID 0x00000c01.e3b3f62c ECC Data Bit 49 was in error and corrected
Jun 3 13:13:33 nca SUNW,UltraSPARC-IIe: [ID 730196 kern.info] [AFT0] Corrected Memory Error detected by CPU0, errID 0x00000c01.e3b6f114
Jun 3 13:13:33 nca AFSR 0x00000000.00100000 AFAR 0x00000000.8ffdc100
Jun 3 13:13:33 nca AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x4c120c
Jun 3 13:13:33 nca UDBH Syndrome 0x1a Memory Module Invalid Syndrome
...
系統在一秒鐘之內發出了超過10條的錯誤信息,難怪會down掉.
prtdiag -v
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/10325341/viewspace-573114/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/10325341/viewspace-573114/