最近遇到一个很奇怪的问题,首先,我们发现某台机器的vas不正常,我们曾怀疑是掉域了,想重新加域,但是失败,遇到了报错:
(系统是Linux SuSE10 SP2)
ecnshxenlx0041:~ # uname -a
Linux ecnshxenlx0041 2.6.16.60-0.21-bigsmp #1 SMP Tue May 6
12:41:02 UTC 2008 i686 i686 i386 GNU/Linux
ecnshxenlx0041:~ #
ecnshxenlx0041:~ # cat
/etc/SuSE-release
SUSE Linux Enterprise Server 10 (i586)
VERSION = 10
PATCHLEVEL = 2
ecnshxenlx0041:~ #
ecnshxenlx0041:~ # date
Mon May 28 01:39:18 CST 2012
ecnshxenlx0041:~ #
ecnshxenlx0041:~ #
ecnshxenlx0041:~ # cd /etc/opt/quest/vas/
ecnshxenlx0041:/etc/opt/quest/vas #
ecnshxenlx0041:/etc/opt/quest/vas # sh lastjoin
Checking whether computer is already joined to a domain ...
no
ERROR: Could not authenticate as instcnshsv.
Clock skew error, time sync failed
VAS_ERR_NOT_FOUND: An ntp client is already running
ERROR: Could not join to the
domainecnshxenlx0041:/etc/opt/quest/vas #
ecnshxenlx0041:/etc/opt/quest/vas #
之后我又尝试flush,刷新vas信息,得到类似的报错:
ecnshxenlx0041:/etc/opt/quest/vas # /opt/quest/bin/vastool
flush
Stopping vasd: ..done
Could not load caches- Authentication failed,
error = VAS_ERR_NOT_FOUND: Not
found Caused by:
VAS_ERR_KRB5: System time out
of sync with realm EAPAC.ERICSSON.SE
(eapaccnsh01.eapac.ericsson.se)
Caused by:
KRB5KRB_AP_ERR_SKEW
(-1765328347): Clock skew too great
It appears that the computer object has
not yet replicated to the Global Catalog.
vasd will stay in disconnected mode until this replication takes
place.
You do not need to rejoin this computer.
fork_ns_ipc_handler_process: Could not
load NS caches - Authentication failed, error = VAS_ERR_NOT_FOUND:
Not found Caused by:
VAS_ERR_KRB5: System time out
of sync with realm EAPAC.ERICSSON.SE
(eapaccnsh01.eapac.ericsson.se)
Caused by:
KRB5KRB_AP_ERR_SKEW
(-1765328347): Clock skew too great
Waiting for computer object to be
replicated throughout the domain.
The NS IPC handler will be in disconnected mode until the
replication takes place.
Starting vasd: ..done
ecnshxenlx0041:/etc/opt/quest/vas #
经过多方查看,我们无意中发现这台server的系统时间非常不正常,它超出了物理机时间很多,偏差有30分钟左右。我们怀疑可能是因为系统时间和域时间发生偏移,从而导致的掉域。所以我们尝试去回复系统时间
hwclock -r:查看硬件时间
hwclock -systohc:把硬件时间赋给系统时间
ecnshxenlx0041:~ #
ecnshxenlx0041:~ # hwclock -r
Wed May 23 14:33:05 2012 -0.016208 seconds
ecnshxenlx0041:~ #
ecnshxenlx0041:~ # date
Wed May 23 15:01:46 CST 2012
ecnshxenlx0041:~ #
ecnshxenlx0041:~ # hwclock -systohc
ecnshxenlx0041:~ #
ecnshxenlx0041:~ # ### hwclock -systohc
ecnshxenlx0041:~ #
ecnshxenlx0041:~ # date
Wed May 23 14:38:56 CST 2012
ecnshxenlx0041:~ #
如此做之后,系统时间恢复,vas也正常了;但是,一旦我们重启,我们发现,系统时间会莫名其妙的偏出10几分钟,我们尝试设定ntp来矫正,但是问题根本无法解决。最奇怪的我们做过以下测试,我们查看系统时钟,发现它在运行时,会莫名其妙的突然跑偏几分钟。
ecnshxenlx0041:~ #
ecnshxenlx0041:~ # watch -n 1 date
Every 1.0s:
date Fri May 25 13:59:08 2012
Fri May 25 13:59:08 CST 2012
我们做过以下测试:
开启ntp 时间会跑偏
关闭ntp 时间仍然会跑偏
我们怀疑ntp影响系统时间,或者ntp会帮助纠正系统时间的错误,但测试证明,并非如此。
经过同事在网上的查找,我们尝试去添加一个参数到系统内核:
(青蓝色是标注我要修改的文件,和要修改哪个位置;红色是我真正添加的内容)
ecnshxenlx0041:~ # cat
/boot/grub/menu.lst # Modified by YaST2. Last
modification on Fri Jun 15 04:31:58 UTC 2012
default 0
timeout 8
##YaST - generic_mbr
gfxmenu (hd0,0)/message
##YaST - activate
###Don't change this comment - YaST2
identifier: Original name: linux###title SUSE Linux Enterprise Server 10
SP2 root (hd0,0)
kernel /vmlinuz-2.6.16.60-0.21-bigsmp
root=/dev/system/root resume=/dev/system/swap splash=silent
showopts clock=pmtmr
hpet=disable initrd /initrd-2.6.16.60-0.21-bigsmp
###Don't change this comment - YaST2 identifier: Original name:
failsafe###
title Failsafe -- SUSE Linux Enterprise Server 10 SP2
root
(hd0,0)
kernel
/vmlinuz-2.6.16.60-0.21-bigsmp root=/dev/system/root showopts
ide=nodma apm=off acpi=off noresume nosmp noapic maxcpus=0 edd=off
3
initrd
/initrd-2.6.16.60-0.21-bigsmp
ecnshxenlx0041:~ #
重启后,不知道是否这个参数的作用,系统时间不再跑偏。