问题描述
主机因故障重启,重启后节点无法正常启动,其它节点可以正常对外提供服务。
问题处理
-
检查集群状态
css服务启动异常 -
检查集群日志
[gpnpd(231513)]CRS-2328:GPNPD started on node xxx.
2023-08-14 19:46:09.210:
[cssd(231620)]CRS-1713:CSSD daemon is started in clustered mode
2023-08-14 19:46:09.219:
[cssd(231620)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011:) in /opt/oracle/grid/11.2.0/grid/log/db2/cssd/ocssd.log
2023-08-14 19:46:11.034:
[ohasd(229354)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
- 检查ocssd.log日志
view /opt/oracle/grid/11.2.0/grid/log/db2/cssd/ocssd.log
view /opt/oracle/grid/11.2.0/grid/log/jcsjdb2/cssd/ocssd.log
2023-08-14 19:49:55.911: [ CSSD][3743803200](:CSSSC00011:)clssscExit: A fatal error occurred during initialization
2023-08-14 19:59:56.862: [ CSSD][1963468608]clsu_load_ENV_levels: Module = CSSD, LogLevel = 2, TraceLevel = 0
2023-08-14 19:59:56.862: [ CSSD][1963468608]clsu_load_ENV_levels: Module = GIPCNM, LogLevel = , TraceLevel = 0
2023-08-14 19:59:56.862: [ CSSD][1963468608]clsu_load_ENV_levels: Module = GIPCGM, LogLevel = 2, TraceLevel = 0
2023-08-14 19:59:56.862: [ CSSD][1963468608]clsu_load_ENV_levels: Module = GIPCCM, LogLevel = 2, TraceLevel = 0
2023-08-14 19:59:56.862: [ CSSD][1963468608]clsu_load_ENV_levels: Module = CLSF, LogLevel = 0, TraceLevel = 0
2023-08-14 19:59:56.862: [ CSSD][1963468608]clsu_load_ENV_levels: Module = SKGFD, LogLevel = 0, TraceLevel = 0
2023-08-14 19:59:56.862: [ CSSD][1963468608]clsu_load_ENV_levels: Module = GPNP, LogLevel = 1, TraceLevel = 0
2023-08-14 19:59:56.862: [ CSSD][1963468608]clsu_load_ENV_levels: Module = OLR, LogLevel = 0, TraceLevel = 0
[ CSSD][1963468608]clsugetconf : Configuration type [4].2023-08-14 19:59:56.862: [ CSSD][1963468608]clssscmain:Starting CSS daemon, version 11.2.0.4.0, in (clustered) mode with uniqueness value 1692014396
2023-08-14 19:59:56.863: [ CSSD][1963468608]clssscmain:Environment is production
2023-08-14 19:59:56.863: [ CSSD][1963468608]clssscmain:Core file size limit extended
2023-08-14 19:59:56.868: [ CSSD][1963468608]clssscmain:GIPCHA down 0
2023-08-14 19:59:56.870: [ CSSD][1963468608]clssscGetParameterOLR: OLR fetch for parameter logsize (8) failed with rc 21
2023-08-14 19:59:56.870: [ CSSD][1963468608]clssscExtendLimits: The current soft limit for file descriptors is 65536,hard limit is 65536
2023-08-14 19:59:56.870: [ CSSD][1963468608]clssscExtendLimits: The current soft limit for locked memory is 4294967295, hard limit is 4294967295
2023-08-14 19:59:56.871: [ CSSD][1963468608]clssscGetParameterOLR: OLR fetch for parameter priority (15) failed with rc 21
2023-08-14 19:59:56.871: [ CSSD][1963468608]clssscSetPrivEnv: Setting priority to 4
2023-08-14 19:59:56.881: [ CSSD][1963468608]clssscSetPrivEnv: unable to set priority to 4
2023-08-14 19:59:56.881: [ CSSD][1963468608]SLOS: cat=-2, opn=scls_set_priority_realtime, dep=1, loc=setsched
unable to escalate to real time
从ocss日志中可以看到ocssd进程启动时无法得到较高的优先级,无法启动到real time。
Linux: GI OCSSD Fails to Start After cgroups Setting Change (Doc ID 1577784.1) 描述与此现象高度相似
- 检查cgconfig.conf,发现未配置任何信息。
cat /etc/cgconfig.conf
#
# Copyright IBM Corporation. 2007
#
# Authors: Balbir Singh <balbir@linux.vnet.ibm.com>
# This program is free software; you can redistribute it and/or modify it
# under the terms of version 2.1 of the GNU Lesser General Public License
# as published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
#
# By default, we expect systemd mounts everything on boot,
# so there is not much to do.
# See man cgconfig.conf for further details, how to create groups
# on system boot using this file.
- 检查/sys/fs/cgroup/cpu/cpu.rt_*
cat /sys/fs/cgroup/cpu/cpu.rt_period_us
1000000
cat /sys/fs/cgroup/cpu/cpu.rt_runtime_us
950000
cpu.rt_period_us和cpu.rt_runtime_us设置的就是推荐值950000
该文档《Linux: GI OCSSD Fails to Start After cgroups Setting Change (Doc ID 1577784.1)》的解决方案不适用。
- reahat官方关于CPU的相关设置说明
How to configure a RHEL 7 or RHEL 8 system to be able to run programs requiring Real-Time Scheduling
当CPUAccounting参数enabled时,将不能创建real-time进程。排查system.conf配置文件发现并没有开启CPUAccounting参数
find /etc/systemd/system.conf /etc/systemd/system /usr/lib/systemd -type f | xargs grep -e CPUAccounting -e CPUWeight -e StartupCPUWeight -e CPUShares -e StartupCPUShares -e CPUQuota
# 返回结果
/etc/systemd/system.conf: #DefaultCPUAccounting=no
/usr/lib/systemd/system/titanagent.service:CPUQuota=50%
发现/usr/lib/systemd/system/titanagent.service中有CPUQuota=50%参数配置,而CPUQuota参数如果配置就会隐性开启CPUAccounting,所以即使第六步中CPUAccounting参数没有配置enabled也会开启CPUAccounting
- 禁用titanagent.service后,重启主机集群启动正常
systemctl stop titanagent.service
systemctl disable titanagent.service
参考文档
https://www.modb.pro/db/568198
http://blog.itpub.net/23825935/viewspace-2917179/
https://editor.csdn.net/md/?articleId=132299816