Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481

本文分析了Oracle ASM在非首节点启动失败的原因及解决方法,包括网络配置问题、防火墙设置、HAIP路由缺失等常见故障场景。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Information in this document applies to any platform.

PURPOSE

This note lists common causes of ASM start up failure with the following error on non-first node (second or others):
  • alert_<ASMn>.log from non-first node
lmon registered with NM - instance number 2 (internal mem no 1)
Tue Dec 06 06:16:15 2011
System state dump requested by (instance=2, osid=19095 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /g01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_diag_19138.trc
Tue Dec 06 06:16:15 2011
PMON (ospid: 19095): terminating the instance due to error  481
Dumping diagnostic data in directory=[cdmp_20111206061615], requested by (instance=2, osid=19095 (PMON)), summary=[abnormal instance termination].
Tue Dec 06 06:16:15 2011
ORA-1092 : opitsk aborting process

Note : ASM instance terminates shortly after "lmon registered with NM"

If ASM on non-first node was running previously, likely the following will be in alert.log when it failed originally:
IPC Send timeout detected. Sender: ospid 32231 [oracle@ftdcslsedw01b (PING)]
..
ORA-29740: evicted by instance number 1, group incarnation 10

diag trace from non-first ASM (+ASMn_diag_<pid>.trc)
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE])
alert_<ASMn>.log from first node
LMON (ospid: 15986) detects hung instances during IMR reconfiguration
LMON (ospid: 15986) tries to kill the instance 2 in 37 seconds.
Please check instance 2's alert log and LMON trace file for more details.
..
Remote instance kill is issued with system inc 64
Remote instance kill map (size 1) : 2
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2
Reconfiguration started (old inc 64, new inc 66)

If the issue happens while running root script (root.sh or rootupgrade.sh) as part of Grid Infrastructure installation/upgrade process, the following symptoms will present:
root script screen output
Start of resource "ora.asm" failed

CRS-2672: Attempting to start 'ora.asm' on 'racnode1'
CRS-5017: The resource action "ora.asm start" encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0
. For details refer to "(:CLSN00107:)" in "/ocw/grid/log/racnode1/agent/ohasd/oraagent_grid/oraagent_grid.log".
CRS-2674: Start of 'ora.asm' on 'racnode1' failed
..
Failed to start ASM at /ispiris-qa/app/11.2.0.3/crs/install/crsconfig_lib.pm line 1272
$GRID_HOME/cfgtoollogs/crsconfig/rootcrs_<nodename>.log
2011-11-29 15:56:48: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl start resource ora.asm -init
..
>  CRS-2672: Attempting to start 'ora.asm' on 'racnode1'
>  CRS-5017: The resource action "ora.asm start" encountered the following error:
>  ORA-03113: end-of-file on communication channel
>  Process ID: 0
>  Session ID: 0 Serial number: 0
>  . For details refer to "(:CLSN00107:)" in "/ispiris-qa/app/11.2.0.3/log/racnode1/agent/ohasd/oraagent_grid/oraagent_grid.log".
>  CRS-2674: Start of 'ora.asm' on 'racnode1' failed
>  CRS-2679: Attempting to clean 'ora.asm' on 'racnode1'
>  CRS-2681: Clean of 'ora.asm' on 'racnode1' succeeded
..
>  CRS-4000: Command Start failed, or completed with errors.
>End Command output
2011-11-29 15:59:00: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl check resource ora.asm -init
2011-11-29 15:59:00: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl status resource ora.asm -init
2011-11-29 15:59:01: Checking the status of ora.asm
..
2011-11-29 15:59:53: Start of resource "ora.asm" failed
For 12.1.0.2, the root.sh on the 2nd node could report:

PRVG-6056 : Insufficient ASM instances found.  Expected 2 but found 1, on nodes "racnode2".

DETAILS

Case1: link local IP (169.254.x.x) is being used by other adapter/network

Symptoms:
$GRID_HOME/log/<nodename>/alert<nodename>.log
[/ocw/grid/bin/orarootagent.bin(4813)]CRS-5018:(:CLSN00037:) Removed unused HAIP route:  169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0
OS messages (optional)
Dec  6 06:11:14 racnode1 dhclient: DHCPREQUEST on usb0 to 255.255.255.255 port 67
Dec  6 06:11:14 racnode1 dhclient: DHCPACK from 169.254.95.118
ifconfig -a
usb0      Link encap:Ethernet  HWaddr E6:1F:13:AD:EE:D3
        inet addr:169.254.95.120  Bcast:169.254.95.255  Mask:255.255.255.0
..

Note: it's usb0 in this case, but it can be any other adapter which uses link local
Solution:

Link local IP must not be used by any other network on cluster nodes. In this case, an USB network device gets IP 169.254.95.118 from DHCP server which disrupted HAIP routing, and solution is to black list the device in udev from being activated automatically.

Dell iDRAC service module may use link local, engage Dell to change the subnet.

On Sun T series, by default, ILOM (adapter name usbecm0) uses link local, engage Oracle Support for advice.

Case2: firewall exists between nodes on private network (iptables etc)

No firewall is allowed on private network (cluster_interconnect) between nodes including software firewall like iptables, ipmon etc

Case3: HAIP is up on some nodes but not on all

Symptoms:
alert_<+ASMn>.log for some instances
Cluster communication is configured to use the following interface(s) for this instance
10.1.0.1
alert_<+ASMn>.log for other instances
Cluster communication is configured to use the following interface(s) for this instance
169.254.201.65

Note: some instances is using HAIP while others are not, so they can not talk to each other
Solution:
The solution is to bring up HAIP on all nodes.

To find out HAIP status, execute the following on all nodes:
$GRID_HOME/bin/crsctl stat res ora.cluster_interconnect.haip -init
If it's offline, try to bring it up as root:
$GRID_HOME/bin/crsctl start res ora.cluster_interconnect.haip -init
If HAIP fails to start, refer to Note 1210883.1 for known issues. Once HAIP is restarted, ASM/DB instances need to be restarted to use HAIP; if OCR is on ASM DG, GI needs to be restarted.
If the "up node" is not using HAIP, and no outage is allowed, the workaround is to set init.ora/spfile parameter cluster_interconnect to the private IP of each node to allow ASM/DB to come up on "down node". Once a maintenance window is planned, the parameter must be removed to allow HAIP to work.
The following article may assist in determining the reason for the failure to start HAIP:
note 1640865.1 - Known Issues: Grid Infrastructure Redundant Interconnect and ora.cluster_interconnect.haip
If the issue happened in the middle of GI upgrade, refer to: 
note 2063676.1 - rootupgrade.sh fails on node1 as HAIP was not starting from old home but starting from new home

Case4: HAIP is up on all nodes but some do not have route info

Symptoms:
alert_<+ASMn>.log for all instances
Cluster communication is configured to use the following interface(s) for this instance
169.254.xxx.xxx
"netstat -rn" for some nodes (surviving nodes) missing HAIP route
netstat -rn
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
161.130.90.0     0.0.0.0         255.255.248.0   U         0 0          0 bond0
160.131.11.0     0.0.0.0         255.255.255.0   U         0 0          0 bond2
0.0.0.0      160.11.80.1     0.0.0.0         UG        0 0          0   bond0

The line for HAIP is missing, i.e:

169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 bond2

Note: As HAIP route info is missing on some nodes, HAIP is not pingable; usually newly restarted node will have HAIP route info

Solution:
The solution is to manually add HAIP route info on the nodes that's missing:

4.1. Execute "netstat -rn" on any node that has HAIP route info and locate the following:

169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 bond2

Note: the first field is HAIP subnet ID and will start with 169.254.xxx.xxx, the third field is HAIP subnet netmask and the last field is private network adapter name

4.2. Execute the following as root on the node that's missing HAIP route:
# route add -net <HAIP subnet ID> netmask <HAIP subnet netmask> dev <private network adapter>

i.e.

# route add -net 169.254.0.0 netmask 255.255.0.0 dev bond2
4.3. Start ora.crsd as root on the node that's partial up:.
# $GRID_HOME/bin/crsctl start res ora.crsd -init
The other workaround is to restart GI on the node that's missing HAIP route with "crsctl stop crs -f" and "crsctl start crs" command as root.

Case5. HAIP is up on all nodes and route info is presented but HAIP is not pingable

Symptom:
HAIP is presented on both nodes and route information is also presented, but both nodes can not ping or traceroute against the other node HAIP.
[oracle@racnode2 script]$ netstat -r
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
192.168.3.0     *               255.255.255.0   U         0 0          0 eth2
192.168.2.0     *               255.255.255.0   U         0 0          0 eth1
192.168.1.0     *               255.255.255.0   U         0 0          0 eth0
link-local      *               255.255.0.0     U         0 0          0 eth2
default         192.168.1.1     0.0.0.0         UG        0 0          0 eth0



[oracle@racnode1 trace]$ netstat -r
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
192.168.3.0     *               255.255.255.0   U         0 0          0 eth2
192.168.2.0     *               255.255.255.0   U         0 0          0 eth1
192.168.1.0     *               255.255.255.0   U         0 0          0 eth0
link-local      *               255.255.0.0     U         0 0          0 eth2
default         192.168.1.1     0.0.0.0         UG        0 0          0 eth0

[oracle@racnode2 script]$ ping 169.254.100.135 
PING 169.254.100.135 (169.254.100.135) 56(84) bytes of data.

^C
--- 169.254.100.135 ping statistics ---
39 packets transmitted, 0 received, 100% packet loss, time 38841ms

[oracle@racnode1 trace]$ ping 169.254.26.132
PING 169.254.26.132 (169.254.26.132) 56(84) bytes of data.

^C
--- 169.254.26.132 ping statistics ---
35 packets transmitted, 0 received, 100% packet loss, time 34555ms

Solution:
For Openstack Cloud implementation, engage system admin to create another neutron port to map link-local traffic. For other environment, engage SysAdmin/NetworkAdmin to review routing/network setup.

REFERENCES

NOTE:1640865.1  - Known Issues: Grid Infrastructure Redundant Interconnect and ora.cluster_interconnect.haip
NOTE:1210883.1  - Grid Infrastructure Redundant Interconnect and ora.cluster_interconnect.haip
NOTE:1386709.1  - The Basics of IPv4 Subnet and Oracle Clusterware








来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/22996654/viewspace-2146517/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/22996654/viewspace-2146517/

<think>好的,我需要解决用户提到的Higress安装失败的问题,特别是由于依赖项启动失败导致的容器不健康,尤其是higress-grafana-1容器的问题。首先,我得理解Higress的安装依赖哪些组件,以及Grafana在其中的作用。 用户提到使用了k run命令生成Pod模板,但可能配置有误。我需要检查YAML文件是否正确配置了容器依赖,比如资源限制、健康检查等。常见的启动失败原因包括资源不足(如内存或CPU限制过低)、配置文件错误、依赖服务未就绪等。 接下来,查看容器日志是关键。用户应该用kubectl logs命令检查higress-grafana-1的日志,寻找错误信息,比如无法连接到数据库、插件加载失败等。例如,如果日志显示端口被占用或权限问题,就需要调整配置。 另外,依赖项启动顺序可能有问题。Kubernetes中可以使用init容器来确保依赖服务先启动。或者,在部署Higress时,可能没有正确设置服务之间的依赖关系,导致Grafana在依赖的存储服务(如Prometheus)未就绪时启动失败。 资源限制方面,Grafana可能需要更多内存或CPU。如果YAML中设置了过低的限制,会导致容器OOMKilled或无法启动。建议用户检查资源请求和限制,适当增加。 网络策略也可能影响容器通信,比如Grafana需要访问其他服务但被网络策略阻止。需要确认集群的网络策略是否允许必要的通信。 还有可能是持久化存储配置错误,比如PVC无法绑定,导致Grafana无法写入数据。检查PVC配置是否正确,存储类是否可用。 最后,版本兼容性问题也不容忽视。用户安装的Higress版本和Grafana镜像版本是否兼容?可能需要尝试不同版本的组合,或查阅Higress的官方文档确认支持的版本。 综合这些方面,逐步排查问题,从日志入手,调整配置和资源,确保依赖服务正确启动,应该能解决安装失败的问题。</think>为解决Higress安装失败问题(特别是因依赖项启动失败导致的容器`higress-grafana-1`不健康),请按以下步骤排查: --- ### 1. **检查容器日志定位具体错误** ```bash kubectl logs higress-grafana-1 -n <namespace> ``` - **关键点**:查看日志中是否有`插件加载失败`、`数据库连接超时`或`资源不足`等错误提示。例如,若日志显示`Failed to connect to http://prometheus:9090`,说明依赖的Prometheus服务未就绪[^1]。 --- ### 2. **验证资源限制配置** 在YAML配置中检查`higress-grafana-1`的资源请求和限制: ```yaml resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "1Gi" cpu: "500m" ``` - **典型问题**:内存不足会导致容器反复崩溃并重启,表现为`OOMKilled`状态。 --- ### 3. **确认依赖服务启动顺序** 在Kubernetes中通过`initContainers`或`readinessProbe`确保依赖项(如Prometheus、MySQL)优先启动: ```yaml initContainers: - name: wait-for-prometheus image: busybox command: ['sh', '-c', 'until nslookup prometheus; do echo waiting; sleep 3; done'] ``` --- ### 4. **检查持久化存储配置** 若Grafana依赖PVC,需验证存储类是否可用: ```bash kubectl get pvc -n <namespace> kubectl describe pvc grafana-pvc -n <namespace> ``` - **常见错误**:`Pending`状态表示存储类配置错误或存储后端不可用。 --- ### 5. **调整网络策略** 确保Grafana可访问依赖服务(如Prometheus的9090端口): ```yaml networkPolicy: egress: - to: - podSelector: matchLabels: app: prometheus ports: - protocol: TCP port: 9090 ``` --- ### 6. **验证镜像版本兼容性** 检查Higress与Grafana镜像的版本匹配性: ```yaml image: grafana/grafana:9.1.5 # 需与Higress文档推荐版本一致 ``` - **建议**:参考Higress官方文档确认支持的Grafana版本。 --- ### 7. **重新部署并监控状态** 修复配置后重新部署并观察容器状态: ```bash kubectl apply -f 13.yaml kubectl get pods -n <namespace> -w ``` ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值