udevd worker unexpectedly returned with status 0x0100

最新推荐文章于 2023-02-26 22:45:26 发布

victoruu

最新推荐文章于 2023-02-26 22:45:26 发布

阅读量7.5k

点赞数 2

分类专栏： Linux KB 文章标签： Redhat Knowledgebase

Linux KB 专栏收录该内容

35 篇文章 0 订阅

订阅专栏

环境

Red Hat Enterprise Linux(RHEL) 6, several minor versions
udevd

问题

A host was losing paths to storage which is configured using device mapper multipath and as they came back, a lot of following errors occurred:

udevd[11136]: worker [13191] failed while handling '/devices/pci0000:00/0000:00:03.2/0000:04:00.1/host2/rport-2:0-4/target2:0:2/2:0:2:31/scsi_device/2:0:2:31'
udevd[11136]: worker [13192] unexpectedly returned with status 0x0100
udevd[11136]: worker [13193] failed while handling '/devices/pci0000:00/0000:00:03.2/0000:04:00.1/host2/rport-2:0-5/target2:0:3/2:0:3:32/block/sdvv'
udevd[11136]: worker [13204] unexpectedly returned with status 0x0100

When rebooting a system, error similar to the ones above are experienced
Rebooting RHEL 6 system after patch, the system fails to boot and boot hangs with udev errors
My RHEL 6 server sometime hang/kernel panic/reboot with udev error message: udevd worker unexpectedly returned with status 0x0100.
Oracle RAC server, Red Hat Enterprise Linux 6 hung with error message "udevd worker unexpectedly returned with status 0x0100" and needed to reboot manually to recover from the situation?

决议

Ensure the kernel options log_buf_len=4M or bigger is used. This is increasing the log buffer, preventing cases where the kernel might fill up the log with messages, i.e. regarding found LUNs.
Update the udev packages udev, libudev and libgudev1 to 147-2.63.el6_7.1 (released with RHBA-2015-2654) or later, which includes fixes for the known issues. After package update, rebuild the ramfs image and reboot after package upgrade:
- yum update udev libudev libgudev1
- dracut -f
- Perform a cold boot up after the updating the packages.
  - shutdown -h now
  - Wait for few minutes and then boot up the system
Disable hal if possible: If the system is not used as desktop, disable hal. Details can be found in What is hald service used for? .
In addition, if EMC powerpath is installed and the mentioned issue is observed, then update the powerpath software to a appropriate version after updating the RHEL OS. Contact EMC for assistance on this.
We have seen issues where starting the system with one CPU did restore normal operations. If such a situation is hit, 2 things should be done:
- It should be attempted, if changing ACTION=="add", KERNEL=="cpu[0-9]*", RUN+="/bin/bash -c 'echo 1 > /sys/devices/system/cpu/%k/online'" in file /lib/udev/rules.d/40-redhat.rules into #ACTION=="add", KERNEL=="cpu[0-9]*", RUN+="/bin/bash -c 'echo 1 > /sys/devices/system/cpu/%k/online'" leads to the system starting, also without kernel options restricting the initial CPUs to 1. This modification can be used, the only downside is that CPUs which would be added onthefly (without reboot) will not be automatically onlined. This can affect real hardware as well as virtual guests, i.e. KVM guests getting CPUs added.
- If this change improves the situation, please contact the Red Hat Support and ask for a comment to be left in (private) bz1310159 .
If you believe you are still hitting this issue contact Red Hat Support to open a case and reference this article.

根源

The number of spawned udevd workers depends only on the amount of RAM. As a result, for machines with relatively big RAM sizes and lots of disks, a lot of udevd workers are running in parallel, maximizing CPU and I/O. This can cause udev events to timeout, because of hardware bottlenecks.
A fix that helps govern multiple parallel driver loads that were occurring via modprobe to prevent unnecessary driver loads which contributed to high system resource use during device discovery which also could case the udev events to timeout.
This and related isses were fixed within the udev-147-2.63.el6_7.1 via (private) bugzilla 1281469 and 1281467. Additional fixes are present from (private) bugzillas 1170313, 885978 and 816724 that address other related issues that contribute to 0x100 messages being displayed.

诊断步骤

kernel option maxcpus=1 can be used as workaround in some cases (this option will lead to a performance degradation and is meant for debugging)
If the issue is occurring post-boot, enable additional udev logging.
Raw
```
# udevadm control --log-priority=info
```

Try increasing the timeout by adding this line to /lib/udev/rules.d/10-dm.rules

OPTIONS+="event_timeout=600"

Like this

...
ENV{DM_UDEV_RULES_VSN}="2"
OPTIONS+="event_timeout=600" 
ENV{DM_UDEV_DISABLE_DM_RULES_FLAG}!="1", ENV{DM_NAME}=="?*", SYMLINK+="(DM_DIR)/$env{DM_NAME}"
...