环境
- Red Hat Enterprise Linux(RHEL) 6, several minor versions
- udevd
问题
-
A host was losing paths to storage which is configured using device mapper multipath and as they came back, a lot of following errors occurred:
udevd[11136]: worker [13191] failed while handling '/devices/pci0000:00/0000:00:03.2/0000:04:00.1/host2/rport-2:0-4/target2:0:2/2:0:2:31/scsi_device/2:0:2:31' udevd[11136]: worker [13192] unexpectedly returned with status 0x0100 udevd[11136]: worker [13193] failed while handling '/devices/pci0000:00/0000:00:03.2/0000:04:00.1/host2/rport-2:0-5/target2:0:3/2:0:3:32/block/sdvv' udevd[11136]: worker [13204] unexpectedly returned with status 0x0100
-
When rebooting a system, error similar to the ones above are experienced
- Rebooting RHEL 6 system after patch, the system fails to boot and boot hangs with udev errors
- My RHEL 6 server sometime hang/kernel panic/reboot with
udev
error message:udevd worker unexpectedly returned with status 0x0100
. - Oracle RAC server, Red Hat Enterprise Linux 6 hung with error message "udevd worker unexpectedly returned with status 0x0100" and needed to reboot manually to recover from the situation?
决议
-
Ensure the kernel options
log_buf_len=4M
or bigger is used. This is increasing the log buffer, preventing cases where the kernel might fill up the log with messages, i.e. regarding found LUNs. -
Update the udev packages
udev
,libudev
andlibgudev1
to147-2.63.el6_7.1
(released with RHBA-2015-2654) or later, which includes fixes for the known issues. After package update, rebuild the ramfs image and reboot after package upgrade:yum update udev libudev libgudev1
dracut -f
- Perform a cold boot up after the updating the packages.
shutdown -h now
- Wait for few minutes and then boot up the system
-
Disable hal if possible: If the system is not used as desktop, disable hal. Details can be found in What is hald service used for? .
-
In addition, if EMC powerpath is installed and the mentioned issue is observed, then update the powerpath software to a appropriate version after updating the RHEL OS. Contact EMC for assistance on this.
-
We have seen issues where starting the system with one CPU did restore normal operations. If such a situation is hit, 2 things should be done:
- It should be attempted, if changing
ACTION=="add", KERNEL=="cpu[0-9]*", RUN+="/bin/bash -c 'echo 1 > /sys/devices/system/cpu/%k/online'"
in file/lib/udev/rules.d/40-redhat.rules
into#ACTION=="add", KERNEL=="cpu[0-9]*", RUN+="/bin/bash -c 'echo 1 > /sys/devices/system/cpu/%k/online'"
leads to the system starting, also without kernel options restricting the initial CPUs to 1. This modification can be used, the only downside is that CPUs which would be added onthefly (without reboot) will not be automatically onlined. This can affect real hardware as well as virtual guests, i.e. KVM guests getting CPUs added. - If this change improves the situation, please contact the Red Hat Support and ask for a comment to be left in (private) bz1310159 .
- It should be attempted, if changing
-
If you believe you are still hitting this issue contact Red Hat Support to open a case and reference this article.
根源
-
The number of spawned
udevd
workers depends only on the amount of RAM. As a result, for machines with relatively big RAM sizes and lots of disks, a lot ofudevd
workers are running in parallel, maximizing CPU and I/O. This can causeudev
events to timeout, because of hardware bottlenecks. -
A fix that helps govern multiple parallel driver loads that were occurring via modprobe to prevent unnecessary driver loads which contributed to high system resource use during device discovery which also could case the
udev
events to timeout. -
This and related isses were fixed within the
udev-147-2.63.el6_7.1
via (private) bugzilla 1281469 and 1281467. Additional fixes are present from (private) bugzillas 1170313, 885978 and 816724 that address other related issues that contribute to 0x100 messages being displayed.
诊断步骤
- kernel option
maxcpus=1
can be used as workaround in some cases (this option will lead to a performance degradation and is meant for debugging) -
If the issue is occurring post-boot, enable additional udev logging.
# udevadm control --log-priority=info
-
Try increasing the timeout by adding this line to
/lib/udev/rules.d/10-dm.rules
OPTIONS+="event_timeout=600"
Like this
... ENV{DM_UDEV_RULES_VSN}="2" OPTIONS+="event_timeout=600" ENV{DM_UDEV_DISABLE_DM_RULES_FLAG}!="1", ENV{DM_NAME}=="?*", SYMLINK+="(DM_DIR)/$env{DM_NAME}" ...