这个BUG在/var/log/messages里报的是:BROADCOM[32717]: ERROR SemCreate() semget() failed! No space left on device

让人看的莫名其妙,其实是是一个信号量管理的BUG;全文如下:


Leaked semaphore arrays on X7 database servers lead to database instance startup failure or OEDA step "Create Virtual Machine" failure (文档 ID 2421498.1)转到底部转到底部

In this Document

SymptomsCauseSolution


APPLIES TO:

Oracle Exadata Storage Server Software - Version 18.1.0.0.0 to 18.1.6.0.0 [Release 12.2]
Exadata X7-2 Hardware - Version All Versions and later
Exadata X7-8 Hardware - Version All Versions and later
Information in this document applies to any platform.

SYMPTOMS

On X7-2 and X7-8 Exadata database servers running 18.1.0 through 18.1.6, a third-party tool used by CheckHWnFWProfile leaks one semaphore array per invocation, which is typically once per day.  In an OVM configuration this occurs in dom0 only.  The maximum number of semaphores arrays (semmni) is typically set to 256.  Once the number of allocated semaphores arrays reaches 256, then a variety of failures may occur, such as the following:

  • In a non-virtualized configuration new database instances fail to start with an error indicating insufficient semaphores, such as the following:

    SQL> startup nomount 
    ORA-27154: post/wait create failed 
    ORA-27300: OS system dependent operation:semget failed with status: 28 
    ORA-27301: OS failure message: No space left on device 
    ORA-27302: failure occurred at: sskgpcreates


    Already running database instances are unaffected.  

  • In a virtualized (OVM) configuration OneCommand (OEDA) step "Create Virtual Machine" fails with the following error:

    Error: Command [/opt/exadata_ovm/exadata.img.domu_maker start-domain /EXAVMIMAGES/conf/final-vm.xml] run on node n1v1.oracle.com as user root did not execute successfully...
    Error running oracle.onecommand.deploy.machines.VmUtils method createVms

    /var/log/exadata.img.domu_maker.trc contains the following error:

    [WARNING][/opt/exadata_ovm/exadata.img.domu_maker - 6214][exadata_img_domu_maker_start_domain][] [CMD: kpartx -a -v /EXAVMIMAGES/GuestImages/texa2b3npv07adm.de.t-internal.com/System.img] [CMD_STATUS: 3]
    ----- START STDERR -----
    Limit for the maximum number of semaphores reached. You can check and set the limits in /proc/sys/kernel/sem. create/reload failed on loop6p1
    Limit for the maximum number of semaphores reached. You can check and set the limits in /proc/sys/kernel/sem. create/reload failed on loop6p2
    Limit for the maximum number of semaphores reached. You can check and set the limits in /proc/sys/kernel/sem. create/reload failed on loop6p3


    Already running domUs are unaffected.

  • CheckHWnFWProfile does not properly identify the firmware version of 25G Ethernet devices.

  • Customer-installed software that relies on the ability to allocate a semaphore array to run will fail.

 

Additional symptoms include the following:

/var/log/messages contains the following errors:

BROADCOM[32717]: ERROR SemCreate() semget() failed! No space left on device
BROADCOM[32717]: ERROR ngBmapiInitialize() LockCreate() failed!


BROADCOM[32717]: ERROR /usr/share/hwdata/pci.ids file should be updated
BROADCOM[32717]: ERROR GetSriovInfo() fopen() /sys/bus/pci/devices/0000:5e:00.0/virtfn0/uevent failed! 2

A large number of semaphore arrays exist containing 3 semaphores per array, but are not associated with any running process.  This can be determined by running the following bash code as the root user.  This bash code identifies semaphore arrays that have been allocated but the process they are associated with no longer exists.

# for semid in $(ipcs -s | egrep ' 3[ ]*$' | awk '{print $2}'); do 
   for pid in $(ipcs -s -p -i $semid | awk '/^[0-9]/{print $NF}'|sort -u); do 
      if ! ps -p $pid >/dev/null 2>&1; then 
         echo "safe to remove semid $semid - no pid $pid"
      fi
   done
done

safe to remove semid 98306 - no pid 15494
safe to remove semid 1703939 - no pid 33680
safe to remove semid 3309572 - no pid 57755
safe to remove semid 5373957 - no pid 269913
safe to remove semid 8814598 - no pid 138286
safe to remove semid 10878983 - no pid 30220
safe to remove semid 13860872 - no pid 175465
safe to remove semid 17301513 - no pid 268284
safe to remove semid 18907146 - no pid 271790
safe to remove semid 22347787 - no pid 156966
safe to remove semid 24412172 - no pid 191491
safe to remove semid 29229069 - no pid 322416
safe to remove semid 32669710 - no pid 369570
safe to remove semid 34275343 - no pid 77291
safe to remove semid 35880976 - no pid 179308
safe to remove semid 38862865 - no pid 81631
safe to remove semid 40468498 - no pid 172204
safe to remove semid 42991635 - no pid 139704
safe to remove semid 45056020 - no pid 195145
safe to remove semid 47120405 - no pid 32180
safe to remove semid 50561046 - no pid 207504
safe to remove semid 52166679 - no pid 374637
safe to remove semid 53772312 - no pid 204800
safe to remove semid 55377945 - no pid 137219
safe to remove semid 56983578 - no pid 237474
safe to remove semid 58589211 - no pid 278120
safe to remove semid 60653596 - no pid 146935
safe to remove semid 62259229 - no pid 394885
safe to remove semid 64323614 - no pid 208690
safe to remove semid 65929247 - no pid 269162
safe to remove semid 67534880 - no pid 323781
safe to remove semid 69599265 - no pid 31314
safe to remove semid 71204898 - no pid 115285
safe to remove semid 72810531 - no pid 178516
safe to remove semid 74416164 - no pid 251490
safe to remove semid 76021797 - no pid 323671
safe to remove semid 77627430 - no pid 366789
safe to remove semid 80150567 - no pid 21204
safe to remove semid 81756200 - no pid 21632
safe to remove semid 83361833 - no pid 164005

 

 

CAUSE

Bug 28027670 

SOLUTION

This issue is fixed in 18.1.7.

Workaround

Remove the leaked semaphore arrays that are no longer associated with a running process by running the following bash code as the root user.  Run this code on all database servers in the cluster.  In an OVM configuration, run this code in dom0.

# for semid in $(ipcs -s | egrep ' 3[ ]*$' | awk '{print $2}'); do 
   for pid in $(ipcs -s -p -i $semid | awk '/^[0-9]/{print $NF}'|sort -u); do 
      if ! ps -p $pid >/dev/null 2>&1; then 
         echo "removing semid $semid"
         ipcrm -s $semid
      fi
   done
done