HBA card got the OCR - Online Controller reset

Issue:

-- Server details;

Product Manufacturer  : Oracle Corporation
Product Name          : Exadata X8-2
Product Part Number   : Exadata X8-2

FRU Device Description : /SYS (LUN 0, ID 3)
Chassis Type          : Rack Mount Chassis
Product Manufacturer  : Oracle Corporation
Product Name          : ORACLE SERVER X8-2

-- DB server

      
        bbuStatus:              normal
        coreCount:              48/48
        cpuCount:               96/96
        diagHistoryDays:        7
        fanCount:               16/16
        fanStatus:              normal
        httpsAccess:            ALL
        id:                     2020XLB0HR
        interconnectCount:      2
        interconnect1:          ib0
        interconnect2:          ib1
        kernelVersion:          4.1.12-124.42.4.el7uek.x86_64
        locatorLEDStatus:       off
        makeModel:              Oracle Corporation ORACLE SERVER X8-2
        metricHistoryDays:      7
        msVersion:              OSS_19.2.22.0.0_LINUX.X64_210113
        powerCount:             2/2
        powerStatus:            normal
        releaseImageStatus:     success
        releaseVersion:         19.2.22.0.0.210113
        releaseTrackingBug:     32367052
        status:                 online
        temperatureReading:     24.0
        temperatureStatus:      normal
        upTime:                 173 days, 4:58 <============= Host not reset
        msStatus:               running
        rsStatus:               running

-- Image version running

Kernel version: 4.1.12-124.42.4.el7uek.x86_64 #2 SMP Thu Sep 3 16:14:48 PDT 2020 x86_64
Image kernel version: 4.1.12-124.42.4.el7uek
Image version: 19.2.22.0.0.210113
Image created: 2021-01-14 00:43:45 -0800
Image activated: 2021-03-26 09:45:37 +0900
Image image type: production
Image status: success
Image label: OSS_19.2.22.0.0_LINUX.X64_210113
Node type: COMPUTE
System partition on device: /dev/mapper/VGExaDb-LVDbSys1

-- No critical alerts reported in alert history

227     2021-09-24T15:00:46+09:00       info            "HDD disk controller battery temperature is normal for Adapter 0"

-- According to OS logs, the controller was reported a fatal reset  

Sep 24 15:00:45 hmymsdb3 kernel: megaraid_sas 0000:65:00.0: 9430 (685810825s/0x0020/CRIT) - Controller encountered a fatal error and was reset
Sep 24 15:00:45 hmymsdb3 kernel: megaraid_sas 0000:65:00.0: 9430 (685810825s/0x0020/CRIT) - Controller encountered a fatal error and was reset
Sep 24 15:00:45 hmymsdb3 kernel: megaraid_sas 0000:65:00.0: 9430 (685810825s/0x0020/CRIT) - Controller encountered a fatal error and was reset

-- The controller firmware is as follows;


Product Name    : Avago MegaRAID SAS 9361-16i
Serial No       : SK01079259
FW Package Build: 24.22.0-0055

-- Checking the firmware term logs
-- No errors reported however the controller reported a chipreset

09/24/21  6:21:12: C0:EVT#09427-09/24/21  6:21:12:  35=Patrol Read complete
09/24/21 11:38:05: C0:Bad Block Count for LD 0 is 0
09/24/21 15:00:13: C1:MonSetAllowChipReset: MonAllowResetChip 1  <======== Controller received chip reset request

09/24/21 15:00:13: C1:In MonTask; Seconds from powerup = 0x00defce0
09/24/21 15:00:13: C1:Max Temperature = 68 on Channel 4
Firmware crash dump feature enabled
Crash dump collection will start immediately
copied 513 MB in 10040639 Microseconds Mfi State = f0010006
adapterResetRequestIsr  CCRMiscCfg c001ff06 timeUs: 86fa0850
SramFixedC->ioPathCode=c0060000 _io_path_code_start=c0060000
T0: C0:sramInitDynMemandParamsForPersonality: gDmaStructs: c0140000 gmfaHiPriFIFOAddr:c01c3800 gfpeFIFOAddr:c01c5800 greqPostFIFOAddr:c01c7800 ghsFIFOAddr:c01c9800 grlBypassFIFOAddr:c01ca000
T0: C0:sramInitDynMemandParamsForPersonality: gDmPlDynSramFreeHead:c01ce000 gDmPlDynSramFreeTail:c0400000, gDmPlDynSramFreeBytes:232000
T0: C0:Flash Size = 16 MB
T0: C0:initializeFlashTmo: maxMarginProgTmo=3072 maxMarginEraseTmo=3072
T0: C0:TtyInit: FlashLog @ 0xfc480000 Size = 0x200000
T0: C0:TtyInit: FlashTty @ 0xfc680000 Size = 0x80000

T0: C0:AVAGO ROC firmware
T0: C0:Copyright(C) AVAGO Technologies, 2014
T0: C0:Firmware version 4.740.00-8440 built on Jun 19 2019 at 05:40:28

-- No other faults reported
-- The controller reported an OCR - Online Controller reset
-- OCR is a controller feature which would help the controller to do an internal reset in case of any firmware fault
-- Unfortunately we don't have any more details about why the controller issued an OCR
-- The OCR completed successfully and controller continue funtioning normally after the OCR event
-- We observed similar OCR events with controller without any detailed information,
-- Engaged engineering to investigate why the controller is reporting an OCR under the bug
-- However the controller has not logged any much information about the condition which leads to OCR, we are unable to determine the root cause
-- Also we noticed the systems were not reported any repeated OCR events so chances for another such incident is very rare

-- As a best practise, we recommend to upgrade the image to one of the latest image release,
-- This would help to upgrade the controller firmware and drives which will help to resolve any underlying firmware bugs

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值