Issue:
-- Server details;
Product Manufacturer : Oracle Corporation
Product Name : Exadata X8-2
Product Part Number : Exadata X8-2
FRU Device Description : /SYS (LUN 0, ID 3)
Chassis Type : Rack Mount Chassis
Product Manufacturer : Oracle Corporation
Product Name : ORACLE SERVER X8-2
-- DB server
bbuStatus: normal
coreCount: 48/48
cpuCount: 96/96
diagHistoryDays: 7
fanCount: 16/16
fanStatus: normal
httpsAccess: ALL
id: 2020XLB0HR
interconnectCount: 2
interconnect1: ib0
interconnect2: ib1
kernelVersion: 4.1.12-124.42.4.el7uek.x86_64
locatorLEDStatus: off
makeModel: Oracle Corporation ORACLE SERVER X8-2
metricHistoryDays: 7
msVersion: OSS_19.2.22.0.0_LINUX.X64_210113
powerCount: 2/2
powerStatus: normal
releaseImageStatus: success
releaseVersion: 19.2.22.0.0.210113
releaseTrackingBug: 32367052
status: online
temperatureReading: 24.0
temperatureStatus: normal
upTime: 173 days, 4:58 <============= Host not reset
msStatus: running
rsStatus: running
-- Image version running
Kernel version: 4.1.12-124.42.4.el7uek.x86_64 #2 SMP Thu Sep 3 16:14:48 PDT 2020 x86_64
Image kernel version: 4.1.12-124.42.4.el7uek
Image version: 19.2.22.0.0.210113
Image created: 2021-01-14 00:43:45 -0800
Image activated: 2021-03-26 09:45:37 +0900
Image image type: production
Image status: success
Image label: OSS_19.2.22.0.0_LINUX.X64_210113
Node type: COMPUTE
System partition on device: /dev/mapper/VGExaDb-LVDbSys1
-- No critical alerts reported in alert history
227 2021-09-24T15:00:46+09:00 info "HDD disk controller battery temperature is normal for Adapter 0"
-- According to OS logs, the controller was reported a fatal reset
Sep 24 15:00:45 hmymsdb3 kernel: megaraid_sas 0000:65:00.0: 9430 (685810825s/0x0020/CRIT) - Controller encountered a fatal error and was reset
Sep 24 15:00:45 hmymsdb3 kernel: megaraid_sas 0000:65:00.0: 9430 (685810825s/0x0020/CRIT) - Controller encountered a fatal error and was reset
Sep 24 15:00:45 hmymsdb3 kernel: megaraid_sas 0000:65:00.0: 9430 (685810825s/0x0020/CRIT) - Controller encountered a fatal error and was reset
-- The controller firmware is as follows;
Product Name : Avago MegaRAID SAS 9361-16i
Serial No : SK01079259
FW Package Build: 24.22.0-0055
-- Checking the firmware term logs
-- No errors reported however the controller reported a chipreset
09/24/21 6:21:12: C0:EVT#09427-09/24/21 6:21:12: 35=Patrol Read complete
09/24/21 11:38:05: C0:Bad Block Count for LD 0 is 0
09/24/21 15:00:13: C1:MonSetAllowChipReset: MonAllowResetChip 1 <======== Controller received chip reset request
09/24/21 15:00:13: C1:In MonTask; Seconds from powerup = 0x00defce0
09/24/21 15:00:13: C1:Max Temperature = 68 on Channel 4
Firmware crash dump feature enabled
Crash dump collection will start immediately
copied 513 MB in 10040639 Microseconds Mfi State = f0010006
adapterResetRequestIsr CCRMiscCfg c001ff06 timeUs: 86fa0850
SramFixedC->ioPathCode=c0060000 _io_path_code_start=c0060000
T0: C0:sramInitDynMemandParamsForPersonality: gDmaStructs: c0140000 gmfaHiPriFIFOAddr:c01c3800 gfpeFIFOAddr:c01c5800 greqPostFIFOAddr:c01c7800 ghsFIFOAddr:c01c9800 grlBypassFIFOAddr:c01ca000
T0: C0:sramInitDynMemandParamsForPersonality: gDmPlDynSramFreeHead:c01ce000 gDmPlDynSramFreeTail:c0400000, gDmPlDynSramFreeBytes:232000
T0: C0:Flash Size = 16 MB
T0: C0:initializeFlashTmo: maxMarginProgTmo=3072 maxMarginEraseTmo=3072
T0: C0:TtyInit: FlashLog @ 0xfc480000 Size = 0x200000
T0: C0:TtyInit: FlashTty @ 0xfc680000 Size = 0x80000
T0: C0:AVAGO ROC firmware
T0: C0:Copyright(C) AVAGO Technologies, 2014
T0: C0:Firmware version 4.740.00-8440 built on Jun 19 2019 at 05:40:28
-- No other faults reported
-- The controller reported an OCR - Online Controller reset
-- OCR is a controller feature which would help the controller to do an internal reset in case of any firmware fault
-- Unfortunately we don't have any more details about why the controller issued an OCR
-- The OCR completed successfully and controller continue funtioning normally after the OCR event
-- We observed similar OCR events with controller without any detailed information,
-- Engaged engineering to investigate why the controller is reporting an OCR under the bug
-- However the controller has not logged any much information about the condition which leads to OCR, we are unable to determine the root cause
-- Also we noticed the systems were not reported any repeated OCR events so chances for another such incident is very rare
-- As a best practise, we recommend to upgrade the image to one of the latest image release,
-- This would help to upgrade the controller firmware and drives which will help to resolve any underlying firmware bugs