mysql+call+setup_ORA-00600[kjhn_post_ha_alert0-862]原因分析

Bug 5173066 : REPEATED ORA-600 [KJHN_POST_HA_ALERT0-862] FROM MMON PROCESS

Hdr: 5173066 10.2.0.1.0 RDBMS 10.2.0.1.0 RAC PRODID-5 PORTID-207 ORA-600

Abstract: REPEATED ORA-600 [KJHN_POST_HA_ALERT0-862] FROM MMON PROCESS

*** RSERNA 04/19/06 12:06 pm ***

TAR:

----

@SR:5322693.992

PROBLEM:

--------

Tue Apr 18 18:17:58 2006

Completed: alter database open

Tue Apr 18 18:18:33 2006

Errors in file c:\oracle\product\10.2.0\admin\sfa\bdump\sfa_mmon_1384.trc:

ORA-600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [],

[], [], [], [], []

Tue Apr 18 18:18:59 2006

Errors in file c:\oracle\product\10.2.0\admin\sfa\bdump\sfa_mmon_1384.trc:

ORA-600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [],

[], [], [], [], []

Tue Apr 18 18:20:31 2006

Errors in file c:\oracle\product\10.2.0\admin\sfa\bdump\sfa_mmon_1384.trc:

ORA-600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [],

[], [], [], [], []

Tue Apr 18 18:20:35 2006

Errors in file c:\oracle\product\10.2.0\admin\sfa\bdump\sfa_mmon_1384.trc:

ORA-600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [],

[], [], [], [], []

Tue Apr 18 18:25:32 2006

Errors in file c:\oracle\product\10.2.0\admin\sfa\bdump\sfa_mmon_1384.trc:

ORA-600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [],

[], [], [], [], []

Tue Apr 18 18:25:35 2006

Errors in file c:\oracle\product\10.2.0\admin\sfa\bdump\sfa_mmon_1384.trc:

ORA-600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [],

[], [], [], [], []

generating large trace files.

DIAGNOSTIC ANALYSIS:

--------------------

Customer already has several laptops with the same scenario.

Personal Oracle Database 10g Release 10.2.0.1.0 - Production

With the Partitioning, OLAP and Data Mining options

Windows XP Version V5.1 Service Pack 2

CPU : 1 - type 586

Process Affinity : 0x00000000

Memory (Avail/Total): Ph:728M/1271M, Ph+PgF:2550M/3033M, VA:1746M/2047M

Instance name: sfa

*** SERVICE NAME:(SYS$BACKGROUND) 2006-04-18 18:18:33.736

*** SESSION ID:(106.1) 2006-04-18 18:18:33.736

*** 2006-04-18 18:18:33.736

ksedmp: internal or fatal error

ORA-600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [],

[], [], [], [], []

Current SQL statement for this session:

BEGIN :success := dbms_ha_alerts_prvt.check_ha_resources; END;

----- PL/SQL Call Stack -----

object line object

handle number name

6959CF70 418 package body SYS.DBMS_HA_ALERTS_PRVT

6959CF70 552 package body SYS.DBMS_HA_ALERTS_PRVT

6959CF70 305 package body SYS.DBMS_HA_ALERTS_PRVT

692627B4 1 anonymous block

Executing the procedure manually gives:

SQL> var success varchar2

SQL> begin

2 :success := sys.dbms_ha_alerts_prvt.check_ha_resources;

3 end;

4 /

PL/SQL procedure successfully completed.

SQL> print success

SUCCESS

--------------------------------

N

WORKAROUND:

-----------

None.

No documentation found on what the procedure does, but

assuming the success='N' is why the errors are occuring.

RELATED BUGS:

-------------

REPRODUCIBILITY:

----------------

As soon as database starts.

TEST CASE:

----------

None.

STACK TRACE:

------------

ksedst ksedmp ksfdmp kgerinv kgeasnmierr kjhn_post_ha_alert spefcmpa

spefmccallstd pextproc peftrusted psdexsp rpiswu2 psdextp pefccal

pefcal pevm_FCAL pfrinstr_FCAL pfrrun_no_tool pfrrun plsql_run peicnt

kkxexe opiexe kpoal8 opiodr kpoodr xupirtrc upirtrc kpurcsc

kpuexecv8 kpuexec OCIStmtExecute kjhn_mmon_action 459

kjhn_check_ha_reso urces kebm_ronce_dispatc her kebm_ronce_execute

ksb_run_timeout_an d_intr_action kebm_open_action ksbabs kebm_mmon_main

ksbrdp opirip VInfreq__opidrv sou2o opimai_real opimai

SUPPORTING INFORMATION:

-----------------------

24 HOUR CONTACT INFORMATION FOR P1 BUGS:

----------------------------------------

DIAL-IN INFORMATION:

--------------------

IMPACT DATE:

------------

*** RSERNA 04/19/06 12:44 pm *** (CHG: Sta->16)

*** RSERNA 04/19/06 12:44 pm ***

@ RDA and alert_traces have been uploaded.

*** SSRAO 04/19/06 09:03 pm *** (CHG: Asg->SSRAO)

*** SSRAO 05/09/06 01:49 am ***

@ BDE Screening in progress. Thanks

*** SSRAO 05/09/06 01:51 am ***

@ Reviewing code for the assert. Thanks

*** SSRAO 05/09/06 01:54 am ***

@ The assert condition is

@ --

@ KSEORAASSERTNM(instance_name && *instance_name &&

@ host_name && *host_name,

@ OERINM("kjhn_post_ha_alert0-862"));

@ -- which is the instance_name and the host_name are set within the code. This

@ function is kjhn_post_ha_alert0() which is internal and does the real work of

@ posting HA alerts. It is used by both kjhn_post_ha_alert and

@ kjn_post_ha_alert_plsql. Its parameters are basically the same as those of

@ kjhn_post_ha_alert,other than the fact that it uses individual parameters

@ rather than the more easily extensible structure. Also the parameters passed

@ to it are the instance_name and the host_name which is the kernelized

@ implementation for posting HA alerts. Without actually having the arguments

@ the guess is that either the host_name or the instance_name raised in the

@ assert is null which triggered it. Still reviewing. Thanks

*** SSRAO 05/09/06 01:54 am *** (CHG: SubComp->RAC)

*** SSRAO 05/09/06 02:21 am ***

@ From he stack it seems we have

@ --

@ kjhn_mmon_action kjhn_check_ha_resources kebm_ronce_dispatcher

@ kebm_ronce_execute ksb_run_timeout_and_intr_action

@ --

@ which is called from within the MMON routine for checking the HA resources.

@ Statement executed through PLSQL is

@ --

@ BEGIN :success := dbms_ha_alerts_prvt.check_ha_resources; END;

@ ..

@ 04/13/06 13:52:27 >ERROR: exception at

@ dbms_ha_alerts_prvt.check_ha_resources637: SQLCODE -600,ORA-600: internal

@ error

@ code, arguments: [kjhn_post_ha_alert0-862], [], [], [], [], [], [], []

@ 04/13/06 13:52:27 >parameter dump for dbms_ha_alerts_prvt.check_ha_resources

@ 04/13/06 13:52:27 > - local_db_unique_name (SFA)

@ 04/13/06 13:52:27 > - local_db_domain (==N/A==)

@ 04/13/06 13:52:27 > - rows deleted (0)

@ --

@ .

@ This check is triggered with FAN enabled at this instance and it seems to be

@ associated with a startup action. From the procedure itself which is called

@ this is a run-once MMON (startup) action which supports instance down

@ notification reliability. It does the folowing a) registers the current

@ instance incarnation in recent_resource_incarnations$ if it's not already

@ there b) deletes recent_resource_incarnations$ records that don't apply to

@ this database. They may, e.g., have been copied from seed db or from a former

@ DataGuard primary c) scans recent_resource_incarnations$ for instance

@ incarnations that are no longer alive, and submits instance down alerts for

@ them . If all is good then return 'Y' else 'N' (or error) if there is a

@ failure. That failure is to get back to MMON, so that it may retry this

@ action later. In the local instance I get a 'Y' but in the customer's system

@ it fails with a 'N' which seems related to the ORA-600 assert. Still

@ reviewing. Thanks

*** SSRAO 05/09/06 02:31 am ***

@ Tracking this trace output

@ --

@ 04/13/06 13:52:27 >parameter dump for

@ dbms_ha_alerts_prvt.check_ha_resources

@ 04/13/06 13:52:27 > - local_db_unique_name (SFA)

@ 04/13/06 13:52:27 > - local_db_domain (==N/A==)

@ 04/13/06 13:52:27 > - rows deleted (0)

@ -- we have

@ - the db_unique_name and domain printed

@ - The SQL

@ --

@ DELETE FROM recent_resource_incarnations$

@ WHERE db_unique_name <> local_db_unique_name

@ OR db_domain <> local_db_domain;

@ -- executed to delete rri$ records for other databases (e.g. seed db, former

@ DataGuard primary). which ptints (0) rows so nothing was deleted here . For

@ all suspected dead instances we have post_ha_alert() called with the

@ arguments

@ --

@ post_ha_alert(reason_id => dbms_server_alert.

@ RSN_FAN_INSTANCE_DOWN, same_transaction =>

@ TRUE,

@ clear_old_alert => TRUE,

@ database_unique_name => local_db_unique_name,

@ database_domain => alert_db_domain,

@ instance_name => instance.instance_name,

@ host_name => instance.host_name,

@ event_reason => 'unknown',

@ event_time => instance.shutdown_time);

@ --

@ - which calls the assert. So far we know the dbunique_name is SFA and the

@ domain is not set or N/A and hence '' by default. Thanks

*** SSRAO 05/09/06 02:40 am ***

@ The PLSQL code calls kjhn_post_ha_alert_plsql() which in turn calls

@ kjhn_post_ha_alert0() which causes the assert so the parameters are being

@ passed and called from the PLSQL function , we need to track these parameters

@ for the failure to be able to isolate what is causing the assert. Still

@ reviewing the code for tracing capabilities. Thanks

*** SSRAO 05/09/06 05:33 pm ***

@ It seems that v$instance is returning a null hostname and which also

@ indicates why this package is returning a "F" which indicates a failure from

@ their side. Also from the traces this might be owned by the HA side but these

@ alerts are irrespective of whether RAC is present or not.

@ Reference from kqfv.h we have the hostname come from KSUXSHST from

@ x$ksuxsinst which in turn is populated from the callback of ksuxsrow().

@ This calls slkmnm() which in turn calls gethostbynam() to populate this

@ information so the things we need to verify here are

@ a) output of select * from v$instance

@ b) show parameter instance_name

@ .

@ Thanks

*** SSRAO 05/09/06 05:34 pm *** (CHG: Sta->10)

*** SSRAO 05/09/06 05:35 pm ***

@ Note that there might be two issues at play here

@ a) The hostname is not being set or the instance_name is not being set

@ b) The assert is not documented and should it depend on this condition

@ Depending on the outcome of the provided data we might fork this into two

@ separate bugs as they deal with different areas of the code. Thanks

*** RSERNA 05/10/06 06:53 am *** (CHG: Sta->16)

*** RSERNA 05/10/06 06:53 am ***

@ SQL> select * from v$instance;

@ .

@ INSTANCE_NUMBER INSTANCE_NAME

@

@ --------------- ----------------

@

@ .

@ HOST_NAME

@

@ ----------------------------------------------------------------

@

@ .

@ VERSION STARTUP_T STATUS PAR THREAD# ARCHIVE

@ LOG_SWITCH_WAIT

@ ----------------- --------- ------------ --- ---------- -------

@ ---------------

@ .

@ LOGINS SHU DATABASE_STATUS INSTANCE_ROLE ACTIVE_ST BLO

@

@ ---------- --- ----------------- ------------------ --------- ---

@

@ .

@ 1 sfa

@

@ W8KKWR81

@

@ 10.2.0.1.0 10-MAY-06 OPEN NO 1 STOPPED

@

@ ALLOWED NO ACTIVE PRIMARY_INSTANCE NORMAL NO

@

@ .

@

@

@ .

@ .

@ SQL> show parameter instance_name

@ NAME TYPE VALUE

@

@ ------------------------------------ -----------

@ ------------------------------

@ instance_name string sfa

@

*** SSRAO 05/10/06 09:41 am ***

@ I do not seem to understand this as both the hostname and the instance_name

@ is populated correctly with v$instance and hence the callback ksuxsrow() is

@ executing correctly to provide this information. Going to back into the

@ package to see if I can find additional clues to capture some of the

@ parameters passed.

@ Is this a test setup ? Is the customer willing to take some diagnostic .plb

@ scripts to execute and provide us with additional data ? Thanks

*** SSRAO 05/10/06 09:41 am *** (CHG: Sta->10)

*** RSERNA 05/10/06 10:34 am *** (CHG: Sta->16)

*** RSERNA 05/10/06 10:34 am ***

@ Yes customer is willing to take a diagnostic script or patch.

*** SSRAO 05/15/06 11:41 am ***

@ Working on a diagnostic. Thanks

*** SSRAO 05/15/06 01:19 pm ***

@ Reviewing the procedure check_ha_resources also

@ - recent_resource_incarnations$ table is populated from v$instance

@ - This is another table which contains the instance and the host_name

@ - This could be another source from where the error is being reported

@ - Please provide the output of

@ --

@ set pages 1000

@ select * from recent_resource_incarnations$

@ --

@ - check_ha_resources() does

@ - a) registers the current instance incarnation in

@ recent_resource_incarnations$ if it's not already there b) deletes

@ recent_resource_incarnations$ records that don't apply to this database. They

@ may, e.g. have been copied from seed db or from a former DataGuard primary

@ c) Scans recent_resource_incarnations$ for instance incarnations that are no

@ longer alive, and submits instance down alerts for them. Breaking this down

@ into function code we have

@ - INSERT INTO recent_resource_incarnations$ as select * from v$instance

@ - DELETE FROM recent_resource_incarnations$

@ WHERE db_unique_name <> local_db_unique_name

@ OR db_domain <> local_db_domain;

@ - Cursor dead_instances defines this as join from

@ recent_resource_incarnations$ so it appears that there is already something

@ out there which is causing these problems. Thanks

*** SSRAO 05/15/06 01:19 pm *** (CHG: Sta->10)

*** RSERNA 05/16/06 06:44 am *** (CHG: Sta->16)

*** RSERNA 05/16/06 06:44 am ***

@ File: recent_resource_incarnations.out uploaded.

@ .

@ Had customer execute:

@ c:\hostname

@ W8KKWR81

*** SSRAO 05/16/06 07:41 am ***

@ From the provide traces this seems to be more clearer of why we are seeing

@ this. The output reveals entries like

@ --

@ INSTANCE 1

@ sfa

@ SFA

@ ==N/A==

@ sfa

@ .

@ .

@ ==N/A==

@ -- where a normal entry is

@ --

@ INSTANCE 1

@ sfa

@ SFA

@ ==N/A==

@ sfa

@ W8KKWR81

@ W8KKWR81

@ ==N/A==

@ 09-JAN-06 04.09.36.000000000 AM

@ -- so the hostname is missing for the above entries which is causing the

@ problem. For this table itself it is supposed to insert an entry when an

@ instance comes up and the cleanup is supposed to happen when it goes down or

@ another instance detects this instance going down (if RAC). Since this is a

@ single instance configuration the instance inserts this entry during startup

@ and then deletes the same when the instance is being shutdown or on

@ subsequent startup notes the state change. There are 57 rows in this table

@ which in itself does not look right . All the entries except the one with the

@ instance startup time around "10-JAN-06 08.17.00.000000000 PM" ar normal.

@ This table is populated using v$instance so there was some problem around

@ that time when the hostname was not listed which has caused this to manifest

@ .

@ Summary so far

@ - hostname for entry "10-JAN-06 08.17.00.000000000 PM" is missing

@ - Multiple entries in "recent_resource_incarnations$" which need cleanup

@ - Dead instances are constructed using this list which is why the assert

@ - post_instance_up() in MMON registers the current instance incarnation in

@ recent_resource_incarnations$ if it's not already there and scans the same

@ for instance incarnations that are no longer alive, and submits instance down

@ alerts for them

@ - check_ha_resources() - registers the db in recent_resource_incarnations

@ (RRI) or deletes recent_resource_incarnations$ records that don't apply

@ to this database. They may, e.g., have been copied from seed db or from a

@ former DataGuard primary and scans RRI for instance incarnations that

@ are no longer alive, and submits instance down alerts for them

@ - clear_instance_resources() - It clears all incarnations for the given

@ instance that started before the down event.

@ - The only deletes I found for this table are

@ --

@ DELETE FROM recent_resource_incarnations$

@ WHERE resource_type = 'INSTANCE'

@ AND resource_name = clear_instance_resources.instance_name

@ AND db_unique_name = database_unique_name

@ AND db_domain = NVL(database_domain, '==N/A==')

@ AND startup_time < SYS_EXTRACT_UTC(event_time);

@ ...

@ DELETE FROM recent_resource_incarnations$

@ WHERE db_unique_name <> local_db_unique_name

@ OR db_domain <> local_db_domain;

@ --

@ - Entries in this table are from "09-JAN-06 04.09.36.000000000 AM" to

@ "16-MAY-06 01.13.51.000000000 PM" with no fixed pattern in between.

@ - clear_instance_resources() is supposed to clean the entries out from the

@ description but that is not happening. This is called from

@ kjhn_clear_instance_resources() in kjhn.c which executes the following using

@ OCI

@ --

@ "BEGIN"

@ " dbms_ha_alerts_prvt.clear_instance_resources("

@ " :dbdomain, :dbuniquename, :instance_name, :event_time);"

@ "END;"

@ --

@ - I did not spot any OCI errors but that is what might be happening. Since

@ this in information for an alert this table can be purged (backup please) but

@ this seems to be a re-curring case.

@ - This is called from kjhn_post_ha_alert0()

@ --

@ if ( reason_id == KELT_FAN_INSTANCE_DOWN )

@ {

@ kjhn_clear_instance_resources(database_domain, database_unique_name,

@ instance_name, event_time);

@ }

@ --

*** SSRAO 05/16/06 07:45 am ***

@ The interesting thing is that the block which executes the code for ha_alerts

@ and which returns "kjhn_post_ha_alert0-862" ORA-600 be much before the

@ location where the assert is being processed. Since the code is in a KSETRY

@ block when the ORA-600 is reported is does not process

@ kjhn_clear_instance_resource() as that is the last funcction in the KSETRY

@ block which explains why the superfluos entries for all recent days as its

@ unable to clean it up when it gets posted by FAN or the reason_id of

@ KELT_FAN_INSTANCE_DOWN. The root cause of this still looks like the entry of

@ "09-JAN-06 04.09.36.000000000 AM" where the hostname is null. This is

@ generated from gv$instance at that time and populated

@ .

@ Questions

@ - Do we have the alert log dating that timeframe ?

@ - Is there something specific about 9th Jan which the customer can tell us as

@ the problems with the hostname have been fixed since then. Thanks

*** SSRAO 05/16/06 07:45 am *** (CHG: Sta->10)

*** SSRAO 05/16/06 07:47 am ***

@ Also if the customer wishes to get rid of the ORA-600 the following steps

@ as "sys" would help.

@ --

@ - create table recent_resource_incarnation$_backup as select * from

@ recent_resource_incarnation$;

@ - truncate table recent_resource_incarnation$;

@ - Please perform the truncate while this instance is up and running and not

@ going down and perform a clean shutdown after this

@ - This should stop the ORA-600's and subsequently the instance should have

@ clean shutdowns with no errors. Please provide the infomation requested above

@ too . Thanks

*** RSERNA 05/16/06 08:51 am *** (CHG: Sta->16)

*** RSERNA 05/16/06 08:51 am ***

@ THANKS. I will give customer workaround because they definitely want traces

@ to stop since they are large.

@ i have uploaded alert log: alert_sfa_latest.log

@ which covers Nov 2005 to April 2006.

@ Something definitely happened in January.

@ From customer

@ "We build laptops from an image. The database on the image was created on

@ November. But the laptop was imaged only in January."

@ .

@ I'm not sure how they do the image.

*** SSRAO 05/16/06 01:07 pm ***

@ Its hard to decipher this from the alert as the timeline is as follows

@ - Sun Jan 08 23:11:02 2006 - shutdown

@ - Tue Jan 10 15:17:00 2006 - startup

@ - Nothing in the timeframe listed

@ - Tue Jan 10 15:18:31 2006 - last startup

@ - Wed Jan 11 14:27:06 2006 - startup so unclean shutdown

@ - Wed Jan 11 14:29:36 2006 - ORA-600 reported

@ - We need to know what happened between Jan 8th and the 11th to be able to

@ conclude why v$instance did not return the valid hostname. If they imaged

@ their laptop then is this imaged with no hostname initially set when the

@ Oracle service came up. Also note the source is v$instance which pulls it

@ from x$ksuxsinst which calls slkmnm()->gethostbyname() which is a common

@ routine. The solution is to find out why the hostname was null on the

@ "09-JAN-06 04.09.36.000000000 AM" which is not in the alert log too. Thanks

*** SSRAO 05/16/06 01:07 pm *** (CHG: Sta->10)

*** RSERNA 05/16/06 01:42 pm ***

@ So far so good. They tried the suggested workarounds and no more errors.

@ Customer stated "We used ghost image from microsoft to image all the laptops.

@ We installed oracle on the host image and then slapped it on to other

@ laptops from the host image."

@ .

@ All the laptops they have imaged in this way are experiencing the problem.

@ So the problem may be on the original host. I will find out more.

*** RSERNA 05/19/06 11:47 am *** (CHG: Sta->16)

*** RSERNA 05/19/06 11:47 am ***

@ Customer's latest update

@ "I just got a new laptop that was created with the original image.

@ I checked the table recent_resource_incarnations$ and it has

@ the correct value for the host_name column. "

@ So it appears that the original image used to image all the laptops doesn't

@ always lead to the errors. Some laptops are fine, others are not.

@ .

@ They have also said "I checked with our desktop administrator and he told we

@ can't query the database on our image directly"

@ I'm not sure why.

@ .

@ How else can we diagnose this?

*** RSERNA 05/19/06 12:11 pm ***

@ I asked customer why they could not query the db directly on the image. Their

@ reply was "Our helpdesk administrator is saying that, taking a look

@ at copy of a brand new laptop is similar to seeing the image copy, which

@ makes sense. Image copy copies everything, that is how we have built our 300

@ laptops for our sales force. ". Doesn't really answer the question, but

@ thought I'd pass it along. Anyway, the same Image Copy is used on all the

@ laptops, some had the problems, some did not.

*** SSRAO 05/19/06 12:35 pm ***

@ We might be at a deadend for this. Summarizing the problem so we know what we

@ are looking for

@ .

@ - The routines called are for checking any instances are dead and for sending

@ the appropriate notification

@ - This scans the recent_resource_incarnations$ table for the details

@ - recent_resource_incarnations$ is populated from gv$instance

@ - The first instance (if RAC) which sees this clears the entry

@ - The notification is sent by this instance as well and this is cleared

@ - If the hostname is not populated we see this ORA-600

@ - The point which is not clear is how did gv$instance not have the hostname

@ - gv$instance -> x$ksuxsinst -> slkmnm()-> gethostbyname()

@ - gethostbyname() is generic call which gets the hostname from the OS

@ - I do not see much from gethostbyname() or failure codes from it

@ - The database alert log does not have enough information at that point

@ - The imaging activity is external to the database

@ - Clearing this table would prevent the assert

@ - Root cause is why the null hostname came into the table and this cannot be

@ deciphered with the information provided.

@ .

@ The reason I'm passing this bug is when there is an unclean shutdown the

@ cursor dead_instances is scanned and picked up from where a stale entry with

@ a bad hostname will cause every subsequent startup to assert with no cleanup.

@ I believe that this aberrational condition could be handled with cleanup of

@ the record and raising the ORA-600 , nonetheless not being able to send a

@ notification is not probbaly as serious to raise so many asserts.

@ The hostname issue cannot be debugged further without further details of what

@ happened on that day. BDE Screening complete

*** SSRAO 05/19/06 12:35 pm *** (CHG: Sta->11)

*** BUGPATCH 05/19/06 12:35 pm *** (CHG: Asg->RPARK)

*** BUGPATCH 05/19/06 12:35 pm ***

@ Assigned to RDBMS/RAC area queue owner

@ by SSRAO via the ST Bug Assignment Tool

*** TAKIBA 05/21/06 08:52 pm ***

@ Overflow queue screening.

@ .

@ I think it's better to investigate from viewpoint of PL/SQL.

@ The host_name was already corrupted on the table

@ recent_resource_incarnations$. Therefore, there are 2 possibilities.

@ .

@ 1. v$instance gave the empty host_name

@ .

@ 2. v$instance gave a correct host_name, but Pl/SQL function

@ rdbms/src/server/ccl/crs/prvtkjhn.sql: check_ha_resources

@ corrupted the value.

@ .

@ [quoted from rdbms/src/server/ccl/crs/prvtkjhn.sql: check_ha_resources

@ L562]

@ ============================================================

@ INSERT

@ INTO recent_resource_incarnations$

@ ( resource_type, resource_id, resource_name, db_unique_name,

@ db_domain, instance_name, host_name, startup_time, location,

@ incarnation )

@ SELECT 'INSTANCE', instance_number, instance_name,

@ local_db_unique_name,

@ local_db_domain,

@ instance_name, host_name,

@ SYS_EXTRACT_UTC(

@ instance_startup_timestamp_tz(vi.startup_time)),

@ host_name, '==N/A=='

@ FROM v$instance vi;

@ ============================================================

*** TAKIBA 05/21/06 08:55 pm *** (CHG: Asg->PLSREP Prod->11 Comp->PLSQL SubComp->)

*** TAKIBA 05/21/06 08:55 pm ***

@ I guess they can easily route the bug to port-specific (v$instance

@ issue probably in slkmnm()) or SQL execution.

*** AMANGAL 05/22/06 10:53 am *** (CHG: Asg->RDBMSREP Prod->5 Comp->RDBMS)

*** AMANGAL 05/22/06 10:53 am *** (CHG: Asg->RPARK)

*** AMANGAL 05/22/06 10:55 am ***

@ This needs to be investigated by the owners of check_ha_resources .

*** TAKIBA 05/28/06 04:34 am *** (CHG: SubComp->RAC)

*** TAKIBA 05/28/06 04:34 am ***

@ Put the subcomponent RAC for now.

*** BUGPATCH 06/05/06 10:21 pm *** (CHG: Asg->TAKIBA)

*** BUGPATCH 06/05/06 10:21 pm ***

@ Assigned from the area queue to TAKIBA

via the ST Bug Assignment Tool

*** TAKIBA 06/07/06 02:46 am ***

@ I describe the 2 possibilities again.

@ .

@ The host_name was already corrupted on the table

@ recent_resource_incarnations$. Therefore, there are 2 possibilities.

@ .

@ 1. v$instance gave the empty host_name

@ .

@ If GetComputerName() returned an error, slkmnm() can return a

@ NULL string.

@ .

@ [quoted from rdbms/src/common/osds/slkmnm.c@@/RDBMS_10.2.0.1.0_NT_RELEASE

@ slkmnm() L150]

@ ============================================================

@ len = mnm_l - 1;

@ if (GetComputerName(mnm, &len) == FALSE)

@ {

@ *mnm = '\0';

@ return(0);

@ }

@ else

@ return(len);

@ ============================================================

@ .

@ 2. v$instance gave a correct host_name, but Pl/SQL function

@ rdbms/src/server/ccl/crs/prvtkjhn.sql: check_ha_resources

@ corrupted the value.

@ .

@ [quoted from rdbms/src/server/ccl/crs/prvtkjhn.sql: check_ha_resources

@ L562]

@ ============================================================

@ INSERT

@ INTO recent_resource_incarnations$

@ ( resource_type, resource_id, resource_name, db_unique_name,

@ db_domain, instance_name, host_name, startup_time, location,

@ incarnation )

@ SELECT 'INSTANCE', instance_number, instance_name,

@ local_db_unique_name,

@ local_db_domain,

@ instance_name, host_name,

@ SYS_EXTRACT_UTC(

@ instance_startup_timestamp_tz(vi.startup_time)),

@ host_name, '==N/A=='

@ FROM v$instance vi;

@ ============================================================

*** TAKIBA 06/07/06 10:04 pm *** (CHG: Asg->NKOREHIS)

*** TAKIBA 06/07/06 10:04 pm ***

@ Assigning to Korehisa-san, to investigate from viewpoint of

@ Windows RAC.

*** NKOREHIS 06/13/06 11:20 pm *** (CHG: DevPri->2)

*** NKOREHIS 06/13/06 11:20 pm *** (CHG: Confirmed Flag->Y)

*** KEOCHI 07/30/06 02:39 pm ***

@ Sorry for rushing you..

@ My customer hits similar problem, and then BUG:5403213 for my customer is

@ marked as duplicate of this bug.

@ How the work is going? Thanks.

*** NKOREHIS 07/31/06 11:24 pm ***

@ slkmnm calls GetComputerName WIN32API. GetComputerName could fail if

@ buffer is too small to store the computer name.

@ .

@ on my local env, it is called with the following stack when select

@ from v$instance.

@ .

@ >oracommon10.dll!slkmnm(slerc * se=0x0b58d5e4,

@ unsigned char * mnm=0x0abd7ebc,

@ unsigned int mnm_l=0x00000040)

@ oracle.exe!_ksuxsrow() + 0x5a

@ oracle.exe!_qerfxFetch() + 0x3da

@ oracle.exe!_rwsfcd() + 0x5f

@ oracle.exe!_qerjotFetch() + 0xf3

@ oracle.exe!_rwsfcd() + 0x5f

@ oracle.exe!_qerjotFetch() + 0xf3

@ oracle.exe!_opifch2() + 0xc20

@ oracle.exe!_kpoal8() + 0xcf0

@ oracle.exe!_opiodr() + 0x44e

@ oracommon10.dll!_ttcpip() + 0x4fc

@ oracle.exe!_opitsk() + 0x3f9

@ oracle.exe!_opiino() + 0x444

@ oracle.exe!_opiodr() + 0x44e

@ oracle.exe!_opidrv() + 0x338

@ oracle.exe!_sou2o() + 0x32

@ oracle.exe!_opimai() + 0x171

@ oracle.exe!_opimai() + 0x61

@ oracle.exe!_OracleThreadStart@4() + 0x2c9

@ kernel32.dll!_BaseThreadStart@8() + 0x37

@ .

@ in ksuxsrow, call to slkmnm is made as follows.

@ .

@ DISCARD slkmnm(&se, xsc->ksuxshst, sizeof(xsc->ksuxshst)); /*

@ host name */

@ .

@ ksuxshst is defined as follows.

@ .

@ text ksuxshst[SKGP_HOSTNAME_LEN]; /* host

@ machine name */

@ .

@ in skgp0.h, SKGP_HOSTNAME_LEN is defined as follows.

@ .

@ #define SKGP_HOSTNAME_LEN SKGP_HOSTNAME_DEFAULT

@ /* operating system name

@ buffer size */

@ .

@ SKGP_HOSTNAME_DEFAULT is 64 as defined in skgp.h

@ .

@ #define SKGP_HOSTNAME_DEFAULT 64 /* host name

@ buffer size */

@ .

@ So, GetComputerName should not be failed due to the buffer size in this

@ case.

*** NKOREHIS 07/31/06 11:29 pm *** (CHG: Sta->30)

*** NKOREHIS 07/31/06 11:29 pm ***

@ As the slkmnm does not return any errors though GetComputerName fails,

@ we might need figure out what fails the function using debug patch.

@ Please let me know if the ct would agree to apply the debug patch.

*** RSERNA 08/01/06 06:19 am *** (CHG: Sta->11)

*** RSERNA 08/01/06 06:19 am ***

@ Yes, customer is willing to apply a debug patch.

*** NKOREHIS 08/10/06 04:55 am ***

@ working on the debug patch.

*** NKOREHIS 08/10/06 09:16 pm *** (CHG: Sta->30)

*** NKOREHIS 08/10/06 09:16 pm ***

@ the debug patch is available at bugftp:/upload/bug5173066/5173066.zip.

@ .

@ 1. How is the diagnostic information activated and de-activated:

@ Once installed the patch, it is automatically activated.

@ 2. What new information is being traced and when is it triggered?

@ Error information from GetComputerName will be traced in the

@ alert log.

@ 3. What is the performance impact of the new diagnostic tracing?

@ No performance impact is expected.

@ 4. What is expected from the generated diagnostic information and

@ where will this lead.

@ The patch is provided to confirm if GetComputerName is failed

@ in slkmnm. If so, it would show what is the OS error from the

@ call. Please provide alert log once the error reproduced.

*** RSERNA 08/18/06 07:15 am *** (CHG: Sta->11)

*** RSERNA 08/18/06 07:15 am ***

@ From customer:

@ "When i tried to start the Oracle database service, OracleDBService, it did

@ not start. It failed with a generic windows error,Couldn't start the service.

@ "

@ .

@ They followed the README that came with the patch and could not even start

@ services. Backing out the patch, everything was fine again.

*** NKOREHIS 08/22/06 04:06 am *** (CHG: Sta->30)

*** NKOREHIS 08/22/06 04:06 am ***

@ The patch worked fine on my development environment. Would you please

@ describe what error exactly the ct got?

*** RSERNA 08/28/06 08:12 am *** (CHG: Sta->11)

*** RSERNA 08/28/06 08:12 am ***

@ Customer attempted patch install again and same errors.

@ I've uploaded screenshots from the customer. File:patch_error_screenshot.doc

*** NKOREHIS 08/30/06 09:08 pm *** (CHG: Sta->30)

*** NKOREHIS 08/30/06 09:08 pm ***

@ rebuilt the diagnostic patch changing compiler options.

@ Would you please verify if it works on the customer's env?

@ It is available on bugftp:/upload/bug5173066/5173066_2.zip

*** NKOREHIS 08/31/06 06:16 pm ***

@ basically, why the patch failed to startup is still unknown. the

@ major difference between the debug version and the release version is

@ the compiler options for the patched file. The newer version is to

@ verify if it could lead to the failure. diagnostic code for the

@ original problem is present as well.

*** RSERNA 09/05/06 01:36 pm *** (CHG: Sta->11)

*** RSERNA 09/05/06 01:36 pm ***

@ The customer still had problems applying this Diag patch.

@ See file:screen_shot_of_second_patch_application.doc

@ that was uploaded which shows the customer's output following step-by-step

@ instructions from the readme.

*** NKOREHIS 09/13/06 01:38 am ***

@ built the patch on another development environment. regression

@ test is running.

*** NKOREHIS 09/13/06 07:18 am *** (CHG: Sta->30)

*** NKOREHIS 09/13/06 07:18 am ***

@ put the new diagnostic patch to bugftp:/upload/bug5173066/5173066_3.zip

@ This time, I zipped the patch with orageneric10.dll and

@ oraclient10.dll which exist on the development environment for

@ RDBMS_10.2.0.1.0_NT_RELEASE. I confirmed the patch worked on my local

@ testing env. Please try the new patch.

*** RSERNA 09/14/06 10:49 am *** (CHG: Sta->11)

*** RSERNA 09/14/06 10:49 am ***

@ Customer applied diag patch and started DB.

@ Alert log and mmon trace uploaded: alert_trace_14Sep06.ZIP

*** NKOREHIS 09/14/06 10:20 pm ***

@ alert log does not show any output from the diagnostic code. this

@ means that GetComputerName is not the culprit. we might need

@ another diagnostic.

*** NKOREHIS 09/26/06 04:33 am *** (CHG: Sta->30)

*** NKOREHIS 09/26/06 04:33 am ***

@ I added some more diagnostic code to ksuxsrow to see if xsc->ksuxshst

@ has a valid value after returning from slkmnm. the patch is in

@ bugftp:/upload/bug5173066/5173066_4.zip. Would you please have the

@ ct apply the patch? please provide alert log once the symptom reproduced.

*** RSERNA 09/26/06 05:35 am *** (CHG: Sta->11)

*** RSERNA 09/26/06 05:35 am ***

@ Hi, there is no file in: bugftp:/upload/bug5173066/5173066_4.zip

@ .

@ ftp> pwd

@ 257 "/upload/bug5173066"

@ ftp> get 5173066_4.zip

@ 200 PORT command successful. Consider using PASV.

@ 550 Failed to open file.

@ ftp> ls 5173066*.zip

@ 200 PORT command successful. Consider using PASV.

@ 150 Here comes the directory listing.

@ 5173066.zip

@ 5173066_2.zip

@ 5173066_3.zip

@ 226 Directory send OK.

*** NKOREHIS 09/26/06 06:13 pm *** (CHG: Sta->30)

*** NKOREHIS 09/26/06 06:13 pm ***

@ Sorry I re-uploaded the patch.

@ .

@ ftp> ls 5173066*.zip

@ 200 PORT command successful. Consider using PASV.

@ 150 Here comes the directory listing.

@ 5173066.zip

@ 5173066_2.zip

@ 5173066_3.zip

@ 5173066_4.zip

@ 226 Directory send OK.

*** RSERNA 09/28/06 10:10 am *** (CHG: Sta->11)

*** RSERNA 09/28/06 10:10 am ***

@ Diag Patch has been applied, latest alert log: alert_Sep28.log

@ has been uploaded.

*** NKOREHIS 09/28/06 06:10 pm ***

@ The following diagnostic trace indicates that MMON tid=1540 did not

@ fail to get hostname in ksuxsrow. the hostname was vaild until the

@ function finished. However the internal error singalled in the

@ same thread.

@ .

@ MMON:1540:ksuxsrow: ksuxshst = W8KKWR81

@ MMON:1540:ksuxsrow: ksuxshst = W8KKWR81

@ MMON:1540:ksuxsrow: ksuxshst = W8KKWR81

@ Thu Sep 28 08:05:04 2006

@ Errors in file c:\oracle\product\10.2.0\admin\sfa\bdump\sfa_mmon_1540.trc:

@ ORA-600: internal error code, arguments: [kjhn_post_ha_alert0-862], [], [],

@ [], [], [], [], []

*** NKOREHIS 09/28/06 06:13 pm *** (CHG: Asg->RDBMSREP)

*** NKOREHIS 09/28/06 06:13 pm ***

@ back to generic RAC to investigate check_ha_resources.

*** BUGPATCH 09/28/06 06:18 pm *** (CHG: Asg->EMERITT)

*** BUGPATCH 09/28/06 06:18 pm ***

@ Assigned to RDBMS/RAC area queue owner

@ by NKOREHIS via the ST Bug Assignment Tool

@ .

@ No problem found in GetComputerName, slknmn, ksuxsrow. However the internal

@ error signalled. We should verify if check_ha_resources works as expected.

@ .

*** MMPANDEY 10/03/06 11:46 pm *** (CHG: Asg->MMPANDEY)

*** MMPANDEY 10/03/06 11:46 pm ***

@ This not exactly related to RAC, but as I have worked on a similar

@ problem, I would review.

*** RSERNA 10/25/06 06:19 am ***

@ Please provide an update. Thanks.

*** MMPANDEY 11/02/06 10:05 am ***

@ I am just back from a 2-week vacation and would be reviewing the bug

@ soon, alongwith others. Thanks.

*** RSERNA 11/28/06 05:56 am ***

@ Please provide an update. Thanks.

*** MMPANDEY 12/03/06 08:17 pm ***

@ I would do so, shortly. Thanks.

*** MMPANDEY 12/13/06 07:04 am ***

@ I am looking at the traces now. A similar problem is seen when the

@ oracle_sid name exceeds 17 characters, which does not seem to be

@ the case here. Looking at the possibilities of internal error being

@ raised in ksuxsrow().

*** RSERNA 01/03/07 08:41 am ***

@ Please provide an update. Thanks.

*** MMPANDEY 01/19/07 02:17 pm ***

@ The code to raise the error appears to have changed a bit in the later

@ labels, i.e. we are no longer checking the condition (*host_name)

@ .

@ KSEORAASSERTNM(instance_name && *instance_name && host_name,

@ OERINM("kjhn_post_ha_alert0-862"));

@ .

@ I am verifying this fact.

*** KNEEL 01/19/07 05:47 pm *** (CHG: Asg->RSERNA)

*** KNEEL 01/19/07 05:47 pm ***

@ The new row being inserted by check_ha_resources should not be the one

@ causing the assert; the assert is caused when attempting to delete an old row

@ with null hostname. Are new rows in recent_resource_incarnations$ still

@ getting null host name; if not, when is/are the bad row(s) from? Is this

@ still the row from Jan 2006?

@ .

@ Also, this is XP, and not an earlier version, right? Was this db ever run on

@ a pre-XP version, or on a system with a long host or network name?

@ .

@ We may never be able to solve where that original bad row came from, and it's

@ not clear that we should bother customer with more diagnostic patches. If, on

@ the other hand, the problem is repeating -- i.e. new rows with null hostname

@ -- it would be good to figure out why.

@ .

@ If necessary, we can certainly backport the change to ignore null hostname.

@ But it would be nice to know if the null hostname is still being generated;

@ i.e. if another problem exists.

@ .

@ That change would include the line Mukul (MMPANDEY) suggests, but it also

@ needs to include the following change:

@ .

@ reasonargs[KJHN_REASON_ARG_HOST] = host_name;

@ =>

@ if ( *host_name )

@ reasonargs[KJHN_REASON_ARG_HOST] = host_name;

*** RSERNA 01/23/07 06:27 am *** (CHG: Sta->33)

*** RSERNA 01/23/07 06:27 am ***

@ Customer no longer wishes to pursue.

@ They always clone their machines so its something specific to the original

@ one which they can not reproduce.

*** MMPANDEY 01/23/07 09:37 am *** (CHG: Sta->30 Asg->MMPANDEY)

*** MMPANDEY 01/23/07 09:37 am ***

@ Assigning it back, looks like this bug got reassigned to Rowena when

@ we needed the information.

@ .

@ Hi Rowena,

@ .

@ One question. I think they were able to reproduce this problem, as we

@ were able to apply the diag patch, cf. [NKOREHIS 09/28/06 06:10 pm],

@ isn't it? If the customer still has the backup, I was wondering if

@ they could retry?

@ .

@ I think that we should give the patch of the changes to the customer,

@ and I will get this verified by Kevin. Thanks.

*** MMPANDEY 01/23/07 12:14 pm ***

@ Hi Rowena,

@ .

@ Kevin has the following comments, about the customer's situation and

@ the workaround (to truncate recent_resource_incarnations$):

@ .

@ So was this a permanent fix -- i.e. they have fixed the clone, so that

@ they no longer need to do this? Or are they still doing these workarounds?

@ I.e. ould we be creating work for them to apply the patch, or saving

@ them work/making their processes more efficient? To some degree, that's

@ up to hem.

@ .

@ Could you please request the customer to reply? Thanks.

*** MMPANDEY 01/24/07 02:46 pm *** (CHG: Fixed->11.1.0)

*** MMPANDEY 01/24/07 02:46 pm *** (CHG: Sta->35)

*** MMPANDEY 01/24/07 02:46 pm ***

@ Closing the bug, following Kevin's reply on my email:

@ .

@ "I don't think the "bad laptop" is the source of the "infection" of other

@ boxes; the source is the image made last January. The "bad laptop" is simply

@ one which has that bad image, without the workaround. It is a victim of the

@ illness -- the only one who has not taken the cure -- but is not a carrier,

@ infecting others. They just left it infected in case we wanted to look at the

@ living illness. Unfortunately, we can't learn much from looking at the

@ illness, since the entry path of the original virus didn't leave any trace.

@ We can't see how the immune system was weakened; we just know that it was

@ weakened at some point, and has returned to normal. IF they wanted a

@ workaround patch [see below], it could be a good machine on which to test it.

@ .

@ I'm not surprised that they don't want any more diagnostic patches. Even if

@ we succeed in figuring out what condition/event caused the original bad

@ row(s), that is not useful for them, since that condition/event in itself is

@ not recurring. What recurs is the interaction between the bad row(s) and our

@ assert. So they have essentially three choices:

@ .

@ 1. continue with the workaround

@ 2. create a new image without bad rows (seems they don't want to do this)

@ 3. we can give them a workaround patch -- which presumably would have to

@ be patched into their image -- to ignore the bad rows rather than assert

@ .

@ So if they want a workaround patch (#3) -- not a diagnostic one -- we can do

@ that. But I suspect that it may be more work for them to get it into their

@ image (and possibly existing boxes) than to keep their workaround.

@ .

@ It sounds like they have somehow automated the workaround as part of their

@ processes (e.g. part of a script run after reimaging), and have little need

@ to give themselves more work to install a different workaround (i.e. patch).

@ .

@ As for the bug, you could mark it fixed in 11.1."

@ .

@ Thanks.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值