Return to MAIN-Index  Return to SUB-Index    IBM-AUSTRIA - PC-HW-Support    30 Aug 1999

ORACLE TERMINATES WITH FAILURE OF NODE IN A TWO-NODE CLUSTER



Subject: ORACLE TERMINATES WITH FAILURE OF NODE IN A TWO-NODE CLUSTER
New Netfinity server RETAIN Tip: 

    Record number:       H165892
    Device:              D/T8680
    Model:               M
    Hit count:           UHC00000
    Success count:       USC0000
    Publication code:    PC50
    Tip key:
    Date created:        O99/02/10
    Date last altered:   A99/02/10
    Owning B.U.:         USA


Abstract: ORACLE TERMINATES WITH FAILURE OF NODE IN A TWO-NODE CLUSTER


TEXT:

Oracle database instance terminates after failure of one node in a two-node cluster.


SYMPTOMS:

  1.  When running a two node Oracle Parallel Server (OPS) database and one of the nodes fails,  the remaining node also fails after a period of time ranging from 30 minutes to several hours.
  2.  Clients connected to the remaining node begin to time-out and stop.
  3.  Any of the following Oracle messages are seen:

    1.  ORA-00472: PMON process terminated with error
    2.  ORA-27103: internal error
    3.  ORAnnnnn.TRC file contains the message "FATAL ERROR IN TWO TASK SERVER error 12571

  4.  The file %oraclehome%\database\io.log is very large (2MB or greater) and contains the following  lines repeated since the time of the first node failure until the time the database terminated on  the second node:

       IOInService()...
       IOInService(OK)...
       IOOutOfService()...
       IOOutOfService(OK)...


PROBLEM ISOLATION AIDS:



Note: Supported configurations are listed at the following URL:

http://www.pc.ibm.com/us/compat/clustering/matrix.shtml
look for "Oracle Parallel Server"


Windows NT 4.0 Enterprise Edition with Service Pack 4 applied.


FIX:

This problem has been reported to Oracle, and Oracle Bug number 812552 has been opened.

The fix will be determined by Oracle and is expected to be provided as a new patchset for Oracle V8.0.5.


WORKAROUND:

The problem is related to the load placed on the nodes in the cluster and whether the client programs have a time-out period of less than about 30 minutes.
To avoid the problem, one or both of the following recommendations are made:

  1.  Limit the load on the two node cluster so that if the entire load was placed onto  a single node, the database could still handle the load with some capacity to spare.
  2.  Increase the time-out period for client programs to at least 30 minutes.


DETAILS:

The termination of the database on the remaining node is caused by the extra heavy load placed on the remaining node after the first node terminates. Rather than degrading performance gracefully, the Oracle database may periodically pause for up to 20 minutes.
If clients have time-out periods shorter than this pause time, they will time-out.
If those client programs then terminate, the Oracle database must free up the resources (threads) allocated by the Oracle server for those clients.
If many clients terminate all at once, the Oracle database does not free up these resources correctly, and the remaining database instance terminates.

The workarounds address this problem by avoiding the pause condition and/or avoiding the termination of many clients at once.
This problem has not been seen on clusters with a greater number of nodes since the workload of a single failed node tends to be distributed among the multiple remaining nodes.
In this case, the database does not pause and clients do not time-out.


TRADEMARKS:

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States and/or other countries.

Other company, product and service names may be the trademarks or service marks of others.


Back to  Jump to TOP-of-PAGE
More INFORMATION / HELP is available at the  IBM-HelpCenter

Please see the LEGAL  -  Trademark notice.
Feel free - send a Email-NOTE  for any BUG on this page found - Thank you.