|
Abstract: PROBLEM DIAGNOSTICS IN A SHARED DISK CLUSTER ENVIRONMENT_V02
Summary: Servicer Assistance Information for Shared Disk Clusters.
Shared Cluster - Problem Diagnostic
The intent of the document is to assist with the diagnosis of problems running the following on PC
Servers and Netfinity Servers that support each:
The IBM Storage Enclosures certified by IBM for clustering are:
This document should be used in conjunction with the Servicer Training Video Series Volume 19,
Shared Disk Clustering.
This document will be updated as necessary. The latest version of this document can be found at:
http://www.us.pc.ibm.com/support (You may search on the title of this document)
Maintaining Cluster High Availability requires the Cluster is always available for LAN attached users
to access. With this in mind, always consider keeping at least one node supporting a cluster online
while performing problem isolation/diagnostics on other nodes that are part of the cluster.
1.0 Shared disk cluster identification:
Provided below are ways to identify shared disk cluster configurations. Positive identification to one
of the below tasks does not ensure that you will be working with an active shared disk cluster but
should allow you to take appropriate steps as necessary. Problem determination steps outlined in
this document are safe to use with stand-alone server configurations.
2.0 Strategies to prioritize where to begin problem diagnosis:
When performing problem diagnosis on a node in a cluster or on a cluster, maintaining cluster high
availability should be the greatest priority.
Remember:
Maintaining Cluster High Availability requires the Cluster is
always available for LAN attached users to access.
With this in mind, always consider keeping at least one node supporting
a cluster online while performing problem isolation/diagnostics
on other nodes that are part of the cluster.
2.1 Problem determination guidelines:
The guidelines below can assist in maintaining cluster high availability and quick problem resolution.
Problem determination applied to a node that requires the node is NOT available for cluster
operation:
Problem determination requiring the shutdown of an entire Cluster:
2.2 Recovering a down cluster before problem determination
2.3 Specific Cluster/Node Strategies:
If a cluster is down and each node is in a different node state use the strategy of the highest ranking
node state:
Rank Node State
1 Node failure
2 Hang/Trap condition
3 Running with errors
4 Running without errors
The following Problem Determination Guideline Matrix references Cluster states in conjunction with
Node states that reference their respective document sections that follow.
+------------------------------------------------------------------------------+
| Cluster | Node(s) | Node(s) | Node(s) | Node(s) |
| State | State | State | State | State |
|--------------|---------------|---------------|---------------|---------------|
| Cluster | Node failure | Node(s) in | Node(s) run | Node(s) run |
| Down | | a Hang/Trap | with errors | w/o errors |
| | 2.3.1 | 2.3.2 | 2.3.3 | 2.3.4 |
|--------------|---------------|---------------|---------------|---------------|
| Cluster | Node failure | Node in a | Node running | Node(s) run |
| in | | Hang/Trap | with errors | w/o errors |
| failover | 2.3.5 | 2.3.6 | 2.3.7 | 2.3.8 |
| mode | | | | |
+------------------------------------------------------------------------------+
The following strategies should be used in conjunction with the Problem Determination Guideline
Matrix above.
Note: Items are rated "High" (most probable failure) to "LOW" (least probable failure).
2.3.1 Cluster down and node failure.
* Shared Disk Subsystem ------------------------------- High
* Cables to shared disk systems ----------------------- High
* Shared disk subsystem host adapter ------------------ Medium
* Node failure ---------------------------------------- Low
* Software errors or configuration -------------------- Low
* Communication Link ---------------------------------- Low
2.3.2 Cluster down and node(s) in a hang trap condition.
* Software errors or configuration -------------------- High
* Cables to shared disk systems (termination) --------- High
* Subsystem host adapters (SCSI ID) ------------------- High
* Shared disk subsystem ------------------------------- Medium
* Communication Link ---------------------------------- Low
* Node failure ---------------------------------------- Low
2.3.3 Cluster down and node(s) running with errors.
* Shared disk subsystem ------------------------------- High
* Subsystem host adapters ----------------------------- High
* Cables to shared disk systems ----------------------- High
* Software errors or configuration -------------------- Medium
* Node failure ---------------------------------------- Low
* Communication link ---------------------------------- Low
2.3.4 Cluster down and node(s) running without errors.
* Communication link ---------------------------------- High
* Software errors or configuration -------------------- High
* Shared disk subsystem ------------------------------- Low
* Cables to shared disk systems ----------------------- Low
* Subsystem host adapters ----------------------------- Low
* Node failure ---------------------------------------- Low
2.3.5 Cluster in failover mode and node failure.
* Node failure ---------------------------------------- High
* Cables to shared disk systems ----------------------- Medium
* Subsystem host adapters ----------------------------- Medium
* Communication line ---------------------------------- Medium
* Software error or configuration --------------------- Low
* Shared disk subsystems ------------------------------ Low
2.3.6 Cluster in failover mode and a node in a hang trap condition.
* Software errors or configuration -------------------- High
* Node failure ---------------------------------------- High
* Subsystem host adapter ------------------------------ Medium
* Cables to shared disk systems (termination) --------- Medium
* Shared disk subsystems ------------------------------ Low
* Communication link ---------------------------------- Low
2.3.7 Cluster in failover mode and a node running with errors.
* Communication link ---------------------------------- High
* Subsystem host adapters ----------------------------- High
* Cables to shared disk systems ----------------------- High
* Node failure ---------------------------------------- Low
* Software errors or configuration -------------------- Low
* Shared disk subsystems ------------------------------ Low
2.3.8 Cluster in failover mode and node(s) running without errors.
* Communication link ---------------------------------- High
* Cables to shared disk systems ----------------------- High
* Subsystem host adapters ----------------------------- Medium
* Software errors or configuration -------------------- Medium
* Shared disk subsystem ------------------------------- Low
* Node failure ---------------------------------------- Low
3.0 Replacing a failed ServeRAID II adapter in a High-Availability configuration.
NOTE: The following procedure requires that specific configuration settings on the ServeRAID II
adapter can be obtained form the adapter that is being replaced or were noted when the adapter
was previously configured and are available for reconfiguring the new adapter.
NOTE: Obtaining the correct information for these settings is the responsibility of the user and is
required to accomplish this procedure.
Step 1:
Tip:
SCSI Bus Initiator_Ids for non-shared SCSI channels will normally be set to 7. However, for shared
SCSI channels the ID's will usually be 7 or 6 and must be different than the
SCSI Bus Initiator_Ids for the corresponding SCSI channels of the cluster partner adapter. You
may obtain the SCSI Bus Initiator_Ids from the corresponding cluster partner adapter by booting
the ServeRAID Configuration Diskette on the cluster partner system and selecting the
"Display/Change Adapter Params" option from the "Advanced Functions" menu. From this
information, the correct settings for the replacement adapter can be determined. For example, if the
cluster partner's shared SCSI bus Initiator_Ids were set to 7, then the replacement adapter would
typically need to be set to 6.
The proper settings for the Host_Id and Cluster Partner's Host_ID of the adapter being replaced
may be determined by reading the settings from the cluster partner system by using the
"Display/Change Adapter Params" option. In this case, the adapter being replaced should have it's
Host_Id set to the same value as is defined for the Cluster Partner's Host_Id on the corresponding
adapter in the cluster partner system. The Cluster Partner's Host Id of the replacement adapter
should be set to the same value as is defined in the Host_Id of the corresponding adapter in the
cluster partner system.
Example:
Node A Node B
SCSI Bus Initiator_Ids 7 6
Adapter Host_Id Server001 Server002
Cluster Partner's Host_Id Server002 Server001
Step 3:
NOTE: If the adapter being replaced is not the adapter which attaches the server's boot disk array or other non-shared disk arrays, then the following steps may not apply and the system may now be restarted normally.
_____________________________________________________________________________________________________
4.0 Contacting support
When it is necessary to contact IBM support for assistance in resolving the cluster problem, the
below information will help support more quickly understand the cluster environment and problem.
5.0 Service tips
5.1 Diagnostics
5.2 Shared Disk Subsystems
5.3 Communication Link/LAN Adapters
6.0 Glossary of terms
Cluster:
A collection of interconnected whole computers utilized as a single unified computing resource.
Node:
A server participating in a cluster.
Communication Link:
The link between the nodes of the cluster used for Cluster communication. This is usually an Ethernet link.
Failover:
The action where processes and applications are stopped on one node and restarted on the other node.
Failback:
The action where processes and/or applications return back to the node they are configured to
run on during normal cluster operation.
Cluster down:
A state where either multiple nodes are physically not functioning or clients cannot access the cluster
or virtual servers configured on the cluster.
Failover mode:
A state where one node in the cluster is handling cluster activity while the other node is offline or not functioning.
Node failure:
A state where a node hardware failure is exhibited by one of the following attributes:
Hang trap condition:
A state where a software failure has halted operation of a node exhibited by any of the following:
Running with errors:
The operating system on the node is capable of running and is reporting errors. Cluster activity on this
node has ceased to function.
Running without errors:
The operating system on the node is capable of running and is NOT reporting errors. Cluster activity on
this node is not functioning.
_____________________________________________________________________________________________________
Windows NT, and Microsoft Cluster Server are trademarks of Microsoft Corporation.
Microsoft is a registered trademark of Microsoft Corporation.
NetWare and IntranetWare are trademarks of Novell, Inc.
SYMplicity is a trademark of Symbios Logic, Inc.
MetaStor is a trademark of Symbios Logic, Inc.
_____________________________________________________________________________________________________
Please see the LEGAL - Trademark notice.
Feel free - send a for any BUG on this page found - Thank you.