|
HP OpenVMS Cluster Systems
C.5 Diagnosing LAN Component Failures
Section D.5 provides troubleshooting techniques for LAN component
failures (for example, broken LAN bridges). That appendix also
describes techniques for using the Local Area OpenVMS Cluster Network
Failure Analysis Program.
Intermittent LAN component failures (for example, packet loss) can
cause problems in the NISCA transport protocol that delivers System
Communications Services (SCS) messages to other nodes in the OpenVMS
Cluster. Appendix F describes troubleshooting techniques and
requirements for LAN analyzer tools.
C.6 Diagnosing Cluster Hangs
Conditions like the following can cause a OpenVMS Cluster computer to
suspend process or system activity (that is, to hang):
C.6.1 Cluster Quorum is Lost
The OpenVMS Cluster quorum algorithm coordinates activity among OpenVMS
Cluster computers and ensures the integrity of shared cluster
resources. (The quorum algorithm is described fully in Chapter 2.)
Quorum is checked after any change to the cluster configuration---for
example, when a voting computer leaves or joins the cluster. If quorum
is lost, process and I/O activity on all computers in the cluster are
blocked.
Information about the loss of quorum and about clusterwide events that
cause loss of quorum are sent to the OPCOM process, which broadcasts
messages to designated operator terminals. The information is also
broadcast to each computer's operator console (OPA0), unless broadcast
activity is explicitly disabled on that terminal. However, because
quorum may be lost before OPCOM has been able to inform the operator
terminals, the messages sent to OPA0 are the most reliable source of
information about events that cause loss of quorum.
If quorum is lost, you might add or reboot a node with additional votes.
Reference: See also the information about cluster
quorum in Section 10.11.
C.6.2 Inaccessible Cluster Resource
Access to shared cluster resources is coordinated by the distributed
lock manager. If a particular process is granted a lock on a resource
(for example, a shared data file), other processes in the cluster that
request incompatible locks on that resource must wait until the
original lock is released. If the original process retains its lock for
an extended period, other processes waiting for the lock to be released
may appear to hang.
Occasionally, a system activity must acquire a restrictive lock on a
resource for an extended period. For example, to perform a volume
rebuild, system software takes out an exclusive lock on the volume
being rebuilt. While this lock is held, no processes can allocate space
on the disk volume. If they attempt to do so, they may appear to hang.
Access to files that contain data necessary for the operation of the
system itself is coordinated by the distributed lock manager. For this
reason, a process that acquires a lock on one of these resources and is
then unable to proceed may cause the cluster to appear to hang.
For example, this condition may occur if a process locks a portion of
the system authorization file (SYS$SYSTEM:SYSUAF.DAT) for write access.
Any activity that requires access to that portion of the file, such as
logging in to an account with the same or similar user name or sending
mail to that user name, is blocked until the original lock is released.
Normally, this lock is released quickly, and users do not notice the
locking operation.
However, if the process holding the lock is unable to proceed, other
processes could enter a wait state. Because the authorization file is
used during login and for most process creation operations (for
example, batch and network jobs), blocked processes could rapidly
accumulate in the cluster. Because the distributed lock manager is
functioning normally under these conditions, users are not notified by
broadcast messages or other means that a problem has occurred.
C.7 Diagnosing CLUEXIT Bugchecks
The operating system performs bugcheck operations only
when it detects conditions that could compromise normal system activity
or endanger data integrity. A CLUEXIT bugcheck is a
type of bugcheck initiated by the connection manager, the OpenVMS
Cluster software component that manages the interaction of cooperating
OpenVMS Cluster computers. Most such bugchecks are triggered by
conditions resulting from hardware failures (particularly failures in
communications paths), configuration errors, or system management
errors.
C.7.1 Conditions Causing Bugchecks
The most common conditions that result in CLUEXIT bugchecks are as
follows:
Possible Bugcheck Causes |
Recommendations |
The cluster connection between two computers is broken for longer than
RECNXINTERVAL seconds. Thereafter, the connection is declared
irrevocably broken. If the connection is later reestablished, one of
the computers shut down with a CLUEXIT bugcheck.
This condition can occur:
- Upon recovery with battery backup after a power failure
- After the repair of an SCS communication link
- After the computer was halted for a period longer than the number
of seconds specified for the RECNXINTERVAL parameter and was restarted
with a CONTINUE command entered at the operator console
|
Determine the cause of the interrupted connection and correct the
problem. For example, if recovery from a power failure is longer than
RECNXINTERVAL seconds, you may want to increase the value of the
RECNXINTERVAL parameter on all computers.
|
Cluster partitioning occurs. A member of a cluster discovers or
establishes connection to a member of another cluster, or a foreign
cluster is detected in the quorum file.
|
Review the setting of EXPECTED_VOTES on all computers.
|
The value specified for the SCSMAXMSG system parameter on a computer is
too small.
|
Verify that the value of SCSMAXMSG on all OpenVMS Cluster computers is
set to a value that is at the least the default value.
|
C.8 Port Communications
These sections provide detailed information about port communications
to assist in diagnosing port communication problems.
C.8.1 LAN Communications
For clusters that include Ethernet or FDDI interconnects, a multicast
scheme is used to locate computers on the LAN. Approximately every 3
seconds, the port emulator driver (PEDRIVER) sends a HELLO datagram
message through each LAN adapter to a cluster-specific multicast
address that is derived from the cluster group number. The driver also
enables the reception of these messages from other computers. When the
driver receives a HELLO datagram message from a computer with which it
does not currently share an open virtual circuit, it attempts to create
a circuit. HELLO datagram messages received from a computer with a
currently open virtual circuit indicate that the remote computer is
operational.
A standard, three-message exchange handshake is used to create a
virtual circuit. The handshake messages contain information about the
transmitting computer and its record of the cluster password. These
parameters are verified at the receiving computer, which continues the
handshake only if its verification is successful. Thus, each computer
authenticates the other. After the final message, the virtual circuit
is opened for use by both computers.
C.8.2 System Communications Services (SCS) Connections
System services such as the disk class driver, connection manager, and
the MSCP and TMSCP servers communicate between computers with a
protocol called System Communications Services (SCS). SCS is
responsible primarily for forming and breaking intersystem process
connections and for controlling flow of message traffic over those
connections. SCS is implemented in the port driver (for example,
PADRIVER, PBDRIVER, PEDRIVER, PIDRIVER), and in a loadable piece of the
operating system called SCSLOA.EXE (loaded automatically during system
initialization).
When a virtual circuit has been opened, a computer periodically probes
a remote computer for system services that the remote computer may be
offering. The SCS directory service, which makes known services that a
computer is offering, is always present both on computers and HSC
subsystems. As system services discover their counterparts on other
computers and HSC subsystems, they establish SCS connections to each
other. These connections are full duplex and are associated with a
particular virtual circuit. Multiple connections are typically
associated with a virtual circuit.
C.9 Diagnosing Port Failures
This section describes the hierarchy of communication paths and
describes where failures can occur.
C.9.1 Hierarchy of Communication Paths
Taken together, SCS, the port drivers, and the port itself support a
hierarchy of communication paths. Starting with the most fundamental
level, these are as follows:
- The physical wires. The Ethernet is a single coaxial cable. The
port chooses the free path or, if both are free, an arbitrary path
(implemented in the cables and managed by the port).
- The virtual circuit (implemented in LAN port emulator driver
(PEDRIVER) and partly in SCS software).
- The SCS connections (implemented in system software).
C.9.2 Where Failures Occur
Failures can occur at each communication level and in each component.
Failures at one level translate into failures elsewhere, as described
in Table C-3.
Table C-3 Port Failures
Communication Level |
Failures |
Wires
|
If the LAN fails or is disconnected, LAN traffic stops or is
interrupted, depending on the nature of the failure. All traffic is
directed over the remaining good path. When the wire is repaired, the
repair is detected automatically by port polling, and normal operations
resume on all ports.
|
Virtual circuit
|
If no path works between a pair of ports, the virtual circuit fails and
is closed. A path failure is discovered for the LAN, when no multicast
HELLO datagram message or incoming traffic is received from another
computer.
When a virtual circuit fails, every SCS connection on it is closed.
The software automatically reestablishes connections when the virtual
circuit is reestablished. Normally, reestablishing a virtual circuit
takes several seconds after the problem is corrected.
|
LAN adapter
|
If a LAN adapter device fails, attempts are made to restart it. If
repeated attempts fail, all channels using that adapter are broken. A
channel is a pair of LAN addresses, one local and one remote. If the
last open channel for a virtual circuit fails, the virtual circuit is
closed and the connections are broken.
|
SCS connection
|
When the software protocols fail or, in some instances, when the
software detects a hardware malfunction, a connection is terminated.
Other connections are usually unaffected, as is the virtual circuit.
Breaking of connections is also used under certain conditions as an
error recovery mechanism---most commonly when there is insufficient
nonpaged pool available on the computer.
|
Computer
|
If a computer fails because of operator shutdown, bugcheck, or halt,
all other computers in the cluster record the shutdown as failures of
their virtual circuits to the port on the shut down computer.
|
C.9.3 Verifying Virtual Circuits
To diagnose communication problems, you can invoke the Show Cluster
utility using the instructions in Table C-4.
Table C-4 How to Verify Virtual Circuit States
Step |
Action |
What to Look for |
1
|
Tailor the SHOW CLUSTER report by entering the SHOW CLUSTER command ADD
CIRCUIT,CABLE_STATUS. This command adds a class of information about
all the virtual circuits as seen from the computer on which you are
running SHOW CLUSTER. CABLE_STATUS indicates the status of the path for
the circuit from the CI interface on the local system to the CI
interface on the remote system.
|
Primarily, you are checking whether there is a virtual circuit in the
OPEN state to the failing computer. Common causes of failure to open a
virtual circuit and keep it open are the following:
- Port errors on one side or the other
- Cabling errors
- A port set off line because of software problems
- Insufficient nonpaged pool on both sides
- Failure to set correct values for the SCSNODE, SCSSYSTEMID,
PAMAXPORT, PANOPOLL, PASTIMOUT, and PAPOLLINTERVAL system parameters
|
2
|
Run SHOW CLUSTER from each active computer in the cluster to verify
whether each computer's view of the failing computer is consistent with
every other computer's view.
WHEN... |
THEN... |
All the active computers have a consistent view of the failing computer
|
The problem may be in the failing computer.
|
Only one of several active computers detects that the newcomer is
failing
|
That particular computer may have a problem.
|
|
If no virtual circuit is open to the failing computer, check the bottom
of the SHOW CLUSTER display:
- For information about circuits to the port of the failing computer.
Virtual circuits in partially open states are shown at the bottom of
the display. If the circuit is shown in a state other than OPEN,
communications between the local and remote ports are taking place, and
the failure is probably at a higher level than in port or cable
hardware.
- To see whether both path A and path B to the failing port are good.
The loss of one path should not prevent a computer from participating
in a cluster.
|
C.9.4 Verifying LAN Connections
The Local Area OpenVMS Cluster Network Failure Analysis Program
described in Section D.4 uses the HELLO datagram messages to verify
continuously the network paths (channels) used by PEDRIVER. This
verification process, combined with physical description of the
network, can:
- Isolate failing network components
- Group failing channels together and map them onto the physical
network description
- Call out the common components related to the channel failures
C.10 Analyzing Error-Log Entries for Port Devices
Monitoring events recorded in the error log can help you anticipate and
avoid potential problems. From the total error count (displayed by the
DCL command SHOW DEVICES device-name), you can determine
whether errors are increasing. If so, you should examine the error log.
C.10.1 Examine the Error Log
The DCL command ANALYZE/ERROR_LOG invokes the Error Log utility to
report the contents of an error-log file.
Reference: For more information about the Error Log
utility, see the HP OpenVMS System Management Utilities Reference Manual.
Some error-log entries are informational only while others require
action.
Table C-5 Informational and Other Error-Log Entries
Error Type |
Action Required? |
Purpose |
Informational error-log entries require no action. For
example, if you shut down a computer in the cluster, all other active
computers that have open virtual circuits between themselves and the
computer that has been shut down make entries in their error logs. Such
computers record up to three errors for the event:
- Path A received no response.
- Path B received no response.
- The virtual circuit is being closed.
|
No
|
These messages are normal and reflect the change of state in the
circuits to the computer that has been shut down.
|
Other error-log entries indicate problems that degrade
operation or nonfatal hardware problems. The operating system might
continue to run satisfactorily under these conditions.
|
Yes
|
Detecting these problems early is important to preventing nonfatal
problems (such as loss of a single CI path) from becoming serious
problems (such as loss of both paths).
|
C.10.2 Formats
Errors and other events on LAN cause port drivers to enter information
in the system error log in one of two formats:
- Device attention
Device-attention entries for the LAN,
device-attention entries typically record errors on a LAN adapter
device.
- Logged message
Logged-message entries record the receipt of a
message packet that contains erroneous data or that signals an error
condition.
Section C.10.4 describe those formats.
C.10.3 LAN Device-Attention Entries
Example C-1 shows device-attention entries for the LAN.
Example C-1 LAN Device-Attention Entry |
**** V3.4 ********************* ENTRY 337 ******************************** (1)
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version XC56-BL2
Event sequence number 96.
Timestamp of occurrence 16-SEP-2009 16:33:03 (2)
Time since reboot 0 Day(s) 0:50:08
Host name PERK
System Model AlphaServer ES45 Model 2 (3)
Entry Type 98. Asynchronous Device Attention
---- Device Profile ----
Unit PERK$PEA0 (4)
Product Name NI-SCA Port
---- NISCA Port Data ----
Error Type and SubType x0700 Device Error, Fatal Error Detected by
Datalink(5)
Status x0000120100000001 (6)
Datalink Device Name EIA2: (7)
Remote Node Name (8)
Remote Address x0000000000000000 (9)
Local Address x000063B4000400AA (10)
Error Count 1. Error Occurrences This Entry (11)
----- Software Info -----
UCB$x_ERRCNT 2. Errors This Unit
|
The following table describes the LAN device-attention entries in
Example C-1.
Entry |
Description |
(1)
|
The four lines are the entry heading. These lines contain the number of
the entry in this error log file, the architecture, the OS version and
the sequence number of this error. Each entry in the log file contains
such a heading.
|
(2)
|
This line contains the date and time.
|
(3)
|
The next two lines contain the system model and the entry type.
|
(4)
|
This line shows the name of the subsystem and component that caused the
entry.
|
(5)
|
This line shows the reason for the entry. The LAN driver has shut down
the data link because of a fatal error. The data link will be restarted
automatically, if possible.
|
(6)
|
The first longword shows the I/O completion status returned by the LAN
driver. The second longword is the VCI event code delivered to PEDRIVER
by the LAN driver.
|
(7)
|
DATALINK NAME is the name of the LAN device on which the error occurred.
|
(8)
|
REMOTE NODE is the name of the remote node to which the packet was
being sent. If zeros are displayed, either no remote node was available
or no packet was associated with the error.
|
(9)
|
REMOTE ADDR is the LAN address of the remote node to which the packet
was being sent. If zeros are displayed, no packet was associated with
the error.
|
(10)
|
LOCAL ADDR is the LAN address of the local node.
|
(11)
|
ERROR CNT. Because some errors can occur at extremely high rates, some
error log entries represent more than one occurrence of an error. This
field indicates how many. The errors counted occurred in the 3 seconds
preceding the timestamp on the entry.
|
|