[an error occurred while processing this directive]

HP OpenVMS Systems Documentation

Content starts here

HP OpenVMS Cluster Systems


Previous Contents Index

F.6.2 Techniques for Troubleshooting

When there is a break in communications between two nodes and you suspect problems with channel formation, follow these instructions:

Step Action
1 Check the obvious:
  • Is the remote node powered on?
  • Is the remote node booted?
  • Are the required network connections connected?
  • Do the cluster multicast datagrams pass through all of the required bridges in both directions?
  • Are the cluster group code and password values the same on all nodes?
2 Check for dead channels by using SDA. The SDA command SHOW PORT/CHANNEL/VC=VC_ remote_node can help you determine whether a channel ever existed; the command displays the channel's state.

Reference: Refer to Section F.3 for examples of the SHOW PORT command. Section F.11.1 describes how to use a LAN analyzer to troubleshoot channel formation problems.

3 See also Appendix D for information about using the LAVC$FAILURE_ANALYSIS program to troubleshoot channel problems.

F.7 Retransmission Problems

Retransmissions occur when the local node does not receive acknowledgment of a message in a timely manner.

F.7.1 Why Retransmissions Occur

The first time the sending node transmits the datagram containing the sequenced message data, PEDRIVER sets the value of the REXMT flag bit in the TR header to 0. If the datagram requires retransmission, PEDRIVER sets the REXMT flag bit to 1 and resends the datagram. PEDRIVER retransmits the datagram until either the datagram is received or the virtual circuit is closed. If multiple channels are available, PEDRIVER attempts to retransmit the message on a different channel in an attempt to avoid the problem that caused the retransmission.

Retransmission typically occurs when a node runs out of a critical resource, such as large request packets (LRPs) or nonpaged pool, and a message is lost after it reaches the remote node. Other potential causes of retransmissions include overloaded LAN bridges, slow LAN adapters (such as the DELQA), and heavily loaded systems, which delay packet transmission or reception. Figure F-4 shows an unsuccessful transmission followed by a successful retransmission.

Figure F-4 Lost Messages Cause Retransmissions


Because the first message was lost, the local node does not receive acknowledgment (ACK) from the remote node. The remote node acknowledged the second (successful) transmission of the message.

Retransmission can also occur if the cables are seated improperly, if the network is too busy and the datagram cannot be sent, or if the datagram is corrupted or lost during transmission either by the originating LAN adapter or by any bridges or repeaters. Figure F-5 illustrates another type of retransmission.

Figure F-5 Lost ACKs Cause Retransmissions


In Figure F-5, the remote node receives the message and transmits an acknowledgment (ACK) to the sending node. However, because the ACK from the receiving node is lost, the sending node retransmits the message.

F.7.2 Techniques for Troubleshooting

You can troubleshoot cluster retransmissions using a LAN protocol analyzer for each LAN segment. If multiple segments are used for cluster communications, then the LAN analyzers need to support a distributed enable and trigger mechanism (see Section F.9).

Reference: Techniques for isolating the retransmitted datagram using a LAN analyzer are discussed in Section F.11.2. See also Appendix G for more information about congestion control and PEDRIVER message retransmission.

F.8 Understanding NISCA Datagrams

Troubleshooting NISCA protocol communication problems requires an understanding of the NISCA protocol packet that is exchanged across the OpenVMS Cluster system.

F.8.1 Packet Format

The format of packets on the NISCA protocol is defined by the $NISCADEF macro, which is located in [DRIVER.LIS] on VAX systems and in [LIB.LIS] for Alpha systems on your CD listing disk.

Figure F-6 shows the general form of NISCA datagrams. A NISCA datagram consists of the following headers, which are usually followed by user data:

  • LAN headers, including an Ethernet or an FDDI header
  • Datagram exchange (DX) header
  • Channel control (CC) or transport (TR) header

Figure F-6 NISCA Headers


Caution: The NISCA protocol is subject to change without notice.

F.8.2 LAN Headers

The NISCA protocol is supported on LANs consisting of Ethernet, described in Section F.8.3 . These headers contain information that is useful for diagnosing problems that occur between LAN adapters.

Reference: See Section F.10.4 for methods of isolating information in LAN headers.

F.8.3 Ethernet Header

Each datagram that is transmitted or received on the Ethernet is prefixed with an Ethernet header. The Ethernet header, shown in Figure F-7 and described in Table F-8, is 16 bytes long.

Figure F-7 Ethernet Header


Table F-8 Fields in the Ethernet Header
Field Description
Destination address LAN address of the adapter that should receive the datagram
Source address LAN address of the adapter sending the datagram
Protocol type NISCA protocol (60--07) hexadecimal
Length Number of data bytes in the datagram following the length field

F.8.4 Datagram Exchange (DX) Header

The datagram exchange (DX) header for the OpenVMS Cluster protocol is used to address the data to the correct OpenVMS Cluster node. The DX header, shown in Figure F-8 and described in Table F-9, is 14 bytes long. It contains information that describes the OpenVMS Cluster connection between two nodes. See Section F.10.3 about methods of isolating data for the DX header.

Figure F-8 DX Header


Table F-9 Fields in the DX Header
Field Description
Destination SCS address Manufactured using the address AA--00--04--00-- remote-node-SCSSYSTEMID. Append the remote node's SCSSYSTEMID system parameter value for the low-order 16 bits. This address represents the destination SCS transport address or the OpenVMS Cluster multicast address.
Cluster group number The cluster group number specified by the system manager. See Chapter 8 for more information about cluster group numbers.
Source SCS address Represents the source SCS transport address and is manufactured using the address AA--00--04--00-- local-node-SCSSYSTEMID. Append the local node's SCSSYSTEMID system parameter value as the low-order 16 bits.

F.8.5 Channel Control (CC) Header

The channel control (CC) message is used to form and maintain working network paths between nodes in the OpenVMS Cluster system. The important fields for network troubleshooting are the datagram flags/type and the cluster password. Note that because the CC and TR headers occupy the same space, there is a TR/CC flag that identifies the type of message being transmitted over the channel. Figure F-9 shows the portions of the CC header needed for network troubleshooting, and Table F-10 describes these fields.

Figure F-9 CC Header


Table F-10 Fields in the CC Header
Field Description
Datagram type (bits <3:0>) Identifies the type of message on the Channel Control level. The following table shows the datagrams and their functions.
Value Abbreviated
Datagram
Type
Expanded
Datagram
Type
Function
0 HELLO HELLO datagram message Multicast datagram that initiates the formation of a channel between cluster nodes and tests and maintains the existing channels. This datagram does not contain a valid cluster password.
1 BYE Node-stop
notification
Datagram that signals the departure of a cluster node.
2 CCSTART Channel start Datagram that starts the channel-formation handshake between two cluster nodes. This datagram is sent in response to receiving a HELLO datagram from an unknown LAN adapter address.
3 VERF Verify Datagram that acknowledges the CCSTART datagram and continues the channel formation handshake. The datagram is sent in response to receiving a CCSTART or SOLICIT_SRV datagram.
4 VACK Verify acknowledge Datagram that completes the channel-formation handshake. The datagram is sent in response to receiving a VERF datagram.
5 Reserved    
6 SOLICIT_SERVICE Solicit Datagram sent by a booting node to form a channel to its disk server. The server responds by sending a VERF, which forms the channel.
7--15 Reserved    
Datagram flags (bits <7:4>) Provide additional information about the control datagram. The following bits are defined:
  • Bit <4> (AUTHORIZE)---Set to 1 if the cluster password field is valid.
  • Bit <5> (Reserved)---Set to 1.
  • Bit <6> (Reserved)---Set to 0.
  • Bit <7> (TR/CC flag)---Set to 1 to indicate the CC datagram.
Cluster password Contains the cluster password.

F.8.6 Transport (TR) Header

The transport (TR) header is used to pass SCS datagrams and sequenced messages between cluster nodes. The important fields for network troubleshooting are the TR datagram flags, message acknowledgment, and sequence numbers. Note that because the CC and TR headers occupy the same space, a TR/CC flag identifies the type of message being transmitted over the channel.

Figure F-10 shows the portions of the TR header that are needed for network troubleshooting, and Table F-11 describes these fields.

Figure F-10 TR Header


Note: The TR header shown in Figure F-10 is used when both nodes are running Version 1.4 or later of the NISCA protocol. If one or both nodes are running Version 1.3 or an earlier version of the protocol, then both nodes will use the message acknowledgment and sequence number fields in place of the extended message acknowledgment and extended sequence number fields, respectively.

Table F-11 Fields in the TR Header
Field Description
Datagram flags (bits <7:0>) Provide additional information about the transport datagram.
Value Abbreviated
Datagram
Type
Expanded
Datagram
Type
Function
0 DATA Packet data Contains data to be delivered to the upper levels of software.
1 SEQ Sequence flag Set to 1 if this is a sequenced message and the sequence number is valid.
2 Reserved   Set to 0.
3 ACK Acknowledgment Acknowledges the field is valid.
4 RSVP Reply flag Set when an ACK datagram is needed immediately.
5 REXMT Retransmission Set for all retransmissions of a sequenced message.
6 Reserved   Set to 0.
7 TR/CC flag Transport flag Set to 0; indicates a TR datagram.
Message acknowledgment An increasing value that specifies the last sequenced message segment received by the local node. All messages prior to this value are also acknowledged. This field is used when one or both nodes are running Version 1.3 or earlier of the NISCA protocol.
Extended message acknowledgment An increasing value that specifies the last sequenced message segment received by the local node. All messages prior to this value are also acknowledged. This field is used when both nodes are running Version 1.4 or later of the NISCA protocol.
Sequence number An increasing value that specifies the order of datagram transmission from the local node. This number is used to provide guaranteed delivery of this sequenced message segment to the remote node. This field is used when one or both nodes are running Version 1.3 or earlier of the NISCA protocol.
Extended sequence number An increasing value that specifies the order of datagram transmission from the local node. This number is used to provide guaranteed delivery of this sequenced message segment to the remote node. This field is used when both nodes are running Version 1.4 or later of the NISCA protocol.

F.9 Using a LAN Protocol Analysis Program

Some failures, such as packet loss resulting from congestion, intermittent network interruptions of less than 20 seconds, problems with backup bridges, and intermittent performance problems, can be difficult to diagnose. Intermittent failures may require the use of a LAN analysis tool to isolate and troubleshoot the NISCA protocol levels described in Section F.1.

As you evaluate the various network analysis tools currently available, you should look for certain capabilities when comparing LAN analyzers. The following sections describe the required capabilities.

F.9.1 Single or Multiple LAN Segments

Whether you need to troubleshoot problems on a single LAN segment or on multiple LAN segments, a LAN analyzer should help you isolate specific patterns of data. Choose a LAN analyzer that can isolate data matching unique patterns that you define. You should be able to define data patterns located in the data regions following the LAN header (described in Section F.8.2). In order to troubleshoot the NISCA protocol properly, a LAN analyzer should be able to match multiple data patterns simultaneously.

To troubleshoot single or multiple LAN segments, you must minimally define and isolate transmitted and retransmitted data in the TR header (see Section F.8.6). Additionally, for effective network troubleshooting across multiple LAN segments, a LAN analysis tool should include the following functions:

  • A distributed enable function that allows you to synchronize multiple LAN analyzers that are set up at different locations so that they can capture information about the same event as it travels through the LAN configuration
  • A distributed combination trigger function that automatically triggers multiple LAN analyzers at different locations so that they can capture information about the same event

The purpose of distributed enable and distributed combination trigger functions is to capture packets as they travel across multiple LAN segments. The implementation of these functions discussed in the following sections use multicast messages to reach all LAN segments of the extended LAN in the system configuration. By providing the ability to synchronize several LAN analyzers at different locations across multiple LAN segments, the distributed enable and combination trigger functions allow you to troubleshoot LAN configurations that span multiple sites over several miles.

F.9.2 Multiple LAN Segments

To troubleshoot multiple LAN segments, LAN analyzers must be able to capture the multicast packets and dynamically enable the trigger function of the LAN analyzer, as follows:

Step Action
1 Start capturing the data according to the rules specific to your LAN analyzer. HP recommends that only one LAN analyzer transmit a distributed enable multicast packet on the LAN. The packet must be transmitted according to the media access-control rules.
2 Wait for the distributed enable multicast packet. When the packet is received, enable the distributed combination trigger function. Prior to receiving the distributed enable packet, all LAN analyzers must be able to ignore the trigger condition. This feature is required in order to set up multiple LAN analyzers capable of capturing the same event. Note that the LAN analyzer transmitting the distributed enable should not wait to receive it.
3 Wait for an explicit (user-defined) trigger event or a distributed trigger packet. When the LAN analyzer receives either of these triggers, the LAN analyzer should stop the data capture.

Prior to receiving either trigger, the LAN analyzer should continue to capture the requested data. This feature is required in order to allow multiple LAN analyzers to capture the same event.

4 Once triggered, the LAN analyzer completes the distributed trigger function to stop the other LAN analyzers from capturing data related to the event that has already occurred.

The HP 4972A LAN Protocol Analyzer, available from the Hewlett-Packard Company, is one example of a network failure analysis tool that provides the required functions described in this section.

Reference: Section F.11 provides examples that use the HP 4972A LAN Protocol Analyzer.


Previous Next Contents Index