|
HP OpenVMS Cluster Systems
F.14.2 Partner Program
The partner program waits for the distributed enable; then it captures
all of the LAN traffic and terminates as a result of either a
retransmission or the distributed trigger. Upon termination, this
program transmits the distributed trigger to make sure that other LAN
analyzers also capture the data at about the same time as when the
retransmitted packet was detected on this segment or another segment.
After the data capture completes, the data from multiple LAN segments
can be reviewed to locate the initial copy of the data that was
retransmitted. The partner program is shown in the following example:
Store: frames matching LAVc_all
or Distrib_Enable
or Distrib_Trigger
ending with Distrib_Trigger
Log file: not used
Block 1: Wait_for_distributed_enable
When frame matches Distrib_Enable then go to block 2
Block 2: Wait_for_the_event
When frame matches LAVc_TR_ReXMT then go to block 3
Block 3: Send the distributed trigger
Mark frame
and then
Send message Distrib_Trigger
|
F.14.3 Scribe Program
The scribe program waits for the distributed enable and then captures
all of the LAN traffic and terminates as a result of the distributed
trigger. The scribe program allows a network manager to capture data at
about the same time as when the retransmitted packet was detected on
another segment. After the data capture has completed, the data from
multiple LAN segments can be reviewed to locate the initial copy of the
data that was retransmitted. The scribe program is shown in the
following example:
Store: frames matching LAVc_all
or Distrib_Enable
or Distrib_Trigger
ending with Distrib_Trigger
Log file: not used
Block 1: Wait_for_distributed_enable
When frame matches Distrib_Enable then go to block 2
Block 2: Wait_for_the_event
When frame matches LAVc_TR_ReXMT then go to block 3
Block 3: Mark_the_frames
Mark frame
and then
Go to block 2
|
Appendix G NISCA Transport Protocol Congestion Control
G.1 NISCA Congestion Control
Network congestion occurs as the result of complex interactions of
workload distribution and network topology, including the speed and
buffer capacity of individual hardware components.
Network congestion can have a negative impact on cluster performance in
several ways:
- Moderate levels of congestion can lead to increased queue lengths
in network components (such as adapters and bridges) that in turn can
lead to increased latency and slower response.
- Higher levels of congestion can result in the discarding of packets
because of queue overflow.
- Packet loss can lead to packet retransmissions and, potentially,
even more congestion. In extreme cases, packet loss can result in the
loss of OpenVMS Cluster connections.
At the cluster level, these
congestion effects will appear as delays in cluster communications
(e.g. delays of lock transactions, served I/Os, ICC messages, etc.).
The user visible effects of network congestion can be application
response sluggishness, or loss of throughput.
Thus, although a particular network component or protocol cannot
guarantee the absence of congestion, the NISCA transport protocol
implemented in PEDRIVER incorporates several mechanisms to mitigate the
effects of congestion on OpenVMS Cluster traffic and to avoid having
cluster traffic exacerbate congestion when it occurs. These mechanisms
affect the retransmission of packets carrying user data and the
multicast HELLO datagrams used to maintain connectivity.
G.1.1 Congestion Caused by Retransmission
Associated with each virtual circuit from a given node is a
transmission window size, which indicates the number of packets that
can be outstanding to the remote node (for example, the number of
packets that can be sent to the node at the other end of the virtual
circuit before receiving an acknowledgment [ACK]).
If the window size is 8 for a particular virtual circuit, then the
sender can transmit up to 8 packets in a row but, before sending the
ninth, must wait until receiving an ACK indicating that at least the
first of the 8 has arrived.
If an ACK is not received, a timeout occurs, the packet is assumed
lost, and must be retransmitted. If another timeout occurs for a
retransmitted packet, the timeout interval is significantly increased
and the packet is retransmitted again. After a large number of
consecutive retransmissions of the same packet has occurred, the
virtual circuit will be closed.
G.1.1.1 OpenVMS VAX Version 6.0 or OpenVMS AXP Version 1.5, or Later
This section pertains to PEDRIVER running on OpenVMS VAX Version 6.0 or
OpenVMS AXP Version 1.5, or later.
The retransmission mechanism is an adaptation of the algorithms
developed for the Internet TCP protocol by Van Jacobson and improves on
the old mechanism by making both the window size and the retransmission
timeout interval adapt to network conditions.
- When a timeout occurs because of a lost packet, the window size is
decreased immediately to reduce the load on the network. The window
size is allowed to grow only after congestion subsides. More
specifically, when a packet loss occurs, the window size is decreased
to 1 and remains there, allowing the transmitter to send only one
packet at a time until all the original outstanding packets have been
acknowledged.
After this occurs, the window is allowed to grow
quickly until it reaches half its previous size. Once reaching the
halfway point, the window size is allowed to increase relatively slowly
to take advantage of available network capacity until it reaches a
maximum value determined by the configuration variables (for example, a
minimum of the number of adapter buffers and the remote node's
resequencing cache).
- The retransmission timeout interval is set based on measurements of
actual round-trip times, and the average variance from this average,
for packets that are transmitted over the virtual circuit. This allows
PEDRIVER to be more responsive to packet loss in most networks but
avoids premature timeouts for networks in which the actual round-trip
delay is consistently long. The algorithm can accommodate average
delays of up to a few seconds.
G.1.2 HELLO Multicast Datagrams
PEDRIVER periodically multicasts a HELLO datagram over each network
adapter attached to the node. The HELLO datagram serves two purposes:
- It informs other nodes of the existence of the sender so that they
can form channels and virtual circuits.
- It helps to keep communications open once they are established.
HELLO datagram congestion and loss of HELLO datagrams can prevent
connections from forming or cause connections to be lost. Table G-1
describes conditions causing HELLO datagram congestion and how PEDRIVER
helps avoid the problems. The result is a substantial decrease in the
probability of HELLO datagram synchronization and thus a decrease in
HELLO datagram congestion.
Table G-1 Conditions that Create HELLO Datagram Congestion
Conditions that cause congestion |
How PEDRIVER avoids congestion |
If all nodes receiving a HELLO datagram from a new node responded
immediately, the receiving network adapter on the new node could be
overrun with HELLO datagrams and be forced to drop some, resulting in
connections not being formed. This is especially likely in large
clusters.
|
To avoid this problem on nodes running:
- On VMS Version 5.5--2 or earlier, nodes that receive HELLO
datagrams delay for a random time interval of up to 1 second before
responding.
- On OpenVMS VAX Version 6.0 or later, or OpenVMS AXP Version 1.5 or
later, this random delay is a maximum of 2 seconds to support large
OpenVMS Cluster systems.
|
If a large number of nodes in a network became synchronized and
transmitted their HELLO datagrams at or near the same time, receiving
nodes could drop some datagrams and time out channels.
|
On nodes running VMS Version 5.5--2 or earlier, PEDRIVER multicasts
HELLO datagrams over each adapter every 3 seconds, making HELLO
datagram congestion more likely.
On nodes running OpenVMS VAX Version 6.0 or later, or OpenVMS AXP
Version 1.5 or later, PEDRIVER prevents this form of HELLO datagram
congestion by distributing its HELLO datagram multicasts randomly over
time. A HELLO datagram is still multicast over each adapter
approximately every 3 seconds but not over all adapters at once.
Instead, if a node has multiple network adapters, PEDRIVER attempts to
distribute its HELLO datagram multicasts so that it sends a HELLO
datagram over some of its adapters during each second of the 3-second
interval.
In addition, rather than multicasting precisely every 3 seconds,
PEDRIVER varies the time between HELLO datagram multicasts between
approximately 1.6 to 3 seconds, changing the average from 3 seconds to
approximately 2.3 seconds.
|
G.1.3 HELLO IP Unicast and IP Multicast Datagrams
PEDRIVER periodically transmits one IP multicast and one IP unicast for
each of the IP multicast address. These unicast and multicast messages
must be updated in the PE$IP_CONFIG.DAT file. The HELLO datagrams
serves two purposes:
- It informs other nodes of the existence of the sender so that they
can form channels and virtual circuits.
- It helps to keep communications open once they are established.
HELLO datagram congestion and loss of HELLO datagrams can prevent
connections from forming or causing connections to be lost.
|